Decoding Methods for Neural Narrative Generation

Narrative generation is an open-ended NLP task in which a model generates a story given a prompt. The task is similar to neural response generation for chatbots; however, innovations in response generation are often not applied to narrative generation, despite the similarity between these tasks. We aim to bridge this gap by applying and evaluating advances in decoding methods for neural response generation to neural narrative generation. In particular, we employ GPT-2 and perform ablations across nucleus sampling thresholds and diverse decoding hyperparameters—specifically, maximum mutual information—analyzing results over multiple criteria with automatic and human evaluation. We find that (1) nucleus sampling is generally best with thresholds between 0.7 and 0.9; (2) a maximum mutual information objective can improve the quality of generated stories; and (3) established automatic metrics do not correlate well with human judgments of narrative quality on any qualitative metric.


Introduction
Narrative generation (or story generation) is the task of generating a creative response given an input prompt. This output can be a story closure, a paragraph, or a structured story with multiple paragraphs. This input and output setup is similar to the response generation task of chatbots, as both tasks convert some variable-length sequential input from a user to an automatically generated variablelength sequential output. Thus, the neural models and methods proposed to date for story generation and dialogue generation have been similar.
However, as narrative generation is largely focused on coherence across long outputs, the strategies used in this subfield have evolved separately * * Equal contribution.
† † Work performed while at Johns Hopkins University. from those in chatbot response generation; the latter has been more concerned with generating interesting and diverse-and typically short-outputs. Thus, while many beneficial techniques may have arisen from one domain, they are not often employed in the other. One decoding method, nucleus sampling (Holtzman et al., 2020), has recently been applied to narrative generation (Ippolito et al., 2020), but a thorough evaluation of its various p thresholds has not been performed with human judgments using narrative-specific criteria, as this can be time-and labor-intensive. Also, recent advances in decoding methods for response generation-notably, the application of the maximum mutual information (MMI) objective (Li et al., 2016a)-have resulted in more interesting dialog according to human evaluators (Zhang et al., 2020b); nonetheless, this also has not been applied to narrative generation. Indeed, the MMI objective has been confined to short-form and less openended generation tasks thus far.
Thus, we apply techniques from neural response generation to neural narrative generation in order to investigate the potential benefits-and pitfalls-of applying these methods in this underexplored domain. This study aims to connect research developments across tasks by sweeping various thresholds of nucleus sampling and the application of diverse decoding to generate more long-form creative outputs. We perform human and automatic evaluations of automatically generated stories in these settings in order to investigate the following phenomena: 1. The effect of the nucleus sampling threshold p on narrative quality. 2. The effect of the maximum mutual information (MMI, Li et al. 2016b) diverse decoding objective with various diversity strengths λ on narrative quality. 3. The correlation (or lack thereof) between human evaluations of narrative quality and automatic metrics for response generation. As this domain generates longer and less constrained outputs than other natural language generation (NLG) tasks, we expect to find different ideal settings than those found for short-form or constrained generation.
Our preprocessing, training, generation, and analysis scripts are available publicly. 1 2 Related Work Narrative generation tasks Work on narrative generation is split between cloze tasks, open-ended generation, and guided generation. In a cloze task, a full story except for a final word, phrase, or sentence is given, and a model generates a completion. This could be cast as a short generation problemor, more commonly in this domain, a multiplechoice problem (Mostafazadeh et al., 2016;Weston et al., 2015;Hill et al., 2015;Ippolito et al., 2019a).
Open-ended generation is the task of generating long-form output conditioned on a prompt ( Figure 1). Fan et al. (2018) create a paired prompt and response dataset from the subreddit r/WritingPrompts 2 to train a sequence-tosequence "fusion model." See et al. (2019) extend Fan et al. (2018, but use GPT-2 small and perform a top-k decoding parameter sweep. We focus on this open-ended narrative generation task in our investigation, but primarily focus on GPT-2 Medium and on the effect of nucleus sampling thresholds 1 https://github.com/AADeLucia/ gpt2-narrative-decoding 2 https://www.reddit.com/r/WritingPrompts/ [ WP ] You live in a world where there has never been sickness , and you are the first to have ever experienced being sick . I open my eyes in a panic , sweat beading and then falling down my face . I look around and the sun in shining through the maroon curtains of my studio apartment . Everything seems to be as I left it the afternoon before , but there is a heavy , unfamiliar air in the room . and diverse decoding strengths on narrative quality. While Nadeem et al. (2020) similarly perform a hyperparameter search over sampling algorithms in a language generation setting, they perform human evaluations using a convincingness metric on a short-form news generation task; long-form narrative generation is not bound by realism (and may actually benefit from less realistic output), and thus requires different metrics and evaluation setups. Guided generation is the middle ground of cloze and open-ended generation. The model is provided more context, such as characters, plot information, and potentially other information, and then generates a story based on all of the provided structural and semantic information (Peng et al., 2018;Akoury et al., 2020).
Decoding methods for generation Decoding refers to the inference methods used in natural language generation; given input sequence S, how should we construct the output sequence T ? Since finding the exact most probable token at each time step often does not produce human-like or highquality results (Zhang et al., 2020a;Holtzman et al., 2020), search and sampling are used to overcome label bias and generate more human-like language. One popular search method is beam search, where at each time step, the algorithm keeps track of the top B most probable partial hypotheses. When B = 1, this method reduces to the greedy decoder, which chooses the argmax over the model's token distribution at each time step.
An alternative to search is sampling-based approaches, which select a token with likelihood proportional to a (typically constrained) probability distribution at each time step. Such methods include top-k (Fan et al., 2018) which restricts the sampling space to the top k most probable tokens at every time step, and "nucleus sampling" 3 (Holtzman et al., 2020) which thresholds the cumulative token probability distribution according to a hyperparameter p. We focus on nucleus sampling, as it has tended to be a more effective decoding method in various response generation settings (Zhang et al., 2020a;Ippolito et al., 2020).
An approach to control sampling is temperature (Ackley et al., 1985), which modifies the softmax estimating the token probability distribution. This has been applied widely in neural text generation (Ficler and Goldberg, 2017;Caccia et al., 2018), especially when using top-k or random sampling. Low temperatures bias the model toward high-probability events, which tends to increase generation quality while decreasing token diversity (Hashimoto et al., 2019). Temperature sampling has been investigated extensively in natural language generation over multiple sampling methods, and nucleus sampling has been found to be a more effective method of controlling the sampling distribution (Holtzman et al., 2020), so we do not investigate this here.
Decoding objective In chatbot response generation, top-k and nucleus sampling have been known to generate fluent, but uninteresting and simple high-probability responses which do not address the input (Li et al., 2016b). This issue is commonly referred to as the "I don't know" problem, where the response to all inputs is often the highprobability phrase "I don't know." Proposed solutions to this response blandness issue involve altering the decoding objective. Some recent work in this domain includes Nakamura et al. (2018), who use Inverse Token Frequency to reweight generated tokens. Xu et al. (2018) andZhang et al. (2018) use adversarial loss to optimize for diversity, informativeness, and fluency. Martins et al. (2020) propose entmax sampling to generate more effectively from sparse distributions and address the train-test mismatch in text generation.
Another approach explores variants of the standard log-likelihood loss, applying different objectives during inference. An example of this is maximum mutual information (MMI, Li et al. 2016b), an objective that promotes more diverse responses in the neural response generation task. This mitigates the "I don't know" problem in which all responses tend to converge to some high-probability sequence with no real content conveyed in response 3 Also referred to as "top-p". to the input sequence. Two versions are introduced in Li et al. (2016b): bidirectional (MMI-bidi) and an anti-language model (MMI-antiLM) objective. The typical decoding objective is defined aŝ where S is the input sequence, T is a possible target sequence, andT is the selected target. We use a slightly modified form of the MMI-antiLM objective (Li et al., 2016a), defined as follows: where λ is a hyperparameter controlling the degree to which the language modeling objective is subtracted from the sequential transduction objective. Intuitively, this is meant to increase the likelihood of relevant targets while penalizing popular generic responses (e.g. "okay").
This diverse decoding objective has been applied to response generation but has not yet been applied to the narrative generation task; here, we evaluate the effect of the MMI-antiLM objective on narrative generation quality.   This dataset was built from the subreddit r/WritingPrompts 5 , where users post a "prompt" consisting of up to a few sentences, and other users reply to the post with a story continuing the prompt (the "response"). An example prompt and response pair is in Table 1.
To create datasets of varying lengths-and to make the dataset compatible with our model (GPT-2, discussed more in §3.2)-we preprocess the WRITINGPROMPTS dataset as follows:  first line break/the first 100 tokens, (2) before the third line break/the first 256 tokens, and (3) the entire response/the first 1024 tokens, respectively. These are referred to as the "small", "medium", and "large" datasets/response lengths, and are treated as separate corpora. Thus, we have 3 train, validation, and test corpora for a total of 9. 3. Combine the source (prompt) and target (response) strings into one, as in Figure 2. During step 2, we create multiple versions of the training set with varying response lengths to evaluate the quality of narrative generation for outputs of various lengths. We use line breaks instead of a token cutoff as in Fan et al. (2018), because line breaks are more likely to provide complete sentences. See Table 2 for the sizes of these datasets.

Narrative Generation with GPT-2
Instead of the convolutional-sequential model used in Fan et al. (2018), we focus on the generative Transformer-based model GPT-2 (Radford et al., 2019). 6 We employ this model because it is currently the state-of-the-art publicly available text generation model, though this may change when GPT-3 (Brown et al., 2020) is released publicly.
We investigate the small and medium GPT-2 models for output quality comparison. GPT-2 Large was infeasible to train on the medium and large datasets, even on a machine with multiple Tesla P100 GPUs. GPT-2 is pre-trained on WebText. For this work, we fine-tune GPT-2 Small and Medium on the small, medium, and large versions of the WRIT-INGPROMPTS dataset discussed in §3.1. We finetuned for one epoch using Adam with a learning rate of 5×10 −5 , epsilon of 1×10 −8 , and batch size of 4. Fine-tuning is performed on Google Cloud instances using NVIDIA Tesla K80s or T4s. Inference is performed by feeding GPT-2 a string of the format in Figure 2 up to the [RESPONSE] token.

Decoding Methods
After GPT-2 is fine-tuned on the WRITING-PROMPTS dataset, we evaluate the model's generated responses with a parameter sweep of p for nucleus sampling. We also provide a small comparison with top-k sampling in Appendix C.
Holtzman et al. (2020) uses a threshold of p = 0.95 for chatbot response generation; we perform an ablation over values of p here to discover which value best suits narrative generation. Specifically, we investigate the thresholds of of 0.3, 0.5, 0.7, 0.9, 0.95, and also include greedy search and full random sampling, represented by p = 0 and p = 1, respectively.
Once we find the best p, we apply the diverse decoding objective to narrative generation to investigate whether this generates better stories. Specifically, we implement the MMI-antiLM (antilanguage model) objective for GPT-2.
We also perform an ablation over λ values for the antiLM objective, testing the values 0.1, 0.2, 0.35, 0.5; λ = 0 represents not using diverse decoding. As this objective was originally designed to increase the specificity of a response with respect to a prompt, we expect this to increase interestingness and relevance (but perhaps decrease fluency and coherence, since we are subtracting the language modeling objective from the response generation objective). We only employ the antiLM objective when generating the first 20 tokens of the target sequence, after which we use the regular log-likelihood loss. This follows the approach of Li et al. (2016b), who find that ungrammatical sequences often arise later in the output sequence and that the first few tokens have a large effect on the rest of the output sequence; thus, they threshold the objective to only apply to the first few tokens during generation.
There is an established quality-diversity tradeoff (Zhang et al., 2020a) in natural language generation, so we expect that strong diverse decoding (e.g., λ = 0.5) will generate lower-quality narratives overall compared to lower λ values, which may increase interestingness more than they decrease fluency.

Evaluation
The qualities important for narrative generation are interestingness, coherence, fluency, and relevance to the prompt. These metrics are also evaluated in Akoury et al. (2020), though they measure "likeability" instead of interestingness.
A combination of automatic and human evaluation is used to assess the quality of generated narratives. For automatic evaluation, we employ test perplexity, lexical diversity (dist-n, Li et al. 2016b), and a BERT-based sentence similarity metric, Sentence-BERT (sent-BERT, Reimers and Gurevych 2019). Perplexity is used to evaluate language models and may correlate with fluency. The latter two may act as proxies for interestingness, since they measure n-gram diversity within an output and sentence embedding diversity across outputs, respectively. We use sent-BERT as an output diversity metric by using the cosine distance instead of cosine similarity. Our motivation in choosing these diversity metrics is from Tevet and Berant (2020), who identify dist-n and sent-BERT as the best metrics to evaluate two targeted types of diversity-diverse word choice and diverse content, respectively.
For human evaluation, we employ 4-point Likert scales to evaluate narratives for interestingness, coherence, fluency, and relevance. For the purpose of evaluation, we define interestingness as the enjoyment of reading the story, coherence as the level of cohesion between sentences in a narrative, and fluency as the grammaticality and naturalness of the English output; these metrics judge the quality of a generated narrative independently from the input prompt. Relevance is a metric we employ to measure how well the response follows from the input prompt. We evaluate 100 narratives per-p and perλ, and we have 5 human annotators per-narrative. We judge quality on medium-length outputs, as these are less variable in length than large narratives while being long enough to properly judge our metrics. Appendix B contains a thorough description and example of our Mechanical Turk setup.

Baseline
We employ the fusion model-the previous stateof-the-art approach for narrative generation before pre-trained Transformer models-from Fan et al.
(2018) as a baseline. This model is an ensemble of two convolutional seq2seq models, where the first is pre-trained on the training set and is then used to boost a second model. We employ this model on the WritingPrompts dataset and evaluate on different narrative lengths.

Model
Small Medium Large  The perplexities of each model on each narrative length are shown in Table 3. GPT-2 Medium had the lowest perplexity within each dataset size. GPT-2 Small had a fairly close perplexity to GPT-2 Medium despite having significantly fewer parameters. Comparatively, the fusion model had a high perplexity, though scores are not directly comparable across models due to tokenization differences. In general, perplexity decreased as the length of the response increases, though perplexities are also not necessarily comparable across dataset sizes since this a per-word metric. Nonetheless, these results suggest that we should generally expect GPT-2 Medium to be marginally more fluent than GPT-2 Small, and that both of these will output far better English than the fusion model. We confirm this qualitatively; see Appendix A. We thus focus on GPT-2 Medium for the following analyses.
Next, we sweep over various p-values for nucleus sampling using GPT-2 Medium on the medium-length dataset, evaluating using human annotators ( Figure 3). We found that p = 0.7 performed best on average for all metrics except interestingness, where p = 0.9 was best. p = 0.9 was a close second overall, and the difference in performance between these two settings was not high. Increasing p past 0.9 or decreasing p below 0.7 more notably decreased performance. Interannotator agreement (measured with Fleiss' kappa) was 0.13 for interestingness and coherence, 0.12 for fluency, and 0.10 for relevance; these are similar to agreements found in Akoury et al. (2020) when prompts are included. To test the effect of diverse decoding on narrative quality (Figure 4), we use the same human annotator setup as for the p sweep. We decode with nucleus sampling using p = 0.7 and vary the λ hyperparameter (Figure 4). Higher λ indicates a larger modification from the original decoding objective. We found that setting λ = 0.1 increased the quality of narratives for all metrics. Interestingness and relevance further increased at λ = 0.2, which is expected given that the p(T | S) term in the decoding objective becomes more prominent than p(T ) as λ increases; however, fluency and coherence began to decline here. Higher settings of λ tended to reduce quality on all metrics.
Next, we discuss the relationship between model size and the diversity of outputs. Table 4 contains dist-n and sent-BERT scores for all model sizes, p values in nucleus sampling, and response lengths. For any given p value and response length, GPT-2 Medium tended to use a slightly larger variety of tokens per-response than GPT-2 Small. Meanwhile, the diversity of the fusion model outputs was quite low in comparison-typically due to the degeneracy of the output. We also note that the dist-n scores were the same for the medium and large response lengths; this is also due to the degeneracy of the output and the surprisingly short stories generated, even when trained on large data and when allowed to generate up to 1,000 tokens.
Dist-n and sent-BERT scores both declined with increasing response lengths. We believe that the former is due to the normalization constant (the number of n-grams in the narrative) in dist-n calculations. Larger responses tend to repeat tokens more than shorter responses, so increasing response length increases the normalization constant more quickly than the number of unique n-grams. The latter may be due to the way sentence embeddings are calculated: as the number of tokens grows, sentence embeddings may grow more similar on average, since they are calculated as the mean of the token embeddings that compose the sentence.
Relatedly, even though we allow the fusion models trained on the large dataset to generate longer responses, they often generated responses which were of similar lengths to medium responses (i.e., they often did not generate to their maximum allowed sequence length). This may explain the lack  of distinction between the scores obtained in Table 4 between medium and large narratives. Finally, we analyze the effect of various p values as well as different strengths of the MMI-antiLM objective on narrative token diversity ( Figure 5). There was an expected consistent positive correlation between p and dist-n, as well as a positive correlation between λ and diversity; since dist-n increases monotonically with both hyperparameters, ρ s = 1. Sent-BERT consistently decreased with higher p when p > 0, indicating lower levels of difference between narratives as p increases. Sent-BERT decreased monotonically with respect to λ.

Qualitative Results
In this section, we analyze the quality of narratives by directly observing the outputs. Appendix A shows generated narratives from a variety of model architectures, sizes, and decoding hyperparameters.

Nucleus Sampling
When p was high, we generally observed more interesting and vivid narratives with good diction and fluency scores, but which had no single cohesive plot. When p was low, we saw more repetitive word choice but higher cohesion. However, when p was very low (p ≤ 0.3), the output was degenerate. Generally, when p was around 0.7, we observed consistently good stories compared to other p values. With values of p = 0.9 and higher, we generally saw output stories with more variable quality (i.e., whose quality is often either higher or lower than stories with p = 0.7). This is intuitive with respect to how p restricts the sampling space: when p is too small, too many options are removed and the model cannot generate fluent text. When p is large, we more closely approach random sampling and fewer tokens are removed from the sampling space, so the probability tail increases the likelihood for the model to choose unlikely tokens; this can produce interesting output, but tends to reduce fluency and coherence. A discussion of the number of tokens sampled for each p is in Appendix E.

Diverse Decoding
For smaller values of λ, MMI had a smaller effect on the output of the models. Within a given p value, increasing MMI values up to 0.2 seemed to result in slightly more interesting diction for the small models. Coherence seemed to be unaffected by changing values of λ, though we saw a notable drop in the grammaticality of output at 0.35 and higher.
More interesting is that the intensity of the subject matter seemed to increase with λ, especially notable around 0.2 and 0.35. Indeed, we generally observed more cursing, violent content, and jokes featuring sexuality and dark humor as λ increased. This may not necessarily be a positive or negative trend; if one wishes to generate stories which are more vivid, and one's language model is sufficiently high-quality to start, then this may be a beneficial method to employ. Nonetheless, we do not have a clear mathematical explanation for this, since the MMI-antiLM objective simply increases the importance of the prompt while decreasing the importance of the language model. Perhaps these more intense subjects are somewhat less probable than more tame content, hence why subtracting the language model could increase the likelihood of seeing these darker themes.

Correlating Automatic Metrics with Quality
Thus far, we have observed how perplexity, distn, and sent-BERT vary with various model architectures/sizes, decoding approaches, and hyperparameters. However, what do these quantities say about the quality of generated narratives? In general, we note the following qualitative trends: (1) Lower perplexity is better. This correlates mainly with fluency and non-degenerate output.
(2) Very low dist-n scores indicate consistent neural text degeneration.
(3) Very high dist-n scores indicate variable-quality narratives. Dist-n demonstrated a moderate correlation 7 with interestingness (ρ s = .75, P < .1) across top-p values. The two metrics correlated well up 7 All correlations here are measured using Spearman's rank correlation (ρs) along with measures of significance (capital P ).
to top-p = 0.9, but it is possible that decreased fluency and coherence at higher values of p overshadowed the increased number of distinct tokens perresponse, thus negating any interestingness gains. For all other human metrics, dist-n did not correlate well (ρ s ≤ .5, P > .1). Thus, we do not recommend optimizing over dist-n. Rather, this quantity can be a helpful heuristic when comparing across model configurations at a high level, and both very high and very low dist-n scores can be indicative of distinct problems in narrative generation despite having little inherent meaning in isolation.
Sent-BERT did not correlate well with any of our metrics (0 ≤ ρ s ≤ .43, P > .1), indicating that it is either not a sufficient method for sentence diversity measurement when applied to narratives, or that it does not correlate with factors that make for interesting narratives. When p is lower, we observed stories that were degenerate in different ways, whereas when p was higher, we observed stories that were always more token-diverse, and thus generally more similar on a sentential level.
We find a less marked diversity-quality trade-off in the narrative generation setting compared to recent natural language generation papers in other settings (Ippolito et al., 2019b;Zhang et al., 2020a;Nadeem et al., 2020). If this trade-off were strong, we would expect generally decreasing human evaluation scores with higher p and higher λ, since dist-n increases monotonically with both hyperparameters. While this held to an extent with λ (and even then not monotonically, since λ = 0.1 showed higher performance on all metrics), it was certainly not true for p up to very high values. Perhaps this is due to the more open-ended nature of narrative generation, as stories can benefit from higher levels of diversity without needing to maintain realism or a specific writing style.

Conclusions
Our results suggest that p values lower than those suggested for other tasks (Holtzman et al., 2020) are ideal in narrative generation, and that small magnitudes of diverse decoding may produce better and more vivid stories. We also find that distinct-n and sentence-BERT do not correlate well with any of our human perceptions of narrative quality, and that the quality-diversity trade-off is less strong in narrative generation than in other generation tasks. The latter finding is preliminary, though supported by Martins et al. (2020), who find increases in both diversity and human scores with their proposed method.
Our findings aim to inform future efforts in the narrative generation domain by establishing future baselines given our recommended hyperparameters, and by facilitating further investigation of decoding objectives for better narrative generation. Once GPT-3 (Brown et al., 2020) is released for public use, it is very likely that this model will outperform GPT-2; thus, we encourage future work to investigate similar hyperparameters and sampling methods to see whether these trends are stable across model sizes.

Ethical Considerations
Our contributions include a story generation model to be used by other researchers and AI hobbyists. This model was fine-tuned on WritingPrompts (Fan et al., 2018), which is a collection of prompts and responses from a popular creative writing subreddit r/WritingPrompts. To the best of our knowledge, this dataset was not examined for hate speech or gender bias, and we did not perform such inspections here. Also, the released code has no post-generation filter to flag potentially offensive narratives.
We did not pursue any of these filters or offensive text detection because our work was focused on evaluating generated narratives for stylistic measures of quality, and was not focused on contentbased sources of bias. However, one should look to relevant work in the field on bias and hate speech detection (Sheng et al., 2020; MacAvaney et al., 2019) before deploying such models as creative writing tools. Besides the clear ethical obligation to vet such a tool, a "creative" writing tool which propagates or amplifies the bias of its training set would potentially hinder the quality of output narratives. Normative and stereotypical narratives would likely be uninteresting.

A Example Outputs
All examples start on the following page. We report narrative responses given a single prompt for various model architectures/sizes, decoding methods, and hyperparameter sweeps.

B Human Annotator Survey Details
As discussed in §3.4, we created a survey on Amazon Mechanical Turk for the human evaluation. Evaluating all of the prompts was infeasible, so we sampled 100 prompts and generated one story for each nucleus sampling p value ({0.0, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0}), for a total of 700 stories. We wanted story lengths that were long enough to give the worker sufficient context to be able to evaluate a passage, but not too long as to take too much time per story. We used the GPT-2 Medium model (best performing, see §4) trained on the medium length dataset because it fit our requirements. Due to the projected length of time to complete the survey, we paid $1 per human intelligence task (HIT). Each HIT was seen by five workers.
The generated stories were shuffled, and split into groups of five for each HIT. The story display is shown in Figure 9. In addition to the five stories, each HIT had one "attention check." There were a total of 140 HITs. The definitions for interesting, fluent, coherent, and relevant were explained, along with guidelines for each of the [1, 4] Likert scale options (shown in Figure 7). For convenience, the definitions were available as a tooltip when a mouse hovered over a question or option. Example ratings were available to the worker under the "Examples" tab (not shown).
As mentioned earlier, each HIT included one attention check. The attention check was used to check if a worker was paying attention to the task or selecting options at random. The check, shown in Figure 8, asked the worker to fill in the same answers as for the previous story. In addition to the attention checks, we supervised the workers by only releasing 20 HITs at a time (total of seven batches), and iteratively removing workers who did a poor job. While this task was very subjective (a handful of workers left us comments about the difficulty of the task), we consider performance subpar for any combination of the following: (1) if a worker finished the task unreasonably quickly (under 5 minutes), (2) failed an attention check, (3) had low agreement with other annotators, and (4) completed many HITs in a short amount of time. We spot-checked work from those who were automatically flagged as suspicious by checking their task answers. Overall, we removed 28 workers from the final results.
Once the highest-rated nucleus sampling parameter was chosen (p = 0.7), we repeated the same setup for the antiLM λ parameter sweep. Using the same 100 prompts from earlier, we generated stories with GPT-2 Medium-medium with p = 0.7 and λ = {0.1, 0.2, 0.35, 0.5}. We also included λ = 0.0 (i.e. without the antiLM objective) to help with worker calibration. The 500 stories were split into 100 HITs (five batches of 20 HITs).
Total cost of both the nucleus sampling and an-tiLM sweeps was $1,440.
C Top-k vs. Nucleus Sampling

C.1 Setup
For top-k sampling, we use k = 40; our motivation for choosing this value is that it is the one used in Radford et al. (2019) for "conditional" (prompted) generation 8 , and in Fan et al. (2018).
The following is a qualitative review performed by the authors.

C.2 Qualitative Evaluation
For most reasonable settings of p, nucleus sampling tends to produce stories which are dramatic, vivid, and fun to read, but which do not often stay on topic. Indeed, the outputs demonstrate two main types of errors: (1) cramming too many topics into one story, and (2) sudden shifts in topic. Example outputs are in Table 8.
Top-k sampling, however, demonstrates quite extreme variance. Some of the generated stories feel almost human-like with how on-topic they remain for multiple paragraphs-but they are about safe and boring topics and generally employ very common token collocates, which makes the output feel uncreative and uninteresting. Other stories are dramatic, but almost dream-like due to the streamof-consciousness incoherent flow. Yet other stories are completely unintelligible and show signs of neural text degeneration. Holtzman et al. (2020) finds nucleus sampling to generally be preferable to top-k sampling, and we find this to be true in the narrative generation task. p seems to correlate more closely with narrative quality than k.

C.3 Conclusions
As we had expected, we preferred the stories generated with nucleus sampling decoding. Since nucleus sampling is essentially a dynamic top-k algorithm (i.e. each step has a different number of tokens that constitutes the top x%), and even small nucleus sampling values have large number of tokens to choose from (k), this aligns with the results of See et al. (2019), who found large k to be preferred according to automatic evaluations.

E A Closer Look at Nucleus Sampling
How does the nucleus sampling token filter compare to the top-k filter? For example, when a token is sampled from p = 0.3, how many tokens are in the sampling space? Figure 6 shows the cumulative distribution function (CDF) for the p values tested in the nucleus sampling hyperparameter sweep. Using the same set of 100 prompts from the human evaluation, we re-generate the responses and collect the number of tokens in the sampling space at each step. Each p is represented by the raw number of tokens in the distribution across all 100 prompts and is not averaged for each generated story.
Surprisingly, nucleus sampling often reduced to the greedy decoder (token space reduced to a single token), even at higher p values. Despite high probability thresholds, 0.7 <= p <= 0.95 still skewed towards sampling from a relatively low number of tokens, as these tokens tended to have very high probability in certain contexts. All p values frequently sampled from less than 1, 000 tokens, which is less than 2% of the total number of tokens in 260). As p increased, we observed a larger sample space, which indicates more of a return to the long-tailed token probability distribution seen in random sampling.
This shows that it may not be correct to compare a nucleus sampling p value directly against a single top-k value. Nucleus sampling is essentially dynamic top-k sampling, which makes a direct comparison unfair without first checking the distribution of the number of tokens sampled by p for the model and task in question.  GPT-2 Small, p = 0.7 The last time I saw this place was in a very old diner, that was a pretty nice diner. The waiters were in all black suits and shoes, and the only reason they didn't make it in time was because I didn't want to. I had my smartphone in my hands, but I 'd had it the whole time. <newline><newline>" Don't take it anymore, it's been a while. " A voice said, and I smiled. " I don't want to take it anymore. " <newline><newline>I smiled and held my phone out. I was still on the phone, but I could feel my stomach churning. I could feel the sun on my face, so I held it out, but the lights went out. I wasn't sure how long I had left, but it wasn't long before I woke up. GPT-2 Med., p = 0.7 " Hi, I'm Mr. Smith, and I'm here for the Purge! " <newline><newline>The building's boss sat at the desk with a huge grin. I 'd been working in the building for about five months now and I 'd grown tired of it. <newline><newline>" Now, I understand the concerns, but we have a very strict policy about employees not coming into work after 12:00 PM. If you're caught cheating, or even the slightest bit annoying, you're fired. We also don't allow anyone over 12 hours without the company paying a visit. " Fusion Model, p = 0.7 " Hello ! My name is Randy , " I shouted into the air . " I ' m the <unk>" <newline><newline>" Well , it ' s just that I need a new way to put this . I ' m not going to let you take my own life. " Randy said . " You ' re not going to allow me to use my own life . But I ' m not going to let you take your own life ! " <newline><newline>" Oh , you ' re sorry ! I was just thinking about your life ! You ' re going to make a decent living , and all I need to do is come back to the living room . I need to be happy . But you ' re not going to let me get my own life ! " <newline><newline>" You ' re going to have to put my own life together , <unk>! " Randy laughed . " If I ' m going to take your own life , I ' ll take you on to your own life . " <newline><newline>" You '  , hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, he p = 0.3 I was nervous. <newline><newline>I was nervous about the job. <newline><newline>I was nervous about the pay. p = 0.5 The first thing I noticed was the absence of people. I wasn't sure why, but I couldn't shake the feeling that it wasn't just the people that were gone. I mean, it wasn't like I hadn't been here before. I mean, I was in the office for a month, but it felt like a year. <newline><newline>I wasn't sure what was going on, but I could tell that something was wrong. <newline><newline>" Hello? " p = 0.7 My boss was a loner. He liked to work late and I didn't know why. I didn't want to work late. I just didn't have the patience for it. <newline><newline>I don't know what happened. <newline><newline>" You got ta be kidding me, " he said, " a sixteen year old would just walk into the office and run out of work? " p = 0.9 All four of the chimes clicked and the chandelier snapped, as the room dimmed and grew dim until the only light came from an overhead television, showing the " The Purge " live. In the background, a towering wall of reds, greens, and yellows flashed in contrast to the background colors, casting warm green shimmers across the television. <newline><newline>The chandelier struck the ceiling, crumpling under the weight. A small circle of lights in the ceiling blinked and illuminated what had been a desk, until two figures entered it. <newline><newline>The first man, dressed in a yellow and white baseball cap and sunglasses, stood tall, wearing khakis and a button-down shirt. A muscular build and hairy arms waved about as he spoke. p = 0.95 Those who survive are always in their offices when they finish, they wake up, make sure they run the water for a while, then try to get up and make the 9 to 5 as fast as possible. The rest are scared to take a break from that, they're afraid the DIT will find them and kill them and they have to do it. Well those people, these employees, they're all so qualified to run around the office. I mean they're trained. I took the systems Administrator and Job Writer ( same person from last year ) and fired them in 15 minutes. The guy, one of the better systems engineers had died from a heart attack and there are only 3 employees left. I've never heard of a DIT being fired like that before. " <newline><newline>Dr. Neutson was writing these words down as they told her to, he 'd always loved his wife Sandy too and she was absolutely dying. Mrs. Neutson looked as though she was about to make another coffee, now that she realized it was someone else who was in her office, there was tears streaming down her face, she was going to finish today. " Great, I'm going to have no more" p = 1.0 Chad's Melatonin Impaired Heart implant had occurred during the Golden elimination pilot. " Critically, " the docs explained to the board, " Third lapse, unlocking the minimum Carol Corporation-mandated for the Restricted Solid Species Program. We penalize those who have such lapses as early and late entry personnel. " <newline><newline>Chad felt pain in the core of his neck -compressing saliva into his neckline, seeing his neighbors ' streaks poking out, " That doesn't work, I still live in Seattle ' s Milliner Square until May 17th, " thought Chad as he fell down the empty stairs the hospital had installed that morning. Without warning, his head slammed into the steel sill beneath him and he felt a sharp pain begin to run down his spine. The pain apparently doomed him to eternity in a single meditation, the disease never getting better. <newline><newline>After five ER visits, nine Lab ultrasound tests, six minor surgeries, pressure checks, one heart-heated ultrasound, Chad came back for surgeries, six of which were removeable. There were only two of us left in the hospital. The nurses wouldn't let me leave. I was in pain. My mother was looking at me with a confused look on her face. <newline><newline>"Can you tell me what's wrong, honey?" she asked. <newline><newline>"Honey, it's okay." I whispered. I couldn't hear her, and the nurses couldn't hear me either. λ = 0.2 My wife came home from work around 3 AM to find her child covered in blood and screaming. Her heart was racing and her skin was boiling with the pain of being cut. <newline><newline>I rushed into the living room, eyes wide and bloodshot. She was hysterical, sobbing, and cackling as she knelt over my body, gasping for air. <newline><newline>"Wh-who are you?" I asked. She was gasping for air. λ = 0.35 Sitting next to her bed hung cardboard boxes. Five candles burning brightly in their candles hung in the ceiling. A bright yellow heart in a jar was stuck in the middle. There were three other jars on the floor. The left was empty, with the word "SHIT" written on it. The other two had the words "WAIT" written on them. <newline><newline>"Come on, Mom, it's just a puppy!" He was angry. <newline><newline>"It's a girl!" I responded. "What are you talking about? You have two of them, and they're twins!" λ = 0.5 "Daddy? Daddy what's wrong honey? Daddy why are you crying honey? Daddy pick up your child and leave the room, you have to get to work" <newline><newline>I picked up my daughter and we walked out into the kitchen. <newline><newline>I held her close and whispered into her ear "It's ok honey, I'll be ok." Table 7: Medium-length stories generated using GPT-2 Medium with nucleus sampling (p = 0.7) and various diverse decoding strengths λ.  The worker is asked to input the same answers as they did for the previous story. A worker gets flagged for review if they fail at least one attention check. Figure 9: An example narrative display to a worker along with the questions. There were five stories in each HIT and one attention check. The definitions of each metric and option are shown when the mouse hovers over the questions and radio buttons, respectively.
Model, Decoding Prompt: "[WP] You just started a new job in a large office building and what you do n't know is that at twelve o'clock every day , there is a office wide Purge style battle to the death , those who survive an hour get a significant pay raise and can come back tomorrow . A bell just went off , it 's twelve o'clock ." GPT-2 Med., p = 0.9 All four of the chimes clicked and the chandelier snapped, as the room dimmed and grew dim until the only light came from an overhead television, showing the " The Purge " live. In the background, a towering wall of reds, greens, and yellows flashed in contrast to the background colors, casting warm green shimmers across the television. <newline><newline>The chandelier struck the ceiling, crumpling under the weight. A small circle of lights in the ceiling blinked and illuminated what had been a desk, until two figures entered it. <newline><newline>The first man, dressed in a yellow and white baseball cap and sunglasses, stood tall, wearing khakis and a button-down shirt. A muscular build and hairy arms waved about as he spoke. GPT-2 Med., k = 40 The first time it happened I remember it vividly. The first time I met it. We got our first coffee, we sat on the train and talked and made plans but when I first saw it coming I knew something was going to happen. There was a man in white with a clipboard. I said " how you doing? " and he asked " what the hell did I just read you about ". I told him nothing he went through what the hell I knew everything about the Purge and I knew that I had to be careful because if I wasn't careful it would happen again. He looked at me expectantly and I knew immediately that I had to go and try to help people ' s lives. No one else would go so I did as I was told. <newline><newline>He asked me if I had time for something more serious than just reading Fusion Model, p = 0.9 He was n't ready to die . I did . The name on his chest would kill him if he wanted to know . They wanted me to accept him as he wanted me to be not ready . I did it for him but he loved me . <newline><newline>I was the one that started a revolution . It was a simple matter of encouraging him and all the loyal soldiers that murdered him . And yet I never seemed to know what he had to do to deserve my right to be born and robbed . If only I knew . I could feel the faintest drop of power beneath my footsteps as he died . I felt the need to put some strings around him . But he had so much to offer . He seemed to feel I was the leader of the group and I was ready . He taught me the fact that he was supposed to be the most hated man in the country . So the world slowed . <newline><newline>And so he fought me . I was ready . He lost every step of his very being when he Fusion Model, k = 40 It was a sunny Monday morning when I woke up to the noise of my alarm going off . I got up from my bed , got out of bed , and went into the bathroom and took off my coat . It was n't exactly a normal morning . I walked into the bathroom and put on my shoes , and put on some pants , and went to the bathroom . The light from the bathroom was n't going to change anything . I walked out of the bathroom and went to the bathroom . It was a good morning . My morning routine was going well in bed , and I was going to see some shit , so it was good . <newline>I went to the bathroom . It was the first step in my morning shift , so I took off my pants and Table 8: Medium-length responses from GPT-2 Medium and the Fusion (baseline) model with top-k and nucleus sampling.