Evaluating the Evaluation of Diversity in Natural Language Generation

Despite growing interest in natural language generation (NLG) models that produce diverse outputs, there is currently no principled method for evaluating the diversity of an NLG system. In this work, we propose a framework for evaluating diversity metrics. The framework measures the correlation between a proposed diversity metric and a diversity parameter, a single parameter that controls some aspect of diversity in generated text. For example, a diversity parameter might be a binary variable used to instruct crowdsourcing workers to generate text with either low or high content diversity. We demonstrate the utility of our framework by: (a) establishing best practices for eliciting diversity judgments from humans, (b) showing that humans substantially outperform automatic metrics in estimating content diversity, and (c) demonstrating that existing methods for controlling diversity by tuning a"decoding parameter"mostly affect form but not meaning. Our framework can advance the understanding of different diversity metrics, an essential step on the road towards better NLG systems.


Introduction
An important desideratum of natural language generation (NLG) systems is to produce outputs that are not only correct, but also diverse. For example, a dialog system (Adiwardana et al., 2020) should permit many responses for the prompt "How are you today?". Similarly, we expect diverse responses in NLG tasks such as story generation (Li et al., 2018), question generation (Pan et al., 2019) and abstractive question answering (Fan et al., 2019).
Despite growing effort to produce more diverse models (Li et al., 2016c,a;Holtzman et al., 2019;Du and Black, 2019), there is currently no standard evaluation metric for measuring model di- Figure 1: Our diversity metric evaluation framework checks the capability of metrics to capture different aspects of diversity. Presented are two sets of responses to the same question, generated by crowdsourcing workers. While both sets are diverse in terms of the form of the sentences, only set A is diverse in terms of content. Each graph presents the distribution over a diversity metric for sets with high content diversity (blue) and low content diversity (orange). Distributions are approximated over 200 sets such as the two presented. We observe that the human score metric (absHDS) separates the two distributions, while an n-gram based metric (distinct-n) fails, illustrating that n-gram metrics do not capture content diversity. The dotted lines correspond to the specific sets A and B presented above. versity. Thus, different papers evaluate diversity differently (if at all), making it difficult to fairly compare competing approaches (Hashimoto et al., 2019). Having a principled and consensual diversity evaluation metric is hence fundamental for advancing the field of NLG.
A key challenge in developing diversity evaluation metrics, is that it is difficult to determine their efficacy. Unlike metrics for evaluating the quality of generated text, where one can measure the correlation between an automatic metric (such as BLEU (Papineni et al., 2002) or ME-TEOR (Banerjee and Lavie, 2005)) and human judgement (Zhang et al., 2019a;Sagarkar et al., 2018), it is unknown whether humans can reliably estimate diversity.
In this paper, we propose a framework for evaluating diversity metrics (see Figure 2). We assume that a tester (human or model) is generating sets of sentences, conditioned on some diversity parameter that controls the diversity of the output sentences. We evaluate the diversity of the sentences using a proposed diversity metric, and measure the correlation between the proposed metric and the diversity parameter. High correlation indicates that the metric indeed captures how the diversity parameter affects the model output.
We instantiate this framework with two tests. In the decoding test, the tester is a neural generation model and the diversity parameter is a decoding parameter, such as softmax temperature (Ackley et al., 1985). This parameter controls the skewness of the distribution in every generated token, and is known to affect model diversity (Holtzman et al., 2019;Caccia et al., 2018). In the content test (see Figure 1), the tester is a human, and the diversity parameter is a binary variable, where the human is instructed to generate sets of sentences with either high or low diversity in content.
We evaluate three families of popular diversity metrics with these tests: (a) n-gram-based metrics that estimate diversity based on surface patterns in a set of generated sentences, (b) neural metrics: we propose a reduction from evaluating sentence similarity to evaluating diversity, then evaluate diversity using state-of-the-art sentence similarity models, and (c) human evaluation: we explore multiple ways in which humans can be asked to estimate diversity, resulting in multiple Human Diversity Score (HDS) variations.
We find that n-gram-based metrics succeed in detecting diversity that is driven by decoding parameters, suggesting that such parameters mostly control the form of generated text rather than its content. Conversely, n-gram-based metrics perform poorly in the content test. While neural metrics outperform n-gram-based metrics, we establish that humans are substantially better than any automatic metric at detecting content diversity. This is illustrated in Figure 1, where a hu-man score clearly distinguishes between sets that have high (blue) and low (orange) content diversity, while n-gram-based metrics fail to do so.
To conclude, our main contributions are: • A framework for evaluating diversity metrics.
• Tests instantiating this framework, measuring the sensitivity of metrics to content and form. • Best practices for obtaining diversity evaluations from crowdsourcing workers. • Establishing that humans outperform current automatic metrics in detecting content diversity. • The collected data, test scores and code are publicly available, 1 and can be used to easily compare new diversity metrics to existing results in our framework.

Background: Diversity Evaluation
Recently, interest in diversity in NLG has increased (Du and Black, 2019;Holtzman et al., 2019;Hashimoto et al., 2019;Dušek et al., 2020), resulting in multiple proposals for its evaluation. We describe recent approaches, highlighting the need for a standard way to evaluate metrics.
Perplexity is the standard metric in language modeling (LM), measuring the proximity of a LM, P LM , to the true distribution, P ref , by empirically approximating the cross-entropy H(P ref , P LM ) with held-out data sampled from P ref . Thus, perplexity captures to some extent diversity. For example, a dialog model that puts all probability mass on the output "I don't know" for any given context will obtain infinite perplexity once it encounters any other response. This property makes perplexity popular in LM-based NLG models, and often it is the only reported measure for diversity (Lewis et al., 2017;Fan et al., 2018;Wang et al., 2019;Li et al., 2019). However, perplexity does not purely measure diversity, and high perplexity does not entail low diversity. For example, a LM with a uniform distribution over the vocabulary for each decoded token has high diversity, but its perplexity will be extremely high, due to its low quality. Moreover, perplexity evaluates a LM, while the diversity of a NLG system is also strongly affected by the decoding procedure. For example, Top-k and nucleus sampling are popular decoding schemes that tradeoff quality and diversity by ignoring some of the LM probability mass (Holtzman et al., 2019).
Last, some NLG models, such as Generative Adversarial Networks (GANs) (Yu et al., 2017) are not based on a LM at all. While it is possible to approximate perplexity for such models (Tevet et al., 2019), a metric should ideally not be tied to model specifics.
N-gram-based metrics A popular metric is distinct n-grams (Li et al., 2016b), which computes the proportion of unique n-grams out of the total number of n-grams in a set of generated sentences. For example, distinct unigrams is the ratio of word types to word tokens, alluding to the richness of the vocabulary. Dušek et al. (2020) calculated Shannon entropy (Manning et al., 1999 based on different n-grams as a measure of lexical diversity. Self-BLEU (Zhu et al., 2018;Shu et al., 2019) measures the BLEU score of a generated sentence with respect to another generated sentence (rather than a gold reference). High average Self-BLEU indicates high similarity between generated sentences and low diversity. In §5 we expand this idea and suggest a reduction from any similarity metric to a diversity metric. By design, n-gram based metrics are sensitive to diversity in the form of language, rather than its meaning.
Embedding-based metrics A new line of metrics suggests to embed generated sentences in latent space, then evaluate them in this space. Du and Black (2019) suggest to cluster the embedded sentences with k-means, then use its inertia as a measure for diversity. Recently, Lai et al. (2020) suggested to consider the volume induced by the embedded sentences as a diversity metric.
Human evaluation Yang et al. (2019) asked humans to evaluate the internal diversity of a generated essay. Ghandeharioun et al. (2019) let crowdsourcing workers interact with a dialog chat-bot, then asked them to evaluate the diversity of a single conversation. In contrast, this paper focuses on the diversity of different responses given a context, as in Zhang et al. (2019b).
To conclude, increasing interest in diversity resulted in multiple proposed diversity metrics. However, there is no consensus on how to evaluate diversity and what each metric actually measures.

Evaluating Diversity Metrics
We now describe our framework for evaluating diversity metrics. We note that diversity has many facets (see discussion in §7): for instance, a set Diversity Parameter d "How are you today?" c Test Score ρ(m div , d) "Very good!" "Fine thank you." "Couldn't be better." Figure 2: An overview of our diversity metrics evaluation framework. The tester (machine or human) generates a response set (S c,d ) given a diversity parameter (d) and a context (c). The test score of a metric mdiv is the correlation between the metric score for S c,d and d.
of sentences can be diverse in terms of their content, while another may have similar content, but diverse form (see Figure 1). Our framework provides a way to evaluate metrics for different aspects of diversity under moderate assumptions. We define a diversity metric m div (S c ) ∈ R as a function that takes a set of generated responses S c as an input, and outputs a diversity score. Each response s ∈ S c is generated for the same input context c, hence S c is a sample from a generative distribution P gen (s | c). The overall diversity score of a generative model can be obtained by averaging m div over sets S c sampled from the model given multiple contexts c ∈ C.
To evaluate m div (·), our framework assumes access to some deterministic diversity parameter d that controls an aspect of diversity in S c . Our framework tests the relation between m div and the parameter d. By varying d and measuring m div , we can compute the correlation ρ between m div and an aspect of diversity, represented by d. Because our goal is to measure the ability of metrics to rank the diversity level of generated text, we use Spearman's ρ rank correlation as our test score. Figure 2 illustrates the flow of a test in our framework.
In practice, to control the diversity level of S c using d, we use a tester: a generative model that takes a context c and a diversity parameter d as input, and outputs a response set S c,d . We stress that the tester can be either a neural model or a human. A good tester should reliably represent the diversity level quantified by d.
As a hypothetical example, c can be a movie name and d can represent sentiment diversity, that is, the number of different sentiments in a collection of generated reviews S c about that movie. A human tester can observe c and d, and produce reviews accordingly (such data can be easily mined from IMDB). A collection of such (d, S c,d ) makes a test, in which Spearman's ρ correlation between m div (S c,d ) and d is a measure for the sensitivity of m div to sentiment diversity.
We note that perplexity cannot be evaluated as a diversity metric in our framework, because it requires a sample from P ref , while we assume a response set sampled from P gen .
We now describe two tests that instantiate this framework, roughly corresponding to the two main aspects of diversity: form diversity and content diversity.

Decoding Test
The diversity of a NLG system constructed from a LM and a decoder is dependent on the decoding scheme. For example, beam search approximates the most probable output, and thus dramatically reduces diversity. Conversely, pure sampling from the LM distribution leads to high diversity, but low quality output (Holtzman et al., 2019).
Consequently, a popular method to control diversity in NLG systems is to vary some decoding parameter. Variations include (a) softmax temperature (Ackley et al., 1985), where a temperature parameter τ controls the skewness of the softmax distribution at each step, (b) Nucleus (Top-p) sampling (Holtzman et al., 2019), where one samples at each step from the minimal set of most probable tokens whose cumulative probability is at least p, and (c) Top-k sampling, which samples from the top-k most probable tokens at each step. All methods skew the LM distribution in a way that avoids low-probable tokens and leads to higher quality (Holtzman et al., 2019), providing a decoding parameter that trades off quality and diversity (Caccia et al., 2018).
In the decoding test (decTest), we define the tester to be a strong LM, such as GPT-2 (Radford et al., 2019), and the diversity parameter d to be a decoding parameter such as temperature. We check how different diversity metrics m div correlate with decoding parameters. This can shed light both on the quality of the metrics, but also on how decoding parameters actually affect the output of a NLG system.

Content Test
In the content test (conTest), our goal is to evaluate how different diversity metrics capture the notion of content diversity, that is, whether a set of responses are diverse in terms of their content. Measuring content diversity requires deep understanding of the semantics of responses in S c .
To isolate content diversity from form diversity, we aim to generate sets of responses with a similar level of form diversity, but where the level of content diversity is controlled by the diversity parameter d. To do this, we use crowdsourcing workers as testers, and a binary diversity parameter d ∈ {0, 1}, corresponding to low or high content diversity. A worker observes a context c and produces a set of responses S c based on the value of d. We encourage workers to use different words and phrases in different responses regardless of the value of d, such that form diversity is generally high in all examples. Examples from this data are presented in Figure 1 and Appendix B.
In §6, we will focus on whether automatic diversity metrics can perform as well a humans on the task of estimating content diversity.

Human Diversity Score
One of the core questions we tackle is: Can humans evaluate diversity reliably?
Although a few papers (Ghandeharioun et al., 2019;Yang et al., 2019;Zhang et al., 2019b) asked humans to evaluate the diversity of their models, to the best of our knowledge no work thoroughly investigated this question. The importance of this question is clear when comparing quality evaluation in NLG systems. There, human judgment is considered the gold standard, and automatic quality metrics are established by showing high correlation with human score. Thus, understanding whether humans can reliably judge diversity is important for improving diversity metrics. In this work, we use crowdsourcing workers 2 to compute a human diversity score: we show workers a context followed by a set of generated responses, and ask them to rate the diversity of the set.
To establish best practices, we experiment with multiple variations of HDS (detailed in §6.2), asking humans to rate the diversity of a response set, and then evaluating each practice with our framework. We focus on the following questions and present results in §6: • Should humans rate the absolute diversity score of a set of sentences or only rank whether one set is more diverse than another? (tl;dr: absolute scoring is more informative but rank scoring is moderately easier for humans.) • Should humans rate diversity of a set or similarity between pairs in the set, from which diversity can be inferred? (tl;dr: diversity) • Can humans evaluate different aspects of diversity well? (tl;dr: not effectively) As a preliminary step, we conducted pilot experiments among a group of NLP graduate students. The main insights were: (a) humans are biased toward quality. For example, if a generated set has high diversity but low quality, humans will rate diversity lower than if the quality of the samples was higher. To neutralize this effect, we explicitly ask workers to evaluate the quality of one of the responses in the set S c , and then instruct them to ignore quality in the diversity questions; (b) To make sure a worker reads the context c, we ask them to generate a sentence s before having them rate the diversity of a response set; (c) It is difficult for workers to evaluate the diversity of a set with more than 10 responses. Our crowdsourcing tasks are provided in Appendix A.

Diversity to Similarity Reduction
We expand the idea introduced by Zhu et al. (2018) and suggest a method to construct a diversity metric from any 2-sentence similarity metric.
Given m sim (s 1 , s 2 ) ∈ R, a symmetric similarity metric that gets a pair of input sentences (s 1 , s 2 ) and returns a similarity score, we can define a diversity metricm div as the negation of the mean similarity score across all (unordered) pairs of S c : This reduction allows us to easily define new diversity metrics based on past work on sentence similarity (Gomaa et al., 2013;Devlin et al., 2019;Zhang et al., 2019a;Reimers and Gurevych, 2019). In §6 we show that both n-gram-based sim-ilarity metrics and neural semantic similarity metrics provide useful diversity metrics.

Experiments
We now turn to our empirical investigation.

NLG Tasks
We apply our evaluation procedure on three different NLG tasks (in English), in which diversity is essential.
• Story completion (storyGen); We use the ROC Stories dataset (Mostafazadeh et al., 2016), in which the context c is the first four sentences of a story, and the response s is a single sentence that ends the story. We use the contexts C from this data and generate response sets S c for each context using our testers. The long contexts characterizing this data narrow down the space of possible responses, making this a "lowentropy" generation task, where the output is constrained, but diversity is still essential. • Dialog response generation (respGen); A comment-response pairs dataset extracted from the website reddit.com and pre-processed by Hashimoto et al. (2019). We use the comments from their data as contexts C and generate response sets S c for each context using our testers. Since comments are single sentences the response is less constrained, making this a "medium-entropy" generation task. • 3-words prompt completion (promptGen); Contexts C are 3-words prompts, extracted from the Cornell Movie-Dialogs Corpus (Danescu-Niculescu-Mizil and Lee, 2011) by taking the first three words from each original context. The response sets S c are completions of the prompts, generated by our testers. This context provides minimal constraints, making this a "highentropy" generation task. Samples of the contexts extracted for each task, along with generated response sets, are presented in Appendix B. We intentionally avoid NLG tasks where diversity is not necessarily desired, such as summarization and machine translation.

Evaluated Metrics
N-gram-based metrics We evaluate distinct ngrams (distinct-n), as described in §2. We also evaluate n-grams cosine similarity (cos-sim): a similarity measure computing the cosine between the vectors representing two sentences, where each vector is a count vector over the n-grams that appear in the response. We use the reduction from §5 to convert this to a diversity measure. In both metrics, rather than choosing the order of the ngrams, we average over n ∈ {1, . . . , 5}, which we found to outperform any single choice of n. We use the cosine similarity between the embeddings of two responses as a similarity metric. In our experiments we used bert-large-nli-stsb-mean-tokens. 5

Neural metrics
Human Metrics We examine four methods for evaluating diversity with humans (see §4), to investigate best practices for obtaining diversity judgment from humans. In all metrics (except ranking), ratings are from 5 (highest diversity/similarity) to 1 (lowest). The original tasks presented to workers are in Appendix A. Absolute HDS (absHDS); Given a context c and a set of generated responses S c , rate the level of diversity of S c . Ranking HDS (rnkHDS); Given a context c and two sets S c,d 1 , S c,d 2 generated with different values of the diversity parameter d, rate which set is more diverse. Similarity HDS (simHDS); Given a context c and a set of generated responses S c , rate the similarity of each two sentences in S c , and then apply the reduction from §5. Aspects HDS (aspHDS); Identical to absHDS, except we explicitly ask about a specific aspect of diversity, namely form and content.
Context Fire next door. John woke up smelling like something was burning. He went outside. He saw the fire next door. He called the authorities.
Response set (τ = 0.25) • It was a minor fire and they put it out.
• It was a fire.
• It was a fire.
• It was a fire.
• It was a fire.
Response set (τ = 0.8) • They arrived and put out the fire.
• It was a fire.
• It was a fire.
• It turned out to be a fire.
• It was a minor fire night.
Response set (τ = 1.1) • It turned out to be a mechanic.
• Before the fire was put out it was a fire.
• It was a fire.
• They co-worker matter how bad the fire was.
• Several shells, the fire department came just in time.

Decoding Test
In decTest we measure the correlation between diversity metrics (m div ) and the softmax temperature decoding parameter (d). The tester generating the response sets (S c ) is a neural NLG model.  ric, we collected 10 ratings per query from Amazon Mechanical Turk (AMT) workers. Whereas absHDS demands one query per response set, in order to perform simHDS at a reasonable cost, we chose |S c | = 5 (the first half of the original set), resulting in 5 2 = 10 crowdsourcing queries instead of 10 2 = 45 per set. Table 2 presents the results of absHDS, simHDS, as well as all automatic metrics. In general, n-gram based metrics succeed in capturing the diversity induced by a temperature sweep, beating HDS and neural metrics. Figure 3 provides a more detailed analysis, where each point represents a single set of responses generated at some temperature. We observe that while rank correlation for cosine similarity is high, it is far from linear and reaches high values even at low temperatures, scoring 0.6 Pearson correlation. Conversely, the correlation for BERT-STS and absHDS is more linear, scoring 0.75 and 0.77 Pearson correlation respectively. Thus, Pearson and Spearman correlations disagree in this case on the quality of the different metrics. This result shows that humans perform worse than automatic metrics in this experimental setup, hinting that temperature mostly controls superficial changes to the generated text. Additionally, simHDS performs worse than absHDS although it is 3x more expensive, showing that rating the entire set rather than averaging over pairs is useful.

Absolute scoring results
Ranking results To examine whether we can improve correlation by asking humans to rank whether one set is more diverse than another, rather than providing an absolute score, we conduct a ranking experiment. Each context is given along with two sets (5 samples each), produced with different temperature values. We sweep over temperature differences instead of the absolute temperature values. The human metric in this set-  Table 3: decTest ranking results: Spearman's (ρ) correlation between temperature differences and each metric score. Accuracy (acc) of classifying which set has the higher temperature. Standard deviation is up to 0.02 for all automatic metrics for both Spearman's correlation and accuracy.
ting is rnkHDS (see §6.2), and the automatic metrics are the difference between the scores each of the two sets got. We report two measures; The first is Spearman's ρ between the metric and the temperature difference. The second is accuracy, i.e., whether the metric can predict which set has higher temperature (e.g., in automatic metrics this is whether the sign of the temperature difference and the sign of metric score difference agree). 6 Table 3 summarizes the ranking test results. We observe that humans are better at ranking compared to giving absolute scores, and are doing as well as automatic metrics. However, the scores of all automatic metrics also improve, making it difficult to separate between the different metrics.
Other decoding parameters To Examine the robustness of our conclusions to other decoding parameters, we repeat it with two additional decoding methods: (a) in Nucleus (Top-p) sampling we swept linearly over 100 values of p in the range [0.1, 1.0]; (b) In Top-k sampling we swept k in logarithmic scale over 100 values in the range [1, 30K] and present the correlation between the metrics and log 10 (k). While softmax temperature enables skewing P LM to a more diverse P gen using τ > 1, both Top-p and Top-k enable only skewing P LM to a more sharp (hence less diverse) P gen . Table 4 presents results for all automatic metrics using the three decoding methods over prompt-Gen. Although the correlation in Top-k is significantly lower, and the variance is higher, all three decoding methods reflect a similar ordering between the metrics. Results for other tasks are in Appendix C.  Spearman's ρ (mean and standard deviation) of automatic metrics for promptGen.

Content Test
In conTest, we measure the correlation between diversity metrics (m div ) and content diversity, represented by a binary parameter d ∈ {0, 1}. The testers are AMT workers, guided to create sets with high level of form diversity and high or low content diversity according to d.
Data and settings For each task, we collected 200 sets of 5 responses each (100 sets per class). For high content diversity class, we asked the workers to give 5 responses for a context, with as different content and structure as possible. Then we asked the same workers to choose a single response they wrote, and rephrase it 5 times such that the original content will be preserved, while changing the form -this set is used for the low content diversity class. A sample from this data is in Figure 1 and more samples in Appendix B. For each HDS metric, we collected 10 ratings from crowdsourcing workers, different than the ones who composed the sets.

Results
In addition to Spearman's ρ between m div and d, we report the optimal single-threshold classifier accuracy (OCA), that is, the best accuracy that can be achieved in predicting the class of a response set (high or low content diversity) given any threshold η on m div , such that if m div (S c ) > η the classifier predicts high diversity, and otherwise predicts low diversity. Table 5 shows the test results. This time, n-gram-based metrics perform poorly, indicating they do not measure well content diversity. Neural models perform better than n-gram-based metrics (especially sent-BERT), but there is still a clear gap between automatic metrics and humans. Figure 4 illustrates the typical distributions of n-gram, neural and human metrics. Clearly, HDS separates high and low content diversity much better than neural metrics. In addition, n-gram-based metrics saturate both classes to near maximal values, similarly to decTest.
Since conTest isolates content diversity, we used aspHDS to ask workers to directly rate content diversity and form diversity. Content aspHDS gets similar scores to absHDS, implying that there is no additional gain in asking directly on the tested aspect. Form aspHDS gets substantially lower scores compared to absHDS, validating that the form diversity of the two classes is similar.

HDS Stability: Picking Parameter Values
HDS experiments demand expensive human labor. Thus, we need to carefully choose the number of sets and the number of different ratings we ask per set, to get reliable results within a reasonable budget. In Figure 5 we measure HDS results for different number of sets and different number of ratings. Empirically, the test results are stable starting from 7 ratings and 150 sets. Hence, we used 10 ratings and 200 sets for HDS experiments.  Figure 4: conTest: histograms of metric values of n-gram (distinct n-grams), neural (BERT-Score) and human (absHDS) metrics for promptGen. The orange histogram represents the distribution of the low content diversity class, the blue histogram represents the distribution of the high content diversity class and brown is the intersection between the two. Pointing down triangles represent the threshold η of the optimal classifiers. The histograms show how each metric separates the two classes.  set's class (1 -high content diversity, 0 -low content diversity) and each metric score. The optimal classifier accuracy (OCA) between the two classes over the metrics' score.

Aspects of Diversity
In this work, we focused on the two primary aspects of diversity: content diversity (What to say?) and form diversity (How to say it?). In Figure 1, Both sets are diverse, but Set B is only form diverse, as all answers deliver the same massage, whereas Set A is diverse in both form and content.
Furthermore, we can observe aspects of diversity as having a tree-like structure, where both content and form diversity can be divided to subaspects: Content diversity (e.g. answering the question "How are you today?") can be expressed by using different sentiment ("I'm doing good." vs. "I'm so glad you asked! I'm really doing good."), different relevance ("I'm fine" vs. "Did you see the game last night?"), and more. Form diversity can be divided into sub-aspects as well: syntactic diversity ("Someone took it from me." vs. "It was taken from me.") or lexical diversity ("I feel fine." vs. "I feel very well."). Even those sub-aspects can be further divided. For example, a sub-aspect of lexical diversity is register diversity ("How are you?" vs. "Sup bro?").
Another observation is that different aspects are not orthogonal, that is, changing one aspect may lead to changes in other aspects. Specifically, we observe that while it is relatively easy to produce high form diversity with low content diversity (Set B in Figure 1), it is almost impossible to diversify content without changing form. This observation was important during the design of conTest.

Conclusions
This work presents a novel framework for evaluating diversity metrics as a step toward standardized evaluation. We limit the scope of this work to the differences between form and content diversity, which we consider key towards understanding the different aspects of diversity. Future work can explore other sub-aspects of diversity as detailed in §7, e.g., testing sentiment diversity, as proposed in §3. We urge researchers to use this framework as a platform for developing new diversity metrics and establishing their efficiency.

A HDS Questionnaires
All Human scores for HDS metrics were collected using AMT crowdsourcing platform by English native-speaking workers that were specifically qualified for this task. Figure 6 presents the warm-up part, common for all HDS questionnaires. Before asking workers to rate the diversity of each set, we first asked them to generate a response for the context themselves, to make sure they read the it. To neutralize the effect of the responses' quality on the workers, we also asked the workers to rate the quality of the first response in the set, then explicitly instructed them to ignore quality when rating diversity. Figures 7 to 10 present the diversity questions of absHDS, aspHDS, rnkHDS and simHDS as appeared in the AMT questionnaires.
Costs For HDS metrics that require one query per response set (i.e. absHDS, rnkHDS, aspDHS), the cost for a single rating was 0.18$. We collected 10 ratings per response set, and conduct each experiment with 200 sets, hence the total cost for an experiment was 360$. In the case of simHDS, the response set size was 5, and the number of queries needed per set is 5 2 = 10. The cost of a single rating for this task was 0.056$, and with the same multipliers, the total cost for an experiment was 1120$, three times more expensive.

B.1 Decoding Test (decTest)
Tables 6 to 14 present data samples from sto-ryGen, respGen and promptGen with the neural testers of decTest, as detailed in §6. Each table presents two contexts and three response sets per context. Each response set was generated with a different value of decoding parameter for the three decoding methods: softmax temperature, Nucleus sampling, and Top-k.

B.2 Content Test (conTest)
Tables 15 to 17 present data samples from sto-ryGen, respGen and promptGen with the human testers of conTest, as detailed in §6. Each table presents two contexts and two response sets per context -one for the low content diversity class and one for the high content diversity class.

C Additional Results
Comparing decTest results of storyGen to other tasks (Table 2), this task is characterised with noisier scores for all metrics (Figures 3 and 11), hence lower ρ values and higher variance. A possible explanation is larger effect of c on the distribution P gen (s|c) in this task.
Tables 4, 18 and 19 present decTest absolute scoring experiment using temperature, nucleus sampling and Top-k decoding parameters as d.
Top-k consistently yields lower ρ compared to other decoding parameters, especially for story-Gen task. This implies that Top-k represents diversity less reliably than other methods.

Context
Response set (τ = 0.25) Response set (τ = 0.8) Response set (τ = 1.1) Fire next door. John woke up smelling like something was burning. He went outside. He saw the fire next door. He called the authorities.
• It was a minor fire and they put it out. • It was a fire. • It was a fire. • It was a fire. • It was a fire. • It was a fire. • It was a minor fire. • It was a fire. • It was a fire. • It was a minor fire and they put it out.
• They arrived and put out the fire. • It was a fire. • It was a fire. • It turned out to be a fire.
• It was a minor fire night. • They arrived and put it out. • It was a scary but beautiful fire next time he went in. • It was a fire. • It was a put out and John was able to put it out. • It was a scary place to live.
• It turned out to be a mechanic. • Before the fire was put out it was a fire. • It was a fire. • They co-worker matter how bad the fire was. • Several shells, the fire department came just in time.
• They arrived and put out the fire. • It turned out to be a clear, fact. • It was a fire. • It was a scary meal for wet clothes. • It was a scary stealing from him. Charles Shaves His Chest. Charles decides he no longer wants chest hair. He shaves his chest hair with a razor. Although his skin is irritated, he hair is now gone. Charles is satisfied.
• He is glad that he finally shaved. • He is glad that he finally shaved. • He is very happy to have a chest hair.
• He is glad that he finally shaved. • He is glad that he finally shaved. • He is glad that he finally shaved. • He is glad that he finally shaved. • He is glad that he finally shaved. • He is glad that he finally shaved. • He is glad that he finally shaved.
• He is very happy to have a chest hair.
• He is glad that he finally shaved. • He is very happy that he no longer has chest hair. • He is glad that he finally shaved. • He is glad that he finally shaved. • He is glad that his chest hair is no longer shaved. • He is glad that he finally shaved • He is glad that he finally shaved. • He is glad that he finally of shaving. • He is glad that he finally shashock his chest hair.
• He is glad that he finally peaceful his voice again. • He is glad that he finally shaavoided his chest hair. • He is glad that he finally downcome. • He is ready to start better cooperation his fireplace. • He is glad that he finally shaved. • He is glad that he no longer has chest hair. • He is happy to finally be shaved. • He is very happy to have ster good ride of his life. • He is glad that he finally shaved. • He is glad that he finally has chest hair. When it arrived, it was hot and smelled wonderful. We devoured it with gusto.
• After a few hours of take it home we all enjoyed its night.
• After a few hours of take it home we all enjoyed its night.
• After a few hours of take it home we all enjoyed its night.
• After a few hours of eating everyone was satisfied.
• After a few hours of take it home we all enjoyed its night.
• After a few hours of eating everyone was satisfied.
• After a few hours of take it home we all enjoyed its night.
• After a few hours of take it home we all enjoyed its night.
• After a few hours of take it home we all enjoyed its night.
• After a few hours of eating everyone was satisfied.
• After dinner, we all went home to cook Mexican food.
• After a few hours of cooking she was tired and ready to eat. • After dinner, I always put got ready for Christmas. • After dinner, I helped her do the dishes.
• After a few hours of dinner, the food was amazing.
• After a few hours of take it home we all enjoyed margaria. • After dinner, I was ready to take on work the next day of • After dinner, I was sad to say goodbye to her • After a few hours of take it home we all enjoyed one bite.
• After a few hours of eating everyone was satisfied.
• Even though my stomach was gone, I was sad it was finally pockets • After dinner, I alone.
• All in all the family while my mom finished the food.
• After a few hours of dancing, she pianed. • Afterwards I'd never finish single night a week. • Afterwards we all went to sleep in my woods.
• Afterwards I helped her do much better than my wife.
• Afterwards my mom helped me finish my dinner.
• After a few hours we all enjoyed storm blood from the Italy. • After dinner, we all enjoyed some good food together. Table 7: decTest data samples for storyGen task and different p values (nucleus sampling).

Response set (k = 3)
Response set (k = 32) Response set (k = 318) Loud Noise. Jane was trying to take a nap. She heard a loud bang in the kitchen. It woke her up. A dish had fallen off the counter.
• Jane pulled over and started to clean herself. • Jane was horrified and dropped her favorite food. • Jane was able to finish her car and take a nap. • Jane was able to finish her nap . • Jane was able to finish her nap .
• Jane stopped at the store to buy a new one. • Jane was able to grab a nap afterwards.
• Jane was unable to finish her nap since it was lost. • Jane pulled over and started to clean up. • Jane was able to finish her nap .
• Jane was able to catch a car using the seat. • Jane stopped at the store to buy a new book. • Jane was sad her cat dropped out of the kitchen. • Jane screamed. • Jane was horrified to find her car broken down on the floor. • Jane was horrified and dropped her pay phone. • Jane was easily able to grab a nap. • Jane pulled over and started to cry. • Jane pulled over and started to cry. • Jane stopped at the store to buy a new dish from the store.
• Jane comes, noticed a lot of food left under it. • Jane was horrified and dropped her book. • Jane remembered to take a nap. • Jane was since she took a nap while she waited for the refun • Jane knew she had no time to finish her book. • Jane was glad. • Jane was annoyed and began to cry. • Jane stopped at the store to buy a new one.
• Jane wanted to have her car back. • Jane was monthed.

Headache.
Kate was wearing big over the ear headphones. But they were tight and squeezing her head. She tried to adjust them to relieve the tension. But nothing really worked.
• Kate decided to go to the store and buy some ear phones.
• She decided to go to the store and buy some headphones instead.
• She decided to go to the store and buy some ear phones.
• She decided to go buy a pair of headphones instead.
• She decided to go to the store and buy some headphones instead.
• She decided to go to the store and buy some headphones instead.
• She decided to go to the doctor and have some rest.
• Kate decided to go to the store and buy some headphones instead.
• She decided to go to the store and buy some ear phones.
• She decided to go buy some ear plugs.
• She decided to go buy a hat so she could enjoy the the long lin • Kate decided to go to the store to buy some candy since she was ti • Kate decided to go outside to rest. • Kate decided to go to the store and buy some ran phones.
• Kate decided to go to the store and buy some headphones instead.
• Kate decided to go back to her old dishes.
• She decided to go buy a big pair of headphones instead.
• Kate decided to go to the store and wearing some headphones. • She was forced to go to the store to buy some cash.
• She decided to go to the store and buy some headphones instead.
• She decided to go to a ess instead.
• Kate decided to go to the store and buy some fake headphones.
• She decided to go to the store and buy some ear phones.
• She decided to go to the store and buy some headphones instead.
• She decided to go to the store and buy some headphones instead.
• She decided to go buy some ear phones.
• Kate decided to go to the gym and use some saw no more watching T • Kate decided to go to the store and buy some cute phones.
• She decided to go buy a fake headphones instead.
• She decided to go to the store and buy some concert. Table 8: decTest data samples for storyGen task and different k values (Top-k).

Context
Response set (τ = 0.25) Response set (τ = 0.8) Response set (τ = 1.1) it is doing okay i just want people to talk more • i think you mean people • i think its working now • i just want people to talk more • i think i just want people to talk more • i dont think you understand what youre saying • i like to think so • i know that feel • i dont know what to say • i think its pretty cool • i think its working now • i cant believe you didnt think this through i actually havent had • may i ask why • i just hope its not going to happen • its alright dont even want to talk • what did it say i was using windows 95 anyway xd • thanks for reading the thread • i understand and hope they dont mind that • want to talk more • its doing ok body language great voice acting and music • all good i know your pain • im just hawaii everyone wants to be friends lt go talk to people from • its pretty smart we didnt run a year ago and have turned back but • id lick your teeth but tha t sucks smart people • it really didnt stop i think you need to go back to normal now • sounds like a plan to me • we we can try forever practice beer • okay but another two more 200 radio views im going to ore my comment of

Context
Response set (k = 3) Response set (k = 32) Response set (k = 318) watching curry play in his prime is truly a privilege • i know i just dont want him to play for us • he has to be a good center for that • he is a great center of football in his prime • hes been playing in his prime for a long time • he was a great back in the day • hes been playing for a while now • i dont know about that he was pretty damn good at that • i dont think he was ever in his prime • i dont think he is a prime minister • i dont know why but i think he is a very good player and • and his career as well • agreed the way hes playing is awesome • it has to be • this is just called a job • and then being on the field for the first time • i dont see him doing that often enough • he just likes to party in the kitchen • at this point hes going to be a great star for the rest of the • only if he pays well • the only thing that can make that kind of difference is how much time you • yeah my feeling i mean we dont like it but it happens all the • you are one for real • they still have a rule saying they might not be injured yet • it really is a necessary thing to do • finally some reason to continue watching him at some point • yet that would be epic • not to mention eating curry dinner is a privilege • i just dont want to turn over for this goal like he does in • gt playing in his prime is truly a privilege ftfy • so is saying he is in high school im going this evening when she usually works • i think you accidentally a word • you are a good man • i hope she works • im going to the same time as you when she usually works • i am so sorry to hear that • i hope she works for you • i am so jealous of your work i am so jealous • i hope you have fun • i hope you get a job at a local bar • i hope she works for you • i see what you did there • where in the world are you going to put your socks on • she usually does but she has to keep up to date with her in the • awesome what do you want for it • how does a women have a relationship with someone • did you get the job • thats where she goes • i know im also going this friday • thats a great idea • well make sure youre there too good luck • thats what you like to hear • go see her later this evening • i read that as she usually fails • okay ill see you there brah • i always thought that it was for the sake of having the girl play as • that can be expected here • thats it im down now • i dont see why it would matter if shes married • i will be the ex gf • she still does make her phone calls the whole office Table 11: decTest data samples for respGen task and different k values (Top-k).
Response set (τ = 0.25) Response set (τ = 0.8) Response set (τ = 1.1) • Not the hacking. The hacking is the fact that the DNC was hacked. ! • Not the hacking. The hacking is the real problem. The hacking is the! • Not the hacking of the DNC, but the leaks of the emails of the Democratic National Committee.! • Not the hacking, but the way it was done. The FBI's investigation into the! • Not the hacking of the DNC, but the hacking of the emails of the Democratic National Committee.! • Not the hacking of the DNC, but the leaking of the emails. The DNC's! • Not the hacking of the DNC. The hacking of the DNC was a "false flag! • Not the hacking of the DNC. But the hacking of the RNC. The DNC hack! • Not the hacking. The hacking is the problem. The hacking is the problem! • Not the hacking of the DNC, but the leaking of the emails. The DNC was hacked,! • Not the hacking after all? I'm sure the nation-states that are involved in! • Not the hacking that happened on the internal networks of the Energy Department. In fact, according to! • Not the hacking of the American public but rather the fraudulent Heisenberg principle that seemed to be! • Not the hacking that took place in the DNC last year or the release of hacked emails during the! • Not the hacking futurists Cardboard inventor and self-described tinkerer Dennis! • Not the hacking alone. In the first half of the report, the hackers tried to create fake! • Not the hacking. The hacking is the NSA's new SHIELD technology. It is! • Not the hacking and hacking and hacking of the world government. I know this man is a man! • Not the hacking aspect, but the pressure exerted by the Trumpistas. But also the Russia angle! • Not the hacking, but the willingness." The evidence of interest in this case comes in! • Not the hacking experience of a CIA VRO crunch nine months agoJumpStart for 2016 jumps! • Not the hacking, David.) The directory was flagged in a document it created in late last year! • Not the hacking of Democratic Party systems -said the Russian team's activity represented "just the beginning! • Not the hacking, of course -which these sources sounded more concerned about than being attacked 140 times! • Not the hacking story is over. But yet there's another reason not to rush out such statements! • Not the hacking-either.-These were scattered in the workshop.(Expanded-being guys with! • Not the hacking of private material of elected officials, e.g. emails, even if the! • Not the hacking has happened yet!!!!!!!!!!!!!! • Not the hacking rumours have cost him any of his followers, least of all the proprietors of! • Not the hacking group behind the breach of Sony, which has posted the staffer's information online,! • How is our new technology helping us to do that? We are using a new technology! • How is our system different from that of the United States? The United States is a! • How is our approach different from that of the other major European countries? The European Commission! • How is our country going to be able to compete with the rest of the world if we don! • How is our country going to be able to compete with China in the future?" he asked.! • How is our work different from that of other organizations? The work of the Center for! • How is our work different from other research in this area? We are not the first! • How is our system of government supposed to work? The reason we have a government is! • How is our system different from the one that was used in the past? The system! • How is our country supposed to be a beacon of hope for the world if we have to look! • How is our government going to catch up with the cyber criminals?" he said. "I'm! • How is our society selling humanity on slavery? The answers to these questions are also important for us! • How is our minister giving it to you? Isn't it? It's got a bit of! • How is our research different from other studies? This study examined the effects of peer-! • How is our mission different from Seniors' Service Corps (SSC) other than the fact! • How is our challenge different? The only difference is that this challenge is about building an! • How is our nation governed?" As Obama moved into his second term, he is increasingly! • How is our recommendation different from what more traditional veterinarians do? We don't believe! • How is our rapid abandonment of critical thinking, knowledge, and values, and the subsequent burial of! • How is our education system designed for our futures? We are the children of immigrants,! • How is our Internet even even connected with our corporate tracks? Every cell phone on the planet knows! • How is our developer name attached to the icon? Since the Planetside icon is use internally! • How is our food paradise created? Artificial chemical fertilizers. So these aren't GMOs, but! • How is our acquisition* worth -BOARD ROLL (Least Significant Equivalents)! • How is our transit plan addressing this problem? Under our old plans, Burlington Buses! • How is our mind different than any other part of the body?" A Broader View! • How is our campaign working? Bitcoin launches alongside psychological research showing that people pay a lot! • How is our mentioning application related to a related method (#five with two in queue) page such! • How is our having to resort to roundabout hypotheticals to argue that Stewart may secretly want! • How is our blood working out for you?" a statewide voter got an outpouring of rename and!   -k). Bold text is the 3-words prompt context.

Context
Response set (high content diversity) Response set (low content diversity) Sold Out Jane wanted to watch a big new action movie. She had been waiting a long time for it to come out. When tickets became available she was too busy. By the time she had a chance to buy some it was sold out.
• Jane cried over the fact that she couldn't watch it and just gave up looking for a ticket. • Jane decided to look for a scalper that would sell her the ticket for the movie that she really wanted to see. • Jane thought it was okay since she can still have a chance to watch it once it gets uploaded in video and movie streaming applications. • Jane posted a status on her social media accounts asking her friends for any spare ticket that she is willing to buy. • Jane resorted to contacting her old friend who is working at a huge movie theater hoping she can help her get a ticket.
• Jane remembered that she has an old friend who is a manager at a big movie theater so she contacted that friend in the hopes that she can buy any spare ticket. • Desperate to watch the movie, Jane called her friend, who works at a movie theater, asking for a ticket to that movie. • Jane recalled that her friend works at a movie theater and hoped that she can help get a ticket for that movie. • Jane decided to look for her friend who could possibly have access to tickets for that movie since that friend currently works at a movie theater. • Jane realized that her friend might have spare tickets since she is a manager of a movie theater showing that film.

Beavers.
My friend has some beavers in his backyard. They come up from the creek by his house. He invites my over and we watch them. We take pictures of them and send them to our friends.
• They are fascinating animals. • Our friends love getting the pictures. • Sometimes his dogs chase them. • They are building a dam on the creek. • They won't let us get too close to them.
• They are busy gathering sticks to make a dam. • The dam they are building is almost complete. • It's fascinating to see their workmanship building a dam. • They are turning the creek into a pond by building a dam. • They all work together with careful engineering to build a dam.

Context
Response set (high content diversity) Response set (low content diversity) kill la kill is still going new episode every thursday • That show sucks • OMG I can't wait • I thought they canceled it • What channel is it on • I only watch nature programs on BBC • Lead actor is soooo hot • Did you see the cliffhanger at the end of the season • I've been waiting for it to return for weeks • I'm totally gonna binge watch last season • I just got into this show and can't stop watching places apple slices in a bowl so they'll stay fresh • Oh boy, I love apples.
• I don't need you telling me how to keep things fresh, take a hike.
• Girl, you're the fresh one around here.
• This post might be better in the life hacks section.
• This is actually a useful bit of advice.
• I find merit in this input.
• That information will serve me well.
• Thanks, that's really good to know! • Such knowledge is certainly beneficial.
• Wise words, I will heed them.  The worker is asked to generate response of hers/his own and rate the quality of the tester's response.    Response set (high content diversity) Response set (low content diversity) • Suppose there's an escape plan we haven't thought of yet. • Suppose there's an omelet that is the most amazing ever. • Suppose there's an airplane ticket that's even cheaper. • Suppose there's an actual deadline for this paper. • Suppose there's an event that we can go to this weekend.
• Suppose there's an airline that costs less. • Suppose there's an flight that isn't as expensive. • Suppose there's an air travel fare, but doesn't cost as much. • Suppose there's an way to fly there that is low cost. • Suppose there's an flight going there and it's not a lot of money • Nothing remotely like eating a big breakfast. • Nothing remotely like dancing with your wife at the wedding. • Nothing remotely like singing Justin Bieber's greatest hits • Nothing remotely like falling down a hill • Nothing remotely like getting yelled at • Nothing remotely like being super full and satisfied. • Nothing remotely like getting to taste many different foods. • Nothing remotely like starting the day off right. • Nothing remotely like doing exactly what I want to do. • Nothing remotely like feeding myself with great food. Table 17: conTest data samples for promptGen task. Bold text is the 3-words prompt context.   Spearman's ρ (mean and standard deviation) of automatic metrics for respGen.