What Comes Next? Evaluating Uncertainty in Neural Text Generators Against Human Production Variability

In Natural Language Generation (NLG) tasks, for any input, multiple communicative goals are plausible, and any goal can be put into words, or produced , in multiple ways. We characterise the extent to which human production varies lexically, syntactically, and semantically across four NLG tasks, connecting human production variability to aleatoric or data uncertainty . We then inspect the space of output strings shaped by a generation system’s predicted probability distribution and decoding algorithm to probe its uncertainty. For each test input, we measure the generator’s calibration to human production variability. Following this instance-level approach, we analyse NLG models and decoding strategies, demonstrating that probing a generator with multiple samples and, when possible, multiple references, provides the level of detail necessary to gain understanding of a model’s representation of uncertainty. 1


Introduction
Humans display great variability in language production, in particular when the context or the task are open-ended, such as in storytelling or in dialogue.Given a story prompt, for example, there are many plausible ways in which different humans (or a single writer, if asked multiple times) may tell the story (Fan et al., 2018).We refer to this phenomenon as production variability.Production variability in humans has two main sources.First, when situated in a context, speakers may entertain variable communicative goals (Searle, 1969;Sacks et al., 1974;Austin, 1975), and the number and variety of plausible communicative goals depends on the production task (Jokinen, 1996).Translation, for instance, defines the communicative goal almost unequivocally while a dialogue context might allow for a wide variety of communicative goals (expressed, e.g., as a request, an assertion, or a yes-no question).The second source of variability is the fact that even when context and communicative goal are fixed, speakers' linguistic realisations of the communicative goal may vary (Levelt, 1993).Both sources of variability apply to individuals as well as to populations: if an expert is asked to simplify a complicated sentence multiple times, they may perform different rewriting transformations (e.g., paraphrasing, reordering, or sentence splitting) and produce different texts (Alva-Manchego et al., 2021); the same is true if multiple experts are asked to perform a task (Xu et al., 2015).If we are to regard a Natural Language Generation (NLG) system (or text generator) as a good model of human production, it should capture the variability observed in humans.
Text generators combine two mechanisms: (i) an underlying statistical model-typically, an autoregressive factorisation of the probability of sequences, with conditional token probabilities predicted by a neural network; and (ii) an iterative decoding algorithm that chains samples from next token distributions into a complete production.To-gether these two mechanisms specify a probability distribution over sequences of tokens, which can be regarded as a representation of the model's uncertainty about productions for a given generation context (see Baan et al. (2023) for a detailed discussion).In this work, we assess whether this representation of uncertainty is in compliance with production variability exhibited by a population of humans-which in turn, we argue, can be regarded as an expression of aleatoric uncertainty, i.e., irreducible uncertainty due to the stochastic nature of the data generating process (Der Kiureghian and Ditlevsen, 2009;Hüllermeier and Waegeman, 2021).In other words, we compare the distribution over productions of a text generator against the distribution over the productions of a population of human speakers, given the same context (Figure 1).
Quantifying the closeness in distribution between a text generator and a human population is difficult: we only have an iterative view into the generator's distribution; the 'human distribution' is an implicit or even hypothetical object; and in both cases, the sample space is large or even unbounded.We can, however, compare these two objects via the samples they produce and assess their statistical distance-which is what we propose here.For each individual generation context, we compare scalar properties of generations (through repeated model sampling) and human productions (using multireference NLG datasets).In particular, we probe for lexical, syntactic, and semantic distance between productions, thus allowing for a quantitative and interpretable assessment of uncertainty.
We find that the uncertainty of neural text generators is higher than justified by human production variability in open-ended tasks, like story generation and open-domain dialogue; and that it is lower on more constrained tasks, like machine translation and text simplification.Popular decoding algorithms, which bias away from the distribution of the generator's underlying statistical model (e.g., top-k, top-p, or locally typical, rather than ancestral sampling), have a limited impact on the generator's ability to faithfully represent human variability.We complement our quantitative assessments with a detailed analysis of individual generation contexts, which sheds light on whether a generator has robustly learned to reproduce degrees and aspects of human variability plausible for the communicative task.
Beyond the experimental results obtained on our selection of models and tasks, our work has important implications for NLG evaluation and data collection.Multiple samples and, when possible, multiple references, should be used to assess the statistical fit of text generators.Our approach, complementary to other types of automatic evaluation, makes model assessments particularly insightful and trustworthy because it does not judge a model only by a single output but also, intuitively, by what it could have generated-and it does so for each individual input in the test set.We therefore hope our framework will be used by the community as an evaluation criterion for NLG systems, especially to assess them in more open-ended tasks.

Related Work
Automatic approaches to the evaluation of NLG systems are of high practical importance: they allow for model selection at scale and power qualityaware decoding algorithms (Borgeaud and Emerson, 2020;Eikema and Aziz, 2020;Fernandes et al., 2022;Suzgun et al., 2022).In spite of their known limitations (Gehrmann et al., 2022), they are a necessary complement to human evaluation (Belz and Reiter, 2006;van der Lee et al., 2021).
Reference-based evaluation.The most common way of automatically evaluating text generators is via metrics that estimate the similarity between candidate generations and references, such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), COMET (Rei et al., 2020), BLEURT (Sellam et al., 2020), and BertScore (Zhang et al., 2020a).Reference-based metrics are less suited for open-ended tasks such as story generation and dialogue, where a single reference (or even a handful) cannot be representative of the large space of plausible communicative goals and realisations.
Reference-free evaluation.A popular, referencefree alternative is to train evaluation models that discriminate human from model output (e.g., Bruni and Fernández, 2017;Gehrmann et al., 2019;Hashimoto et al., 2019), score the appropriateness of input-output pairs (e.g., Sinha et al., 2020;Fomicheva et al., 2020), or model human judgements directly (e.g., Lowe et al., 2017;De Mattei et al., 2021;Rei et al., 2021).Neural language models themselves have been proposed as evaluators (e.g., Yuan et al., 2021;Deng et al., 2021) and used to assess generations along interpretable evaluation dimensions (Zhong et al., 2022), yet they have been criticised for being biased (toward models similar to the evaluator) and thus limited in their ability to evaluate generated text (Deutsch et al., 2022).
Statistical evaluation.Statistical evaluation compares model generations to human productions in distribution through real-valued statistics (e.g., Zipf's coefficient, type-token ratio, length) as opposed to strings themselves.These statistics are typically compared marginally, at the corpus level (Eikema and Aziz, 2020;Meister and Cotterell, 2021;Pillutla et al., 2021;Pimentel et al., 2022), supporting general claims about model performance in relation to humans.More recently, Barkhof and Aziz (2022) and Deng et al. (2022) compared statistics at the instance level, supporting claims about models' performance in relation to humans for individual inputs.In this work, we craft statistics that evaluate generators' uncertainty at the instance level against the variability over sequences observed in multi-reference NLG datasets.Although evaluating uncertainty is gaining traction in NLP (e.g., Desai and Durrett, 2020;Glushkova et al., 2021;Baan et al., 2022), there is relatively little work on sequence-level uncertainty (Ott et al., 2018;Malinin and Gales, 2020;Aina and Linzen, 2021;Kuhn et al., 2022).
Diversity in NLG.Our analysis is related to NLG studies on output diversity.Some have evaluated diversity induced by different models and NLG decoding strategies-yet do not use human levels of variability as a target (Wiher et al., 2022)-or have used human judgements to evaluate diversity metrics themselves (Tevet and Berant, 2021;Stasaski and Hearst, 2022).Others have developed diversity-enhancing objective functions (Li et al., 2016) and decoding algorithms (Vijayakumar et al., 2018;Shu et al., 2019;Weir et al., 2020;Meister et al., 2021).In our study, where the aim is to evaluate the uncertainty of NLG systems, we focus on unbiased sampling and the most widely used decoding algorithms.

Probing Language Processes for Production Variability
We interpret language production, by humans or NLG systems, as captured by a probability distribution over natural language strings (productions), a random variable Y , given a linguistic context X = x.The context x can be a source sentence in translation, a story prompt in story generation, or more generally the input to a language process.In turn, a production is a piece of text y such as a single translation, a story, or more generally the output of a language process.2

Production Variability
For any language process, production variability is fully characterised by a conditional probability distribution p Y |X=x representing uncertainty about the output Y given input X = x.Intuitively, the uniform distribution maximises production variability and the Dirac delta (one-hot) distribution minimises it.Analysing this distribution is difficult.Notably, for human language processes, we do not have an explicit representation of p Y |X=x .This prevents a direct comparison through measures of statistical divergence, or summaries like entropy.
Through data collection we can, however, draw conditional samples from the human language process (i.e., gather references given a context).On the other hand, for NLG models, we do have an algorithmic representation of p Y |X=x , which is usually sufficient to enable sampling, but the unbounded sample space and lack of conditional independence assumptions make statistical divergence and summaries like entropy intractable. 3nstead, we propose to analyse language processes through their samples.This in turn introduces other difficulties, as text is a highdimensional, structured, non-numerical data type.For tractable analysis, we exploit a set of real-valued and interpretable statistics, or production probes, to re-express a language process distribution in terms of how, given an input, its outputs relate to outputs of another language process.When both processes are independent humans performing a task, we obtain a sense of how plausible human productions relate (or vary with respect) to other plausible human productions, along a linguistically interpretable dimension.When we swap one or both processes for an NLG model, we obtain tools to analyse how model generations relate to plausible human productions, thus assessing a model's representation of uncertainty against the variability observed in humans.Specifically, given a context x, two language processes with distributions p Ŷ |X=x and p Y |X=x , and a choice of distance metric k(•, •) ∈ R, our probe for production variability is a real random variable k( Ŷ , Y ).This random variable captures the joint distribution of distance between any two outputs drawn conditionally from the two processes.The distribution of the probe k( Ŷ , Y ) is also intractable, but we can estimate it via simulation by drawing productions from the processes and assessing the distance metric on sampled pairs, as illustrated in Figure 1.
Consider analysing the human process ( § 5) through k(Y, Y ): when multiple realisations of the output are dissimilar (e.g., given the input 'How is your day?' and outputs 'Fantastic, thank you!' and 'I asked you first') production variability is high along the dimension captured by k.

Production Probes
We instantiate our production probes with three distance functions.They return values from 0 to 1.We hope that future work will experiment with alternative probes that may capture other linguistic or extra-linguistic levels of analysis.
Lexical: The fraction of distinct n-grams in two strings, with n ∈ [1, 2, 3] (i.e., number of nonmatching n-gram occurrences divided by the total number of n-grams in both strings).
Syntactic: Analogous to lexical distance but on part-of-speech tag n-grams. 4emantic: The cosine distance between the sentence embeddings of two strings (Reimers and Gurevych, 2019).5 4 Experimental Setup

Data and Models
We experiment with four NLG datasets that contain 5+ human references per input instance and for which we expect humans to display different degrees of production variability.For each tasks, we select models that are publicly available, are reasonably sized, have been used previously on the task, and are conventionally accepted as suitable for it. 6All datasets are in English; for translation, the target language is German.Table 1 (Appendix C) shows relevant statistics.The reference collection procedure varies across datasets and we discuss how this may impact our analysis in the Limitations section.
Machine translation.We use 500 sentences from the WMT-14 En-De test set (newstest2014; Bojar et al., 2014), which have been annotated by Ott et al. (2018) with 10 additional reference translations produced by as many human translators.As a generator, we use Helsinki-NLP's Transformer-Align model trained on Opus-MT (Tiedemann and Thottingal, 2020).
Text simplification.We use the 2,000 instances of the ASSET validation set (Alva-Manchego et al., 2020).For each source sentence, originally from the TurkCorpus (Xu et al., 2016), ASSET includes 10 additional simplifications by as many crowdsource annotators.On this dataset, we test Flan-T5-large (Chung et al., 2022), an instruction-finetuned version of the T5 language model (Raffel et al., 2020), which we further finetune on the ASSET training set.
Storytelling (story generation).We use the 759 instances from the WritingPrompts test set (Fan et al., 2018) for which at least 5 human references are available.Prompts and stories are originally scraped from r/WritingPrompts, a Reddit forum of stories written by online users in response to story prompts designed by other users.The number of stories available per prompt (9.56 ± 7.67) varies from 5 to 92.We use GPT2-large (Radford et al., 2018) finetuned on the WritingPrompts training set.
Open-domain dialogue.We use the development set of DailyDialog++ (Sai et al., 2020), which contains 5 additional references for 1,028 conversations from the DailyDialog corpus (Li et al., 2017).The dialogues are short (less than 8 turns) and cover a broad list of topics; for each dialogue, 2-3 annotators were asked to generate 1-3 alternative responses. 7For this task, we use the pretrained DialoGPT-medium (Zhang et al., 2020b).

Human Production Variability Across NLG Tasks
Consider p Y |X=x the distribution that describes the human language process, and define the following special case for human production variability: Estimating this probe by drawing pairs of human productions provides an interpretable view on plausible variability-i.e., aleatoric uncertainty-along the dimension captured by k. Figure 2 shows H k (x) marginalised over inputs for the four NLG tasks.We use unigram distance for the lexical probe, POS bigram distance for the syntactic probe, and cosine distance for the semantic probe.High distance indicates high variability, and vice versa.
Translation and text simplification.Humans show low production variability in these two tasks.While translations of a given source sentence are more lexically and semantically varied, simplifications exhibit a higher degree of syntactic variability, probably as a result of the instructions used during data collection (writers were asked to use varying rewriting transformations).Overall, low levels of variability are to be expected as, in both tasks, content preservation is part of the communicative goal.Open-domain dialogue.We observe the highest production variability in this task across all probes.Many output pairs are lexically and syntactically completely dissimilar, as indicated by the rightmost bin in Figures 2a and 2b.Lexical variability is even more extreme when looking at bigrams and trigrams (Figure 7 in Appendix D) suggesting that while responses rarely share words or phrases, they still sometimes convey similar meaning (Figure 2c).Overall, the fact that dialogue appears to be the most open-ended task can be explained by the wide variety of communicative goals that can plausibly follow from a dialogue context and, in part, by the fact that individual annotators produced multiple responses for the DailyDialog++ dataset and thus were able to monitor the diversity of their outputs.
6 Do Neural Text Generators Reproduce Human Production Variability?
Consider, now, a second language process: a text generator with distribution p Ŷ |X=x .We study this generator's uncertainty about outputs given an input x under two lenses.In § 6.1, we study how outputs vary with respect to one another, which is analogous to human production variability H k (x).
We refer to this as the generator's self-variability: (2) In § 6.2, instead, we study how model generations vary with respect to a language process known to be plausible: a human language process p Y |X=x .We refer to this as cross-variability: Our expectation is that generators with a good representation of aleatoric uncertainty reproduce human production variability along both axes.As we employ a distance metric, it may look like we should regard a model as a good approximation to the human process whenever C k (x) concentrates about small positive values.To some extent this is the interpretation exploited by most automatic evaluation metrics (single-or multi-reference).In this work, we refrain from taking any one human production as a 'reference' to be closely 'matched'; rather, we take statistical properties of human productions as illustrative of plausible variability and thus targets to be reproduced.We quantify deviation from plausible human variability by estimating a notion of statistical divergence.

The Underlying Statistical Model
In this section, we criticise the underlying statistical model (as a result of parameter estimation via MLE) using unbiased sampling.As models observe variability only marginally (multiple references are rarely used during training), it is interesting to study if their self-variability is calibrated to human variability: given individual input instances, do distances between unbiased model samples distribute similarly to distances between human productions?To distinguish over-estimation from under-estimation of variability, we report a signed notion of divergence, µ M k (x) − µ H k (x) .When M k (x) and H k (x) distribute similarly, their mean difference is low for a given x.Positive differences imply that models overestimate variability, i.e., model samples vary more with respect to one another than human samples do.Negative differences indicate that models underestimate variability.
Figure 3 shows how mean differences distribute across each task-specific test set for the models in Section 4. We use up to 10 human productions (5 for dialogue) and 10 generations.The first two rows show that µ M k (x) −µ H k (x) distributes far below 0 for translation (OpusMT) and somewhat below 0 for simplification (Flan-T5), indicating that the two models substantially underestimate variability. 8The opposite is true for dialogue and story generation: both GPT-2 and DialoGPT moderately overestimate the open-endedness of these tasks.We also inspect cross-variability µ C k (x) −µ H k (x) , finding similar patterns, with slightly better over- all cross-variability calibration for translation and simplification (Figure 8, Appendix D).

The Effect of Decoding Algorithms
We now study text generators obtained by varying the sampling procedure. 9We analyse their representation of uncertainty by assessing the divergence between the distribution of generator-human cross-variability C(x) and human variability H(x).While µ C k (x) − µ H k (x) can inform us about the direction of miscalibration, we observe only a handful of cases where different decoding strategies yield both under-and over-estimation for the same model (see Figures 10 and 11 in Appendix D).Instead, as we sometimes observe distributions with multiple modes-causing their difference in means to paint an incomplete picture-we additionally report a measure of divergence that is more robust to such multi-modal distributions: the Wasserstein 1-Distance D W 1 (•, H k (x)).10 Results for selfvariability M (x) and mean distance can be found in Appendix D, Figures 9 to 11.
Human control group.The blue curve in Figure 4 shows how D W 1 (C k (x), H k (x)) distributes over inputs for unbiased samples from GPT-2 on story generation.To contextualise this observation we report a human control group (the orange curve): this is D W 1 measured between two human populations (i.e., we make two disjoint samples from the available human productions for each prompt, use those to estimate H k (x) and an analogous Ĥk (x), and compute D W 1 ( Ĥk (x), H k (x))).We can now appreciate what is a plausible Wasserstein distance curve between two human-based processes, and with that, we can better discern that this particular system gives good but not perfect representation to human levels of production variability (note the overlap between the two distributions).Upon visual inspection of divergence distributions (like Figure 4) for different sampling strategies, we find similar shapes.We exploit this finding and summarise each divergence distribution using its mean.This is shown in Figure 5, which presents results for many decoding settings, tasks and probes.The leftmost red dots indicate the human control group. 11e observe that two human groups agree more on the meaning of translations and simplifications than on their form, while for story generation the two groups agree more on surface form and basic structures and less on the semantic content of the stories.
Results.Overall, Figure 5 shows that most decoding settings are close to unbiased sampling, which in turn is in the same ballpark (mean Wasserstein distance always lower than 0.1) as the human control.This indicates that text generators capture the space of plausible human productions well when coupled with most decoding algorithms, though not as well as another human language process.Decoding settings form many clusters, and for all tasks except open-domain dialogue, unbiased samples best match human variability.This suggests that, within the limits of decoding configurations typically considered as appropriate, different token-level decoding strategies often have a similar effect on a generator's ability to reproduce human production variability along our three probes.Altogether, these findings inform us about an often neglected aspect of decoding algorithms, namely their effect on the model's representation of uncertainty (rather than their ability to select individual high-quality generations).

Qualitative Instance-Level Analysis
We now qualitatively analyse individual inputs for which a generator's uncertainty is miscalibrated to human variability-as detected by D W 1 .For each task, we use up to 10 human productions (5 for dialogue) and 10 generations.Figures accompanying the examples in this section are in Appendix E. While it is not a replacement for more standard NLG evaluation procedures, we argue that this level of analysis is complementary and crucial to gain deeper understanding of a generator's representation of uncertainty.
Variability underestimation in translation and simplification.We have seen that in translation and simplification, generators' self-variability is lower than human variability ( § 6.1).We now zoom in on examples from these two tasks, inspecting instances that show inadequate model fit on all linguistic levels (i.e., D W 1 (M k (x), H k (x)) is high for all k).The most severe cases of miscalibration for OpusMT are all instances of variability underestimation.12For most of these, generations are virtually or completely identical, while a few present slightly higher but still substantially lower variability than human productions.For example, ten humans translated the phrase 'reacted cautiously' in the English source sentence 'Several companies have thus far reacted cautiously when it comes to hiring' in six different ways ('vorsichtig reagiert', 'zurückhaltend reagiert', 'mit Vorsichtsmaßnahmen reagiert', 'reagierten mit Zurückhaltung', 'mit Vorsicht reagiert', 'reagierten verhalten') while all ten generated samples contain the German phrase 'vorsichtig reagiert', signalling that the generator's lexical rephrasing abilities do not generalise to this input instance.For text simplification, we focus on instances where Flan-T5's uncertainty is not calibrated to human syntactic variability. 13We observe that simplifications sampled from the generator are always syntactically more similar to each other than humans', indicating that the generator struggles to capture an important aspect of text simplification: that many semantically equivalent rewritings are possible if a text's syntactic structure is altered.
Variability overestimation in dialogue.According to our estimates of human variability ( § 5), dialogue is the most open-ended task on all linguistic levels.We have hypothesised that this is due to the large variety of communicative act types plausible given any dialogue context.We have also seen that DialoGPT generally overestimates production variability ( § 6.1)-Figure 1 is one such example.Now we further inspect instances where cross-variability is miscalibrated with respect to human outputs. 14We find that the generator's bad fit can be due to very short and generic responses (e.g., 'Well...', 'haha', 'Ahem', 'Well done!'), but is more often due to the presence of fluent yet very diverse and often inadequate samples.For such instances, not only is the generator's crossvariability miscalibrated-self-variability, too, is overestimated on all linguistic levels.In particular, the generator's poor calibration to lexical and syntactic variability is related to its inability to choose the correct dialogue acts (or favouring an excessive variety of dialogue acts).In an example instance where the last dialogue turn goes 'I've got a business call that I really need to take', humans all reply with short affirmative responses ('Okay!Please.','Well!Go on.','Sure, why not!', 'Sure!Go ahead.','Yes!Sure.') while the model's responses are mostly lengthy statements, sometimes not particularly coherent ones (e.g., 'You don't need a business call.You need a friend').
Variability in lack of situational grounding.We have observed that human-written stories in the WritingPrompts dataset show lower variability than human dialogue responses, and hypothesised that this may be in part due to contextual pressures that constrain variability ( § 5).We now analyse instances flagged by our probe as cases of badly calibrated semantic cross-variability for GPT-2. 15 For one of these, the prompt refers to a portion of the situational context the model does not have access to ('all top level comments in this prompt take place in the same world, so make them all fit together').Because they are conditioned on and reuse that context, human stories are quite similar to each other; generations, instead, show much higher pairwise distance both when sampled jointly with the human productions (see Figure 6) and with themselves.The lack of relevant situational grounding makes the model more uncertain than it should be for this instance.

Conclusion
Variability is an intrinsic property of human language production.Text generators, if they are to be considered as good statistical models of human written production, should exhibit plausible levels of variability.However, in NLG, the widespread practice is (i) collecting only one 'reference' production for each input and (ii) evaluating only a single generation.To appreciate the impact of this incongruity empirically, we analyse multiplereference datasets for four NLG tasks, and show that each task has its own plausible levels of lexical, syntactic, and semantic variability.We connect production variability to aleatoric uncertainty, the irreducible uncertainty of the language production process, and evaluate neural text generators in terms of whether their representation of uncertainty is calibrated to the levels of variability observed 15 DW 1 (Ck(x), Hk(x)) > 0.3; k is cosine distance.in humans.We find that NLG models overestimate production variability in open-ended tasks and underestimate it in more constrained tasks, and that most popular decoding algorithms all have a similar, limited effect on the generators' ability to reproduce human variability.
We advocate for more widespread usage of instance-level probing of NLG systems as a way to evaluate their statistical fit, not just along the dimensions we cover in this study but with respect to any other quality of interest.This approach contrasts with corpus-level analyses of NLG systems (e.g., Pillutla et al., 2021;Meister and Cotterell, 2021;Pimentel et al., 2022) and thanks to its greater interpretability, it builds trust in the ability of generators to reproduce human-like statistics when situated in specific linguistic contexts rather than 'globally', over a possibly heterogeneous corpus.In the future, we plan to devise new ways of improving the calibration of models' uncertainty (Zhao et al., 2022;Zhang et al., 2022), e.g., steering generators with sequence-level decoding algorithms (Eikema and Aziz, 2022), and to investigate the relation between uncertainty and perceived generation quality (e.g., Kuhn et al., 2022): while we use human levels of variability as a target, desirable levels of variability may deviate from human statistics for specific applications.
Future work should also study production variability as a function of a more complex notion of discourse context (Giulianelli and Fernández, 2021;Giulianelli et al., 2023) and attempt to disentangle uncertainty over communicative goals and realisations (Stasaski and Hearst, 2023).This is an important avenue not only toward more practically useful generators but also toward reliable computational models of language production.

Limitations
Our analysis relies on multiple-reference datasets, which are scarce for NLG tasks.Even though, for single-reference datasets, we cannot perform a similar instance-level analysis, this fact does not entail that the observations we make do not apply to such datasets-we might simply not have the data to expose them.
Impact of data collection.The way in which multiple references are gathered may impact the variability in productions.For example, asking a single annotator to produce several distinct references might artificially increase the diversity of responses.Conversely, asking several independent annotators might decrease diversity for they may resort to similar responses that quckly come to mind (or, in fact, the opposite if they interpret the linguistic context differently).To summarise, there are two levels of uncertainty in human production data: one is on the individual level, the other is on the population level.In this work, we do not distinguish these two, although the analysis tools that we propose allow for it.For example, one could collect human productions from one individual (e.g., for personalisation) or from sub-populations (e.g. to improve fit for underrepresented communities).
Other quality dimensions.It is possible that a model fits various statistical properties of the human process (under M k (x), under C k (x), and for various choices of k) meanwhile none of its probable responses are humanly-accepted as a whole.This is why we shall think of our tools as statistical probes.We indeed find interesting instances that show good fit in terms of our distance probes but whose outputs may be perceived as inadequate.Manual inspection reveals that a marriage proposal in one of the dialogues (Figure 16 in the Appendix) is followed by a few incoherent model responses (e.g.., 'Thank you.It's not a question of the strength or weakness of the plot.I think it all falls within my capacity.'), some dispreferred ones ('If you want to have a hug?'; see Levinson, 1983), and some with negative affect ('I don't need your love.I know where you are coming from and I trust you will do the same.').Exhaustively defining all aspects of perceived quality (or human-likeness) is a strenuous endeavour which is highly dependent on the use case of the generation system.Our probes can be replaced with (possibly asymmetric) quality metrics which capture aspects (e.g., affective content, toxicity, or readability) that are considered relevant for any given application.

References
Laura Aina and Tal Linzen.2021.The language model understood the prompt was ambiguous: Probing syntactic uncertainty through generation.In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 42-57, Punta Cana, Dominican Republic.Association for Computational Linguistics.
Fernando Alva-Manchego, Louis Martin, Antoine Bordes, Carolina Scarton, Benoît Sagot, and Lucia Specia.2020.ASSET: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4668-4679, Online.Association for Computational Linguistics.

A A Note on Entropy
Entropy is an information-theoretic concept that is often used to summarise uncertainty about a random variable.As useful as it may be in various contexts, entropy is not itself a complete characterisation of uncertainty (e.g., two different distributions may have the same entropy, yet represent different uncertainty about their respective random variables).As we discuss in § 3.1, uncertainty about a random variable is fully represented by its underlying probability distribution (Halpern, 2017, Chapter 2).Consider a discrete random variable X with distribution p X and probability mass function (pmf) p X (x).Define the surprisal of an outcome X = x as the quantity − log p X (x).Then, Shannon entropy (or just entropy for short) is defined as the surprisal of X taken in expectation under p X (MacKay, 2003).Due to the unbounded sample space and lack of conditional independence assumptions, entropy is intractable to compute for neural text generators.In some cases a Monte Carlo (MC) estimate of entropy can be formed with a reasonable amount of computation.For example, consider an autoregressive language model that assigns probability f (x; θ) to a complete sequence x using a neural network with parameters θ (e.g., an LSTM or Transformer).When we use ancestral sampling (Bishop, 2006) to decode from this model obtaining a sample x (s) , the surprisal of x (s) is directly available via − log f (x; θ), and the sample mean − 1 S S s=1 log f (x (s) ; θ) for S ancestral samples forms an unbiased MC estimate for the entropy of X.In most cases, however, the generator's pmf is unknown and the surprisal of an outcome is not available.That is the case, for example, whenever we employ decoding algorithms that bias away from the underlying distribution of the autoregressive LM-top-p, top-k, typical sampling are all good examples.The resulting pmf is then hard (or impossible) to characterise.
Furthermore, as much as Shannon entropy can be interpreted in its own information-theoretic terms, it is not immediately obvious how it can inform an analyst interested in the generator's faithfulness to human production variability.That said, the analyst may be interested in knowing, for example, that the entropy of the generator is similar to that of the 'human distribution' regardless of their ability to assign any useful interpretation to entropy proper.While we accept some analyst out there may be curious about that question, we refrain from performing such an analysis ourselves because (a) MC estimation is not available for most of the popular decoders we wanted to analyse, and (b) estimating the entropy of the human distribution requires a faithful model of it (that is, we need a perfectly faithful text generator to play the role of a 'gold standard').

B A Note on the Wasserstein 1-Distance
The Wasserstein 1-Distance W 1 (•, •) quantifies a notion of distance between two probability measures and is particularly convenient for it can be estimated using Dirac deltas (samples from those measures; Peyré et al., 2019) more easily than alternatives such as Kolmogorov-Smirnov and total variation distance (which require binning the measurements into empirical cdfs/pdfs).W 1 (M k (x), H k (x)) and W 1 (C k (x), H k (x)) have an interpretation in terms of 'mass' (in units of k) that has to be moved, on average, to transform one set of samples into another.

C Data Statistics
Table 1 shows relevant statistics for the four multiple-reference datasets presented in § 4.

E Examples Discussed in the Qualitative
Instance-Level Analysis

F Decoding Configurations
Tables 2 to 5 show mean divergences for all the analysed decoding strategies, in terms of Wasserstein 1-Distance as well as mean distance.

Figure 1 :
Figure 1: Production variability observed in 5 human responses vs 10 responses generated by DialoGPT.The graph presents the distribution of pairwise cosine distances: generated responses exhibit higher semantic variability than human responses.The generator's semantic uncertainty is too high in this dialogue context.

Figure 2 :
Figure 2: Human production variability across four NLG tasks.The values on the horizontal axis are single samples of lexical (unigram), syntactic (POS bigram), or semantic (cosine) distance between two randomly sampled productions for each input (see Section 3).Re-sampling productions results in nearly identical marginal distributions.Probability mass on the right side signals high distance and thus high variability, and vice versa.

Figure 3 :
Figure 3: Distribution of µ M k (x) − µ H k (x) over instances.Values greater than zero indicate the model overestimates the variability of the task (higher mean pairwise distance); values below zero indicate variability underestimation.

Figure 5 :
Figure 5: Mean Wasserstein distances D W1 (C(x), H(x)) for (tasks, probe, decoder) tuples.Base models for each task are described in Section 4. Colour refers to decoding algorithm with various parameter settings (fully reported in Table4, Appendix F).Human control group in red. 11Clusters suggest that decoders often have similar effect.Unbiased sampling is competitive.

Figure 6 :
Figure 6: Example of poor cross-variability calibration for GPT-2 with typical sampling on story generation.

Figure 7
Figure7shows human production variability over lexical and syntactic unigrams, bigrams, and trigrams (complementing Figure2in the main paper).Figure8shows the distribution of µ C k (x) − µ H k (x) over instances for our four tasks (complementing Figure3in the main paper).Figures 9 to 11 show Figure7shows human production variability over lexical and syntactic unigrams, bigrams, and trigrams (complementing Figure2in the main paper).Figure8shows the distribution of µ C k (x) − µ H k (x) over instances for our four tasks (complementing Figure3in the main paper).Figures 9 to 11 show

Figures 12 -
Figures 12-16 show examples of model fitness for the instances discussed in § 7.

Figure 7 :
Figure 7: Human production variability across four NLG tasks (the remaining settings not reported in the main paper).The values on the x-axis are single samples of lexical or syntactic distance between two productions for each input (see Section 3).Probability mass on the right side signals high distance and thus high variability, and vice versa.A large spread indicates that production variability varies widely across inputs, and as such that a task does not define a specific level of variability.

Figure 8 :
Figure 8: Distribution of µ C k (x) − µ H k (x) over instances.Values greater than zero indicate the model overestimates the variability of the task (higher mean pairwise distance); values below zero indicate variability underestimation.

Figure 9 :
Figure 9: Mean Wasserstein distances D W1 (M (x), H(x)) for (tasks, probe, decoding algorithm) tuples.Base models for each task are described in Section 4. Tuples that share colour have different decoding parameters.Human control group in red; except for dialogue, where 5 references are too few to create a control group.

Figure 10 :
Figure 10: Mean of distances µ M (x) − µ H(x) for (tasks, probe, decoding algorithm) tuples across test sets.Base models for each task are described Section 4. Tuples that share colour have different decoding parameters.Human control group in red; except for dialogue, where 5 references are too few to create a control group.

Figure 11 :
Figure 11: Mean of distances µ C(x) − µ H(x) for (tasks, probe, decoding algorithm) tuples across test sets.Base models for each task are described in Section 4. Tuples that share colour have different decoding parameters.Human control group in red; except for dialogue, where 5 references are too few to create a control group.

Figure 12 :
Figure 12: Example 1 of bad fitness (D W1 (M k (x), H k (x))) to human variability for the Opus MT model.

Figure 13 :
Figure 13: Example 2 of bad fitness (D W1 (M k (x), H k (x))) to human variability for the Opus MT model.

Figure 14 :
Figure 14: Example of good fitness (D W1 (M k (x), H k (x))) to human variability for the Opus MT model.
Story generation.Variability in story generation is strongly dependent on the probe.It is low at the syntactic level-close to translation and simplification-while lexical and semantic probes place this task closer to open-domain dialogue.Stories generated from a given prompt may vary a lot in content, but basic syntactic structures and lexical material are shared.Although this task can be a priori perceived at least as 'open-ended' as dialogue, lower levels of variability may result from contextual factors specific to the WritingPrompts dataset that we are not explicitly modelling, such as writers reading stories contributed by other users.
Sebastian Gehrmann, Hendrik Strobelt, and Alexander  Rush.2019.GLTR: Statistical detection and visualization of generated text.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 111-116, Florence, Italy.Association for Computational Linguistics.

Table 1 :
MEAN ± STD MED.RANGE MEAN ± STD MED.RANGE MEAN ± STD MED.RANGE MEAN ± STD MED.RANGE Length statistics.Number of tokens obtained with the tokenisers of the language models used for generation.

Table 2 :
Mean D W 1 (M (x), H(x)) results for different decoder settings.