On the Efficacy of Sampling Adapters

Sampling-based decoding strategies are widely employed for generating text from probabilistic models, yet standard ancestral sampling often results in text that is degenerate or incoherent. To alleviate this issue, various modifications to a model’s sampling distribution, such as top-p or top-k sampling, have been introduced and are now ubiquitously used in language generation systems. We propose a unified framework for understanding these techniques, which we term sampling adapters. Sampling adapters often lead to qualitatively better text, which raises the question: From a formal perspective, how are they changing the token-level distributions of language generation models? And why do these local changes lead to higher-quality text? We argue that the shift they enforce can be viewed as a trade-off between precision and recall: while the model loses its ability to produce certain strings, its precision rate on desirable text increases. While this trade-off is not reflected in standard metrics of distribution quality (such as perplexity), we find that several precision-emphasizing measures indeed indicate that sampling adapters can lead to probability distributions more aligned with the true distribution. Further, these measures correlate with higher sequence-level quality scores, specifically, Mauve.


Introduction
The vast majority of natural language generation systems take a probabilistic approach.The backbone of such an approach is a probability distribution over strings p θ for a specific target domain.While modern language models have achieved remarkable performance on standard measures of distribution quality, e.g., perplexity (Brown et al., 2020;Chowdhery et al., 2022;Hoffmann et al., 2022;OpenAI, 2023), they often fall short when applied out of the box for language generation tasks-both sampling directly from them and searching for the maximum-probability string under them can lead to dull, incoherent, and degenerate text (Holtzman et al., 2020;Eikema and Aziz, 2020;Welleck et al., 2020).
Surprisingly, applying a post-hoc modification to p θ (• | y <t ) often serves to dramatically improve the quality of the generated text (Nadeem et al., 2020;Pillutla et al., 2021;Wiher et al., 2022;Hewitt et al., 2022;Li et al., 2022).In this paper, we give a name to these methods, dubbing them sampling adapters.A sampling adapter can be formally defined as a simplex-to-simplex map α : ∆ |V|−1 → ∆ |V|−1 that systematically modifies the conditional distribution of an autoregressive language model p θ (• | y <t ), thus creating another language model α(p θ (• | y <t )) with a desired set of characteristics, e.g., it may only give non-zero probability to items assigned high probability under the original model.Sampling adapters often require little to no fine-tuning and can be implemented in just a few lines of code.Presumably due to their simplicity, sampling adapters have become a default tool in text generation pipelines, serving as the core component of baseline decoding strategies in various tasks (Welleck et al., 2020;Pillutla et al., 2021;Pimentel et al., 2023).
The fact that sampling adapters often lead to qualitatively better text, however, evokes a simple question: How do they change our language generation models such that the distribution p θ (• | y <t ) places more probability mass on what we qualitatively deem to be "better" text?Most sampling adapters have been found through trial and error with only intuitive motivations given for their efficacy.Moreover, standard evaluation measures 1 do not immediately shed light on why sampling adapters work well because most sampling adapters make language generation models substantially worse according to these measures, e.g., they often reduce the probability assigned to certain strings to zero, which can yield a perplexity of ∞.
In this paper, we posit that the change of distribution induced by sampling adapters can be analyzed in terms of a precision-recall trade-off, using the generalizations of these terms to the field of generative modeling (Sajjadi et al., 2018;Lucic et al., 2018;Djolonga et al., 2020).While a model loses its ability to produce certain strings, its ability to produce desirable text increases.We experiment with various sampling adapters that have been proposed (Fan et al., 2018;Holtzman et al., 2020;Meister et al., 2023;Hewitt et al., 2022) and find that, while the use of these adapters negatively affects recall-emphasizing performance measures, certain choices of hyperparameters increase performance in terms of measures that balance between precision and recall or that are precision-emphasizing.Comparing trends in these measures, we see evidence of a precision-recall trade-off, which offers a quantitative motivation for the efficacy of sampling adapters.We further find that precision-emphasizing measures correlate most highly with sequence-level quality metrics, offering a potential avenue for efficiently choosing sampling adapter hyperparameter values.The formal framework and empirical analysis presented here should pave the way for the development of theoretically motivated sampling adapters, and provide a straightforward means for both analysis of and comparison between adapters.

Probability Distributions over Strings
Most language generation systems are based on probabilistic models, i.e., models of the probability distribution over natural language strings 2 V * , where V * is the Kleene closure of an alphabet V.In words, V * is the set of all strings that can be generated from a vocabulary of (sub)words V.A common modeling choice is to break down string probabilities autoregressively and locally normalize p θ , i.e., instead of directly modeling the full sequence probability p θ (y), one models (sub)word probabilities p θ (y | y <t ) conditioned on the prior context y <t def = ⟨y 1 , . . ., y t−1 ⟩ ∈ V * .Note that here, we have y ∈ V for V def = V ∪{EOS} where EOS is a special end of string token required for an 2 Notably, these distributions might be conditioned on an input string, as in machine translation or summarization.autoregressive p θ to define a valid probability distribution over V * .The sequence-level probability can then be computed via the chain rule of probability: See Du et al. (2023) for a characterization of when these models are tight, i.e., when the probability mass assigned to finite-length strings is 1.
The parameters θ of these models are typically chosen by (numerically) maximizing the log-likelihood of the training data D, where log-likelihood is defined as: Note this is equivalent to minimizing the (forward) cross-entropy between the empirical distribution p D induced by the training data D.

Decoding Strategies
In order to produce text from a model, one must use a decoding strategy, which provides a set of decision rules according to which tokens are sequentially chosen from the distribution p θ to form a string.Decoding strategies can be broadly taxonomized as either maximization-based or samplingbased.Maximization-based strategies aim to find the candidate string that scores highest under some objective.Finding the string with the highest probability under the model is a common maximizationbased strategy.Sampling-based strategies instead sample tokens according to some distribution derived from the model.While maximization-based strategies may make intuitive sense, they often lead to dull or degenerate text in open-generation settings (Cohen and Beck, 2019;Eikema and Aziz, 2020;Nadeem et al., 2020).Sampling-based strategies likewise have shortcomings: They introduce randomness into the generated text, which may lead to a disruption in coherence or fluency when units are sampled from low-probability regions of the distribution (Holtzman et al., 2020;Hewitt et al., 2022).A class of methods has been developed to address the problems observed when sampling directly from the model, specifically by altering the distribution from which tokens are sampled.We term these methods sampling adapters, formally defining them in the next section.
Formally, sampling adapters are simplex-tosimplex mappings, i.e., functions α : ∆ |V|−1 → ∆ |V|−1 that take a probability distribution over V as input and map it to another one over V. 3 We use the notation p to denote the output of this map, as applied to the distribution p: One popular way of formulating sampling adapters in the literature has been via truncation functions, i.e., functions where vocabulary units that do not meet a certain criterion are re-assigned zero probability.We write these functions as: where C : ∆ |V|−1 → P(V) is a function that finds the set of (sub)words that meets said criterion; P(•) denotes the powerset operator.Truncation sampling methods aim to eliminate probability mass placed on tokens deemed likely to lead to undesirable text, reallocating their probability mass to the remaining options.We now specify several common truncation-based sampling adapters.
Example 3.3.We recover top-k sampling (Fan et al., 2018) when i.e., a function that returns the top-k most-probable (sub)words.
Example 3.5.We recover locally typical sampling (Meister et al., 2023) when i.e., the set of items with log-probability closest to the (sub)word-level entropy that collectively have probability mass ≥ π.
Other methods can similarly be cast in the sampling adapter framework, such as Mirostat (Basu et al., 2021) and the re-calibration method proposed by Braverman et al. (2020).Moreover, the general equation for sampling adapters given in Eq. (3) suggests that one direction for future research is learning a sampling adapter α.While many previously proposed adapters are truncation-based, adapters that reallocate mass in a different manner may also prove effective.Indeed, equipping α with tunable parameters could prove useful as a lightweight finetuning method.
An Unintuitive Effect.The motivation behind the use of sampling adapters with language generation models is to readjust their distribution, shifting mass away from tokens deemed likely to lead to undesirable text and onto tokens that will generate high-quality text.Yet why are such transformations even necessary?Standard measures of distribution quality, such as perplexity, would suggest that our models' estimates of the ground-truth distribution over natural language strings are quite good (Brown et al., 2020;Wang and Komatsuzaki, 2021;Hoffmann et al., 2022).This, in turn, implies that the heuristic shifts performed by sampling adapters should lead to worse language generators.We argue that the disparity between the quality of language generation systems using sampling-adapted models and the quality of these same models according to standard measures can be reconciled using probabilistic analogs of precision and recall.

A Precision-Recall Hypothesis
We begin by reviewing generalizations of the concepts of precision and recall in the field of generative modeling.We then discuss the shortcomings of current language generation models and how sampling adapters may address these shortcomings.

Generalizations of Precision and Recall
A series of recent papers have related the precision of a learned distribution p θ to the average quality of generated samples, where high-quality samples are assumed to be those with high probability under the data-generating distribution p.5 Additionally, they relate the recall of p θ to its coverage of p (Sajjadi et al., 2018;Lucic et al., 2018;Djolonga et al., 2020, inter alia), i.e., high overlap in the support of p θ and p.Following this line of reasoning, the notions of precision and recall can naturally be operationalized using measures of the difference between two distributions-specifically, ones that enable different penalizations of over-and undercoverage of our reference distribution.
There are several measures that, when considered together, naturally operationalize precision, recall, or some combination of the two. 6In this paper, we focus on cross-entropy, KL divergence, total variation distance (TVD), and Jensen-Shannon (JS) divergence.We introduce each in greater detail below.We note that for all these measures, a larger value indicates a greater discrepancy between two distributions, and that all but the cross-entropy will be zero when the two distributions are identical.Further, we note that not all the measures are symmetric, i.e., their values change depending on the order in which the distributions are given as arguments to the measure.Out of convention, in the case that the reference distribution is provided first, we call this the forward variant of the measure.We call the case where the reference distribution is the second argument the reverse variant of the measure.We define all measures in terms of generic distributions p 1 and p 2 , which we assume both have (not necessarily identical) supports that are a subset of V.
Precision-emphasizing Measures.We first consider the cross-entropy between p 1 and p 2 : Upon inspection, we can see that the reverse crossentropy, i.e., where p 1 is the distribution being evaluated and p 2 is a (fixed) reference distribution, rewards high precision.7Specifically, it rewards p 1 for assigning probability mass where p 2 is large, implicitly penalizing p 1 for assigning high probability where p 2 is small.In fact, the reverse crossentropy is minimized in the case where p 1 places all probability on the most probable token under p 2 .A related measure is the reverse KL divergence which is equivalent to the cross-entropy up to the subtraction of the entropy term H(p 1 ).As with cross-entropy, the reverse KL divergence rewards high precision.This property is reflected by a common intuition provided about this measure when it is used as a learning objective: It is referred to as a mode-seeking objective, i.e., it aims to place mass on the modes of p 1 .8Importantly, the distributions that minimize the reverse variants of Eq. ( 9) and (10a) will not necessarily be equivalent because the latter takes into account p 1 's entropy.So which of these two metrics should we use?As we are interested in using metrics that operationalize the notion of precision, the entropy of the distribution under evaluation is irrelevant.Thus, we will use the reverse cross-entropy as our primary precision-emphasizing metric.
Recall-emphasizing Measures.On the other hand, the forward variants of Eq. ( 9) and (10a), where p 2 is now the distribution under evaluation and p 1 is assumed to be fixed, reward recall.This is evident when taking a closer look at their definitions.If p 2 fails to place probability on all elements y assigned probability by p 1 , then both the cross-entropy and KL divergence will be ∞. 9Analogously to the reverse KL's description as mode-seeking, the forward KL is referred to as mean-seeking.Note that using the forward variants of cross-entropy and KL divergence as learning objectives is equivalent since H(p 1 ) is constant with respect to p 2 .Further, the forward KL and cross-entropy, as well as the reverse KL, are minimized when p 2 = p 1 .
Balanced Measures.The definitions for TVD and JS divergence, which are both symmetric measures, suggest a balance between the characteristics of precision and recall: where m(y) = p 1 (y)+p 2 (y) 2 for y ∈ V is a pointwise average.Practically, the JS divergence can informally be viewed as an interpolation between the forward and reverse KL divergences.Indeed, several divergences that generalize the forward and reverse KL recover the JS divergence given a particular choice of hyperparameter (Huszár, 2015;Meister et al., 2020;Pillutla et al., 2021).TVD can be similarly motivated: Sajjadi et al. (2018) recover TVD in their precision-recall operationalization for generative models when assigning equal importance to precision and recall.Further, a standard result demonstrates that the JS divergence is a lower bound on TVD (Lin, 1991).With these measures in hand, we can more effectively assess the shifts to precision and recall that sampling adapters induce in a model. 9To avoid the possibility of an infinite cross-entropy, one can use an ε-smoothed variant of p2 i.e., where p This trick is often employed to evaluate methods that do not produce distributions covering the entire support, e.g., Peters et al. (2019) and Martins et al. (2020).As many of the sampling adapters that we analyze produce sparse distributions (specifically, the truncation sampling methods), we will likewise employ this variant of KL divergence where necessary.

Current Modeling Shortcomings
It is not clear that the objective with which probabilistic language generators are typically trained imparts characteristics that align with the goals of building good language generators.10Any form of maximum-likelihood training is equivalent to minimizing H(p D , p θ )-often with an additional form of regularization.Thus, it encourages high recall: p θ (y t | y <t ) must be nonzero for all tokens y t in every string y in the training set D for the objective to be finite.This, in turn, results in p θ allocating some probability mass to all (sub)words y ∈ V for all contexts y <t .In language modeling, this is perhaps a desirable property: We often care about the relative probabilities of strings, and assigning strings 0 probability would be counter-productive towards this goal.Yet, this property can potentially prove problematic when such models are used out of the box as language generators.11For language generation systems, high precision is arguably a higher priority, i.e., the goal is for all of the generated sequences to be of high quality.An intuitive argument for this is that a single bad output can leave a lasting poor impression on the user.Yet, the inability to generate a single sequence may go unnoticed-especially if the difference between that sequence and one the model can produce is a single, exchangeable token.
In this light, a possible explanation for the efficacy of sampling adapters is as follows: While model parameters are chosen to minimize a recall-prioritizing objective, sampling adapters re-align the distribution with a more appropriate precision-prioritizing probabilistic objective, i.e., sampling adapter hyperparameter combinations that work well perhaps do so because they minimize an objective that balances between precision and recall.If this is indeed the case, it should not be surprising that the transformation induced by sampling adapters leads to worse models according to standard, recall-emphasizing measures: Any generator that assigns zero probability to a valid string-as is the case when top-π or top-k sampling are applied-will have both infinite cross-entropy and perplexity with respect to the natural language distribution.They may, however, lead to better models according to more balanced (or even precision-emphasizing) measures, which is what we now empirically test.

Experiments
To test the hypothesis that the operations performed by sampling adapters are akin to a re-prioritization of precision over recall in the output of the model, we evaluate the effects of sampling adapters on measures that emphasize recall, precision or a balance of the two, as outlined in §4.1.We then observe how these measures vary as a function of the sampling adapters' hyperparameters.Further, we also look at these measures' Spearman correlations with MAUVE, a sequence-level quality metric.
We consider five different adapters: temperature, η (eta), top-π, top-k and locally typical sampling, each over a wide range of hyperparameters.Note that for the latter three adapters, a smaller hyperparameter value corresponds to a larger shift between p θ and p θ .For η-sampling, the reverse is true, and for temperature sampling, hyperparameter values farther from 1 imply a larger shift.For reproducibility, we leverage the Hugging Face framework (Wolf et al., 2020) and its implementation of sampling adapters for all but η-sampling, for which we rely on the original authors' implementation. 12Error bars for all plots indicate 95% confidence intervals for the observed values; note that bars are often small enough that they are not visible.

Setup
We focus on the task of open-ended text generation.We use GPT-2 small and large (Radford et al., 2019), as well as, GPT-Neo (small) (Gao et al., 2020) as our generation models.The main results of this paper use the test set of a public version of the WebText dataset13 as our reference text.Results using the WikiText test set (Merity et al., 2016) are qualitatively similar and can be found in App. A.
Sequence-level Metrics.Following Pillutla et al. (2021), we use the first 35 tokens of samples from our reference text as a prompt to generate continuations y ∼ p θ (• | y <t ) until |y| = 512, or EOS is sampled.We generate 1000 samples for each Token-level Measures.In this analysis, we compare (sub)word-level distributions p θ (• | y <t ) and p(• | y <t ).The former is our generation model after the application of a sampling adapter and the latter is a reference distribution.We present results using both the empirical distribution induced by our test set and the distribution given by the GPT-J model (Wang and Komatsuzaki, 2021) 14 as our reference distribution.Here, y is a string from the test set.Results are mean-aggregated across both t = 1, . . ., |y| and all y.Note that when we compute either the cross-entropy or KL divergence and it is not guaranteed that the support of p 1 is a subset of the support of p 2 , we make use of the ε version of the metrics, as specified in §4.1, with ε = 1e-6.

Results
Trends in Probabilistic Measures.We first present our analysis of how different adapterhyperparameter settings affect the relationship of the model to a reference distribution (either probabilities according to GPT-J or the empirical distribution).Note that if our hypothesis in §4.1 is correct, we would expect to see that certain sampling adapter-hyperparameter settings lead to lower values of measures that emphasize precision, such as reverse cross-entropy, while simultaneously increasing measures that emphasize recall, such as forward cross-entropy.We show the reverse and forward cross-entropy, as well as TVD, in Fig. 1. 15Both the forward and reverse cross-entropy results align closely with our hypothesis: A larger adapter shift generally leads to a higher forward cross-entropy and lower reverse cross-entropy. 16his observation holds when using either the Figure 1: Forward/reverse cross-entropy and TVD of the model with GPT-J and the empirical distribution (WebText test set) after different sampling adapter methods have been applied to the output distribution.Note that as described in §4.1, the ε-variant is used in all cross-entropy estimates except for reverse estimates with GPT-J.Dashed lines represent divergence with the unmodified distribution, i.e., the equivalent of using ancestral sampling.empirical distribution or GPT-J as our reference.Interestingly, we see that the trends reverse when we consider the reverse KL divergence (as opposed to the reverse cross-entropy; see Fig. 3).This is perhaps expected given that the entropy of the model's distribution monotonically decreases after the application of sampling adapters (see Fig. 7).
Lastly, the trends in TVD differ largely depending on the distribution used as a reference.When GPT-J is used, we see that TVD monotonically increases as adapter strength increases.The reverse trend appears to hold when considering the empirical distribution: TVD generally decreases with adapter strength.The reason for this difference is not immediately obvious.Closer inspection reveals that when GPT-J is the reference, the trends in TVD mimic what we would expect from a metric that interpolates between forward and reverse crossentropies.Since TVD is motivated as a metric that balances between precision and recall, our results therefore make intuitive sense.On the other hand, the observed trends for the empirical distribution do not have a clear explanation.
Critically, we find that the observed trends are stable across various design choices; see App.A for results with the WikiText dataset and with different choices of ε for the ε-smoothed versions of metrics. 17 A Precision-Recall Trade-Off.We next look at whether the shifts induced by common sampling adapters correspond to a precision-recall trade-off according to our probabilistic measures.In Fig. 2, we compare the reverse and forward crossentropies (with GPT-J used as the reference) across the adapter hyperparameter settings used.Results using the empirical distribution are similar (see Fig. 10 in App.A).Fig. 2 indeed suggests a quite direct trade-off between our operationalizations of precision and recall.Notably, the highest sequence-level quality scores do not correspond with the sampling adapter-hyperparameter settings that achieve the best precision (i.e., lowest reverse cross-entropy). 18Rather, they correspond to an intermediate point along the line, suggesting the importance of balancing precision and recall.
Correlations.The previous observations motivate us to look at correlations between (sub)wordlevel probabilistic measures and sequence-level quality metrics.We consider both the WebText and WikiText results when computing correlations.In Tab. 1, we see that the reverse KL of the generation model with GPT-J has the highest (rank) correlation with our quality metrics, closely followed by TVD.This finding suggests that reverse KL with another model could be a useful metric for selecting sampling adapter's hyperparameters, as its computation is much faster than standard methods for choosing such hyperparameters-e.g., human annotations or sequence-level quality scores-which require the generation of full sequences.

Related Work
Precision and Recall in Language Generation.This is by no means the first work to focus on the notions of precision and recall in the context of language generation.Language generator evaluation metrics have historically intentionally 17 We also observed that trends were very stable across the choice of reference model, i.e., using GPT2-XL and the 1.5B parameter version of GPT-Neo rather than GPT-J.We omit these results from the appendix to reduce clutter. 18MAUVE scores for all adapter-hyperparameter settings and both datasets can be seen in Fig. 4. prioritized precision-based measures due to their higher correlation with human quality judgments.For example, BLEU (Papineni et al., 2002) is computed using n-gram precision, and the original work on CHRF (Popović, 2015), which is a precision-recall-based metric, found that variants of the metric that placed more weight on precision correlated better with human judgments.More recently, Pimentel et al. ( 2023) report that the reverse KL divergence between multinomial distributions over embeddings of text from language models and of text from humans correlated more with human quality judgments than the results of other divergence measures.On the other hand, measures that place higher importance on recall of the model with respect to some test set, such as perplexity, are known not to be good indicators of text quality (Holtzman et al., 2020;Cohen and Beck, 2019;Meister et al., 2023).In terms of model training, alternative objectives that emphasize precision have been proposed in an attempt to alleviate the zero-avoiding effect induced by optimization for maximum likelihood (Kang and Hashimoto, 2020;Pang and He, 2021).
Analysis of Language Generation Models.The effect of sampling adapters on language models has previously been discussed in the framework of a quality-diversity trade-off (Zhang et al., 2021;Meister et al., 2022).For instance, Nadeem et al. (2020) and Wiher et al. (2022) catalog various sampling adapters and analyze their properties with respect to a quality-diversity trade-off using a wide range of automatic metrics.Hashimoto et al. (2019) propose an evaluation framework that combines human and statistical evaluation.In contrast, our work makes an explicit connection to the concepts of precision and recall and analyzes the effect of sampling adapters employing measures of differences in distributions.While Pillutla et al. (2021) likewise use notions of precision and recall for assess- ing language generators, they look at quantized distributions over language embedding spaces rather than directly at distributions over (sub)words.

Conclusion
In this work, we offer a formal treatment of sampling adapters and provide an analysis that aims to uncover why they are effective when used with probabilistic models for language generation.To this end, we first introduce a general framework that encompasses most of the transformations performed by previously proposed sampling adapters.
We then offer an intuition as to why sampling adapters may lead to better language generators.Using the notions of precision and recall proposed for generative models, which can be quantified in terms of standard probabilistic measures, we perform an empirical analysis.We find evidence that the application of sampling adapters increases the precision a distribution at the expense of its recall; this observation is robust across several experimental design choices.We further find a high correlation between sequence-level quality metrics and reverse KL divergence of the generation model with a reference model.

Limitations
A clear limitation of this work is that the results have been shown only for English.Further work should consider other model architectures, as well as datasets that span a variety of languages and domains.Another limitation is that we do not conduct human evaluations.Given the large number of adapter and hyperparameter settings that we chose to explore, acquiring the human evaluations that would have allowed us to make statistically significant conclusions regarding the relationships between text quality, distribution-level measures, and adapter-hyperparameter settings would have been financially prohibitive.Instead, we chose to look at automatic sequence-level quality metrics that are known to correlate highly with human quality judgments.Further, it has been observed that crowd-sourced judgments of text quality are far from perfect (Clark et al., 2021), making it not obvious whether this is indeed the better option.

Ethical Considerations
The use of language models for text generation comes with several ethical concerns.Especially when using sampling-based decoding algorithms, as is promoted in this work, the text generated by probabilistic models may contain malicious or hallucinatory content.This may be an intention of the user, but can also occur simply due to the training data that the model was exposed to, which is often not carefully filtered for undesirable material that a model then learns to mimic.The goal of works like this-to help create systems that can produce more human-like text-may also make it easier to automatically produce such content, which can ultimately have several negative downstream side effects.We caution designers and users of text generation systems to publicly advertise when content was created by a machine, and implement checks to prevent the production of harmful material.

A Additional Results
Figure 3: Reverse and forward KL divergence of the model with GPT-J and the empirical distribution (WebText test set) after different sampling adapter methods have been applied to the output distribution.Note that the ε-method, as described in §4.1, is used in all but reverse KL estimates of models with GPT-J.Dashed lines represent divergence with unmodified distribution, i.e., the equivalent of using ancestral sampling.

Figure 2 :
Figure 2: Reverse cross-entropy versus forward cross-entropy (the latter uses ε-smoothing) of the model with GPT-J for various sampling adapter and hyperparameter settings.Stars correspond to values at which hyperparameter settings achieved the highest MAUVE scores.The black dot corresponds to ancestral sampling.

Figure 4 :
Figure 4: MAUVE scores for text generated using WebText prefixes and different sampling adapters.The dashed lines indicate the scores of samples generated using ancestral sampling.

Figure 5 :
Figure 5: JS divergence of the model with the empirical distribution in the first row and with GPT-J in the second row after different sampling adapter methods have been applied to the output distribution.Dashed lines represent the distance to the unmodified distribution.We observe that at lower temperature values, some NaNs are produced by the JS computation with the empirical distribution.

Figure 6 :
Figure 6: Average entropy of the distribution p θ (• | y <t ) for different sampling adapter-hyperparameter combinations.Dashed lines correspond to the entropy of the unmodified distribution.

Figure 7 :
Figure 7: Average model token coverage per sequence y (i.e., percentage of tokens to which the adapter assigns non-zero probability) of the WebText test set after different sampling adapter methods have been applied to the output distribution.Dashed lines correspond to unmodified distribution, which always assigns probability mass to each token.

Figure 8 :
Figure 8: Same plot as Fig. 1 albeit using smaller ε (1e-8 instead of 1e-6) in computation of ϵ variants of methods.Results are essentially unchanged, except for a slight shift in axis values.

Figure 9 :
Figure 9: Same plot as Fig. 1 except using the test set of WikiText as our set of strings (y) and to construct the empirical distribution.

Figure 10 :
Figure 10: Reverse cross-entropy versus forward cross-entropy divergence (both using ε-smoothing) of the model with the empirical distribution for various sampling adapter and hyperparameter settings.Stars correspond to values at which hyperparameter settings achieved the highest MAUVE scores.The black dot corresponds to ancestral sampling.

Table 1 :
Spearman correlations of (sub)word-level probabilistic measures with MAUVE.We use * to indicate significance with a p-value < 0.001.