Language Model Evaluation Beyond Perplexity

We propose an alternate approach to quantifying how well language models learn natural language: we ask how well they match the statistical tendencies of natural language. To answer this question, we analyze whether text generated from language models exhibits the statistical tendencies present in the human-generated text on which they were trained. We provide a framework–paired with significance tests–for evaluating the fit of language models to these trends. We find that neural language models appear to learn only a subset of the tendencies considered, but align much more closely with empirical trends than proposed theoretical distributions (when present). Further, the fit to different distributions is highly-dependent on both model architecture and generation strategy. As concrete examples, text generated under the nucleus sampling scheme adheres more closely to the type–token relationship of natural language than text produced using standard ancestral sampling; text from LSTMs reflects the natural language distributions over length, stopwords, and symbols surprisingly well.


Introduction
Neural language models 1 have become shockingly good at modeling natural language data in recent years (Merity et al., 2017;Conneau and Lample, 2019;Radford et al., 2019). Thus, to test just how well neural language models capture language NLP researchers have started to look beyond standard evaluation metrics such as perplexity, endeavoring to understand which underlying attributes of human language these models are learning. To this end, a nascent literature has emerged that focuses on probing language models (Belinkov 1 In this work, we do not use the term language model to refer to cloze language models such as BERT (Devlin et al., 2019), which do not give us a distribution over strings. Figure 1: Average number of unique words vs. document length, i.e., type-token, in text sampled from language models. Values from models' test set are plotted for reference. and Glass, 2019), i.e., determining whether models encode linguistic phenomena. For the most part, these works have been limited to analyses of sentence-level phenomenon, such as subject-verb agreement (Gulordava et al., 2018) and garden path effects (van Schijndel and Linzen, 2018) among a myriad of other properties (Blevins et al., 2018;Chowdhury and Zamparelli, 2018, inter alia).
In this work, we attempt to understand which macro-level phenomena of human language today's language models reflect. That is, we pose the question: Do neural language models exhibit the statistical tendencies of human language? Phenomena that can be measured at this level provide an alternate view of a model's comprehension; for example, rather than exploring whether morphological agreement is captured, we look at whether our models learn the trends across a corpus as a whole, e.g., the token rank-frequency (Zipf's) relationship. In comparison to standard probing techniques, this framework does not require we know a priori how linguistic phenomena should manifest themselves. That is, when there is no law stating the theoretical tendencies of an attribute of natural language or we have reason to believe our language domain does not follow such a law, we can use the statistical tendencies present in empirical data as our baseline. This characteristic both allows us to assess a model's fit to highly corpus-dependent distributions-like the length distribution-and mitigates the biases introduced by our own preconceptions regarding properties of natural language. 2 More concretely, our paper describes an experimental design and accompanying hypothesis tests to determine precisely whether text generated from language models follows the same empirical trends as human language. Our experiments reveal that adherence to natural language tendencies varies widely with both model architecture and generation strategy, e.g., Fig. 1 shows varying degrees of adherence to the empirical type-token relationship, an artifact that perplexity alone could not reveal. Our findings suggest this framework is a valuable tool for gaining a deeper understanding of where today's language models are succeeding and failing at capturing human language.

Language Models
Language models are probability distributions over natural language sentences. We define the support of a language model p ✓ with parameters ✓ as where V is the model's vocabulary and tokens EOS and BOS demarcate the beginning and end of a string, respectively, and V ⇤ is the Kleene closure of V. In this paper, we term vocabularies consisting of words closed and those consisting of BPE tokens (Sennrich et al., 2016) open.
In the case when p ✓ is locally normalized, which is the predominant case for language models, p ✓ is defined as the product of probability distributions: where each p ✓ (· | y <t ) is a distribution with support overV := V[{EOS} and y <1 = y 0 := BOS. To estimate model parameters ✓, one typically optimizes the log-likelihood function over a corpus C train : where we call each string y a document. To determine the goodness of fit of a model to the empirical distribution (defined by C train ), it is standard practice to measure perplexity on a held-out dataset, which is simply a monotonic function of average (per token) log-likelihood under that model. While low perplexity on an evaluation set undoubtedly reflects some level of fit to natural language, it does not give us a fine-grained view of which linguistic attributes a model has learned.

Statistical Tendencies of Language
Human languages are thought to exhibit statistical tendencies, several of which are explicitly quantified by laws (Altmann and Gerlach, 2016). In this section, we review a subset of these distributionsboth with and without well-established formsover which we subsequently perform analyses.

Classical Laws
Rank-Frequency. Zipf's law (1949), otherwise known as the rank-frequency law, states that the frequency of a word in a corpus decays exponentially in the frequency rank of that word, i.e., the frequency !(·) of the k th most frequent word w k follows the power-law distribution: !(w k ) / k s . When fit to natural language text, the free parameter s is typically close to 1. Zipf's law also has a probabilistic interpretation: the marginal probability that a random word in our corpus takes on the value of thek th most frequent can be expressed as where ⇣(s) = 1/ P 1 k=1 k s is the normalizing constant of our probability mass function (pmf). The adherence of language to Zipf's law has been widely studied and is considered one of the canonical laws of quantitative linguistics (Baroni, 2009;Li et al., 2010;Moreno-Sánchez et al., 2016).
Estimating s from an observed set of rankfrequency pairs can be done using standard estimation techniques. Here we use the maximumlikelihood estimate 3 (MLE), employing numerical optimization to solve for s since the MLE of the discrete power law lacks a closed form solution.
Type-Token. Heaps' law (Herdan, 1960), also known as the type-token relationship, states that the number of additional unique tokens (i.e., number of types) in a document diminishes as its length increases. Formally, we can express the expected number of types u(·) as a function of the length l(·) of the string y via the relationship u(y) / l(y) where < 1 is a free parameter. Types may be, e.g., unigrams or bigrams.
The above formulation of Heaps' law lacks an obvious probabilistic interpretation. However, if we frame Heaps' law as modeling the expected value of the number of types for any given length document, then we can model the relation as a Poisson process, where the marginal distribution over document length follows Heaps' proposed power law. Specifically, we model the number of types for a document of a given length as a non-homogeneous Poisson process (NHPP; Ross, 1996) where our rate parameter (l(y)) is Heaps' power law relation. The probability that there are k types in a document of length t is then for (l(y)) = ↵ · l(y) . Similarly to Eq. (4), we can fit parameters ↵, using MLE (see App. A).

Other Tendencies
Natural language has other quantifiable distributions, e.g., over document length or unigrams. While there may not exist well-established laws for the behavior of these (often highly corpusdependent) distributions, we can observe their empirical distributions w.r.t. a corpus. We review a few here and leave the exploration of others to future work.
Length. Using notation from earlier, we estimate the pmf of the distribution over the length of documents in a corpus C aŝ We can additionally compute statistics of this distribution, such as sample mean:μ l (C) = 1/|C| P y2C l(y).
Unigram. Notably, the rank-frequency law of §3.1 leaves the categorical distribution over words unspecified, i.e., it defines the frequency for thek th ranked word without specifying the word itself. In order to make explicit comparisons, we define the unigram distribution w.r.t. corpus C aŝ Stopwords and Symbols. Certain percentages of words in a string consist of either symbols, i.e., numbers and punctuation, or stopwords, i.e., common words such as "that" or "so" that primarily serve a syntactic function. We can model this percentage as a (continuous) random variable S and estimate its probability density function (pdf) aŝ The pdf for symbols is defined similarly. As with our length distribution, we can compute the meanŝ µ stop ,μ sym of these distributions.

Statistical Distances
In this work, we aim to quantify the degree to which the linguistic distributions of text generated from language models match-or differ from-those of natural language. To this end, we propose the use of several probability metrics (Mostafaei and Kordnourie, 2011;Rachev et al., 2013) as our notion of statistical distance. 4 For each of these metrics, we present nonparametric statistical significance tests, i.e., tests that may be used when the underlying distribution of observed data is not known.

Primary Metrics
Perhaps the simplest method for measuring the distance between two random variables is through differences in expectations, e.g., means or variances.
(Semi-)distances of this nature are formally called primary metrics. To estimate this distance, we can use observations from random samples S 1 and S 2 , e.g., Observing a value of (S 1 , S 2 ) 6 = 0 on its own is not enough to confirm a difference between µ 1 and µ 2 ; we need to assess whether the observed distance is significantly above or below 0. Formally, our null and alternative hypotheses are: H a : (S 1 , S 2 ) 6 = 0 In our setting, we typically do not know the theoretical distributions of the random variables generating S 1 and S 2 , nor of an arbitrary test statistic . Consequently, we use resampling techniques to construct the sampling distribution of (S 1 , S 2 ).
Permutation Tests. In a nutshell, a permutation test provides a simple method for constructing the sampling distribution of a test statistic through empirical observations. The method uses the value of over all possible rearrangements of the observed data points to represent the distribution of the test statistic under the null hypothesis. Using this distribution, we can determine the probability of observing a value of the test statistic (or a more extreme value), which if low, may give us reason to reject a specific null hypothesis. In this work, we only consider statistics (·, ·) over two samples. We provide pseudocode for this case in App. B. 5

Simple Metrics
Primary metrics provide only a weak measure of the sameness of random variables as they are completely dependent on a single statistic of a distribution. On the other hand, we know a random variable can be completely described by its distribution function. As such, we turn to simple metrics of distance between random variables. Given cumulative density functions (cdfs) P 1 and P 2 over one-dimensional random variables, the Kolmogorov-Smirnov (KS) metric is where D 2 [0, 1] and D(·, ·) = 0 indicates the distributions are identical. However, not all random variables can be described in terms of a cdf. For categorical distributions where the support of our random variable is not ordinal, the natural counterpart to the KS metric is the Chi-square distance. This metric has a number of drawbacks (discussed in App. C)-primarily that its value can be hard to interpret and so we instead turn to the total variation distance (TVD)-a widely used metric of distance between probability distributions. Given two pmfs p 1 and p 2 , we define TVD as where similarly to the KS metric, TVD is bounded above by 1 and a value of 0 indicates identical distributions. In our setting, we consider two use cases for the KS metric and TVD: as distance metrics between an empirical and theoretical distribution (one-sample) and between two empirical distributions (two-sample). The corresponding hypotheses that we can test with these metrics are: One-Sample Case: H 0 : Samples S1 and S2 are drawn from same p H a : Samples S1 and S2 are not drawn from same p where in the two-sample case, the exact form of p does not need to be known. These hypotheses require the following tests.
The Kolmogorov-Smirov Test. The KS test (Smirnov, 1948) is a nonparametric goodness-of-fit test originally designed to assess the fit of a continuous cdf to empirically-observed data; the two-sample version tests whether two samples come from the same distribution. The method has since been extended to discrete distributions and is regarded as one of the most widely applicable nonparametric goodness-of-fit tests for comparing two distributions (Horn, 1977;Moreno-Sánchez et al., 2016). The test uses the KS metric D as its test statistic; under our null hypothesis, D converges to 0 almost surely in the limit as our number of samples n ! 1 by the Glivenko-Cantelli theorem. 6 We may reject the null hypothesis if our test statistic is greater than the critical value, which is computed based off of our sample size and a desired significance level. 7 A Test for TVD. Unlike the KS metric, we do not have a (theoretical) limiting distribution for TVD between samples from the same distribution that holds for all density functions (Devroye and Győrfi, 1990). However, we can construct this distribution using resampling techniques. Formally, when S 1 and S 2 are drawn from the same distribution pwhere p need not be known-then the test statistic TVD(p S1 , p S2 ) follows the sampling distribution Z p , i.e., TVD(p S1 , p S2 ) ⇠ Z p . The distribution of Z p can Only Transformer (AS) and trigram models have a closed vocabulary; the higher red line is the size of the former.
be computed using permutations of our samples, in the same manner as defined in §4.1.

Experiments
We use the above framework to assess the degree to which language models learn various distributions of natural language, i.e., we report metrics outlined in §4 measured over the distributions and quantities defined in §3. We compare samples generated from language models to a reserved test set taken from the same corpus as the model's training data. Each set contains 1 million samples. 8 We tokenize all samples using the Moses decoder toolkit (Koehn et al., 2007). All text is lower-cased and only complete unigrams are considered, i.e., when BPE is used, only the detokenized unigram is considered. Length of a string is computed as the number of tokens separated by whitespace. Note that when reporting the KS metric (D), we always report the metric between (a) an empirical cdf computed over the respective model-generated samples and (b) a reference cdf, where D p indicates direct comparison with empirical cdf of the test set. D p ✓ and Dp indicate comparison with cdfs of a parametric distribution, whose parameters are estimated on the model and test set, respectively.
Natural Language Corpus. We use English Wikipedia Dumps, 9 preprocessing data following the steps used for XLM (Conneau and Lample, 2019) albeit with a 44.7e6 train-1e4 valid-1e6 test split. The test set is used in all statistical tests, however, we estimate standard deviations for statistics in Tab. 4 (in the Appendix) using samples from 8 Due to our large sample sizes, we should anticipate that our results will almost always be significant, even when effect sizes are trivially small. As such, we will almost assuredly reject our null hypotheses that model-generated samples come from the same distribution as natural language ones. While in this light, the presentation of hypothesis tests in §4 may seem pointless, we provide them for cases where generating many samples for each model setting is computationally prohibitive. 9 dumps.wikimedia.org/ the training set; see this table for e.g., parameter estimates over test set.
Simulating Corpora from Language Models. Given the distribution p ✓ , we may exactly compute statistics and distributions for language models over the entire set Y, weighting examples by the probability assigned to each string; however, doing so is infeasible due to the size of the output space and non-Markovian structure of most neural models. Rather, we turn to sampling to create a representative set S = hy (1) , . . . , y (N ) i from p ✓ . We explore three sampling schemes: ancestral random sampling (Random), nucleus sampling (Nucleus), and beam sampling (Beam). 10 In ancestral random sampling, y (i) are constructed iteratively according to the distribution where y 0 = BOS. Under the local normalization scheme of Eq. (2), sampling according to Eq. (14) is equivalent to sampling y (i) directly from p ✓ . In nucleus sampling, our distribution is truncated to the most probable items covering portion n 2 (0, 1] of the probability mass. Formally, we now sample where V n (p) ✓V is the smallest subset such that P y2Vn(p) p(y) n and Z := P y2Vn(p) p(y). Beam sampling uses Eq. (14) as the sampling distribution, but extends a "beam" of k sequences at each sampling iteration. I.e., k extensions are sampled from p ✓ (· | y (i) <t ) and the k most probable of the k 2 sampled items remain on the beam; note that unlike standard beam search, this is a stochastic procedure. 11 We use a beam size of 5 in all experiments.    , 2017). All models are implemented and trained using fairseq. 12 We train models on corpora processed both with and without BPE. We include details for each model in Tab. 1. We additionally estimate a trigram model on the training data; formally, we build a model where the probability of observing token x 2V at position i of the text is estimated as where c(·) denotes the function counting occurrences of a sequence in some implicit C. Note that we do not employ smoothing techniques in this model, thus, perplexity over a held-out dataset may diverge and so is not reported in Tab. 1. Vocabulary statistics for each sample are shown in Fig. 2. We provide samples of model-generated text in App. E.
to two "new" distributions, p ✓ , respectively. 11 Note that this is the default sampling scheme for language generation in the fairseq library. 12 github.com/pytorch/fairseq/

Rank-Frequency
To understand the rank-frequency relationship implicitly learned by language models-and how it relates to the rank-frequency distribution present in natural language-we compute the three KS metrics previously described: D p ✓ , Dp, and D p . Specifically, for the first two values, we use the cdf of a Zipfian distribution parameterized by s as our reference-where s is estimated using model generated samples or the test set, respectively. 13 These metrics give us a sense of how well the rank-frequency distribution under our language models match a Zipfian distribution. Since the power-law behavior of the token rank-frequency distribution is known to fall off at higher ranks (Piantadosi, 2014;Moreno-Sánchez et al., 2016), we consider solely the first 10,000 ranks in each sample, including when computing D p . We report these values in Tab. 2. Values of estimates of s and plots of rank-frequency are shown in App. D.
Our results indicate that our models' empirical rank-frequency distributions do not adhere very closely to a standard Zipfian distribution (as shown by D p ✓ and Dp 0), despite appearing to at a superficial level (see App. D). However, the same is true for our test (Dp = 0.148), which suggests that our models fit a Zipfian distribution perhaps no more poorly than natural language does. Rather, the model produces qualitatively worst text (see App. E)-a trigram model under the beam sampling generation strategy-follows a power law trend the most closely of any of our samples. On the other hand, the small values of D p suggest our   (Wood and Altavela, 1978)) for all KS metrics are ⌧ 0.001. For the unigram distribution, we report TVD between empirical cdfs of model and test set. All p-values are < 0.001 (see App. D).
Figure 4: KS metrics (lower implies closer fit) with reference distributions for the type-token relationship as a function of document length. Dp ✓ and Dp statistical distance from NHPP distribution for params fit to model text and test sets, respectively; Dp is computed directly against the empirical cdf of test set. Shading indicates significance of the statistic. models learn the empirical rank-frequency trends of human text quite well, something that would not be evident by simply looking at adherence to a Zipfian distribution. The combination of these results suggest the limitation of using adherence to Zipf's law as a gauge for a model's consistency with natural language.

Type-Token
Fig . 3 shows the type-token trend for all corpora and generation schemes. While most models appear not to follow the same trend as the natural language distribution (as depicted by our test set), we observe that transformers under the nucleus sampling generation scheme match it most closely. Indeed, both models based on the transformer architecture exhibit remarkably similar trends in these experiments, despite having different vocabulary sizes and hyperparameters: both in their generally close fit to the natural language type-token distribution and in their visible fall-off for longer length sequences. The latter observation reveals a deficiency that is seemingly specific to the transformer architecture-one that may be linked to observations in natural language generation tasks. More specifically, we take this as quantita-tive evidence for recent qualitative observations that when left to generate lots of text, neural language models based on the transformer architecture tend to babble repetitively (Holtzman et al., 2020;Cohen and Beck, 2019;Eikema and Aziz, 2020).
To provide a more mathematically rigorous analysis, we compute KS metrics, 14 again presenting three values: D p ✓ , Dp, and D p . In Fig. 4, we can see that model-generated text follows a NHPP parameterized by Heaps' law moderately well (D p ✓ ); there are larger divergences at the tails of document length. However, most do not follow an NHPP with the same parameters as our test set (Dp). Further, in contrast to rank-frequency, the type-token distribution is more disparate from the empirical natural language distribution than our parameterized ones, as shown by high values of D p . While both transformers exhibit the closest fit for all document lengths, which is in-line with our observations in Fig. 3, statistical distance from the natural language distribution for all models and in all settings increases with document length.

Unigram Distribution
Because we do not have a well-established law dictating the form of the natural language unigram distribution, we compare only empirical pmfs from model-generated samples and the test set directly. Further, as the distribution over unigrams is categorical, we employ TVD following §4.2. Our results in Tab. 2 indicate that language models generally capture the unigram distribution quite well. The transformer (AS), which has a closed vocabulary, consistently performs poorly in comparison to other models. While we might speculate this outcome is a result of disparate tails between empirical cdfs-i.e., the part of the distribution over infrequent words, which may have been omitted from the closed vocabulary but could still be generated using BPE-the TVD metric in this setting should generally be robust to tail probabilities. 15 This suggests that BPE (or similar) vocabulary schemes may lead to models that can better fit this natural language distribution.

Length, Stopwords and Symbols
Similarly to the unigram distribution, for length, stopwords and symbols, we compare solely empirical cdfs. We use the set of English stopwords defined by NLTK (Bird et al., 2009). We define the set of symbols as tokens consisting solely of punctuation and numerical values. Our results in Tab. 3 demonstrate that our language models-at least when using random and nucleus samplingmimic these natural language distributions quite well. Notably, text generated from an LSTM using random sampling follows all three distributions the closest of any model, suggesting LSTMs may have an inductive bias that is helpful for capturing these distributions. On the other hand, using beam sampling leads to strong divergence from natural language distributions across the board. Results for differences in distribution means in the permutation testing framework can be found in App. D.
With respect to the length distribution, these results are perhaps surprising: the localnormalization scheme used by the majority of language generation models (and by those in these experiments) has been claimed to result in models that favor shorter than typical sequences (Sountsov and Sarawagi, 2016;Murray and Chiang, 2018). The results in Tab. 3 and Fig. 5 suggest otherwise. 15 We observe this empirically; calculating TVD between distributions truncated to the (union of the) first 1000 ranked unigrams lead to almost the exact same result. Specifically, we see that our models fit the natural language length distribution of our corpus quite closely, in terms of both overall distributions and means (see App. D). Rather, it appears that the generation strategy may be the cause of prior observations. This finding raises further questions: since models capture the length distribution well, is a language model more likely to produce degenerate text (e.g., repetitions) than the EOS token if only long documents are used in training? We posit that corpus preprocessing should perhaps be more carefully considered in light of these results.

Consistent Trends
Across results, we observe that text generated using the nucleus sampling decoding scheme often aligns with natural language more closely than text produced using other generation strategies. This suggests that nucleus sampling performs a helpful alteration to a standard distribution learned via MLE, which may in turn provide motivation for recent efforts to employ truncated or sparse probability distributions directly at training time, e.g., truncated loss (Kang and Hashimoto, 2020) or ↵entmax loss (Peters et al., 2019).
We additionally observe large discrepancies in both §5.1 and §5.2 between the results when using empirical natural language cdfs vs. parametric ones. We take this as a warning that assumptions about the forms of linguistic distributions-such as the ones employed by challenge tasks in probing-can have significant effects on results.

Related Work
In the last few years, a number of works have extended language model analysis beyond simple  Table 3: KS metrics (Dp) between empirical length, stopword, and symbol distributions of test set and model generated text.
evaluation metrics-like perplexity-in order to understand what attributes of human language these models are learning. Some use task-based approaches, i.e., they design a set of tasks that require a specific subset of linguistic knowledge then evaluate model performance on these tasks (Linzen et al., 2016;Gulordava et al., 2018;Jiang et al., 2020, inter alia). Others use model-based approaches, where a separate model is trained to perform some auxiliary task on representations learned by the model under test (Blevins et al., 2018;Giulianelli et al., 2018;Sorodoc et al., 2020, inter alia). We direct readers to Belinkov and Glass (2019) for a full survey of probing methods.
These approaches have drawbacks; for example, introducing a secondary model to determine what the original model has learned presents confounding factors (Hewitt and Liang, 2019). The designing of auxiliary tasks for assessing linguistic knowledge requires large manual effort and lends itself to implicit bias about how linguistic phenomena should manifest. In contrast, our work allows us to take a hands-off approach to analyzing language models. We see the benefit of this in §5, where our results without an assumed model of statistical tendencies give us a much different sense of which empirical properties of human-generated text our models have learned.
Our work is closest to that of Tanaka-Ishii (2017, 2019) who use model generated text to visually analyze whether language models reflect well-established statistical tendencies. In contrast, our work provides a quantitative framework, along with appropriate significance tests, 16 for evaluating distribution fits. We additionally assess the fit of language models to our test set directly, rather than solely to established laws. Further, our analysis includes different generation strategies, multiple neural architectures, and a wider variety of empirical language distributions.

Conclusion and Future Directions
In this work, we present a framework for determining the linguistic properties learned by language models through analysis of statistical trends in generated text. We find that neural language models accurately capture only a subset of natural language distributions and that this subset is highly dependent on both model architecture and generation strategy; no one configuration stands out as capturing all linguistic distributions. Ultimately, we see this analysis framework as a means for a more finegrained evaluation of language models than perplexity alone can provide. Uncovering which linguistic properties language models have learnedand which they have not-should help us to understand both the inductive biases of various models and via which avenues they can still be improved.
There are a number of important axes of variation that this work does not explore: perhaps most importantly, our results are limited to a single corpora in the English language. A cross-linguistic analysis may reveal whether different model architectures exhibit inductive biases compatible with different languages; observing how these metrics change as a function of corpus size would have implications about the effects of data availability. An exploration of the correlation of these metrics with other quantifications of model performance, such as perplexity or a model's ability to capture sentence level phenomenon, may help us understand how comprehensive other evaluation metrics are. We leave these analyses as future work.