Model Criticism for Long-Form Text Generation

Language models have demonstrated the ability to generate highly fluent text; however, it remains unclear whether their output retains coherent high-level structure (e.g., story progression). Here, we propose to apply a statistical tool, model criticism in latent space, to evaluate the high-level structure of the generated text. Model criticism compares the distributions between real and generated data in a latent space obtained according to an assumptive generative process. Different generative processes identify specific failure modes of the underlying model. We perform experiments on three representative aspects of high-level discourse—coherence, coreference, and topicality—and find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference.


Introduction
It is now broadly accepted that neural language models can consistently generate fluent text (Radford et al., 2019;Shoeybi et al., 2019;Brown et al., 2020;Smith et al., 2022).Yet, while large language models make few local word-level errors, human studies have shown that they still often make "high-level" errors such as incoherence, self-contradictions, and off-topic generations (Dou et al., 2022).We hypothesize that researchers have focused on local fluency partly because it is easy to automatically evaluate through metrics such as perplexity and n-gram matching.Automatic assessment of high-level text generation quality has received less attention, partially because a single general-purpose metric does not exist.
This work takes a step toward the automatic evaluation of the high-level structure of the generated text by applying a tool from statistics, model criticism in latent space (Dey et al., 1998;Seth et al., 2019).Under this approach, we first project data to a latent space based on an assumptive generative process, and then compare the implied latent distributions between real data and language model samples.This approach unifies past work for evaluating text generation under a single framework, including existing dimensionality reduction techniques such as probabilistic PCA (Wold et al., 1987), as well as previous applications of model criticism that were restricted to topic models (Mimno and Blei, 2011).
By making different assumptions in the underlying generative process, model criticism in latent space identifies specific failure modes of the generated language.We demonstrate this on three representative high-level properties of the generated discourse-coherence (Barzilay and Lapata, 2005), coreference (Chomsky, 1993), and topicality (Blei and Lafferty, 2006)-as well as on a synthetic dataset for which the true data generating process is known.
Experiments using our proposed framework enable us to make four observations about modern language models.First, we find that it is possible for a model to get strong word-level perplexity, yet fail to capture longer-term dynamics.Second, we find that the transformer language models perform poorly in terms of coherence, in line with previous observations (Dou et al., 2022;Sun et al., 2021;Krishna et al., 2022;Sun et al., 2022), particularly when they do not have access to explicit lexical markers in the context.Third, we show that transformer language models do not model coreference structures well.Last, we show that transformer language models can capture topical correlations (Blei and Lafferty, 2006).All results, data, and code are publicly available at https://github.com/da03/criticize_text_generation.

Model Criticism in Latent Space
Model criticism (O Hagan, 2003) quantifies the relationship between a data distribution P data (x) and a model P model (x) by comparing statistics over Davies, also known by birth as the Davies Duck, is an American pop music duo consisting of Jaleel Brown, Mike DeGagne, and John Varnado that appeared on the Rastafari album, Afterburner.. Brett Butler and Will Butler, two high schoolers, met in elementary school, and went off-track to go to college.They both performed at local and international jazz clubs... Brown and DeGagne formed the duo after Brown's parents found out that their son would play trumpet.The duo released the album, Ties That Bind Us,... On March 26, 2021, Brown officially announced that he and DeGagne's baby were expecting.As of November 2021, the twins are still married.
Studio albums: Ties That Bind Us (2016) Ties That Bind Us II (2020) abstract background background personal life discography Sample Infer Criticize X X Figure 1: Illustration of applying model criticism in latent space to evaluate discourse coherence.Instead of wordlevel errors, we identify improper high-level section transitions (those that are rare in real data), as marked by red crosses.The article shown is generated by GPT-2 finetuned on WIKI.See Section 4 for more explanations.
these two distributions.While model criticism can be applied to the observation space, in many applications we are interested in "higher-level" aspects of the data, such as the underlying topics of a document (Mimno and Blei, 2011), or the latent factors of an image (Seth et al., 2019).Model criticism in latent space (Dey et al., 1998;Seth et al., 2019) lifts the criticism approach to a latent space in order to compute higher-level comparative statistics.How do we critique latent properties of arbitrary, and perhaps unknown, distributions?For example, given a language model, how do we know how well it captures the section transitions at the discourse level (Figure 1)?Lacking access to the generative process, we introduce a critic generative process P c with latent variables z ∈ Z and observations x ∈ X : x ∼ P c (x|z).
Based on this generative process, the posterior distribution P c (z|x) projects x to the latent space.For a single data point x, we can evaluate the negative log-likelihood of the projected latent variables z ∼ P c (z|x) under the prior P c , T c (x) −E z∼Pc(z|x) log P c (z) = H(P c (z|x), P c (z)), where H(p, q) −E p log q denotes the crossentropy between two distributions p and q. 1 This process is illustrated in Figure 2.
Given an arbitrary distribution over x, P x , we can take an expected negative log-likelihood, T c (P x ) −E x∼Px(x) E z∼Pc(z|x) log P c (z).
We term T c (P x ) the Latent NLL. 2 This value is the cross-entropy between the aggregated posterior 1 We discuss the difference between being likely in the latent space versus the observed space in Appendix A. 2 When z is the same as x (Pc(z|x) = 1[z = x]), Latent NLL is the same as the negative log-likelihood of the language model samples under the data distribution (Zhao et al., 2018).
distribution and the prior distribution of z: T c (P x ) = H(E x∼Px(x) P c (z|x), P c (z)).
In practice, we cannot compute T (P x ) analytically due to the existence of the two expectations E x∼Px(x) and E z∼Pc(z|x) , but can approximate expectations using Monte-Carlo sampling.
When z is a sequence of M discrete states, we define a metric Latent PPL analogous to perplexity: With a critic chosen, we can compare P data (x) and P model (x) in the latent space by estimating and comparing T c (P data ) and T c (P model ).Similar to a two-sample test (Hotelling, 1951), when P data and P model are the same, the statistics will also stay close.Furthermore, with a powerful critic, T c (P model ) is meaningful by itself: a higher value means that model generations are less likely in the latent space, whereas a lower value implies that samples match the critic along the latent projection. 3The approach can also be applied to individual points, T c (x), to identify outliers.
How to select the critic P c Choosing the critic P c is obvious only when we know the true latent variables and the generative process of data.In other cases, it depends on the data properties of interest.For example, if we want to criticize the topicality of text, we can use a topic model (Blei et al., 2003) to induce a latent space over topics.Note that the selected critic P c may underperform P model as a model of x, while still providing useful latent structures.By criticizing strong models, using simpler latent models that are designed to capture a particular aspect of text as P c , we provide a sanity check for the stronger model along a specific target axis.This property motivates the use of this approach with powerful, yet opaque models.The likelihood of z is evaluated using P c (z) to measure how likely the samples are in the latent space.

A Surprising Text Generation Failure
As a preliminary experiment, we show a language model with strong word-level perplexity that fails to capture simple long-term dynamics as demonstrated by model criticism.We assume that P data is known and follows a basic pattern.It has a latent high-level sequence of M = 50 discrete states z 1 , z 2 , . . ., z M where each state can take one of 256 possible values.These states are generated from a transition distribution At the observation level, each latent state z m generates a sub-sequence of words x m 1 , x m 2 , . . ., x m Nm conditioned on z m from an emission distribution P (x m 1 , . . ., x m Nm |z m ).We also restrict the model so that each sub-sequence can only come from one latent state.The observed sequence is the concatenation of all sub-sequences.The joint distribution of the latent states and the tokens forms: With this generative process, we sample a dataset. 5e apply a transformer language model as P model and train it on this dataset.Given the simplicity of the generative process and the small vocabulary size, we expect this model to do quite well.And in fact we do see that the model achieves a strong perplexity of 2.28, which nearly matches the true data P data perplexity of 1.99.Model criticism gives a different method for quantifying model fit.Since the true data generating process is known, we can directly use P c = P data as the critic to induce the latent space.
To project an observation x to the latent space, we need to perform posterior inference P c (z|x).By construction, this mapping is deterministic, since each sub-sequence comes from a unique latent state (see Appendix D for details).We then apply model criticism T c by sampling a sequence of transformer outputs, mapping them to a sequence of latent states, counting to compute the aggregated posterior, and then comparing to the known prior.This process is shown in Figure 3.
Table 1 presents the results.Surprisingly, transformer gets a much worse Latent PPL compared to a hidden semi-Markov model (HSMM, the true model class) fit to data (64.80 v.s.47.24), which has a near-optimal Latent PPL.This result implies that even though the transformer is nearly as good at predicting the next word in the sequence, it has not learned the higher-level transition structures.Seemingly, it can produce reasonable estimates of the next token which does not reflect the ability to capture longer-range dynamics of this system.Motivation Given this result, we ask whether similar issues are present in language models applied in more realistic scenarios.We therefore turn to experiments that consider model criticism for long-form generation, and ask whether language models capture properties of discourse coherence (Section 4), coreference (Section 5), and topicality (Section 6).

Critiquing Discourse Coherence
Text generation from large language models rarely leads to local fluency errors, but there is evidence of failures like those in the previous section (Dou et al., 2022;Sun et al., 2021;Krishna et al., 2022;Sun et al., 2022).In this section, we apply model criticism to assess discourse coherence (Barzilay and Lapata, 2005) of large LMs.We study this through an experiment on generating long-form documents divided into explicit sections.While we do not know the true data generating process, knowing the distribution of section types allows us to assess the latent structure of LM generations.Figure 1 illustrates the experiment.Here, an LM generates an article.Each word transition is fluent, but the system makes two section transition errors: first, it generates two sections of type "background"; second, it generates a section of type "personal life" following the last "background" section, with both transitions being unlikely in the data. 6We aim to separate the evaluation of these high-level coherence errors from word-level errors.
To apply model criticism, we posit a simple critic generative process to capture the section changes.We adapt a hidden semi-Markov model (HSMM) which is commonly used to represent segmentations of this form.Specifically, the high-level latent variables z 1 , . . ., z M7 model transitions among section types and the bottom level generates text conditioned on the current section type: We can then evaluate on datasets with known (ground truth) section titles and use these section titles as z.We use three English datasets PUBMED, ARXIV, and WIKI (Cohan et al., 2018). 8We compare two language modeling settings, one trained with all section titles removed ("W/O Title") and one with section titles before each section ("W/ Title"), since we hypothesize that the existence of explicit section type markers might help the model learn the dynamics, inspired by Nye et al. (2021) and Wei et al. (2022b).Sections are separated by a special marker, and a special end-of-sequence symbol is used to mark the end of the generation.Since all three datasets are relatively small (especially considering that we use them to generate entire articles), we leverage pretrained language models GPT-2 small (LM 1 ) (Radford et al., 2019), and GPT-Neo small (LM 2 ) (Black et al., 2021) which is trained on a more diverse dataset (Gao et al., 2020).We finetune these LMs for P model .
To generate, we sample from the language model until we hit the end-of-sequence symbol.No tempering/truncation (Holtzman et al., 2019) is used during sampling, since we are more interested in the learned distribution rather than its mode here.For the "W/ Title" setting, we discard the generated section titles in a postprocessing step.
To infer the section types for a generated article, we need to approximate posterior inference to compute T c .We make a simplifying assumption that the posterior section title of each section only depends on its corresponding text: is mostly over 90% certain about its predictions).More details can be found in Appendix E.
Results Table 2 gives results on coherence experiments.We first note that both models have strong word-level perplexity across datasets, with LM 2 doing better on two of the three datasets.We also note that removing titles has a negligible impact on the perplexity of the models.However, Latent PPL tells a different story.We find that LM 1 greatly outperforms LM 2 when criticizing with respect to the latent sections.9It is also interesting that transformer LMs are sensitive to title words being explicitly included in the training data (i.e., the W/ Title setting).For example, LM 1 W/ Title gets a Latent PPL of 6.72 on ARXIV, whereas LM 1 W/O Title gets a Latent PPL of 9.52, despite having very close word-level PPLs (13.94 v.s.14.13).These observations indicate that lacking explicit markers, the tested transformer LMs do not learn the longterm dynamics necessary for discourse coherence.
Using explicit section topic markers might serve a similar functionality as using chain-of-thought prompting in language-model-based question answering tasks (Wei et al., 2022b).One concern is that the difference between W/ Title and W/O Title is a side effect of language models having a limited context window size (1024 for LM 1 and 2048 for LM 2 ), since two adjacent sections might not fit within the context window size (but one section and the next section title are more likely to fit).To check if this is the case, we filter WIKI to only include articles with maximum section length 500 to form a new dataset WIKI-SHORT.In this dataset, any two adjacent sections can fit within the context window of both LM 1 and LM 2 .Table 3 shows that even in this case W/ Title still outperforms W/O Title, indicating that the difference between W/ Title and W/O Title is not due to the limited context window size.
Figure 4 visualizes the section transition errors made by LM 1 (W/O Title) for the most common section types on WIKI.We can find that the language model tends to generate the same section topic repeatedly, although there are other transition errors as well.More detailed error analysis can be found in Appendix E.  A natural question is whether increasing model size improves coherence.To this end, in addition to GPT-2 small (GPT-2 S, aka LM 1 , 117M parameters), we apply model criticism to GPT-2 medium (GPT-2 M, 345M parameters), GPT-2 large (GPT-2 L, 742M parameters), and full GPT-2 (GPT-2 XL, 1.5B parameters) on WIKI.The results are summarized in Table 5.We can see that increasing model size improves PPL but not Latent PPL.

Critiquing Coreference Chains
Coreference tracks how multiple mention expressions are used to refer to the same underlying entity (Karttunen, 1969;Gordon and Hendrick, 1998).While coreference represents a ubiquitous and important discourse-level phenomenon (Jurafsky and Martin, 1999;Kunz and Hardmeier, 2019), there is evidence that large neural language models make elementary coreference mistakes (Pagnoni et al., 2021), such as referring to non-existent discourse entities (Schuster and Linzen, 2022).
In this experiment, we compare the coreference chain (Jurafsky and Martin, 1999) distributions be-Original Text ... [Lisa] 0 runs off to find [him] 1 and [they] 2 kiss passionately.Afterwards, [Josh] 1 tells [her] 0 the reason why [he] 1 's going to [their] 2 first gig, and that [Lisa] 0 is going to do it, too...
Figure 5: Critiquing coreference chains on a sample from LM 1 .We first extract entity mentions from the text and only keep the genders of proper nouns to form z, then a 5-gram P c is used to score z. [] i denotes a mention with entity id i. .marks sentence boundaries.
tween real data and LM generations.A coreference chain consists of a sequence of coreferent mentions.To simplify the representation, we use gender features to replace non-pronominal tokens, as illustrated in Figure 5. 10 Presumably these latent chains should be similar in generated text and in real data.
For the critic P c , we use a 5-gram language model with Kneser-Ney smoothing (Ney et al., 1994) over chains.To infer z, we use an off-theshelf coreference resolution tool.11To avoid data sparsity issues, we relabel entity clusters within each n-gram.We apply model criticism to compare real data and LMs trained on WIKI (W/ Title), after filtering data to only consider articles about films since they contain richer reference structures.

Results
Table 7 shows the Latent PPLs on real data and LM generations.We can see that in general there is a mismatch of coreference distributions.Interestingly, while LM 1 models outperformed LM 2 models on discourse coherence, for this task LM 2 models are better.
Table 6 shows the 10 coreference chain n-grams that most contributed to this difference.Some are intuitively implausible: in the fourth row, [His] 1 does not have a local antecedent; in the second to the last row, [himself] 2 also does not have a local antecedent.Others are rare but possible: in the last row, a proper noun [Male] 0 is used after a pronoun [his] 0 is used in the same sentence to refer to the same entity. 12he learned critic P c can also be used to identify unlikely coreference chains, as shown in Table 15 in Appendix G. Appendix G also has more qualitative examples and analyses.
Lastly, we evaluate whether scaling model size improves coreference modeling.The results are summarized in Table 8.We can see that increasing model size does not improve Latent PPL, similar to our observations on critiquing discourse coherence.

Critiquing Topic Correlations
Topical structure is another important aspect of long-form document generation (Serrano et al., 2009).Certain topics are more likely to appear together, for example, a document containing a topic related to "poets" is more likely to also contain one related to "publisher" relative to one related to "football".A text generation model should capture these topical relations.For this experiment, we again sample documents from the trained language model P model .Specifically, we utilize the transformer-based LMs trained on the datasets in Section 4 (W/O Title).
To explore the topical structure in the generated documents, we need a critic P c .While LDA (Blei et al., 2003) is the most commonly used generative process for topic modeling, the Dirichlet prior does not explicitly model topic correlations in documents.We therefore use the correlated topic model (CTM) specifically designed to model topical structures (Blei and Lafferty, 2006).Model criticism will then compare the latent space of the real data with the generated texts.
For each document, a CTM with M topics first generates a topic coefficient latent variable z ∈ R M from a multivariate Gaussian distribution P c (z) N (z; µ, Σ).
Each coefficient of z can be interpreted as the "strength" of a topic in a document, so the covariance matrix Σ captures the correlations among different topics.These weights z are then normalized using a softmax function, the result of which is used to parameterize the distribution over topic t n for the n-th word.Each topic t n induces a categorical distribution over word types P (x n |t n ) = φ tn,xn , where φ ij parameterizes the probability of emitting word type j conditioned on topic i.The joint probability of a document with N words is: Since we are only interested in criticizing the document-level z, we marginalize out the topic assignments of individual words: To fit this generative process on data, we use variational inference and maximize the ELBO following Blei and Lafferty (2006).We set M to 100.Since analytical posterior inference is intractable, we use variational inference to estimate P c (z|x).
Results Table 9 shows the main results.The Latent NLLs of LM generations and real data are close on all three datasets (there are outlier pathological generations that we can identify using T (x), as shown in Appendix F).In Figure 6, we visualize and compare the covariance matrices of the aggregated posterior distributions of LM generations and real data, and find that transformers are able to model the correlations among topics well.
These results indicate that topic correlation is well represented in text generation systems, and is likely an easier task to model than ordered coherence.

Related Work
Text Generation Evaluation Traditional evaluation metrics include perplexity, and n-gram overlap metrics for translation-type problems such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Lavie and Agarwal, 2007), and NIST (Martin and Przybocki, 2000).In recent years, with the emergence of neural models that learn contextual representations (Devlin et al., 2019;Liu et al., 2019), researchers propose to project text to contextual representations and compute distance in this space (Zhang et al., 2019;Zhao et al., 2019;Pillutla et al., 2021).The closest work to ours is Eikema and Aziz (2020), which evaluates different decoding strategies in machine translation by comparing the statistics of the produced text.While these past works mainly concern word-level string/meaning representation matching, the goal of our work is to check the high-level aspects of the generated text such as coherence.Besides, word-level matching is not suitable for evaluating open-ended generation tasks due to the existence of too many plausible references (Celikyilmaz et al., 2020), while our work projects text to a more manageable lowerdimensional latent space to make the evaluation of open-ended generation feasible.

Evaluation of Long-Form Text
There is a long line of research evaluating the discourse coherence of text (Grosz et al., 1995;Poesio et al., 2004;Barzilay and Lapata, 2005;Lai and Tetreault, 2018;Logeswaran et al., 2018;Persing et al., 2010).Most learn a predictor that maps features such as the distribution of entities (Barzilay and Lapata, 2005) or the transitions of topics (Persing et al., 2010) to manually-labeled coherence scores.Our work differs in two important ways: first, we unify the evaluation of different high-level aspects of text using the formalism of model criticism; second, we do not assume any annotated coherence scores-we only specify a generative process in order to project text to a latent space for the comparison between machine-generated text and real text.Recently, there have been works targeting the evaluation of discourse-level coherence, such as BARTScore (Yuan et al., 2021) and DiscoScore (Zhao et al., 2022).These methods presume either a conditional generation setting or require textual references.We also note that model criticism does not use a generic neural representation, but focuses  (Hotelling, 1951) in that it computes and compares some statistics on the real data and on the samples to determine if they are close enough.While the statistics may be directly computed in the observation space, in many applications we are interested in criticizing some latent aspects of data such as topics (Mimno and Blei, 2011) or latent factors (Seth et al., 2019) and Brant, 1988;O Hagan, 2003;Seth et al., 2019;Dey et al., 1995;Weiss, 1995;Dey et al., 1998).
Recently, Barkhof and Aziz (2022) propose to use model criticism to evaluate VAEs.Model criticism in latent space forms the basis of our work, with two major differences: first, we apply model criticism to models with a point estimate of parameters such as commonly-used neural language models instead of models with uncertainties in their parameters.Second, we allow for using a different generative model to induce the latent space from the model that we criticize.By separating out the model to be criticized and the generative process used for projecting data to the latent space, our approach allows for criticizing different views of the data depending on user needs and for criticizing generative models without any latent variables such as neural language models.For qualitative analysis and outlier identification, our work applies visual posterior predictive checks (Gabry et al., 2019;Gelman, 1997), a graphical version of model criticism.

Limitations
One limitation of the proposed approach is its reliance on choosing a critic generative process P c , which presumes some knowledge of a true data generating process.For an improperly specified critic, it does not expose the latent space that we intend to criticize.However, since we compare statistics between real data and model generations (similar to two-sample tests), for a good model the statistics should be close even with improper critics.Another limitation is that not observing any differences does not imply that the model generations conform to the unknown data distribution-it simply means that they are close with regard to the latent aspects that we criticize (O Hagan, 2003).
Recently, researchers found that certain capabilities such as reasoning under augmented prompts only emerge in large LMs beyond tens of billions of parameters (Wei et al., 2022a).Since the largest LM tested in this paper only has 1.5 billion parameters, future work is required to investigate whether the high-level issues observed in this paper can be solved by further scaling model size.

Conclusions
We consider the problem of evaluating long-form text generation for specific discourse properties.We propose a statistical tool, model criticism in latent space, which projects text to a latent space based on an assumptive generative process, and compares the implied latent distribution.Different critic generative processes focus on different properties of data.We apply this tool to analyze three representative document properties: coherence, coreference, and topicality, using transformerbased language models.Experiments find that while transformer LMs can capture topical structures well, they are not currently strong at modeling discourse coherence without explicit markers or at modeling coreference.

Ethical Considerations
In our experiment of critiquing coreference chains, we used a gender binary (Hyde et al., 2019) to categorize proper nouns, but there are many individuals who do not adhere to the gender binary that this simple categorization fails to consider (Bamman et al., 2014).The reason for the gender binary is primarily because personal pronouns are typically gendered in English which makes the qualitative and statistical analysis more clear.For example, one coreference error detected by the approach is to use pronouns of different genders to refer to the same person, as shown in Appendix G.In Appendix G, we describe the exact procedure through which the genders of proper nouns are determined to make explicit what our "gender" definition is (Larson, 2017).Going forward, exploring other features of proper nouns such as their syntactic features (Shieber and Tao, 2003) to replace gender assignments here might further mitigate this concern.the aggregated posterior distribution under the data distribution (P agg (z) E x∼P data (x) P c (z|x)).
To find the optimal P c (z) that maximizes the data likelihood, we use the equation from Appendix A that log P c (x) = E Pc(z|x) log P c (z) + E Pc(z|x) log P c (x|z) + H(P c (z|x)), and take the expectation on both sides w.r.t.P data (x): In the right-hand side of the above equation, the only term containing P c (z) is the first term −KL(P agg (z)||P c (z)).Therefore, the optimal P c (z) that maximizes the likelihood of data is P agg (z), although the optimization algorithm is not guaranteed to find this optimum.
At its optimality, P c (z) is the same as the aggregated posterior distribution P agg (z), in which case T (P x ) can also be interpreted as the crossentropy between the aggregated posterior under model generations (E x∼Px(x) P c (z|x)) and the aggregated posterior under the real data distribution (E x∼P data (x) P c (z|x)):

C Detection of Code-Mixing
In this section, we show that model criticism can generalize some previous high-level evaluation metrics.In particular, we replicate the machine translation experiment in Zhou et al. (2019) under our framework.In this experiment, an English sentence might be translated into Spanish, German, or French, but never a mix of different languages.Therefore, one failure mode of a model is to generate text that contains code-mixing.
To criticize the existence of code-mixing, we need a model that can model the mixing of languages of a document.LDA (Blei et al., 2003)  Topic 1 (German) die der und in zu den von für dass ist wir des nicht auf das eine werden es im auch Topic 2 (Spanish) de la que en y el a los las del se una para un por no con es al sobre Topic 3 (French) de la et des à les le que en ' du dans nous pour qui une un est au pas is suitable for this purpose, as each language is analogous to a topic, and the document-topic latent variable parameterizes how topics (languages) are mixed in a document.
In LDA, each document is associated with a topic coefficient latent variable z ∈ ∆(N ), where N = {1, 2, . . ., M } is a set of M topics and ∆(N ) is a probability simplex over these topics such that z can be used to parameterize a categorical distribution.The prior over z is modeled using a Dirichlet distribution with parameters α: P (z) Dirichlet(z; α).
The document-topic coefficient latent variable z defines a categorical distribution over the topics t n for each word x n in the document, and each topic in turn induces a categorical distribution over word types P (x n |t n ) = φ tn,xn , where φ ij parameterizes the probability of observing a word type j conditioned on a topic i, so the joint probability of topics and words for a document with N words is: Since we are only interested in criticizing the document-topic coefficient latent variable z, we marginalize out the topic assignments of each word.Assuming there are M topics, the marginal distribution is: To fit this generative process on data, we set M = 3 (since there are three target languages).We treat φ as a latent variable with prior Dirichlet(z; α), and then use collapsed Gibbs sampling to sample topic assignments t from P (t|x) (both z and φ are collapsed).β is fixed at 0.01, and α is optimized every 100 iterations with initial value 1 M , and we use the MAP of P (φ|t, x) as a point estimate of φ * .14For posterior inference, we use a two-stage sampling approach: Since P (z, t|x; φ * ) = P (t|x; φ * )P (z|t, x; φ * ), we again apply collapsed Gibbs sampling to sample from P (t|x; φ * ) first, and then sampling from P (z|t, x; φ * ) is trivial since P (z|t, x; φ * ) = P (z|t; φ * ) = Dirichlet(z; α ) where We evaluate two probabilistic formulations of transformer LMs in terms of code-mixing.The first model is an autoregressive LM which assumes that each word depends on all previous words, and the second model is a non-autoregressive LM which assumes that different words are generated independent of each other (Gu et al., 2018).We train both LMs on the same English-German/Spanish/French dataset as in Zhou et al. (2019). 15  Model Settings For both autoregressive and nonautoregressive LMs we use a transformer with 6 layers, 8 attention heads, model dimension 512, hidden dimension 2048. 16The autoregressive LM has 64.80M parameters, and the non-autoregressive LM has 65.98M parameters.Training takes about 16 hours on a single Nvidia A100 GPU.

Results
Table 10 shows the learned topics, which largely correspond to the three underlying languages.Figure 7 visualizes samples from the posterior P (z|x).We can see that for both the ground truth data and the autoregressive LM generations, the posterior is concentrated at the corners (hence it appears that there are fewer points), indicating that each translation contains mostly the same topic (underlying language).On the other hand, the posterior for non-autoregressive LM generations is dispersed, indicating that it's unable to fully commit to a single topic during generation due to the strong independence assumption.This result is the same as Zhou et al. (2019) without relying on external lexicons.

D Details of "A Surprising Text
Generation Failure" Data We set M to 50, |Z| to 256, V to the set of upper-and lower-case letters plus a special end-of-sequence symbol (so |V| = 53.We uniformly sample 10k distinct subsequences of tokens x m 1 , . . ., x m Nm by first sampling uniformly a length between 4 and 11, and then each token is drawn uniformly from the set of letters (except for the last token x m Nm , which is always end-ofsequence).For each subsequence of tokens, we sample uniformly from Z and only allow emissions from the sampled state to this subsequence (such that the posterior P (z m |x m 1 , . . ., x m Nm ) is a delta distribution).The entries in the transition matrix P c (z m |z m−1 ) are initialized with a normal distribution, divided by temperature 0.5, and then normalized using softmax.The entries in the emission matrix P c (x m 1 , . . ., x m Nm |z m ) are initialized with a normal distribution and divided by temperature 0.3.Then we mask out emissions not allowed and normalize the matrix using softmax.
Posterior Inference Given a sequence x, the goal of posterior inference is to infer P (z|x).This can be done in two steps: first, we segment x into subsequences (each subsequence corresponds to one hidden state).This segmentation is deterministic due to the end-of-sequence tokens.Next, we map each subsequence to its hidden state z m by a simple lookup operation because we masked the emission matric to only allow one hidden state per subsequence.Therefore, P (z|x) is a delta distribution.

Model
The HSMM LM has 800 states.It is parameterized with the logits of its transition matrix P (z m |z m−1 ), its length emission matrix P (N m |z m ), and its emission matrix P (x m 1 , . . ., x m Nm |z m , N m ).N m ranges from 1 to 11.To parameterize the emission matrix, we take the 250k most common n-grams in the training dataset for n from 1 to 11.17It has 1.61B parameters due to the large number of possible emissions (the true data distribution P data only has 2.63M parameters).The transformer LM has 6 layers, 4 attention heads, model dimension 512, and hidden dimension 1024.18It has 18.94M parameters.
Optimization We optimize HSMM using stochastic gradient descent (SGD) on the log marginal likelihood log P (x). 19To marginalize out z m and N m , we use PyTorch-Struct (Rush, 2020).The model parameters are initialized with Xavier (Glorot and Bengio, 2010).We use a batch size of 8 and train the model for 10 epochs with the Adam optimizer (Kingma and Ba, 2014) on an Nvidia A100 GPU.The learning rate is initialized to 3e-1 and halved when the validation log marginal likelihood does not improve for 240 steps, with a minimal learning rate 3e-4.We found it necessary to pretrain the emission matrix P (x m 1 , . . ., x m Nm |z m , N m ) for one epoch using a learning rate of 1e-1 while fixing other parameters to avoid the under-utilization of states.Pretraining takes about a day and training takes about a week, due to the large number of parameters and the small batch size that we can afford.The transformer LM is optimized with Adam as well, but with a batch size of 4096 tokens, 4k learning rate warmup steps to maximum learning rate 5e-4.It is optimized to 120k steps in total (about 19 epochs), following fairseq's default setting for conditional language modeling on IWSLT14 De-En. 20Training the transformer LM takes about 4 hours on an Nvidia A100 GPU.

E Details of "Critiquing Discourse
Coherence" Data -PUBMED and ARXIV The PUBMED and ARXIV datasets in Section 4 are based on the datasets of Cohan et al. (2018), where each article consists of a list of sections with section titles. 21e process the dataset in a few steps: First, we standardize the section titles by lemmatizing each word in the section title,22 removing any numbers, and mapping each word to a standard spelling (e.g., "acknowledgement" is mapped to "acknowledgment").
Next, we remove from each article "see also", "external link", "reference", "further reading", "note", and "source" sections.Then we filter articles with fewer than 3 remaining sections, or with sections of more than 2k tokens or fewer than 30 tokens (the number of tokens is counted according to the GPT-2 tokenizer).Finally, we remove articles containing infrequent section titles, where the threshold is 500 for PUBMED and 200 for ARXIV (all counted on the training dataset).
Data -WIKI We download the English Wikipedia dumped on Dec 1, 2021. 23We then use a Python package mwparserfromhell24 to extract top-level sections from each article.We ignore redirect pages, disambiguation pages, links, files, and images, and strip away code.We also ignore articles about years.Then we process the dataset in the same way as how we processed PUBMED and ARXIV, except that we remove articles with fewer than 4 sections (since we always count the introductory paragraph of each article in Wikipedia as an "abstract" section), and that we removed infrequent section titles that appear fewer than 4k times in the training data.

Dataset Statistics
The statistics of all three datasets can be found at Error Analysis Table 12 presents the most common section transition errors across all section types for different settings (W/ Title and W/O Title).We notice again that a very common transition error is to generate a section repeatedly.However, even in the ground truth data, there are repeated sections, such as "career → career" (appearing 0.08%), which is due to the misclassification by the BERT classifier used for the inference network. 28epetition Errors Repetition is a common type of error found by model criticism (see  same criterion as in Table 12) on LM 1 W/O Title and 17.89% on LM 1 W/ Title.While previous works have shown that neural language models tend to repeat at the level of phrases (Holtzman et al., 2019) and sentences (Welleck et al., 2019), our work found that the repetition might even happen at a higher level, as shown in the qualitative example in Table 13.

F Details of "Critiquing Topic Correlations"
Data We use the same datasets as in Section 4.
For topic modeling, we remove word types that appear in more than 50% of training documents, and we also remove L A T E Xcommands such as \xmath.Outlier Detection While Section 6 has shown that in aggregate the Latent NLL of the LM 1 generations is close to that of real data, we can identify outliers by finding x for which T (x) = −E z∼Pc(z|x) log P c (z) is high.We find that those outliers are usually pathological cases that result in a very different distribution of topics, as shown in Table 14.
More Visualizations Section 6 visualized the covariance matrices on WIKI.We also plot the covariance matrices on PUBMED and ARXIV in Figure 8 and Figure 9.Note that we use hierarchical clustering of the covariance matrix on the test set to reorder topics, and we clamp the values in the covariance matrix to be in the range of [-5, 5] for plotting.

G Details of "Critiquing Coreference Chains"
Data All experiments in this section use a subset of the WIKI dataset: we apply a simple filter to only consider articles about films, by matching the first section of the article with the regular expression .*isa.*film.*.

Coreference Resolution
We use an off-the-shelf neural coreference resolution system neuralcoref30 to infer z given an article.We limit our studies to only consider person entities.

Gender Assignment
To avoid the open vocabulary problem of proper nouns and also due to the fact that personal pronouns are usually gendered in English, we replace proper nouns with their genders (Male/Female/Plural/None of the above).In order to identify genders of proper nouns, we use the majority voting of the genders of the pronouns that corefer with them (for example, "she" corresponds to female, "he" corresponds to male, and "they" corresponds to plural).If there are no gen-dered pronouns that corefer with the given proper noun, we assign "None of the above" as the gender.The caption of Figure 13 presents an example of the gender assignment procedure.
Language Models We use LM 1 (W/O Title) trained on WIKI.We apply the same filtering process as applied to real data to only consider generations about films.
What does this critic learn?Table 15 shows a random subset of unlikely coreference chain n-grams generated by LM 1 according to the critic.We can see that the learned critic makes sense intuitively.For example, in the first row, [They] 1 is created even though the previous context only contains a single entity; 31 in the second row, "she" is used to refer to a male; in the third row, "her" doesn't have any antecedent. 32  More Results Table 16 shows the coreference chains that occur more frequently in LM generations than in real data (we again used 5-gram LM 31 That being said, it is possible that outside this context window there are other entities that makes using "They" possible.
[P]0 0.01 0.00 Table 17: The top 5 coreference chain n-grams with the lowest log probability differences between LM generations and real data (log P LM (z 5 |z <5 ) − log P data (z 5 |z <5 )).We only consider n-grams that appear more than (including) 5 times in both test set and LM generations.M (Male), F (Female), P (Plural), and N (None of the above).Blank: padding.
with Kneser-Ney smoothing to estimate the probabilities).We can see that some of these are implausible similar to the observation in the main paper: for example, in the second to last row a proper noun [Male] 0 is used after a pronoun [he] 0 is used to refer to the same entity in the sentence.In Table 17 we show the other direction: the coreference chains that occur more frequently in real data than LM generations.We can see that while this also shows the places where the coreference distributions do not match, the coreference structures here are not unlikely, since they appear frequently in real data.
Qualitative Examples Figure 10 and Figure 11 show two examples where coreference abnormalities are successfully detected by the model.Figure 12 shows an example where due to the limited context window size of the 5-gram critic, a pronoun is identified as unlikely due to its antecedent falling outside the context window even though it is appropriate.Figure 13 shows an example where due to coreference resolution errors are intertwined with coreference errors.This type of errors would likely go away as more powerful coreference resolution systems are developed.Potential Improvements By throwing away all the other words but the entity mentions, we lose much information about the sentence, even syntactic information such as the c-command structures (Chomsky, 1993).By augmenting the entity men-  tions with syntactic features, the critic is likely to be even more powerful at identifying more nuanced abnormalities of language model generations.

H Human Evaluation
Inspired by Persing et al. (2010), we evaluate the coherence of an article by asking human annotators to first label the type of each section, and then label whether an article is coherent based on the organization of section types.Our human evaluation system is based on Amazon Mechanical Turk (Crowston, 2012).Each human annotator needs to first go through a training phase to learn the typical organization of articles in the training dataset, as shown in Figure 14.After this training phase, a human annotator will use the interface shown in Figure 15 to annotate whether an article is coherent or not, where the annotator needs to first label the section types of each sec-Figure 14: The training interface of human evaluation.Upon clicking "Verify", the selected section types will be compared against gold section types.
Figure 15: The testing interface of human evaluation.The human annotator needs to first label all section types and then label whether the article is coherent or not based on the labeled section types.

Figure 2 :
Figure 2: Model criticism in latent space.Given a sample x, we first map it to latent states z using P c (z|x).The likelihood of z is evaluated using P c (z) to measure how likely the samples are in the latent space.

Figure 3 :
Figure 3: Applying model criticism to synthetic data.

Figure 6 :
Figure 6: Topic covariance matrix for the induced z (on WIKI).Left: Test set (P data ).Middle: LM 1 generations (P model ).Right: generations of LM 1 trained on PUBMED as a visual baseline.The Latent NLLs are: 124.70, 123.30, and 140.10.Topic ids are rearranged using hierarchical clustering to facilitate visual comparison.

Figure 7 :
Figure 7: Scatterplot of samples from the topic posterior E Px P c (z|x).Left: real data.Middle: autoregressive LM generations.Right: non-autoregressive LM generations.Real data and autoregressive LM generations contain much less code-mixing compared to non-autoregressive LM generations.

Figure 8 :
Figure 8: Topic covariance matrix for the induced z (on PUBMED).Left: Test set (P data ).Middle: LM 1 generations (P model ).Right: generations of LM 1 trained on ARXIV as a visual baseline.

Figure 9 :
Figure 9: Topic covariance matrix for the induced z (on ARXIV).Left: Test set (P data ).Middle: LM 1 generations (P model ).Right: generations of LM 1 trained on PUBMED as a visual baseline.

Figure 11 :
Figure11:A qualitative example where the critic correctly identifies an implausible coreference n-gram.The argmax at the circled position is [him] 0 with probability exp(−1.43).We only highlighted the root of each entity mention to avoid clutter.

Figure 12 :
Figure12:A qualitative example where the critic incorrectly identifies an implausible coreference n-gram, due to the limited context window not containing the antecedent of the pronoun.The argmax at the circled position is [her] 1 with probability exp(−1.49).We only highlighted the root of each entity mention to avoid clutter.

Figure 16 :
Figure 16: Instructions for the training interface of human evaluation.These instructions are shown upon clicking "Instructions" in the training interface (Figure 14).

Figure 17 :
Figure 17: Instructions for the testing interface of human evaluation.These instructions are shown upon clicking "Instructions" in the testing interface (Figure 15).

Table 1 :
Evaluation results of transformer and HSMM on the synthetic dataset.Word-level PPL values are estimated on the test set, and Latent PPL values are estimated using the same number of samples (6.4k).

Table 2 :
Results of coherence experiments.W/ Title is the setting where section titles are included in the training data for LMs, and W/O Title removes section titles from the training data.

Table 3 :
Section transition errors on WIKI, where each edge is labeled with the difference between P (z m |z m−1 ) of LM 1 (W/O Title) and of data, and its width is proportional to the absolute difference.Red marks unlikely transitions (P c (z m |z m−1 ) < 0.05).For clarity, we only show the top 20 section titles and remove singletons.Latent PPLs on WIKI-SHORT where all section transitions fit within the context window size.
Table 4 correlates automatic metrics with human judgments of coherence.Each human annotator first labels the section title of each section (after a training phase where they labeled and received

Table 5 :
The results of scaling model size on WIKI.Increasing model size improves PPL but not Latent PPL.
feedback on real data), and then labels whether the organization of the section titles makes sense(Persing et al., 2010).The baseline MAUVE(Pillutla  et al., 2021)is a metric that compares the distribution of GPT-3 hidden states between real data and model generations.From this table, we can observe that both MAUVE and Latent PPL align much better with humans than PPLs.Comparing MAUVE and Latent PPL, we can see that Latent PPL aligns better with humans: LM 1 W/O Title is considered to be better than LM 1 W/ Title under MAUVE, but both human evaluation and Latent PPL consider LM 1 W/ Title to be much better.

Table 6 :
Coreference chain n-grams ranked by contribution to the difference in Latent NLL (LM 1 ).P denotes the empirical frequency of an n-gram in percentages.M (Male), F (Female), and N (None of the above).Blank: padding.
Table 8: Critiquing coreference chains on larger models.Increasing model size does not improve Latent PPL.

Table 9 :
Latent NLL of topic correlation modeling.Transformer LMs perform similarly to the real data.
(Srivastava et al., 2022)d high-level aspects of text.In this respect, our work is similar in spirit to some recently proposed suite-based metrics, such as Language Model Evaluation Harness(Gao et al., 2021)and BIG-bench(Srivastava et al., 2022)that utilize many different skill-based metrics.
Model Criticism Model criticism,Stern and Sinharay, 2005;O Hagan, 2003)eral framework for checking if a generative model fits the data well(Box, 1980; Gelman et al., 1995;Stern and Sinharay, 2005;O Hagan, 2003).Model criticism is different from aforementioned metrics such as PPL and is similar to two-sample tests . To this end, Dey et al. (1998) introduce model criticism in latent space, which measures the discrepancy between real data and model generations in the latent space induced by a generative model (Chaloner David Blei and John Lafferty.2006.Correlated topic models.Advances in neural information processing systems, 18:147.David M Blei, Andrew Y Ng, and Michael I Jordan.2003.Latent dirichlet allocation.Journal of machine Learning research, 3(Jan):993-1022.

Table 10 :
Learned topics largely correspond to languages.The top 20 words per topic are shown.

Table 11 .
Posterior Inference We finetune a BERT classifier(Devlin et al., 2019)to estimate P (z m |x m ) using the Adam optimizer.We use a batch size of 32, learning rate of 2e-5, and finetune for 3 epochs.The validation accuracies are 89.48%,72.52%, and 88.15% on PUBMED, ARXIV, and WIKI respectively.Finetuning takes up to a few hours on a single Nvidia A100 GPU.

Table 11 :
Data Statistics.SectioWolf et al., 2020)ed using the GPT-2 tokenizer(Radford et al., 2019;Wolf et al., 2020).Section statistics are based on the validation set.More details on data processing can be found in Appendix E.

Table 12 :
The top 5 section transition errors on WIKI.Frequency is the frequency of observing the specific transition error across all transitions in the (generated) dataset.A transition is deemed an error if P c (z m |z m−1 ) < 0.01.Here we use the betterperforming GPT-2-based LMs (LM 1 ).
Inferred Section Title Section Text abstract Toledo Township is a township in Dauphin County, Pennsylvania, United States.As of the 2010 census, the township population was 1,655.It is part of the Dutchess/Berwick Micropolitan Statistical Area.Toledo Township was organized in 1867, and named for the United States senator, Judge John Toledo.geography Toledo Township is in northern Dutchess County, bordered by the city of Altoona to the north, the borough of Ritzel to the southeast, and Tuskegee Township and the unincorporated hamlet of Westmoreland Township to the south.According to the United States Census Bureau, the township has a total area of, of which is land and, or 0.62%, is water.It is bordered on the south by the Tullahonas River, on the west by the Delaware Channel, on the south by the Mohawk River and on the west by Tullahonas Creek, whose tributaries are the Westmoreland and Trenton rivers.Pennsylvania Route 11, which runs between Routes 11 and N, crosses the township via the Tuskegee River ... demographic As of the census of 2000, there were 1,638 people, 809 households, and 595 families residing in the township.The population density was 787.1 people per square mile (285.2/km2).There were 944 housing units at an average density of 331.2 per square mile (126.5/km2).The racial makeup of the township was 95.07%White, 1.81% African American, 0.46% Native American, 0.36% Asian, 0.06% Pacific Islander, 0.42% from other races, and 1.06% from two or more races.Hispanic or Latino of any race were 1.13% of the population.There were 809 households, out of which 32.4% had children under the age of 18 living with them, 49.0% were married couples living together, 11.1% had a female householder with no husband present, and 30.0%were non-families.26.5% of all households were made up of individuals, and 12.9% had someone living alone who was 65 years of age or older ...

Table 13 :
An example section-level repetition error (LM 1 W/O Title on WIKI).The most common section type after "demographic" is "education".The structures of the repeated sections are similar yet the facts are different.

Table 14 :
Top 2 outliers identified by T (x) on the generations from LM 1 finetuned on the ARXIV dataset (W/O Title).The average T (x) (Latent NLL) is 161.40.

Table 15 :
Unlikely z 5 |z <5 and the corresponding log P c (z 5 |z <5 ) in LM 1 generations according to the learned critic (log P c (z 5 |z <5 ) < −7).To get a better sense of what is considered likely by the critic, we also showed z * 5 = arg max z5 P c (z 5 |z <5 ) as well as log P c (z * 5 |z <5 ).M (Male), F (Female), P (Plural), and N (None of the above).Blank: padding.

Table 16 :
The top 5 coreference chain n-grams with the largest log probability differences between LM generations and real data (log P LM (z 5 |z <5 ) − log P data (z 5 |z <5 )).We only consider n-grams that appear more than (including) 5 times in both test set and LM generations.M (Male), F (Female), and N (None of the above).