A Measure-Theoretic Characterization of Tight Language Models

Language modeling, a central task in natural language processing, involves estimating a probability distribution over strings. In most cases, the estimated distribution sums to 1 over all finite strings. However, in some pathological cases, probability mass can ``leak'' onto the set of infinite sequences. In order to characterize the notion of leakage more precisely, this paper offers a measure-theoretic treatment of language modeling. We prove that many popular language model families are in fact tight, meaning that they will not leak in this sense. We also generalize characterizations of tightness proposed in previous works.


Introduction
Language modeling is a core task in natural language processing. As canonically defined, language modeling involves estimating a distribution over the set of strings over a given alphabet. If the alphabet is the set of words in a language, 1 then a language model can be thought of as a distribution over the language's sentences. Since Shannon (1948), language modeling has been used to estimate statistical properties of language and has become essential for computational linguistics research (Hale, 2001;Meister et al., 2021). Further, it is also central to a wide range of natural language processing applications, whether as a source model in a noisy channel architecture (Weaver, 1955;Jelinek, 1976), as a way of learning better representations of language (Peters et al., 2018), or, more recently, for prompted generation (Brown et al., 2020), where the distribution defined by a language model is employed in tasks like question-answering (Petroni et al., 2019), style transfer (Reif et al., 2022), and even sequence tagging (Liu et al., 2022).
More formally, a language model is typically defined to be a distribution over the countably infinite set Σ * of all (finite) strings (Booth and Thompson, 1973). 2 However, it has been shown that some classes of autoregressive language models have parameter settings in which the generative process terminates with probability < 1. Welleck et al. (2020) discuss this issue for recurrent neural network language models. Models whose generative process may fail to terminate are called non-tight (Chi, 1999, who discussed non-tight PCFGs). If an autoregressive language model is non-tight, it may generate infinite sequences and MCMC algorithms over such models will not mix to the correct distribution.
It is here that a subtlety arises: the set of infinite sequences is uncountably infinite. Properly treating a distribution over this sample space requires a modicum of measure theory. 3 To clarify the situation, we review the measure-theoretic treatment of distributions over infinite sequences. We then make use of a termination symbol EOS to define a random variable whose value can be a string, i.e., an element of Σ * , or an infinite sequence. In a tight language model, this random variable has probability 1 of being a string and hence finite.
Beyond offering a measure-theoretic formalization, our paper also demonstrates how tightness relates to the Borel-Cantelli lemmata, simplifying a recent result by Meister et al. (2022). To conclude our paper, we analyze several language modeling architectures and give conditions on their tightness. We demonstrate that n-gram language modelsand more generally, language models defined by stochastic finite-state automata-can be non-tight, and we give a simple necessary and sufficient condition for tightness in terms of the inverse of the automaton's transition matrix. This builds on a known 2 Recall that Σ * def = n Σ n where for n ≥ 0, Σ n is the set of strings of length n. The * is the Kleene closure operator. 3 Indeed, our treatment resolves an imprecision present in the literature. For instance, Welleck et al. (2020) discusses the probability of infinite sequences despite using the canonical definition of a language model as a distribution over Σ * . result due to Lehmann (1977). We also discuss when neural language models become non-tight. We prove that Transformer-based language models (Vaswani et al., 2017;Brown et al., 2020) are always tight and that recurrent neural language models are always tight when they employ a bounded activation function. However, we also exhibit a recurrent neural network (RNN) language model with a ReLU activation (Nair and Hinton, 2010) that is non-tight in a simpler construction than the one offered by Chen et al. (2018). As a byproduct, we also generalize and strengthen the results given by Welleck et al. (2020), who give a sufficient condition for tightness of recurrent neural language models in terms of the norm of the hidden state.

Motivating Examples
Let Σ be an alphabet, i.e., a finite set of symbols, and let EOS / ∈ Σ be a distinguished end-ofsequence symbol. Let Σ def = Σ∪{EOS}. A string of length n ≥ 0 is a finite sequence x = x 1 x 2 . . . x n where each x t ∈ Σ. By convention, we say that x n+1 = EOS, although x n+1 is not an element of the sequence x. For any integer 1 ≤ t ≤ n + 1, we write x <t for the prefix x 1 x 2 · · · x t−1 .
We now begin to distinguish between "language models" and "sequence models." As is traditional in the NLP literature, we henceforth define a language model to be a probability distribution over the countable set Σ * of all strings (see Def. 3.4). It is popular to specify such a distribution in terms of its conditional probabilitiesp(x t | x <t ).
Definition 2.1. An autoregressive sequence model (ASM) is a conditional probability distribution p(x t | x <t ) where x t ∈ Σ and x <t ∈ Σ * .
Ifp is an ASM, then we define a non-negative function p over Σ * by p(x) where n denotes the length of x.
But is p a language model? Not always, since as we will see below, it is possible for p(Σ * ) def = x∈Σ * p(x) < 1. Of course this "bad" case never happens if the ASM's conditional probabilities are derived from a known LM, in which case p simply recovers that LM. 4 In this case clearly p(Σ * ) = 1.
It follows that if p(Σ * ) < 1, then the ASM's conditional probabilities do not match the conditional probabilities of any language model p 0 .
We now exhibit such a "bad" ASM. Although the conditional probability distributionsp(· | x <t ) each sum to 1 over Σ, they fail to combine into a model p that sums to 1 over Σ * (i.e., a language model).
Example 2.2 (non-tight bigram model). Consider the bigram model defined in Fig. 1a over the alphabet Σ = {a, b}. Under this model, any finite string that contains the symbol b will have probability 0, sincep(EOS | b) =p(a | b) = 0. This implies p(Σ * ) = ∞ n=0 p(a n ) = ∞ n=0 (0.7) n ·0.1 = But if p is not a language model, what is it? It is intuitive to suspect that, in a model with p(Σ * ) < 1, the remainder of the probability mass "leaks" to infinite sequences, i.e., the generative process may continue forever with probability > 0. We will make this intuition formal in §3. By analogy with Chi and Geman (1998) and Cohen and Johnson (2013), we refer to such models as non-tight. 5 The non-tightness of Ex. 2.2 is related to the fact that the probability of EOS is 0 at some states, in contrast to Ex. 2.3. However, requiringp(EOS | x <t ) > 0 for all prefixes x <t is neither necessary nor sufficient to ensure tightness. It is not necessary because one can, for example, construct an ASM in whichp(EOS | x <t ) = 0.1 when t is even but = 0 otherwise. Such a model generates only odd-length strings but is tight. It is not sufficient because of the following example, in whichp(EOS | x <t ) is always positive but decays so rapidly toward 0 that the generative process might continue forever.
Example 2.4 (non-tight RNN). Consider an RNN over a small alphabet Σ = {a, EOS} with the fol-Then by the chain rule of probability, p(x) = p0(x) for each x ∈ Σ * . Thus p = p0, so p is a language model. 5 In Chi and Geman (1998) and Cohen and Johnson (2013), a PCFG is non-tight if its generative process may not terminate, and consequently the total probability of all finite trees is less than 1.
Tight 2-gram model. Figure 1: Tight and non-tight bigram models, expressed as Mealy machines (see §5.1). Transitions with conditional probability of 0 are omitted. The termination probability at a state is represented by an EOS arc from that state.
lowing hidden state recurrence: In this case, the hidden state admits a closed-form expression h t = t ∈ R. Setting the (1-dimensional) embedding of the alphabet to be v a = 1 and v EOS = 0, we arrive at The EOS probability is always strictly positive, but Thm. 4.7 shows that this sequence model is nontight. Numerically, p(Σ * ) ≈ 0.702 < 1. ■ On the other hand, an ASM may be tight after all if the probability of EOS decays more slowly-even when it still approaches 0.
Example 2.5 (tight RNN). Consider again an RNN over the alphabet Σ = {a, EOS} with the following recurrence using softplus activation: 6 Starting from h 1 = 0 = log 1, a simple induction argument shows that Again, setting v a = 1 and v EOS = 0, we arrive at This decays slowly to 0: lim t→∞p (EOS | x <t ) = 0, but since ∞ t=1p (EOS | x <t ) = ∞, Prop. 4.3 below implies that this ASM is tight. ■ Finally, we illustrate the peril of not treating distributions over uncountable sets carefully. 6 We use softplus instead of ReLU to simplify arithmetics. Example 2.6 (infinite coin toss). Consider the infinite independent fair coin toss model, where we aim to place a distribution over {H, T} ∞ , the uncountable set of infinite sequences of {H, T}. Intuitively, such a distribution corresponds to an ASM in which for all x <t ,p(H | x <t ) =p(T | x <t ) = 1 2 andp(EOS | x <t ) = 0. Clearly, each individual infinite sequence over {H, T} should be assigned probability ( 1 2 ) ∞ = 0. Without a formal foundation, one may arrive at the following paradox: Together, these examples suggest that one must take care to characterize tightness. And, to the authors' surprise, it does not appear as if such a careful characterization yet exists in the NLP literature.

The Language Model Measure
In this section, we rigorously characterize the kind of distribution induced by an ASM. As mentioned earlier, an ASM can lose probability mass to the set of infinite sequences, Σ ∞ . However, Σ ∞ , unlike Σ * , is uncountable, and it is due to this fact that we need to work explicitly with the measure-theoretic formulation of probability.

Measure-Theoretic Background
The goal of measure-theoretic probability is to assign probability to subsets of an outcome space Ω. For instance, in Ex. 2.6, Ω = {H, T} ∞ . However, in the course of the study of measure theory, it has become clear that for many common Ω, it is impossible to assign probabilities in a way that satisfies a set of reasonable desiderata. 7 Consequently, the standard approach to probability theory resorts to only assigning probability to certain "nice" subsets of Ω, which are referred to as events or measurable subsets, as in the theory of integration or functional analysis. The set of measurable subsets is commonly denoted as F (Def. 3.1), and a probability measure P : F → [0, 1] is the function that assigns a probability to each measurable subset. As it turns out, the following simple and reasonable requirements imposed on F and P are enough to rigorously discuss probability (Kolmogorov, 1933).
Definition 3.1. Let P(Ω) be the powerset of Ω. Then F ⊆ P(Ω) is called a σ-algebra (or σ-field) over Ω if the following conditions hold: . . is a finite or infinite sequence of sets in F, then n E n ∈ F. If F is a σ-algebra over Ω, we call the tuple (Ω, F) a measurable space.
A measurable space guarantees that operations on countably many sets are always valid, and hence permits the following definition.
Definition 3.2. A probability measure P over a measurable space (Ω, F) is a function P : F → [0, 1] such that 1) P(Ω) = 1, 2) if E 1 , E 2 , . . . is a finite or infinite sequence of disjoint sets in F, then P( n E n ) = n P(E n ). In this case we call (Ω, F, P) a probability space. Note that it assigns measure only to the sets in F; other sets are said to be non-measurable.

Sequence Models
As we saw in §2, sampling successive symbols from a non-tight ASM has probability > 0 of continuing forever. Hence, we hope to regard the ASM as defining a probability space over Ω = Σ * ∪ Σ ∞ , where Σ ∞ denotes the set of infinite strings 8 over Σ. Note that this set Ω is uncountable whenever |Σ| ≥ 2. We will first need to turn it into a measurable space by defining an appropriate σ-algebra.
This type of distribution is more general than a language model, which takes Ω to be the set Σ * of finite strings. To distinguish the two, we refer to a distribution over Σ * ∪ Σ ∞ as a sequence model. Definition 3.3. A sequence model is a probability measure P over the set Σ * ∪ Σ ∞ .
Intuitively (we will make this precise later), the event Σ ∞ ⊂ Σ * ∪ Σ ∞ in Def. 3.3 represents nontermination of the generating process, i.e., it attempts to generate an infinitely long sequence. If this never happens, we have a language model. Definition 3.4. A language model is a probability measure P over just Σ * . Equivalently, it is a sequence model P such that P (Σ ∞ ) = 0.
Our goal in the rest of this section is to rigorously construct a sequence model P that encodes the conditional probabilities of a given ASM. Since the ASM specifies conditional distributions over the augmented alphabet Σ, we first use it to construct a probability measure P over a measurable space (Σ ∞ , σ(C)). We then derive our sequence model P from P as the probability measure of a random variable X in a measurable space (Σ * ∪ Σ ∞ , σ(C)).

Pre-Measure
As mentioned in §3.1, it is often impossible to measure the probability of every single subset of Ω. For example, in the infinite coin toss model in Ex. 2.6, we might begin by reasonably assigning probability 0 to each individual sequence ω ∈ {H, T} ∞ . Unfortunately, it is then impossible to assign probability to every subset of {H, T} ∞ ; we must restrict our measurable space to a strict subset of P(Ω), where P() is the powerset operator. Theorem 3.5. Assuming the Axiom of Choice and the Continuum Hypothesis, there exists no probability measure P over ({H, T} ∞ , P({H, T} ∞ )) such that P({ω}) = 0 for each ω ∈ {H, T} ∞ .
Proof. This is a direct consequence of Ulam (1930). See App. C.1.1 for a discussion. ■ We will address this with well-known methods. A versatile theorem of Carathéodory provides a natural way to construct a probability space for sequences, in which prefix probabilities are welldefined. We first review two needed definitions. Definition 3.6. A ⊆ P(Ω) is called an algebra (or field) over Ω if 1) Ω ∈ A, 2) if E ∈ A, then E c ∈ A, 3) if E 1 , E 2 ∈ A, then E 1 ∪ E 2 ∈ A. Definition 3.7. Let A be an algebra over some set Ω. A probability pre-measure over (Ω, A) is a function P 0 : A → [0, 1] such that 1) P 0 (Ω) = 1, 2) if E 1 , E 2 , . . . is a (countable) sequence of disjoint sets in A whose (countable) union is also in A, then P 0 (∪ ∞ n=1 E n ) = ∞ n=1 P 0 (E n ). Note that the only difference between a σalgebra (Def. 3.1) and an algebra is that condition 3 is weakened from countable to finite, and the only difference between a probability measure (Def. 3.2) and a pre-measure is that the latter is defined with respect to an algebra instead of a σ-algebra.
The idea behind Carathéodory's Extension Theorem is that there is often a simple construction of an algebra A over Ω such that there is a natural way to define a probability pre-measure. One can then extend this probability pre-measure to a probability measure that is both minimal and unique in a precise sense. For example, the standard Lebesgue measure over the the real line can be constructed in this way. For our case of infinite sequences, we will first construct an algebra over Ω = Σ ∞ for some alphabet Σ. Then, assuming we are given an ASMp over Σ, we can associate the algebra with a pre-measure that is consistent withp. We will make use of the following definition to construct the algebra: Definition 3.8. Given any set H ⊆ Σ k , define its cylinder set (of rank k) to be In essence, a cylinder set of rank k is the set of infinite strings that share their k-prefix with some string x ∈ H ⊆ Σ k . For a length-k string x = x 1 · · · x k , the rank-k cylinder set C(x) def = C({x}) is the set of all infinite strings prefixed by x. 9 We denote the collection of all rank-k cylinder sets by C k def = C(H) : H ∈ P(Σ k ) and define C def = ∞ k=1 C k to be the collection of all cylinder sets over Ω. 10 Lemma 3.9. C ⊂ P(Ω) is an algebra over Ω = Σ ∞ .
Proof. See App. C.1.2. ■ We are now ready to define the pre-measure P 0 for the cylinder algebra C. Given an ASMp and any set C(H) ∈ C, let where, denoting the length of x by k, We confirm in Prop. C.2 that P 0 is well-defined even though the cylinder set C(H) may also arise as Lemma 3.10. P 0 is a pre-measure over C.

Extension of Pre-Measure
We have now gathered all the ingredients needed to state Carathéodory's theorem.
Theorem 3.11 (Carathéodory's Extension Theorem). Given an algebra A over some set Ω and a probability pre-measure P 0 : A → [0, 1], there exists a probability space (Ω, F, P) such that A ⊂ F and P| A = P 0 . Furthermore, the σ-algebra F depends only on A and is minimal and unique-thus we may denote it by σ(A)-and the probability measure P is unique.
Applying Carathéodory's extension theorem to our cylinder algebra C and pre-measure P 0 , we see that there exists a probability space (Σ ∞ , σ(C), P) over Σ ∞ that agrees with the ASMp's probabilities.
It is a fair question to ask what kinds of sets are non-measurable under this construction; we discuss this in App. C.2.2.

A String-Valued Random Variable
Having constructed the probability space (Σ ∞ , σ(C), P), we now demonstrate how to use it to induce a probability space over Σ * ∪ Σ ∞ as required by Def. 3.3. We will achieve this through the use of a random variable.
Definition 3.12 (random variable). A mapping X : Ω → S between two measurable spaces (Ω, F) and (A, G) is an (A, G)-valued random variable, or a measurable mapping, if, for all B ∈ G, To construct a random variable that takes values in Σ * ∪Σ ∞ , Def. 3.12 requires us to first construct a σ-algebra over Σ * ∪ Σ ∞ . We will do so in a similar fashion as we constructed (Σ ∞ , C). Given H ⊆ Σ k , define a rank-k cylinder set in Σ * ∪ Σ ∞ to be Let C k be the set of all rank-k cylinder sets. Define C def = ∪ ∞ k=1 C k . Then, σ (C) is a σ-algebra by the same reasoning as in Lemma 3.9 and Thm. 3.11. We can now define the random variable X by 12 where ω ∈ Σ ∞ . We claim that X is well-defined: Proposition 3.13. The function X : Any measurable function induces a probability measure on the output space, called the pushforward measure (cf. §2.4 in Tao, 2011), given by One can check that P , defined using P, is indeed a probability measure on (Σ * ∪ Σ ∞ , σ(C)) and hence (Σ * ∪ Σ ∞ , σ(C), P ) is a probability space.
We have therefore shown that, given any ASM, we can construct an associated sequence model as defined in Def. 3.3. Under the formulation of a probability space together with a random variable, useful probability quantities arise naturally and intuitively. In particular, when x ∈ Σ * is a finite string, we have with the definition of p from §2. Additionally, as we will show in the next section, the probability of the set of infinite strings P (X ∈ Σ ∞ ) is the probability of generating an infinite string. 13 Deriving EOS As an aside, the preceding section allows us to motivate the EOS token in ASM as a construct that emerges naturally. Specifically, for any x ∈ Σ * , rearranging Eq. (14): 12 In this definition, the position k ≤ ∞ of the first EOS-a stopping time-is itself a random variable. 13 An important detail left out in this discussion is that both the singleton set {x} and Σ ∞ need to be measurable in (Σ * ∪ Σ ∞ , σ(C)) for the above to make sense. This is addressed by Prop. C.7 and Prop. C.8.
where we have usedp(x) = P(C(x)) = P(X −1 (C(x))) = P (X ∈ C(x)). This means that the EOS probability in an ASM emerges as the conditional probability that, given that we must generate a string with a prefix x ∈ Σ * , the string is exactly x.

Characterizing Tightness
Beyond the measure-theoretic formalization, a goal of this paper is to provide an exact characterization of tightness in ASMs. The results presented in this section generalize Lemma 3.2 in Welleck et al. (2020). First, we consider the event A k is the event that an EOS symbol appears at position k in the string. Note that under this definition the A k are not disjoint. For example, the string ω = ab EOS c EOS dd · · · lives in the intersection of A 3 and A 5 since EOS appears at both position 3 and position 5. Using Eq. (16), we can express the event consisting of all finite strings as ∞ k=1 A k . It follows that we can express the event of an infinite string as Thus, using the random variable X, we can express the probability of generating an infinite string as Hence, we can now formalize the notion of tightness, which we have introduced in §2 and Def. 3.4. Definition 4.1. A sequence model P is said to be tight if P (X ∈ Σ ∞ ) = 0, in which case it is also a language model (cf. Prop. C.9). Otherwise, we say that it is non-tight.
Note that the definition of A k only uses a string's k-prefix, and hence is a cylinder set of rank k. Recalling that the cylinder sets are measurable and so are the sets countably generated by them, we see that both the event consisting of all finite strings and the event consisting of all infinite strings are measurable. Thus, P (∪ ∞ k=1 A k ) and P (∩ ∞ k=1 A c k ) are well defined.

A Lower Bound Result
We have characterized tightness in terms of the probability of a specific event P (∩ ∞ k=1 A c k ), a quantity we now seek to determine.
Using Lemma 4.2, we can derive the following useful sufficient condition for a sequence model derived from an ASM to be tight. It applies when the probability of EOS does not decay too rapidly with the length of the prefix.
This test implies tightness for all of the tight examples in §2, but not for the non-tight ones. Note that the lower-bounding function f depends only on the length of the prefix, not its content. f does not have to be monotonic-in the case of the even/odd example from §2, it is not.

The Borel-Cantelli Lemmata
It turns out that Prop. 4.3 admits a converse statement in which we can prove a similar property of p by assuming that the model is tight. To prove this result, we will use a fundamental inequality from probability theory-the Borel-Cantelli lemmata. The Borel-Cantelli lemmata are useful for our purposes because they relate the probability measure of sets of the form ∞ n=0 A n or ∞ n=0 A n to a series ∞ n=0 p n . We will only state the lemmata here without supplying their proofs; 14 however, we point out that Lemma 4.2 can be viewed as a parallel statement to the Borel-Cantelli lemmata and one can prove the lemmata using a very similar proof (cf. proof of Thm 2.3.7 in Durrett, 2019).
Concretely, given a sequence of events {A n } ∞ n=1 in some probability space, the Borel-Cantelli lemmata are statements about the event where i.o. stands for "infinitely often." Intuitively, {A n i.o.} is the set of outcomes that appear in infinitely many sets in the collection {A n } ∞ n=1 (hence the name). We will not use Borel-Cantelli directly, but they offer a probabilistic proof of a key result (Cor. 4.6) which will in turn lead to the desired statement about tightness. We formally state the first and second Borel-Cantelli lemmata below.

Lemma 4.4 (Borel-Cantelli I). If
Using the Borel-Cantelli lemmata, we can prove the following useful fact. Corollary 4.6. Given a sequence {p n } where p n ∈ [0, 1). Then, We now turn to proving a more general version of Prop. 4.3, which would imply its converse. First, we define the following quantity which can be viewed as the EOS probability at step t, given that EOS was not generated at any earlier step. In Eq. (48a) in App. D.2, we show that, when p EOS (t) is defined, it has the same value as . (21) We can now completely characterize the tightness of an ASM with the following theorem. We remark that Thm. 4.7 is a generalization of Prop. 4.3 since if p EOS (t) is lower-bounded by f (t) whose series diverges, its own series would also diverge. However, since p EOS (t) involves the computation of a partition function in its denominator, it can be intractable to calculate (Lin et al., 2021). Hence, Prop. 4.3 will be our main tool for determining tightness.
Finally, we note that Thm. 4.7 generalizes claims in previous work. For example, Welleck et al.
(2020) require f (t) = c > 0 for some constant c to determine tightness. Hence, their bound is not helpful in determining the tightness in either Ex. 2.4 or Ex. 2.5, because the EOS probability can be arbitrarily small in both cases. Applying Prop. 4.3, we see that (1) the ASM in Ex. 2.4 is non-tight, because the series ∞ t=1 1 e t +1 is convergent, and (2) the ASM in Ex. 2.5 is tight, since the series ∞ t=1 1 t+1 is divergent.

Analysis of Common Language Models
We now put into practice the foundations built up in the previous sections and discuss the tightness of several classes of ASMs.

Stochastic Finite-State Language Models
Language modeling based on n-grams has been historically influential in NLP (Jurafsky and Martin, 2009, Ch. 4). However, as Fig. 1 illustrates, n-gram language models are specific cases of the more general stochastic finite-state language models (Vidal et al., 2005). Tightness is more naturally characterized in this more general setting, as it turns out. We begin with a linear-algebraic definition of stochastic finite-state language models-or, more precisely, sequence models, since in this paper we do not consider the non-tight ones to be language models.
≥0 is a vector of initial state probabilities, and t ∈ R Q ≥0 is a vector of termination probabilities, i.e., probabilities of generating EOS in each state. 16 We further require that Q q=1 s q = 1 and that t q + Q q ′ =1 P qq ′ = 1 for all 1 ≤ q ≤ Q, where P def = a∈Σ P (a) is the transition sum matrix.
Given an SFSSM Σ, s, {P (a) } a∈Σ , t , the probability of a string x ∈ Σ * is defined bȳ Definition 5.2. A state q of an SFSSM (1 ≤ q ≤ Q) is accessible if there is a positive-probability path to q from some state r with s r > 0; it is coaccessible if there is a positive-probability path from q to some state r with t r > 0. It is useful if it is both accessible and co-accessible, i.e., q appears on some positive-probability accepting path.
Def. 5.2 allows a simple characterization of tight SFSSMs, namely Thm. 5.3, and a straightforward proof of Cor. 5.4. 17 15 For simplicity, we have disallowed ε-transitions. 16 We use Q to denote the number of states as Q is the traditional notation for the set of states in a finite-state automaton. 17 Cor. 5.4 is a special case of Chi and Geman (1998), who showed that MLE estimates of PCFGs are tight. In fact, we can express the termination probability of an SFSSM in simple linear algebra terms.
Definition 5.5. Trimming an SFSSM means removing its non-useful (useless) states to obtain a substochastic finite-state sequence model. 18 This does not affect the string probabilities (22). Removing the non-useful states means removing their rows and columns from P as well as their rows from s and t, yielding possibly smaller P ′ , s ′ and t ′ .
Theorem 5.6. Let P ′ be the transition sum matrix of a trimmed substochastic FSSM.
The well-known matrix inversion formula used above finds the total weight of all accepting paths in any weighted graph (Tarjan, 1981). 19 The formula can be seen as a special case of Lehmann's (1977) algebraic path algorithm.

Transformer Language Models
We now prove that all Transformer language models are tight. Key to our proof of the tightness of various neural architectures, including the Transformer, is the following basic fact in topology.
Theorem 5.7. Let X be a compact topological space and Y be any topological space.
To address the variable-length nature of modern deep NLP models, we will mathematically abstract them as a function on vector tuples, 20 f : R d + → R d + , that is length-preserving in the sense that f R t×d ⊆ R t×d for all t > 0. 18 We use the term substochastic rather than stochastic here because the trimmed model satisfies t ′ q + Q ′ q ′ =1 P ′ qq ′ ≤ 1, but might no longer achieve equality as required by Def. 5.1. 19 This is assuming the total weight is finite (which we guarantee by substochasticity) and the matrix is invertible (which we guarantee by trimming) 20 Here R d + is the set of nonempty tuples of vectors in R d .
Intuitively, this definition is saying that f is a function that maps a nonempty vector tuple where v i ∈ R d is commonly the embedding of the input symbol x i . In particular, we can take the function f : R d + → R d + to be the function defined by a stack of Transformer layers. This setup will help us state the following.
Lemma 5.8. Let f : R d + → R d + be the function defined by a finite number of Transformer layers (e.g., n layers) with any continuous activation function. Given a compact set K ⊂ R d . Then, there exists a compact set Proof. See App. E.2. ■ Recall that a Transformer language model-or more precisely, a Transformer ASM-defines the conditional probabilities using the softmax transformation where u x ∈ R d is the output symbol embedding of x ∈ Σ and h t is defined from the input embeddings of x ≤t via Eq. (23). Using Lemma 5.8, together with the finiteness of the vocabulary Σ and the continuity of the softmax transformation (25), readily yields our main result on Transformers.
Theorem 5.9. The autoregressive sequence model defined by any (fixed-depth) Transformer is tight.

Recurrent Neural Language Models
Recall that the hidden state of an RNN is typically defined by the recurrence where v t ∈ R d is the embedding of the input symbol x t , as above, and σ(·) is some activation function (Elman, 1990). The conditional probabilities are usually defined in the same way as Eq. (25). Using Thm. 5.7 and the same strategy of proof as in Thm. 5.9, one can also easily prove the tightness of any RNN ASM with bounded activations (e.g., tanh or sigmoid). However, as we saw in Ex. 2.4, an unbounded activation function (e.g., ReLU) can indeed lead to non-tightness by making the probability of EOS decay too fast. The condition derived in Thm. 4.7 precisely determines how fast such decay can be without losing the tightness of the language model. Below, we generalize this result as well as Lemma 3.2 of Welleck et al. (2020), and show that if the norm of the activations eventually grows sub-logarithmically, the RNN is still tight.
Again let the output symbol vector be u x ∈ R d for x ∈ Σ, and set k def = sup x∈Σ ∥u x − u EOS ∥ 2 . Additionally, for each t > 0, let ∥ h t ∥ 2 be the maximum attainable hidden state norm for any context x ∈ Σ t . Such a sequence model is tight if k∥ h t ∥ 2 ≤ log t for all sufficiently large t.
This result is weaker than Thm. 5.9 because in an RNN, unlike a Transformer, the depth of the computation graph grows with the sequence length.

Conclusion
This paper presents a measure-theoretic treatment of language modeling and its tightness. Practical implications of our results include determining when sampling from an autoregressive sequence model is guaranteed to terminate and whether MCMC algorithms over such models will mix to the correct distribution.
To this end, we first defined various components of language modeling in measure-theoretic terminology. This in turn allows us to understand the portion of probability mass allocated to infinite-length strings. Importantly, this presentation formalizes a definition of sequence modeling under which the probability of producing an infinite-length sequence is non-zero; while today's models are often capable of producing such strings, previously there was no rigorous treatment of this case.
Indeed, such a definition is useful when considering a number of neural architectures (e.g., a simple RNN as in Elman, 1990) and language generation systems (e.g., the distribution induced by nucleus sampling; Holtzman et al., 2020). In particular, we showed that perhaps the most commonly-used NLP architecture, the Transformer language model, is indeed a language model-a tight distribution over finite strings-a property that had been called into question by previous work.

Limitations
Our discussion in this paper leaves out the consideration of computability of measures over languages. Specifically, we note that there exist works on computable measure theory developed in the context of theoretical computer science (de Leeuw et al., 1956) and probabilistic programming languages (Roy, 2011). Additional machinery needs to be developed in order for a proper treatment and we leave this for future work.
Another notable limitation is that we exclusively focused on the autoregressive production of language. Importantly, our formalism might not be compatible with other models of language production such as those induced by a PCFG.
Finally, our proofs of Thm. 5.9 and Prop. 5.10 exploit the strictly positive property of the softmax function. Importantly, they do not apply to models with sparse distributions (Martins and Astudillo, 2016;Peters et al., 2019;Martins, 2021).

Ethics
There are no ethical implications of this paper to the best knowledge of the authors.

A Related Work
The issue of tightness has been studied extensively in the context of probabilistic context-free grammars (PCFG; Chi and Geman, 1998;Chi, 1999;Cohen and Johnson, 2013), although Chi (1999) refers to non-tight models as improper. Specifically, Chi (1999) gave algorithms for determining the tightness of a PCFG by formalizing a PCFG as a branching process. Chi (1999) further proved that any maximumlikelihood estimator yields a tight PCFG. Several previous works study the ability of language models to place probability mass on infinite-length strings (Booth and Thompson, 1973;Nederhof and Satta, 2006;Chen et al., 2018;Welleck et al., 2020), where they refer to the non-tight language models as inconsistent.
In some cases, this behavior can be attributed to the discrepancy between the language model itself and the distribution induced by a (possibly stochastic) decoding algorithm: the decoder may have a lower probability of generating the EOS token. For example, on the tight bigram model of Ex. 2.3, a greedy decoder will always generate a and never EOS. Yet in other examples, it is the model itself that leaks probability mass to infinite-length strings, i.e., it may be non-tight, which is the problem we focus on in this work, providing a characterization of tightness. Notably, the conditions we propose are more general than those of Welleck et al. (2020). Several other works consider the limitations of common neural network architectures for modeling distributions over finite sequences (strings), albeit focusing specifically on other attributes, such as their computational complexity for problems like equivalence or undecidability (Chen et al., 2018;Lin et al., 2021;Lin and McCarthy, 2022;Lin, 2022). In contrast, this work provides a formal treatment of language models by enlarging the sample space to Σ * ∪ Σ ∞ , although to ensure tightness, Σ ∞ must receive probability 0. Such definitions are not uncommon in probability theory. For example, while the Wiener process (i.e., the standard Brownian motion) is a distribution over all functions, the definition ensures that the set of discontinuous functions is assigned probability 0 (Durrett, 2019, Ch. 7).
Meister et al. (2022) similarly address the notion of a language model as a distribution over infinite sequences by casting such models as stochastic processes. They use this framing in order to motivate decoding, without providing comprehensive measure-theoretic foundations of such distributions.

B Details for Motivating Ex. 2.3
Here, we lay out the steps to calculate P(Σ * ) from Fig. 1b: This theorem is an impossibility of measure theorem. Generally speaking, the existence of a non-measurable set implies some form of impossibility of measure. The most famous example of non-measurable sets are Vitali sets, which exist given the Axiom of Choice. Vitali's 1905 construction is typically described in introductory texts on measure theory (Royden, 1988;Billingsley, 1995;Axler, 2020). The existence of Vitali sets shows that it is impossible to define a probability measure that satisfies translational invariance on the measurable space [0, 1), P([0, 1)) . Thus, to achieve translational invariance, Lebesgue measure uses a σ-algebra smaller than P([0, 1)), in which the Vitali sets are among the non-measurable sets. However, the translational invariance desideratum is not relevant to our space of discrete sequences. A theorem by Ulam (1930) reveals a deeper reason that some sets must be non-measurable. We shall state the theorem below as given in Oxtoby (1980) and omit its proof. We refer interested readers to Chapter 5 in Oxtoby (1980), which contains an accessible proof and an excellent discussion of the theorem including its generalizations and historical context. Ulam, 1930). Assuming the Axiom of Choice, a finite measure µ defined for all subsets of a set X of cardinality ℵ 1 vanishes identically [that is, equals zero for all subsets] if it is equal to zero for every one-element subset.
In the statement above, ℵ 1 denotes the cardinality of the first uncountable ordinal. We can see that Thm. 3.5 is a straightforward consequence of Thm. C.1.
Proof. First, Ω ∈ C since it is a cylinder set of rank 0 or indeed of any rank k: Second, C is closed under complements: given a cylinder set of rank k, that is, is also a cylinder set of rank k. Finally, C is closed under union: the union of cylinder sets of ranks k 1 ≤ k 2 is a cylinder set of rank k 2 , since both can be regarded as cylinder sets of rank k 2 . Hence, C is an algebra over Ω. ■ Proposition C.2. P 0 as defined in Eq. (8) is a well-defined function.
We must show x∈H 1p (x) = x ′ ∈H 2p (x ′ ). Without loss of generality, assume that k 1 ≤ k 2 . The definition of C(H 2 ) (Def. 3.8) implies that H 2 consists of all length-k 2 prefixes of strings in C(H 2 ). But C(H 2 ) = C(H 1 ), so the definition of C(H 1 ) (Def. 3.8) implies that its length-k 2 prefixes are exactly the strings of the form xy where x ∈ H 1 , y ∈ Σ k 2 −k 1 . Hence we can write H 2 in terms of H 1 as where the last equality is true becausep is defined by the locally normalized product (9). ■ Lemma 3.10. P 0 is a pre-measure over C.
For the proof of Lemma 3.10, we will mostly follow the proof of Thm 2.3 in Billingsley (1995), with the exception of invoking the Tychonoff theorem directly. This proof depends on the following lemma, which is Example 2.10 in Billingsley (1995). We repeat the statement and proof here for the reader's convenience.
Lemma C.3. Let P 0 be a finitely additive probability pre-measure over C such that, given a decreasing sequence of sets A 1 ⊃ A 2 ⊃ · · · in C where ∞ n=1 A n = ∅, lim n→∞ P 0 (A n ) = 0. Then, P 0 is also countably additive over C.
Proof. Let {A n } be a sequence of disjoint sets in C such that A = n A n ∈ C. Then, defining B n = m>n A m , we see that B 1 ⊃ B 2 ⊃ · · · and n B n = ∅. Notice that for any n and hence by finite additivity of P 0 or equivalently Since, B n ↓ ∅ implies that P 0 (B n ) ↓ 0 by assumption, taking the limits on both sides of Eq. (31) yields which shows countable additivity. ■ We also recall the Tychonoff theorem. 21 Theorem C.4 (Tychonoff). Let {X α } α∈J be an indexed family of compact topologies. Then, their product topology α∈J X α is also compact.
We can now give the proof for Lemma 3.10.
Proof of Lemma 3.10. We first show that P 0 is finitely additive over C. Let C(H 1 ) and C(H 2 ) be two disjoint cylinder sets. By Prop. C.2, we can assume they are of the same rank without loss of generality. Then, {xω : ω ∈ Σ ∞ } (H 1 and H 2 equal rank and disjoint) (33b) which leads to Hence, P 0 is finitely additive. Now, equip Σ with the discrete topology. Since Σ is finite, it is compact under the discrete topology and so is Σ ∞ by Thm. C.4. Then, by properties of the product topology over discrete finite spaces, all cylinder sets in Σ ∞ are compact. To apply Lemma C.3, let C 1 ⊃ C 2 ⊃ · · · be a decreasing sequence of cylinder sets with empty intersection. Suppose to the contrary that P 0 ( n C n ) > 0. This would imply that all C n are nonempty (any of these being empty would result in a measure 0). However, by Cantor's intersection theorem 22 , n C n is nonempty, contradicting the assumption. Hence, P 0 ( n C n ) = 0, and by Lemma C.3, P 0 is countably additive. ■ 21 See §37 in Munkres (2000) for a detailed and well-written treatise. 22 Cantor's intersection theorem states that a decreasing sequence of nonempty compact sets have a nonempty intersection. A version of this result in introductory real analysis is the Nested Interval Theorem.
C.2 Details in §3.4 C.2.1 Carathéodory's Extension Theorem Theorem 3.11 (Carathéodory's Extension Theorem). Given an algebra A over some set Ω and a probability pre-measure P 0 : A → [0, 1], there exists a probability space (Ω, F, P) such that A ⊂ F and P| A = P 0 . Furthermore, the σ-algebra F depends only on A and is minimal and unique-thus we may denote it by σ(A)-and the probability measure P is unique.
Proof Sketch. First, construct an outer measure by approximation with countable coverings. Then, show that the collection of sets that is measurable with respect to this outer measure is a σ-algebra F that contains A. Finally, restricting the outer measure to this σ-algebra, one is then left with a probability space. To show minimality, one can show that F is contained in any σ-algebra that contains A. Uniqueness is given by applying Dynkin's π-λ theorem (Theorem 3.2 in Billingsley, 1995).
Great care must be taken in each step involved in the outline above. To address these is well beyond the scope of this paper and we refer reader to the many excellent texts with a proof of this theorem, such as Chapter 12 in Royden (1988) andChapter 11 in Billingsley (1995). ■

C.2.2 The Space of Non-measurable Sets
Non-measurable sets are, in general, difficult to find. Even when we can exhibit such sets, they tend to be very abstract and counter-intuitive. Vitali's and Bernstein's sets are two prominent examples for the Lebesgue measure. Blackwell and Diaconis (1996) offers a construction of a non-measurable set in the cylinder σ-algebra. 23 As another approach to understand this better, we can consider how our collection σ(C) of all measurable sets, i.e., our σ-algebra, is constructed from our algebra C of cylinder sets (as opposed to simply knowing from Carathéodory's Extension Theorem that it exists). Concretely, as in §1.6 in Folland (1999), we can intuitively consider the following process to build from the collection of cylinder sets C, which is a countable collection, all the way up to its generated σ-algebra, whose cardinality is unknown just yet: • Let C 0 = C, • Let C 1 be the set that includes all countable unions of sets in C 0 or the complements of such, • Repeat this process to build C n for every n ∈ N. One might then take the union n∈N C n of this increasing sequence of collections of sets, and ask if it is the same as σ(C). In general, the answer is no (as one might expect if one is familiar with the Borel Hierarchy). However, we can obtain σ(C) if we perform this construction for every countable ordinal α. Abbreviating the operation in the second step above as δ, i.e., C 1 = δ(C 0 ), and letting ω 1 be the collection of all countable ordinals, 24 we can define This will give us the desired set as follows: Proposition C.5 (Proposition 1. 23, Folland, 1999). σ(C) = α∈ω 1 C α .
Next, we recall the following basic fact from cardinality theory.
That is, most subsets in Σ ∞ are non-measurable-though explicit examples have rarely been constructed (Blackwell and Diaconis, 1996). App. C.3 below establishes that common subsets of Σ ∞ that we work with are measurable.
Proof. To show that X is measurable, it suffices to show the measurability of preimages of a generating set 25 of the σ-algebra σ(C) on Σ * ∪ Σ ∞ . Such a generating set is formed by the thin cylinders C(x) def = C({x}) for x ∈ Σ * . (Recall that cylinders in Σ * ∪ Σ ∞ are defined by Eq. (11).) Given x ∈ Σ * : Note that the set A k above, defined by Eq. (16), is a cylinder of Σ ∞ , representing the event of terminating by step k. Then, from the derivation above, we can see that X −1 (C(x)) is formed by countable operations over measurable sets (cylinders) of Σ ∞ , and is hence measurable. So X is a measurable function. ■ Proposition C.7. In measure space (Σ * ∪ Σ ∞ , σ(C)), {x} is measurable for all x ∈ Σ * .
Proof. First, Σ * ∪ Σ ∞ is the entire outcome space, which is measurable by the definition of σ-algebra. Notice that Since each {x} in the above is measurable by Prop. C.7 and Σ * is a countable set, Σ ∞ is then measurable. ■ The measurability of Σ ∞ in (Σ * ∪ Σ ∞ , σ(C)) (Prop. C.8) was assumed by our definition of tightness (Def. 4.1). As we have also established that each {x} is measurable (Prop. C.7), we can give an alternative characterization.
Proof. We defined a sequence model to be tight if and only if P (Σ ∞ ) = 0 (Def. 4.1). By Propositions C.7 and C.8, we can write The result below is stated without proof as Exercise 4.3.5 in Durrett (2019).
Note that P(∩ n m=1 A c m ) > 0 for any n, for otherwise the conditional probabilities would be undefined. Let p n def = P(∩ n m=1 A c m ). Then we have that p n > 0 for all n, and In other words, P is tight.
Proof. In the proof, we rename the index t to n to match the usual presentation of the Borel-Cantelli lemmata. We are given thatp(EOS | x) ≥ f (n) for all x ∈ Σ n−1 . To apply Lemma 4.2, we observe that and similarly Since ∞ n=1 f (n) = ∞ and hence ∞ n=2 f (n) = ∞, the above inequality shows that the condition of Lemma 4.2 holds. Hence by Lemma 4.2, the event of a string never terminating, i.e., ∩ ∞ k=1 A c k , has probability measure P(∩ ∞ k=1 A c k ) = 0. In summary, if the EOS probability of a language model is lower-bounded at ever steps by the terms of a divergent series, then the event that this language model terminates has probability 1. Corollary 4.6. Given a sequence {p n } where p n ∈ [0, 1). Then, Proof. We can use a product measure to construct a sequence of independent events {A n } ∞ n=1 such that P(A n ) = p n . (The product measure ensures independence.) Then, by definition in Eq. (18), Then, for any m, So it must the case that, for any m, n≥m (1 − p n ) = 0. Therefore, (1 − p n ) Observe that n≥m (1 − p n ) m is a non-decreasing sequence in m; to see this, note that as m grows larger we multiply strictly fewer values (1 − p n ) ∈ (0, 1]. However, since we know the sequence is non-negative and tends to 0, it follows that for any m, we have It follows that, for any m, we have Case 1. Suppose that p EOS (t) < 1 for all t. Consider the termination probability again: In the above, we have assumed that P(A c 1 ∩ · · · ∩ A c t ) > 0 for all t, which is true by assumption that p EOS (t) < 1.. Hence, by Cor. 4.6,Eq. (57d) is 0 if and only if t p EOS (t) = ∞.
Case 2. If p EOS (t) = 1 is true for some t = t 0 , then P(A c 1 ∩ · · · ∩ A c t 0 ) = 0 and hence P ( ∞ t=1 A c t ) = 0 and such a language model is guaranteed to terminate at t 0 . ■ Proof. We refer to a state q as initial if s q > 0 and as final if t q > 0. (These are sometimes called source and sink states.) We prove each direction of the theorem in turn: (⇒): Assume the SFSSM is tight. Let q be an accessible state. Since the SFSSM has at least one positive-probability path from an initial state, there is a positive probability of reaching q during generation. If there were no positive-probability path from q to a final state, then the SFSSM would never terminate on the occasions when it reached q, contradicting the assumption of tightness. Hence q must be co-accessible.
(⇐): Assume that all accessible states are co-accessible. We construct a Markov chain whose states are the SFSSM's accessible states Q A ⊆ {1, . . . , Q} together with an EOS state. In this Markov chain, the initial probability of q is given by s q when q ∈ Q A and by 0 when q = EOS; the transition probability from q to q ′ is given by P qq ′ when q, q ′ ∈ Q A , by t q when q ∈ Q A and q ′ = EOS, by 1 when q = q ′ = EOS, and by 0 otherwise. The probability that the Markov chain is in state q ∈ Q A after t steps equals the probability that the SFSSM is in state q after t steps (note that the SFSSM never reaches any state q / ∈ Q A ). The probability that it is in state EOS after t steps equals the probability that the SFSSM has terminated after ≤ t steps.
Clearly EOS is an absorbing state of the Markov chain, meaning that once the Markov chain reaches this state, it never leaves. A fundamental result on finite-state Markov chains (Grinstead and Snell, 1997, Theorem 11.3) is that if every state can reach an absorbing state, then with probability 1, the chain reaches an absorbing state ("is absorbed") in finite time. Every state can in fact reach EOS, by coaccessibility of Q A . This further implies that EOS is the only absorbing state (as an absorbing state cannot reach any other state). So by the result cited above, the Markov chain reaches EOS with probability 1 in finite time. Consequently, the SFSSM terminates after finitely many steps with probability 1; that is, the SFSSM is tight. ■ Corollary 5.4. Maximum likelihood estimates of n-gram models based on some corpus are tight.
Proof. The SFSSM for an n-gram model has states that correspond to (n − 1)-grams and transitions that correspond to characters (unigrams), as illustrated by Fig. 1. When the SFSSM's probabilities are estimated with MLE, the accessible states are (n − 1)-grams that have appeared in some string in the corpus. Such states must also be co-accessible so that they can generate the rest of that string. Hence, by Thm. 5.3, this SFSSM is tight. ■

E.1.2 Proofs for Substochastic FSSMs
To prove Thm. 5.6, we will make use of the following useful lemma.
Proof. To begin with, we wish to apply the following result, which connects the row sums of a matrix to its spectral radius. Below, M n denotes the set of n × n matrices, and |||A||| ∞ = max 1≤i≤n n j=1 |A ij | denotes the operator ∞-norm. However, the transition sum matrix P of a substochastic FSSM may be reducible, whereas the irreducibility condition in Prop. E.2 cannot be dropped. Hence, we need to "decompose" P ′ in a way that recovers irreducibility. We use the Frobenius normal form (also known as irreducible normal form) to achieve this. Proposition E.3 ( §8.3.P8, Horn and Johnson, 2012). Let A ∈ M n be non-negative. Then, either A is irreducible or there exists a permutation matrix Π such that is block upper triangular, and each diagonal block is irreducible (possibly a 1 × 1 zero matrix). This is called a Frobenius normal form (or irreducible normal form) of A. Additionally, λ(A) = λ(A 1 ) ∪ · · · ∪ λ(A k ) where λ(·) denotes the set of eigenvalues of a matrix.
Notice that the permutation in the Frobenius normal form merely renumbers the states of the trimmed FSSM. We may check that as a result, the termination probability given in Thm. 5.6 is unchanged: 26 Hence, with an appropriate renumbering, we may assume without loss of generality that P is already given in Frobenius normal form where each P ′ i is irreducible. Since the transition sum matrix P ′ of a trimmed substochastic FSSM is a substochastic matrix, each P ′ i is also substochastic. In fact, each P ′ i is strictly substochastic, meaning that there is at least one row that sums to less than 1. To see this, suppose to the contrary that there is a stochastic P ′ i . Since the FSSM is trimmed, every state is both accessible and co-accessible. Being accessible implies that there is a positive probability of reaching every state in P ′ i . However, the stochasticity of P ′ i forces the corresponding t ′ entries to be 0. Hence, none of these states can transition to EOS, meaning that they're not co-accessible, contradicting the assumption. Hence, every P ′ i is strictly substochastic. Then, either all row sums of P ′ i are less than 1 (in which case |||P ′ i ||| ∞ < 1) or some row sums are 1 and some are less than 1 (in which case |||P ′ i ||| ∞ = 1 and P ′ has unequal absolute row sums). In either case, Prop. E.2 implies that ρ(P ′ i ) < 1, for all 1 ≤ i ≤ k. Finally, the last sentence of Prop. E.3 entails that ρ(P ′ ) = max{ρ(P ′ 1 ), . . . , ρ(P ′ k )}. Since each ρ(P ′ i ) < 1, we have ρ(P ′ ) < 1. ■ Theorem 5.6. Let P ′ be the transition sum matrix of a trimmed substochastic FSSM. Then I − P ′ is invertible and P (X ∈ Σ * ) = s ′⊤ (I − P ′ ) −1 t ′ ≤ 1.

E.2 Proofs for Transformer Result ( §5.2)
Again, the following theorem is well-known: Theorem 5.7. Let X be a compact topological space and Y be any topological space. If f : X → Y is continuous, then f (X) ⊆ Y is also compact.
Proof. Let {U α } α∈A be any open cover of f (X). By continuity, f −1 (U α ) ⊂ X is open for any α ∈ A, and hence {f −1 (U α )} α∈A is also an open cover of X. By the compactness of X, there is a finite sub-cover {f −1 (U α i )} n i=1 , in which case {U α i } n i=1 forms a finite sub-cover for f (X). ■ Lemma 5.8. Let f : R d + → R d + be the function defined by a finite number of Transformer layers (e.g., n layers) with any continuous activation function. Given a compact set K ⊂ R d . Then, there exists a compact set K ′ ⊂ R d such that for every t ∈ Z >0 , Note. We make use of the following notations in the proof below: △ t−1 = {y ∈ R t : y ≥ 0, 1 ⊤ y = 1} denotes the (t − 1)-dimensional simplex; B r (z) = {v ∈ R n : dist(z, v) < r} denotes the open ball centered at z with radius r; A denotes the closure of set A.
Proof. Let K 0 = K. In an autoregressive transformer, each of the n layers consists of two blocks: a self-attention block and a feedforward block. We will use induction on the 2n blocks to build up compact sets K 1 , K 2 , . . . , K 2n that contain the output vectors of these respective blocks, and then take K ′ = K 2n . The self-attention block is a function on (R d ) + → (R d ) + . So, let t ∈ Z >0 be arbitrary and consider any sequence of input vectors (v 1 , . . . , v t ) such that for all i, v i ∈ K 0 . Denote the output vectors of the attention block with (v ′ 1 , . . . , v ′ t ). By definition of attention, each output vector v ′ j = t i=1 α (j) i v i where α (j) ∈ △ t−1 are the attention weight vectors obtained through the softmax function. Compact sets in R d are bounded (by the Heine-Borel theorem), and hence there exists M > 0 such that K 0 ⊆ B M (0). Noting that the norm function ∥ · ∥ on R d is convex, we have the following where ( * ) results from Jensen's inequality. Eq. (62b) shows that each of the output vectors v ′ j lies in B M (0) which is compact. Hence, setting K 1 = B M (0), we have shown that, for any t ∈ Z >0 , the attention block maps K t 0 into K t 1 . Note that we cannot use Thm. 5.7 here because the attention block defines a different function on R t×d → R t×d for each t, and Thm. 5.7 only implies that there exists a separate length-dependent output compact set K t ⊂ R t×d for each t, which is different from this lemma's statement.
The feedforward function is a continuous function on R d → R d , and therefore, by Thm. 5.7, maps its input compact set K 1 to an output compact set, which we call K 2 .
Finally, residual connections and layer norms are also continuous functions acting on each of the input vectors, and hence by the same reasoning would also preserve compactness. Now we can use induction and show that there exist compact sets K 3 , K 4 , . . . , K 2n−1 , K 2n where K 2n contains the output set of the final layer. Set K ′ = K 2n and we have proven the statement. ■ Theorem 5.9. The autoregressive sequence model defined by any (fixed-depth) Transformer is tight.
Proof. Given the Transformer, there exists a fixed compact set K that will contain all inputs v i ∈ R d to the first layer. This is true because each v i is the sum of a word embedding, which falls in a finite set since Σ is finite, and a position embedding, which lies in the compact set [−1, 1] d . Hence, by Lemma 5.8, there exists a fixed compact set K ′ that contains all output embedding vectors (regardless of how long the sequence is). The final output probability is given by a multiplication with the word embedding matrix followed by the softmax function as in Eq. (25). This process amounts to composing two continuous functions. In particular, we can extract the EOS probability as a continuous R-valued function g EOS : K ′ → (0, 1) (neither 0 or 1 is in the range of the softmax function). By continuity of g EOS and Thm. 5.7, K ′′ def = g EOS (K ′ ) ⊆ (0, 1) is compact. Since K ′′ is compact, and hence closed, inf K ′′ ∈ K ′′ . Thus inf K ′′ ∈ (0, 1) and in particular inf K ′′ > 0. Therefore, taking ϵ = inf K ′′ , we have shown that the EOS probability of a Transformer is bounded below by some ϵ > 0 (regardless of the length of the sequence).
Hence, by Prop. 4.3, any Transformer ASM is tight and thus defines a language model. ■

E.3 Proofs for RNN Result ( §5.3)
Proposition 5.10. Given an RNN ASM over Σ. Again let the output symbol vector be u x ∈ R d for x ∈ Σ, and set k def = sup x∈Σ ∥u x − u EOS ∥ 2 . Additionally, for each t > 0, let ∥ h t ∥ 2 be the maximum attainable hidden state norm for any context x ∈ Σ t . Such a sequence model is tight if k∥ h t ∥ 2 ≤ log t for all sufficiently large t.
Proof. Let X t (ω) be the random variable that is equal to the t th token in an outcome ω ∈ Ω. Also let h x be the hidden representation of the RNN after processing some finite list of tokens x ∈ Σ * . Further, let u x ∈ R d be the output embedding of x ∈ Σ, Then for any t ∈ N and any x ∈ Σ t , we have: = 1 1 + y∈Σ exp(u y − u EOS ) ⊤ h x (63c) ≥ 1 1 + y∈Σ exp (∥u y − u EOS ∥ 2 ∥h x ∥ 2 ) (Cauchy-Schwarz) (63d) ≥ 1 1 + y∈Σ exp(k∥h x ∥ 2 ) (63e) Now define ∥ h t ∥ 2 def = sup x∈Σ t ∥h x ∥ 2 . We then have that ∀t ∈ N and ∀x ∈ Σ t : diverges, then the language model is tight.
We will show that this condition holds if ∃N ∈ N such that ∀t ≥ N , k∥ h t ∥ 2 ≤ log t.
Hence, if k ∥ h t ∥ 2 ≤ log t for all sufficiently large t (that is, for all t ≥ N ), then the RNN ASM is tight and thus defines a language model. ■