Entailment Semantics Can Be Extracted from an Ideal Language Model

Language models are often trained on text alone, without additional grounding. There is debate as to how much of natural language semantics can be inferred from such a procedure. We prove that entailment judgments between sentences can be extracted from an ideal language model that has perfectly learned its target distribution, assuming the training sentences are generated by Gricean agents, i.e., agents who follow fundamental principles of communication from the linguistic theory of pragmatics. We also show entailment judgments can be decoded from the predictions of a language model trained on such Gricean data. Our results reveal a pathway for understanding the semantic information encoded in unlabeled linguistic data and a potential framework for extracting semantics from language models.


Implications
We believe near contradiction may be rare compared to entailment, so the test may still tend to identify entailment correctly assuming realistic data distributions.However, our paper's main goal was never to propose a practical NLI method but rather to make the theoretical claim that mastering the LM objective perfectly implies acquiring a full model of entailment.The inability of our distributional entailment test to distinguish entailment from near contradiction means the reconstruction of semantics it would extract from an idealized, perfect LM could still be fundamentally lossy.Future work should investigate whether distributional semantics must fundamentally confuse entailment and near contradiction or whether there is some other way to distinguish them with form alone.

The Conceptual Problem
The edge case that breaks Theorem 2 is simple to describe conceptually.Our entailment test attempts to use the co-occurrence probability p(xy) of two sentences x, y to infer something about their semantic relationship.Under a Gricean speaker, the probability of a redundant pair of utterances x, y should be ∼0 and the probability of contradictory utterances x, y should be exactly 0. The issue is when y nearly contradicts x, e.g.: x = I'm not in North America.y = I'm in a US state.
y nearly contradicts x because they are both satisfiable only in worlds where the speaker is in Hawaii, which we assume p(w) makes unlikely.In such instances of near contradiction, it is possible that p(xy) is slightly above 0 (like for entailment), and thus Theorem 2 detects entailment incorrectly.

Detailed Problem and Revised Analysis
The technical problem in the original proof of Theorem 2 was that Lemma 1 was applied with unmet preconditions.Specifically, it assumed that the speaker utility I ℓ (z; w) is at least 0 for all utterances z and worlds w, but, in fact, this utility is −∞ in worlds where z is false.Near contradictions can then achieve an average exponentiated information content of 1 because exp(−∞) in many worlds is balanced out by large positive information in a few worlds.Formally, let Y = x ∩ y be the worlds where x, y are both true.We show in §H that the original entailment test is 0 when where We can see that (1) has two distinct solutions: Entailment Solution.As expected, (1) is satisfied when x entails y since p(Y ) = 1 and I Y = 1.
Near-Contradiction Solution.Assume for simplicity that I ℓ (y | x; w) = I Y for all w ∈ Y .If y nearly contradicts x, p(Y ) is small because there are very few contexts where x, y are both true.On the other hand, I Y is large because y is very informative when it is true.It is possible to calibrate Y such that these factors multiply to 1. Figure 1 illustrates these two solutions using the Gricean speakers from §6.For two utterances x, y, we vary the number of worlds where x is true but y is false, ranging from entailment to contradiction.The test score crosses 0 twice: for entailment on the left and near contradiction on the right.

Introduction
Recent advances in building computational models of language have been powered by distributional semantics: the idea that the possible surrounding contexts for a text span encode its meaning (Firth, 1957).In particular, large pretrained language models (LMs; Peters et al., 2018;Devlin et al., 2019;Brown et al., 2020) have become an integral part of NLP systems: the representations that emerge from training to predict missing words in a text are empirically useful for natural language understanding tasks.
Despite this empirical progress, Bender and Koller (2020) argue LMs cannot learn to understand the semantics of sentences.This is because of a mismatch between the LM training objectivepredicting missing words in text ("form")-and Bender and Koller's conception of meaning as the relation of a sentence to the external world.Thus Bender and Koller claim "that the language modeling task, because it only uses form as training data, cannot in principle lead to learning of meaning." In this paper, we argue meaning can be learned from form because the communicative goals of human authors encode semantic information in unlabeled text.We show how this semantic information can be extracted to resolve semantic relations between sentences (e.g., whether one sentence entails another): in this inferentialist sense, ideal LMs encode the meaning of sentences.This argument has been raised speculatively by others (Michael, 2020;Potts, 2020;Bommasani et al., 2021), but we will rigorously justify it here with formal results.
To give the simplest (and least general) illustration of our argument, we first assume training data is generated by overly idealized uniformly truthful speakers: agents who decide what to say by picking sentences they consider true uniformly at random.1 This very coarsely captures human authors' goal of being informative (rather than misleading) to their listeners (Grice, 1975).In Theorem 1, we prove a sentence x entails sentence y if and only if, after uttering x, a uniformly truthful speaker is just as likely to say y as to repeat x.Thus, entailment semantics can be extracted from probabilistic languages generated by uniformly truthful speakers.
Uniformly truthful speakers are not a realistic model of humans: while humans favor true sentences to false ones (Grice, 1975), not all true sentences are equally likely to be produced.It is a common principle in linguistic theories of pragmatics that human speakers choose their utterances in order to balance two competing objectives: (a) conveying information to their listener and (b) brevity (Levinson et al., 1983;Grice, 1975).We define a class of Gricean speakers who optimize for these objectives, and prove in Theorem 2 that x entails y if and only if a simple equation holds in terms of text probabilities produced by such speakers.Thus, entailment semantics can be decoded from probabilistic languages generated by Gricean speakers.
The previous results assume access to a language's ideal likelihood function, but, in practice, one only ever receives a corpus sampled from the language.Moving to the corpus setting, we analyze how much data allows approximately computing our derived entailment test using probabilities estimated from sentence frequencies in a corpus.We find that the corpus size needed to guarantee the entailment test holds approximately is inversely related to the likelihood of the sentences.We estimate that approximating the entailment test between 4-word sentences using corpus frequencies is possible with ∼ 10 10 sentences, about the size of the GPT-3 training data (Brown et al., 2020).On the other hand, approximating the entailment test for 10-word sentences should be possible with ∼ 10 17 sentences, or ∼ 10 7 GPT-3 corpora.Thus, extracting entailment judgments using corpus frequencies requires an infeasible amount of dataeven by modern NLP standards.
To overcome this limitation, one might hope to use probabilities estimated by LMs to extract entailment judgments between longer sentences that are rare even in a large corpus.With synthetic data generated by Gricean speakers, we find that entailment can be decoded from n-gram LM predictions to some extent.However, we speculate that current neural LMs may not score the probability of rare text well enough to enable decoding entailment judgments between natural language sentences.In summary, our main contribution is to show a correspondence between the semantics of text and its likelihood, assuming the likelihood function matches models of human text production from linguistic theory.Determining whether a sentence in a probabilistic language entails another sentence can be reduced to modeling the probabilities of strings in the language.In practice, entailment judgments between very short sentences can be extracted from corpus frequencies, but this becomes infeasible for slightly longer sentences.LMs can in principle be used to extrapolate the likelihood of longer strings, but we hypothesize current LMs are not well-suited for doing so well enough to enable extracting entailment from natural language.Our theory demonstrates a formal sense in which unlabeled text data encodes linguistic meaning and makes quantitative predictions for (a) how to extract semantics from text corpora and (b) how much data this requires.

Sentences and Worlds
Let X be a finite set of sentences, and W a countable2 set of possible world states.A sentence x is a string whose denotation x is a proposition, i.e., a set of world states (⊆ W) where x is true.Following standard conventions in formal semantics (cf.Heim and Kratzer, 1998), the set x can be equivalently viewed as a function mapping a world state w to {0, 1} that indicates whether x is true in w, which we will write as x (w).We imagine w to encode a partial description of the world, much like the concept of a situation in formal semantics (Kratzer, 2021).For simplicity, we assume an individual's subjective belief state can be modeled as the unique, maximal w that fully describes the facts which they believe to be true.
Example x = John has at least two cats.Let W = {w 0 , w 1 , w 2 , w 3 } be the set of possible worlds, where w n denotes the state in which John has n cats.Then x = {w 2 , w 3 }, because John has at least two cats in these worlds.Furthermore, it holds that x (w 2 ) = 1, but x (w 1 ) = 0.

Speakers and Texts
We refer to a sequence of sentences z ∈ X * as a text. 3The meaning of a text is the set of worlds consistent with all its sentences, i.e., z = |z| t=1 z t .
We will imagine that a text z ∈ X * is produced by iteratively sampling z t ∈ X ∪ {$} from a speaker model p(z t | z <t , w). p(z t | z <t , w) represents the probability of saying sentence z t with belief state w after having said z 1 • • • z t−1 .Let $ ̸ ∈ X be a special end of sequence token satisfying $ = W.We refer to any text ending with $ as complete.Given a world w, an incomplete text z ∈ X * or complete text z ∈ X * $ has conditional probability The conditional probability of an incomplete text represents the probability of observing z as the prefix of a text written by a human with beliefs w.In contrast, the probability of a complete text represents the probability that a speaker produces z and no further text.The conditional distribution p(z | w) cannot be observed directly by a LM, since w is a latent variable missing from the training data.Rather, a LM has access to texts that have been generated by speakers across many possible belief states.Mathematically, this can be expressed by saying a LM's target distribution is a marginal distribution over z ∈ X * ∪ X * $ according to some prior distribution over worlds p(w): The prior p(w) represents the probability that a speaker contributing to the corpus will have belief state w-we make no assumptions about its form besides that p(w) > 0 for all w ∈ W, and, for every sentence, there is some world state that makes that sentence true.In contrast to p(z), which corresponds to the expected corpus frequency of z, we denote by p( z ) the probability that z is true. 4xample Let z be the 2-sentence text:5 z 1 = We swung our swords.z 2 = That was ever so long ago.
Let p be the distribution of all possible English web texts.The marginal probability p(z) can be decomposed across many possible worlds.One such world w 1 might be the world where the speaker is the semi-legendary Viking hero Ragnar Loðbrók (in modern English translation); another world w 2 might be the perspective of a Reddit user reviewing a coffee maker.Each of these worlds corresponds to one term in a sum over all worlds.We expect p(z | w 1 ) to be higher than p(z | w 2 ) since it is more likely for a medieval literary character to utter z than a modern product reviewer.Finally, p(z | w 1 ) can be factored as In contrast to p(z), which counts all contexts where z is the beginning of a longer text, p(z$) measures the frequency of z 1 z 2 followed by nothing else.

Distributional and Semantic Relations
Distributional Relations A distributional relation d is a relation over sentences x and y defined in terms of likelihood of different texts under some distribution p.Let d p (x, y) be the value of the distributional relation d between sentences x, y according to distribution p.If we train an LM on texts sampled from a target distribution p, the LM estimates a predictive distribution p.Thus, any LM parameterizes d p: an instantiation of the distributional relation d with respect to the probabilities learned by the LM.If the LM perfectly approximates p(x) for all x, then d p = d p by construction.
Example Define the distributional relation d (with respect to some distribution p) such that d > p (x, y) ⇐⇒ p(x) > p(y).d > p (x, y) says x is more likely than y according to p.If p represents LM predictions trained on the target distribution p, than d > p (x, y) says whether the LM predicts a sentence x is more likely than another sentence y.

Semantic Relations
In contrast, a semantic relation between x and y is a relation defined in terms of their denotations x and y .We will focus on the key semantic relation of entailment: Definition 1 For two sentences x, y ∈ X , x entails y if and only if x ⊆ y .
It is not clear prima facie if LMs can represent entailment relations.However, it could be that a semantic relation s can somehow equivalently be written as a distributional relation d p .If so, a LM that perfectly approximates p could be understood to encode s, since s can be extracted from p via d p.
Formally, we can ask if a semantic relation can be alternatively expressed as a distributional relation by analyzing if there exists an isomorphism between a semantic relation s( x , y ) and some distributional relation d p (x, y): If Definition 2 holds under a speaker model p, then predicting whether s holds between two sentences is reducible to perfectly modeling the probabilities of texts generated by p.Our goal going forward will be to derive distributional relations isomorphic to entailment assuming p models the goals of humans when they produce text.
We start by illustrating our research question and technical approach assuming an overly simple model of humans as uniformly truthful speakers.A uniformly truthful speaker chooses a sentence to produce by selecting one of the true sentences that holds in their belief state uniformly at random.This very coarsely captures the property of natural language pragmatics that subjectively true sentences tend to be more likely than false ones, although it does not account for many other factors that influence human speech patterns in complex ways (Grice, 1975). 6Let n(w) be the number of sentences true in world w.We can formally define a uniformly truthful speaker as follows: In other words, p uniformly spreads probability mass across all sentences that are true in world w.We will show that, if the corpus consists of text written by uniformly truthful speakers, entailment can be decided by a distributional relation.The following lemma will be a core technical tool in our analysis.Informally, it is useful because it establishes a correspondence between relations over sets of worlds and probabilities.
Lemma 1 Let 1 S be the indicator function for set S. For sets A, B such that A ⊆ B ⊆ W, and Proof.We will prove that B ⊆ A by contradiction.Assume there exists w ∈ B such that w ̸ ∈ A. Then the right sum contains the positive term c(w), while the left sum does not.Because all terms in the right sum are positive, the left sum must contain at least one term c(w ′ ) that the right sum does not.Thus, w ′ ∈ A but w ′ ̸ ∈ B. But this has violated our assumption that A ⊆ B.
We now use Lemma 1 to derive a simple distributional relation that is isomorphic to entailment.
Theorem 1 If p is a uniformly truthful speaker, then entailment is isomorphic to a distributional relation.Specifically, for all sentences x, y ∈ X , Proof.d p (x, y) holds if and only if An expectation in a countable space is a sum weighted by probability masses.So, by Lemma 1, this holds iff x = xy = x ∩ y .We conclude p(xy) = p(xx) if and only if x ⊆ y .
A similar proof suffices to show that the following isomorphism also holds: Corollary 1.1 If p is a uniformly truthful speaker, the following isomorphism holds for all x, y ∈ X : x ⊆ y ⇐⇒ p(xy) = p(x$).

Discussion
Uniformly truthful speakers resemble humans in that they mimic the tendency of humans to tell the truth about what they believe.However, they are clearly too simple to account for human speech patterns.Most crucially, humans generally aim to produce informative speech, rather than sampling true sentences at random.More fundamentally, natural language has a countably infinite number of possible sentences, so a uniform distribution over all true sentences is not even mathematically well-defined.These limitations motivate our more involved analysis of Gricean speakers, which will adapt the technical tools used in this section.

Gricean Speakers
In this section, we will define a new class of speakers who pick sentences in order to be informative to their listener, while also trying to be concise.To do this, we will draw on information theory to formalize what it means for a speaker to be informative.We will then derive a distributional relation that is isomorphic to entailment for Gricean speakers, which is a generalization of the relation for uniformly truthful speakers from §3.

Definition
Information The first step towards formalizing Gricean speakers is to define a notion of the semantic information contained in a sentence.We formalize a listener ℓ(w | z) as the inverse of a speaker: Given a text z ∈ X * , a listener produces a distribution over possible world states.Then, in a given world w we can define the information that a text conveys to the listener as the reduction in the number of bits needed to transmit w to ℓ after they have read z compared to before they have read z.
In other words, the information content of a text is the reduction in ℓ's code length for the world after having read the text compared to beforehand.We can naturally extend Definition 4 to measure the conditional information conveyed by sentence y given that x has already been produced: Informative Speaker We now define a Gricean speaker in terms of I ℓ .Our definition generalizes the rational speech acts model (Goodman and Frank, 2016), but makes weaker assumptions about the listener and allows a dynamic semantics where later sentences can condition on previous ones (Lewis, 1979;Kamp, 1981;Heim, 1982).We define an utterance's utility as a convex combination of its information content and its cost to produce, operationalizing the Gricean idea that speakers pick utterances by weighing their informativeness against their cost.The cost function c : X * ∪ X * $ → R can be any measure of sentence complexity (e.g., length) satisfying c(xy) = c(x) + c(y) for x, y ∈ X * ∪ X * $.8 Definition 6 A speaker p is Gricean if there exists a listener ℓ(w | z), some α > 0, and a cost function c such that, for all z ∈ X * ∪ X * $: Further, ℓ must satisfy the following for all x ∈ X * , y ∈ X ∪ {$}, and w ∈ W, In other words, the speaker must be trying to convey information about the state of the world to some listener who fully absorbs the semantic information in all sentences they have already heard: clarifying already established information will not benefit the listener.We can formalize this by deriving p(y | x, w) for x ∈ X * and y ∈ X ∪ {$}: Notably, the probability of y given x depends on the conditional information of y given x, which means only information conveyed by y that is nonredundant with x will make y more likely.10

Results
Proofs are in §C.Under a Gricean speaker, the cost of an utterance can be expressed: Corollary 2.1 says that a sentence is costly to the extent that it is unlikely to be repeated twice, giving an intuitive characterization of this quantity in terms of text probabilities.Now, we will use this characterization of cost to derive a distributional relation that is isomorphic to entailment.
Theorem 2 Under any Gricean speaker p, entailment is isomorphic to a distributional relation.Specifically, for all sentences x, y ∈ X , If we allow our decision rule to depend on the cost function c in addition to probabilities, we can simplify Theorem 2 as follows: Corollary 2.1 Under any Gricean speaker p, for all sentences x, y ∈ X , x ⊆ y if and only if If we imagine c(y) − c($) = 0 for a uniformly truthful speaker, we see the equation in Theorem 2 is a generalization of the equation in Theorem 1.

Discussion
Gricean speakers are a general enough model of humans speakers to capture the basic pragmatic principles influencing speech production.Thus, it is notable that Theorem 2 establishes a closed-form distributional relation isomorphic to entailment.
One conceptual limitation of Gricean speakers is that their simulated listener must fully consume information, such that redundantly conveying the same information twice will not lead to any information gain the second time.This contrasts with real speech, where potential interpretation errors by the listener incentivize the speaker to be somewhat redundant (Degen et al., 2019).Mathematically, this would violate the axiom of Definition 6 that Extending Theorem 2 to speakers who use redundancy to account for noise and interpretation errors is an interesting direction for future work.
Another interesting extension would be formalizing speakers who aim to be informative regarding some question under discussion, rather than being generally informative about w (cf.Goodman and Lassiter, 2015).This could encompass both "what" questions that aim to clarify some aspect of the world, and "why" questions that aim to convey explanations for established facts.

Decoding Entailment from Empirical Text Frequencies
We have so far shown that entailment judgments can be extracted from the sentence probabilities in the ideal distribution p(z).What happens if, more practically, we estimate the probability of a sentence by its frequency in a large corpus sampled from p(z)?We prove this method enables feasible extraction of entailment judgments between very short sentences, but the corpus size may become intractably large for longer sentences.
Imagine we have a finite corpus of iid sentences {Z i } n i=1 , each sampled from p(z).Let p(z) be the empirical frequency of a text z in the corpus, i.e., if π(z, z ′ ) returns whether text z is a prefix of text z ′ , Since p(z) encodes entailment via our extraction rules, p(z) will encode entailment between sentences if p(z) is close to p(z).A naive notion of closeness is to guarantee, for all ϵ, there exists some number of texts n such that, with high probability, |p(z) − p(z)| < ϵ.But this notion is not strict enough: if p(z) is small, this difference will also be small, even if p(z) is not a good approximation of p(z) on a relative scale.Instead, we want to guarantee that p(z)/p(z) converges to 1, or, equivalently, that their difference as log probabilities converges to 0. This ensures that convergence will still be meaningful for low-probability sentences, which most sentences are in natural language.
Under this standard, rarer sentences take more samples to approximate.Define the sentence complexity K p (z) = 1 p(z) .We bound the approximation error in terms of K p (z).11 Lemma 3 For z ∈ X * ∪ X * $ and δ > 0, it holds with probability at least To make this bound non-vacuous, n must be large enough to counteract K p (z) and bring (1 − p(z)) n close to 0. Thus, good approximation requires fewer samples for more common sentences.To get a more concrete view of the number of samples required to extract entailment judgments from an LM, we analyze K p (z) for Gricean speakers. 12 Recall that we write c(z) for the cost that a Gricean speaker assigns to producing a sentence z.For Gricean speakers, K p (z) is related to c(z) as well as the probability z is true.Theorem 3 Assume that p(z | w) is a Gricean speaker with respect to listener ℓ and z (w) = 1 ⇐⇒ I ℓ (z; w) ≥ 0. Let g p (x, y) = log p(xy) p(x$) − log p(yy) p(y$) .Let q = 1 − min{p(xy), p(yy)}.Then, for all x, y ∈ X such that xy (p) > 0, for all δ > 0, it holds with probability at least 1 − δ − 4q n that |g p (x, y) − g p(x, y)| is at most Theorem 3 says we can use text frequencies to decode entailment between sentences x, y from a Gricean corpus, but the number of training sentences to guarantee this grows exponentially with the cost of x and y.Thus, we probably cannot expect to extract entailment judgments from text frequencies except between very short sentences.
We make this more quantitative in Figure 2, where we estimate the number of training sentences needed to ensure g p and g p are close on sentences of length ≤ k as a function of k.The main assumption behind this calculation is that a sentence's probability vanishes exponentially in its length, where the exponential base is the perplexity of the language.§E documents the underlying assumptions in more detail.Figure 2 predicts g p and g p can be made close for length-4 sentences using ∼ 10 10 training sentences: about as much data as GPT-3 was trained on.In contrast, handling (still short) sentences of length 10 can be done with ∼ 10 17 training sentences, or ∼ 10 7 GPT-3 corpora.Thus, relying solely on corpus frequencies is likely not a feasible way to extract entailment relations from text generated by Gricean speakers.

Decoding Entailment from LMs
We have just analyzed how many samples are necessary to decode entailment relations from the text frequencies in a finite corpus.As shown by Theorem 3, this approach will require intractably many samples for sentences of nontrivial length because longer strings will appear infrequently (if at all) in the corpus.In order to estimate the probability of rare, longer, strings what if we use an LM to estimate p(z) instead of text frequencies?Perhaps a smoothed LM should allow us to extrapolate p(z) well enough for long sentences to extract entail-ment judgments between them.In this section, we briefly discuss some limitations of this approach.
It is tempting to take low LM perplexity as evidence that an LM estimates sentence probabilities well enough to approximately satisfy the isomorphism in Theorem 2. After all, low test perplexity implies that p(z) is, on average, a good approximation of p(z): if the perplexity is bounded below ϵ, then the KL divergence KL(p, p) is bounded below log ϵ. ϵ decreases with the amount of training data n at a rate between Ω(1/ √ n) and Ω(1/n) (Wang et al., 2013;Li and Liu, 2021).Thus, with enough data, p(z) will closely approximate p(z) for an average sentence z in the training distribution.
But low error on an average z does not establish entailment can be decoded from p because d p, as derived in Theorem 2, depends on the text z = yy, which is very unlikely in natural language. 13Poorly estimating p(yy) has little impact on KL(p, p), so LMs trained to minimize KL(p, p) have no reason to estimate p(yy) well unless they are imbued with strong inductive biases.Thus, we expect that LMs trained with a standard cross-entropy loss may not produce reliable entailment judgments because they poorly estimate the probability of key valid (but unlikely) texts.14However, we find in the next section that they do succeed in the easier setting of small artificial languages and fully Gricean speakers.

Experiments: Extracting Semantics from Simulated Gricean Corpora
We test empirically whether we can extract entailment judgments from LMs trained on unlabelled text. 15Natural language corpora are unlikely to adhere exactly to our idealized assumptions about the speakers generating texts, so we generate the training corpora from a simulated Gricean speaker (see §4).To make learning semantics more tractable with limited computation, we set |W| = 3 and restrict the vocabulary X to 7 utterances, each denoting one of the 7 non-empty subsets of W. Each sentence in the training corpus is generated by sampling utterances from a Gricean speaker, conditioned on a uniformly sampled world state and the previously generated utterance, until the tautological utterance is generated.The semantic value of a sentence is taken to be the conjunction over all of its utterances.We set the rationality parameter α and the cost function heuristically (details in §G).
We generate training sets varying in size from 2 texts to 10M texts, and train two types of models on each: a simple empirical text frequency as described in Section 5, and a trigram model implemented using NLTK (Bird, 2006).Then for all sentence pairs (x, y), where x and y have 6 utterances or fewer and each denotes a non-empty proposition, we compute g p(x, y) from §5.Theorem 2 shows that, under the true distribution p, g p(x, y) = 0 if and only if x entails y.
The results are plotted in Figure 3.We arrive at the following conclusions: Entailment relations can be extracted with greater-than-chance performance from LM predictions.The value of g p(x, y) is much closer to 0 on average for entailed pairs than for non-entailed pairs.This is predicted by Theorem 2.
The size of the corpus needed to extract entailment grows predictably with sentence length.For entailed pairs, the average value of g p(x, y) for shorter sentences approaches 0 more quickly with a large training corpus.This is in line with the predictions of Theorem 4.
Model inductive bias impacts the ease of extracting entailment.Entailed and non-entailed pairs are better distinguished by the trigram model than the text frequency model.Specifically, g p(x, y) is closer to 0 for the trigram model for a given amount of data, and the trigram model's predictions are less sensitive to sentence length.

Generality of Extracting Semantics
Our main result that entailment judgments can be extracted from an ideal LM assumes the corpus was produced by Gricean speakers.While pragmatic theory supports this assumption, real human speakers are undoubtedly more complex.What if we relax the assumption that speakers are Gricean?In Theorem 6 in §F, we show that any semantic relation is isomorphic to some distributional relation as long as, for any pair of possible semantics, there is some text whose probability distinguishes between the two candidate semantics.
We take it to be uncontroversial that semantics influences speech production, so we interpret Theo-  rem 6 to say all semantic relations are fully encoded in ideal LMs.In contrast to Theorem 2, however, this result is nonconstructive, so we do not know which algorithm to use to decide entailment between two sentences, even though one exists.Further, without further assumptions about the speaker, we cannot guarantee the extraction relation is efficiently computable or even computable at all.

Conclusion
Given a general, linguistically motivated model of human text production, we proved that entailment judgments can be decoded from the likelihood function for texts because of semantic artifacts created by human authors.We also showed empirically that entailment could be extracted n-gram LMs trained on simple formal languages.Thus, we have given one explanation for why distributional information encodes semantic information (Firth, 1957) and how semantic relations are, in principle, extractable from LMs.It is an open question whether entailment judgments might be extractable from current large LMs, but we hypothesize that the complexity of natural language makes this substantially more challenging than with our synthetic experiments, and that the loss function and inductive biases of current neural LMs are not well suited for doing so without an infeasible amount of data.
A natural next step for future work is to test this hypothesis empirically by measuring whether entailment judgments can be extracted from large LMs using our theory.Similarly, it would be interesting to think about how LMs could be modified so that they can better pick up on the semantic information encoded in their training distribution.

A Limitations
We derived a recipe for computing entailment in terms of text probabilities, hinting that entailment judgments may be decodable from LM predictions.Yet two key concerns qualify this conclusion.
Learnability We reduce entailment classification to computing probabilities in the target distribution of an LM, not probabilities predicted by an LM.In §6, we argue that the loss function of current LMs is not well suited to producing models from which entailment can be extracted.
Speaker Assumptions Gricean speakers capture important factors influencing speech production in pragmatic theory, but human speakers are undoubtedly more complex.Based on §8, we expect a similar isomorphism to hold under any reasonable speaker model, but the mathematical form may change and it may become harder to compute.

B Uncountable World Spaces
In this section, we assume W is an uncountably infinite set with a a probability density function p(w).We then define "almost sure" entailment as follows: Definition 7 For x, y ∈ X , we say x almost surely entails y (i.e., x ⊑ y ) if and only if Note that if W is countable, then A ⊑ B reduces to A ⊆ B. We can generalize Lemma 1 as follows, which shows that all our results go through for almost sure entailment when W is uncountable.
Lemma 4 Let 1 S be the indicator function for set S. Let f : W → R be some function such that Proof.If p(B \ A) = 0, then the condition follows by construction.We thus only need to show that the condition follows from p(B \ A) = 0. Let q = p(B \ A).By linearity of expectation, we rewrite the premise condition as We apply this identity to both sides of the fraction in the lemma statement:

C Gricean Speaker Proofs
.
Since x ⊆ $ and x ⊆ x , we know that the conditional information of both $ and x given x is 0, and, thus, Theorem 2 Under any Gricean speaker p, entailment is isomorphic to a distributional relation.Specifically, for all sentences x, y ∈ X , x ⊆ y ⇐⇒ p(xy) p(x$) = p(yy) p(y$) .
Proof.Recall from the proof of Lemma 2 that there exists a function g(x, w) such that, for all x ∈ X * and y ∈ X ∪ {$}, p By Lemma 1 (Here is the error: Lemma 1 does not apply!See §H), this holds if and only if, for all w, exp(αI ℓ (y | x; w)) = exp(αI ℓ (x | x; w)) We conclude the distributional relation holds if and only if x ⊆ y .

D Proofs for Learning Bounds
Lemma 3 For z ∈ X * ∪ X * $ and δ > 0, it holds with probability at least Proof.Without loss of generality, assume p(z) > 0. With probabiliy 1 − (1 − p(z)) n over the draw of our sample, the random variable log p(z) has finite variance defined by With finite variance, we can apply Chebyshev's inequality to conclude that Solving for δ ≤ Pr [|log p(z) − log p(z)|], we get We conclude that that with probability We now characterize the complexity factor K p (z) for uniformly truthful speakers.
Lemma 5 For all z ∈ X * ∪ X * $ such that z (p) > 0, it holds that .
Proof.We start by deriving a lower bound on p(z).
Applying this inequality to the definition of K p (z), we conclude that Lemma 5 lets us to derive the following guarantee for estimating entailment scores using a corpus produced by uniformly truthful speakers: Theorem 4 For a uniformly truthful speaker p, let u p (x, y) = log p(x$) − log p(xy).For x, y ∈ X such that xy (p) > 0 and δ > 0, it holds with probability at least Proof.We expand the difference in scores as follows: We then apply Lemma 3 with δ 2 .Since p(x$) ≥ p(xy), this implies that with probability Finally, we apply Lemma 5 to conclude that We now characterize the complexity factor for Gricean speakers.
Because z ∈ X * ∪ X * $, all terms where z (w) = 1 contribute at least 0 information; other terms contribute negative information.Thus, we bound the information content of the "true" terms above 0, and ignore the other terms to get the lower bound .
Plugging this into K p (z), we conclude that Theorem 3 Assume that p(z | w) is a Gricean speaker with respect to listener ℓ and z (w) = 1 ⇐⇒ I ℓ (z; w) ≥ 0. Let g p (x, y) = log p(xy) p(x$) − log p(yy) p(y$) .Let q = 1 − min{p(xy), p(yy)}.Then, for all x, y ∈ X such that xy (p) > 0, for all δ > 0, it holds with probability at least 1 − δ − 4q n that |g p (x, y) − g p(x, y)| is at most We apply Lemma 3 to each term with δ 4 .Since p(yy) ≤ p(y$) and p(xy) ≤ p(x$), we get that with probability at least 1 − δ − 4q n , Finally, we apply Lemma 6 to conclude that, with probability at least 1 − δ − 4q n , We can use Corollary 2.1 to derive a tighter version of Theorem 3 by removing the dependence on the uncommon string yy: Theorem 5 Let s p (x, y) = log p(x$) p(xy) − c(y) + c($) .Then, for all x, y ∈ X such that xy (p) > 0, for all δ > 0, the following holds with probability The proof follows analogously to Theorem 3. The main improvement of Theorem 5 compared to Theorem 3 is that the probability the bound holds no longer depends on the unlikely probability p(yy).We also get the benefit that the cost complexity factor has been reduced to only depend on c(xy) and obtain better constants (2 √ 2 instead of 8), although these changes are likely less important than removing the dependence on p(yy).Of course, the drawback is that we are assuming access to the cost function c(y).If we have such access, though, the improvements in the bound suggest we may be able to extract entailment from a finite corpus of Gricean text with better sample complexity than if we did not.

E Sample Complexity Estimation Details
Assuming the approximation error in Theorem 3 is ≤ ϵ, we aim to solve the following inequality for n: Sentence Length We make the simplifying assumption that max{c(xy), c(yy)} = 2w(ℓ + 1), where ℓ is a variable representing sentence length. 16Let Σ be the word-level vocabulary of English.We estimate the value w by assuming q(z) = exp(−w(|z| + 1)) is a valid prior over Σ * and solving for the unique value of w to satisfy this condition: This reveals that w should be set ≥ 1, but the question remains how to set |Σ|.In practice, we assume the speaker prior is defined over the support of all syntactically valid or likely strings in English, not over all possible strings as derived above.Letting S be the word-level perplexity of English, we set w according to w ≈ log(S + 1).
Making the prior less strong, i.e., increasing |Σ| to be greater than this perplexity estimate, would only increase the number of samples needed to extract entailment judgments.
Truth Probability We conservatively assume p( xy ) = 1 2 , although in practice it may be smaller for more informative sentences.Reducing it would lead to higher sample complexity estimates.
Final Form Putting together our estimates for sentence length and truth probability yields The final form captures the intuition that the likelihood of a string vanishes exponentially with its length, and that the base of this decay is roughly inversely proportional to the perplexity of the language.In practice, we set δ = 0.1 and ϵ = 1.0.Changing the value of ϵ (the desired approximation accuracy) would shift the curve.

F General Relations and Speakers
So far, we have characterized concrete distributional relations that are isomorphic to entailment for different classes of speaker models.In this section, we analyze the conditions under which a distribution relation isomorphic to a semantic relation exists, given no assumptions about the speaker.Informally, we prove in Theorem 6 that a distributional isomorphism exists if and only if the speaker model depends on semantics "at all".This is a very weak condition, and should be satisfied by any reasonable model of natural speakers.Thus, we take this as evidence that any speaker model-not just the ones we have considered, admits a distributional relation isomorphic to entailment.Utterance

G.2 Speaker Model Parameters
We model the listener of the informative speaker as a literal listener (Goodman and Frank, 2016), which means our informative speaker is a rational speaker of depth 1 in the language of rational speech acts.
We set c(x) = 0.1 • |x|, where |x| is the length of the string x.We set the rationality parameter α = 5.These choices were made heuristically, by inspecting the the properties of the speaker's output, as summarized in Figure 4.These parameters led to a relatively uniform distribution over utterances (except for the stop token 111 which is present in all texts), and a variety of text lengths without excessive redundancy.We found that larger values of α or of the coefficient for the cost function produced short texts, biasing maximally informative utterances (i.e., 100, 010, or 001); while smaller values produced long, repetitive utterances or sometimes empty utterances.

G.3 Training and Evaluation
We sample a dataset from a speaker by independently sampling n texts from the speaker model.We generate datasets of varying size from each speaker, with the number of texts n decreasing by factors of 2 from 10 7 texts down to just 2 texts.
We train models of two kinds: a text frequency model, and a trigram model.The text frequency model simply assigns a probability to a text proportional to its frequency in the training data, assigning a small ϵ = 10 −20 probability to an unknown sequence.The trigram model is trained using NLTK's (Bird, 2006) MLE implementation, i.e., the probabilities are unsmoothed.We do not need to use smoothing due to the small number of possible trigrams in the language.
For evaluation data, we generate pairs of texts labeled for entailments.We include all pairs where each text is 6 utterances or shorter, except for utterances that are contradictory or consist only of the end of sequence token.The total number of test pairs is about 1.1M.

H Erratum Derivation
Formally, x can be partitioned into two sets of worlds: 1. Y = x ∩ y where x, y are both true 2. Ỹ = x \ y where x is true but y is false Following the initial reasoning in the original proof of Theorem 2, the entailment test is 0 if and only if

Figure 1 :
Figure1: The entailment test score between x and y as a function of the number of worlds where x is true but y is false.Entailment (left intercept with 0) and near contradiction (right intercept with 0) look the same!

Lemma 2
For any Gricean speaker p and x ∈ X , Under a Gricean speaker, for all x ∈ X , c(x) = log p(x$) − log p(xx) + c($).

Figure 2 :
Figure 2: Estimated number of training sentences for guaranteeing g p closely approximates g p , where p is estimated using empirical text frequencies.
p(x) given by trigram model.

Figure 3 :
Figure 3: Plot of g p(x, y) = log p(xy) p(x$) − log p(yy) p(y$) as a function of the number of sentences in the training corpus and the length |xy|.Given the true distribution p, g p (x, y) = 0 iff x entails y.We exclude pairs x, y where both xy and yy are absent from the training data.

Figure 4 :
Figure 4: Properties of the data generated by the speaker in our experiments, with α = 5 and c(x) = 0.1 • |x|.