Natural Language Generation with Vocabulary Constraints

We investigate data driven natural language generation under the constraints that all words must come from a ﬁxed vocabulary and a speciﬁed word must appear in the generated sentence, motivated by the possibility for automatic generation of language education exercises. We present fast and accurate approximations to the ideal rejection samplers for these constraints and compare various sentence level generative language models. Our best systems produce output that is with high frequency both novel and error free, which we validate with human and automatic evaluations.


Introduction
Freeform data driven Natural Language Generation (NLG) is a topic explored by academics and artists alike, but motivating its empirical study is a difficult task. While many language models used in statistical NLP are generative and can easily produce sample sentences by running their "generative mode", if all that is required is a plausible sentence one might as well pick a sentence at random from any existing corpus.
NLG becomes useful when constraints exist such that only certain sentences are valid. The majority of NLG applies a semantic constraint of "what to say", producing sentences with communicative goals. Other work such as ours investigates constraints in structure; producing sentences of a certain form without concern for their specific meaning.
We study two constraints concerning the words that are allowed in a sentence. The first sets a fixed vocabulary such that only sentences where all words are in-vocab are allowed. The second demands not only that all words are in-vocab, but also requires the inclusion of a specific word somewhere in the sentence.
These constraints are natural in the construction of language education exercises, where students have small known vocabularies and exercises that reinforce the knowledge of arbitrary words are required. To provide an example, consider a Chinese teacher composing a quiz that asks students to translate sentences from English to Chinese. The teacher cannot ask students to translate words that have not been taught in class, and would like ensure that each vocabulary word from the current book chapter is included in at least one sentence.
Using a system such as ours, she could easily generate a number of usable sentences that contain a given vocab word and select her favorite, repeating this process for each vocab word until the quiz is complete.
The construction of such a system presents two primary technical challenges. First, while highly parameterized models trained on large corpora are a good fit for data driven NLG, sparsity is still an issue when constraints are introduced. Traditional smoothing techniques used for prediction based tasks are inappropriate, however, as they liberally assign probability to implausible text. We investigate smoothing techniques better suited for NLG that smooth more precisely, sharing probability only between words that have strong semantic connections.
The second challenge arises from the fact that both vocabulary and word inclusion constraints are easily handled with a rejection sampler that repeatedly generates sentences until one that obeys the constraints is produced. Unfortunately, for models with a sufficiently wide range of outputs the computation wasted by rejection quickly becomes prohibitive, especially when the word inclusion constraint is applied. We define models that sample directly from the possible outputs for each constraint without rejection or backtracking, and closely approximate the distribution of the true rejection samplers.
We contrast several generative systems through both human and automatic evaluation. Our best system effectively captures the compositional nature of our training data, producing error-free text with nearly 80 percent accuracy without wasting computation on backtracking or rejection. When the word inclusion constraint is introduced, we show clear empirical advantages over the simple solution of searching a large corpus for an appropriate sentence.

Related Work
The majority of NLG focuses on the satisfaction of a communicative goal, with examples such as Belz (2008) which produces weather reports from structured data or Mitchell et al. (2013) which generates descriptions of objects from images. Our work is more similar to NLG work that concentrates on structural constraints such as generative poetry (Greene et al., 2010) (Colton et al., 2012) (Jiang and Zhou, 2008) or song lyrics (Wu et al., 2013) (Ramakrishnan A et al., 2009), where specified meter or rhyme schemes are enforced. In these papers soft semantic goals are sometimes also introduced that seek responses to previous lines of poetry or lyric.
Computational creativity is another subfield of NLG that often does not fix an a priori meaning in its output. Examples such asÖzbal et al. (2013) and Valitutti et al. (2013) use template filling techniques guided by quantified notions of humor or how catchy a phrase is.
Our motivation for generation of material for language education exists in work such as Sumita et al. (2005) and Mostow and Jang (2012), which deal with automatic generation of classic fill in the blank questions. Our work is naturally complementary to these efforts, as their methods require a corpus of in-vocab text to serve as seed sentences.

Freeform Generation
For clarity in our discussion, we phrase the sentence generation process in the following general terms based around two classes of atomic units : contexts and outcomes. In order to specify a generation system, we must define 1. the set C of contexts c 2. the set O of outcomes o 3. the "Imply" function I(c, o) → List[c ∈ C] 4. M : derivation tree sentence where I(c, o) defines the further contexts implied by the choice of outcome o for the context c. Beginning with a unique root context, a derivation tree is created by repeatedly choosing an outcome o for a leaf context c and expanding c to the new leaf contexts specified by I(c, o). M converts between derivation tree and sentence text form. This is simply a convenient rephrasing of the Context Free Grammar formalism, and as such the systems we describe all have some equivalent CFG interpretation. Indeed, to describe a traditional CFG, let C be the set of symbols, O be the rules of the CFG, and I(c, o) return a list of the symbols on the right hand side of the rule o. To define an n-gram model, a context is a list of words, an outcome a single word, and I(c, o) can be procedurally defined to drop the first element of c and append o.
To perform the sampling required for derivation tree construction we must define P (o|c). Using M, we begin by converting a large corpus of sentence segmented text into a training set of derivation trees. Maximum likelihood estimation of P (o|c) is then as simple as normalizing the counts of the observed outcomes for each observed context. However, in order to obtain contexts for which the conditional independence assumption of P (o|c) is appropriate, it is necessary to condition on a large amount of information. This leads to sparse estimates even on large amounts of training data, a problem that can be addressed by smoothing. We identify two complementary types of smoothing, and illustrate them with the following sentences.
The furry dog bit me. The cute cat licked me. An unsmoothed bigram model trained on this data can only generate the two sentences verbatim. If, however, we know that the tokens "dog" and "cat" are semantically similar, we can smooth by assuming the words that follow "cat" are also likely to follow "dog". This is easily handled with traditional smoothing techniques that interpolate between distributions estimated for both coarse, P (w|w −1 =[animal]), and fine, P (w|w −1 ="dog"), contexts. We refer to this as context smoothing.
However, we would also like to capture the intuition that words which can be followed by "dog" can also be followed by "cat", which we will call outcome smoothing. We extend our terminology to describe a system that performs both types of smoothing with the following We identify two complementary types of oothing, and illustrate them with the following ntences.
The furry dog bit me. The cute cat licked me.
Assuming a simple bigram model where cont is the previous word and the outcome a sinword, an unsmoothed model trained on this ta can only generate the two sentences verba-. Imagine we have some way of knowing that tokens "dog" and "cat" are similar and would e to leverage this fact . In our bigram model, s amounts to the claim that the words that follow at" are perhaps also likely to follow "dog". This easily handled with traditional smoothing techues, which interpolate between distributions timated for both coarse, P (w|w −1 =[is-animal]), d fine, P (w|w −1 ="dog"), contexts. We refer to s as context smoothing. However, we would also like to capture the intion that words which can be followed by "dog" n also be followed by "cat", which we will call tcome smoothing. We extend our terminology describe a system that performs both types of oothing with the following We identify two complementary types of smoothing, and illustrate them with the following sentences.
The furry dog bit me. The cute cat licked me.
Assuming a simple bigram model where context is the previous word and the outcome a single word, an unsmoothed model trained on this data can only generate the two sentences verbatim. Imagine we have some way of knowing that the tokens "dog" and "cat" are similar and would like to leverage this fact . In our bigram model, this amounts to the claim that the words that follow "cat" are perhaps also likely to follow "dog". This is easily handled with traditional smoothing techniques, which interpolate between distributions estimated for both coarse, P (w|w −1 =[is-animal]), and fine, P (w|w −1 ="dog"), contexts. We refer to this as context smoothing.
However, we would also like to capture the intuition that words which can be followed by "dog" can also be followed by "cat", which we will call outcome smoothing. We extend our terminology to describe a system that performs both types of smoothing with the following We describe the generative process with the following flowchart eform data We do not nly that the , seeking a anguage. phrase the neral terms nits : Confy a genera- xts implied text c. This e definition mpled inderequire the xt, and refer utcomes for a mapping required to sing of the nd as such e equivalent ibe a tradiinals, O be rns a list of e of the rule uld enforce thand sides. more nature we do not nterminals. l, for which an outcome edurally degle context each sentence in our data set. Maximum likeli-hood estimation of P (o|c) is then as simple as normalizing the counts of the observed outcomes for each observed context. However, in order to obtain contexts for which the conditional independence assumption of P (o|c) is appropriate, it is necessary to condition on a large amount of information. This leads to sparse estimates even on large amounts of training data, a problem that can be addressed by smoothing.
We identify two complementary types of smoothing, and illustrate them with the following sentences.
The furry dog bit me. The cute cat licked me.
Assuming a simple bigram model where context is the previous word and the outcome a single word, an unsmoothed model trained on this data can only generate the two sentences verbatim. Imagine we have some way of knowing that the tokens "dog" and "cat" are similar and would like to leverage this fact . In our bigram model, this amounts to the claim that the words that follow "cat" are perhaps also likely to follow "dog". This is easily handled with traditional smoothing techniques, which interpolate between distributions estimated for both coarse, P (w|w −1 =[is-animal]), and fine, P (w|w −1 ="dog"), contexts. We refer to this as context smoothing.
However, we would also like to capture the intuition that words which can be followed by "dog" can also be followed by "cat", which we will call outcome smoothing. We extend our terminology to describe a system that performs both types of smoothing with the following  large amounts of training data, a problem that can be addressed by smoothing. We identify two complementary types of smoothing, and illustrate them with the following sentences.
The furry dog bit me. The cute cat licked me.
Assuming a simple bigram model where context is the previous word and the outcome a single word, an unsmoothed model trained on this data can only generate the two sentences verbatim. Imagine we have some way of knowing that the tokens "dog" and "cat" are similar and would like to leverage this fact . In our bigram model, this amounts to the claim that the words that follow "cat" are perhaps also likely to follow "dog". This is easily handled with traditional smoothing techniques, which interpolate between distributions estimated for both coarse, P (w|w −1 =[is-animal]), and fine, P (w|w −1 ="dog"), contexts. We refer to this as context smoothing.
However, we would also like to capture the intuition that words which can be followed by "dog" can also be followed by "cat", which we will call outcome smoothing. We extend our terminology to describe a system that performs both types of smoothing with the following We describe the smoothed generative process with the flowchart shown in Figure 1. In order to choose an outcome for a given context, two decisions must be made. First, we must decide which context we will employ, the true context or the smooth context, marked by edges 1 or 2 respectively. Next, we choose to generate a true outcome or a smooth outcome, and if we select the latter we use edge 6 to choose a true outcome given the smooth outcome. The decision between edges 1 and 2 can be sampled from a Bernoulli random variable with parameter λ c , with one variable estimated for each context c. The decision between edges 5 and 3 and the one between 4 and 7 can also be made with Bernoulli random variables, with parameter sets γ c and γc respectively.
This yields the full form of the unconstrained probabilistic generative model as follows (1) P 2 (o|c) = γcP 6 (o|c)+ (1 − γc)P 7 (o|ō)P 4 (ō|c) requiring estimation of the λ and γ variables as well as the five multinomial distributions P 3−7 . This can be done with a straightforward application of EM.

Limiting Vocabulary
A primary concern in the generation of language education exercises is the working vocabulary of the students. If efficiency were not a concern, the natural solution to the vocabulary constraint would be rejection sampling: simply generate sentences until one happens to obey the constraint. In this section we show how to generate a sentence directly from this constrained set with a distribution closely approximating that of the rejection sampler.

Pruning
The first step is to prune the space of possible sentences to those that obey the vocabulary constraint. For the models we investigate there is a natural predicate V (o) that is true if and only if an outcome introduces a word that is out of vocab, and so the vocabulary constraint is equivalent to the requirement that V (o) is false for all possible outcomes o. Considering transitions along edges in Figure 1, the removal of all transitions along edges 5,6, and 7 that lead to outcomes where V (o) is true satisfies this property. Our remaining concern is that the generation process does not reach a failure case. Again considering transitions in Figure 1, failure occurs when we require P (o|c) for some c and there is no transition to c on edge 1 or S C (c) along edge 2. We refer to such a context as invalid. Our goal, which we refer to as consistency, is that for all valid contexts c, all outcomes o that can be reached in Figure 1 satisfy the property that all members of I(c, o) are valid contexts.
To see how we might end up in failure, consider a trigram model on POS/word pairs for which S C is the identity function and S O backs off to the POS tag. Given a context c = ( t −2 w −2 , t −1 w −1 ) if we generate along a path using edge 6 we will choose a smooth outcome t 0 that we have seen following c in the data and then independenently choose a w 0 that has been observed with tag t 0 . This implies a following context ( t −1 w −1 , t 0 w 0 ). If we have estimated our model with observations from data, there is no guarantee that this context ever appeared, and if so there will be no available transition along edges 1 or 2.
Let the listĪ(c, o) be the result of the mapped application of S C to each element of I(c, o). In order to define an efficient algorithm, we require the following property D referring to the amount of information needed to determineĪ(c, o). Simply put, D states if the smoothed context and outcome are fixed, then the implied smooth contexts are determined.
To highlight the statement D makes, consider the trigram POS/word model described above, but let S C also map the POS/word pairs in the context to their POS tags alone. D holds here because given S C (c) = (t −2 , t −1 ) and S O (o) = t 0 from the outcome, we are able to determine the implied smooth context (t −1 , t 0 ). If context smoothing instead produced S C (c) = (t −2 ), D would not hold.
If D holds then we can show consistency based on the transitions in Figure 1 alone as any complete path through Figure 1 defines bothc andō. By D we can determineĪ(c, o) for any path and verify that all its members have possible transitions along edge 2. If the verification passes for all paths then the model is consistent.
Algorithm 1 produces a consistent model by verifying each complete path in the manner just described. One important feature is that it preserves the invariant that if a context c can be reached on edge 1, then S C (c) can be reached on edge 2. This means that if the verification fails then the complete path produces an invalid context, even though we have only checked the members ofĪ(c, o) against path 2.
If a complete path produces an invalid context, some transition along that path must be re- moved. It is never optimal to remove transitions from edges 1 or 2 as this unnecessarily removes all downstream complete paths as well, and so for invalid complete paths along 1-5 and 2-7 Algorithm 1 removes the transitions along edges 5 and 7. The choice is not so simple for the complete paths 1-3-6 and 2-4-6, as there are two remaining choices. Fortunately, D implies that breaking the connection on edge 3 or 4 is optimal as regardless of which outcome is chosen on edge 6,Ī(c, o) will still produce the same invalidc.
After removing transitions in this manner, some transitions on edges 1-4 may no longer have any outgoing transitions. The subroutine FIXUP removes such transitions, checking edges 3 and 4 before 1 and 2. If FIXUP does not modify edge 2 then the model is consistent and Algorithm 1 terminates.

Estimation
In order to replicate the behavior of the rejection sampler, which uses the original probability model P (o|c) from Equation 1, we must set the probabilities P V (o|c) of the pruned model appropriately. We note that for moderately sized vocabularies it is feasible to recursively enumerate C V , the set of all reachable contexts in the pruned model. In further discussion we simplify the representation of the model to a standard PCFG with C V as its symbol set and its PCFG rules indexed by outcomes. This also allows us to construct the reachability graph for C V , with an edge from c i to c j for each c j ∈ I(c i , o). Such an edge is given weight P (o|c), the probability under the unconstrained model, and zero weight edges are not included.
Our goal is to retain the form of the stan-dard incremental recursive sampling algorithm for PCFGs. The correctness of this algorithm comes from the fact that the probability of a rule R expanding a symbol X is precisely the probability of all trees rooted at X whose first rule is R. This implies that the correct sampling distribution is simply the distribution over rules itself. When constraints that disallow certain trees are introduced, the probability of all trees whose first rule is R only includes the mass from valid trees, and the correct sampling distribution is the renormalization of these values. Let the goodness of a context G(c) be the probability that a full subtree generated from c using the unconstrained model obeys the vocabulary constraint. Knowledge of G(c) for all c ∈ C V allows the calculation of probabilities for the pruned model with While G(c) can be defined recursively as its calculation requires that the reachability graph be acyclic. We approximate an acyclic graph by listing all edges in order of decreasing weight and introducing edges as long as they do not create cycles. This can be done efficiently with a binary search over the edges by weight. Note that this approximate graph is used only in recursive estimation of G(c), and the true graph can still be used in Equation 2.

Generating Up
In this section we show how to efficiently generate sentences that contain an arbitrary word w * in addition to the vocabulary constraint. We assume the ability to easily find C w * , a subset of C V whose use guarantees that the resulting sentence contains w * . Our goal is once again to efficiently emulate the rejection sampler, which generates a derivation tree T and accepts if and only if it contains at least one member of C w * . Let T w * be the set of derivation trees that would be accepted by the rejection sampler. We present a three stage generative model and its associated probability distribution P w * (τ ) over items τ for which there is a functional mapping into T w * .
In addition to the probabilities P V (o|c) from the previous section, we require an estimate of E(c), the expected number of times each context c appears in a single tree. This can be computed efficiently using the mean matrix, described in Miller and Osullivan (1992). This |C V | × |C V | matri x M has its entries defined as where the operator # returns the number of times context c j appears I(c i , o). Defining a 1 × |C V | start state vector z 0 that is zero everywhere and 1 in the entry corresponding to the root context gives which can be iteratively computed with sparse matrix multiplication. Note that the ith term in the sum corresponds to expected counts at depth i in the derivation tree. With definitions of context and outcome for which very deep derivations are improbable, it is reasonable to approximate this sum by truncation.
Our generation model operates in three phases.
1. Chose a start context c 0 ∈ C w * 2. Generate a spine S of contexts and outcomes connecting c 0 to the root context 3. Fill in the full derivation tree T below all remaining unexpanded contexts In the first phase, c 0 is sampled from the multinomial The second step produces a spine S, which is formally an ordered list of triples. Each element of S records a context c i , an outcome o i , and the index k in I(c i , o i ) of the child along which the spine progresses. The members of S are sampled independantly given the previously sampled context, starting from c 0 and terminating when the root context is reached. Intuitively this is equivalent to generating the path from the root to c 0 in a bottom up fashion.
We define the probability P σ of a triple (c i , o i , k) given a previously sampled context c j as Let S = (c 1 , o 1 , k 1 ) . . . (c n , o n , k n ) be the results of this recursive sampling algorithm, where c n is the root context, and c 1 is the parent context of c 0 . The total probability of a spine S is then where , which cancels nearly all of the expected counts from the full product. Along with the fact that the expected count of the root context is one, the formula simplifies to The third step generates a final tree T by filling in subtrees below unexpanded contexts on the spine S using the original generation algorithm, yielding results with probability where the set T /S includes all contexts that are not ancestors of c 0 , as their outcomes are already specified in S. We validate this algorithm by considering its distrubution over complete derivation trees T ∈ T w * . The algorithm generates τ = (T, S, c 0 ) and has a simple functional mapping into T w * by extracting the first member of τ .
Combining the probabilities of our three steps gives where ρ is a constant and is the probability of T under the original model. Note that several τ may map to the same T by using different spines, and so where η(T ) is the number of possible spines, or equivalently the number of contexts c ∈ C w * in T .
Recall that our goal is to efficiently emulate the output of a rejection sampler. An ideal system P w * would produce the complete set of derivation trees accepted by the rejection sampler using P V , with probabilities of each derivation tree T satisfying Consider the implications of the following assumption A each T ∈ T w * contains exactly one c ∈ C w * A ensures that η(T ) = 1 for all T , unifying Equations 12 and 13. A does not generally hold in practice, but its clear exposition allows us to design models for which it holds most of the time, leading to a tight approximation.
The most important consideration of this type is to limit redundancy in C w * . For illustration consider a dependency grammar model with parent annotation where a context is the current word and its parent word. When specifying C w * for a particular w * , we might choose all contexts in which w * appears as either the current or parent word, but a better choice that more closely satisfies A is to choose contexts where w * appears as the current word only.

END END
We investigate data driven natural language generation under the constraint that all words must come from a fixed arbitrary vocabulary. This constraint is then extended such that a user specified word must also appear in the sentence. We present fast approximations to the ideal rejection samplers and increase variability in generated text through controlled smoothing. Data driven Natural Language Generation (NLG) is a fascinating topic explored by academics and artists alike, but motivating its empirical study is a difficult task. While many language models used in statistical NLP are generative and can easily produce sample sentences from distributions estimated from data, if all that is required is a plausible sentence one might as well pick one at random from any existing corpus.
NLG is useful when constraints are applied such that only certain plausible sentences are valid. The majority of NLG applies the semantic constraint of "what to say", producing sentences with communicative goals. Other work such as ours investigates constraints in structure; producing sentences of a certain form without concern for their meaning.
We motivate two specific constraints concerning the words that are allowed in a sentence. The first sets a fixed vocabulary such that only sentences where all words are in-vocab are allowed. The second demands not only that all words are in-vocab, but specifies the inclusion of a single arbitrary word somewhere in the sentence. These contraints are most natural in the case of language education, where students have small known vocabularies and exercises that reinforce the knowledge of arbitrary words are required. This use

Abstract
We investigate data driven natural language generation under the constraint that all words must come from a fixed arbitrary vocabulary. This constraint is then extended such that a user specified word must also appear in the sentence. We present fast approximations to the ideal rejection samplers and increase variability in generated text through controlled smoothing. Data driven Natural Language Generation (NLG) is a fascinating topic explored by academics and artists alike, but motivating its empirical study is a difficult task. While many language models used in statistical NLP are generative and can easily produce sample sentences from distributions estimated from data, if all that is required is a plausible sentence one might as well pick one at random from any existing corpus.
NLG is useful when constraints are applied such that only certain plausible sentences are valid. The majority of NLG applies the semantic constraint of "what to say", producing sentences with communicative goals. Other work such as ours investigates constraints in structure; producing sentences of a certain form without concern for their meaning.
We motivate two specific constraints concerning the words that are allowed in a sentence. The first sets a fixed vocabulary such that only sentences where all words are in-vocab are allowed. The second demands not only that all words are in-vocab, but specifies the inclusion of a single arbitrary word somewhere in the sentence. These contraints are most natural in the case of language education, where students have small known vocabularies and exercises that reinforce the knowledge of arbitrary words are required. This use

Abstract
We investigate data driven natural language generation under the constraint that all words must come from a fixed arbitrary vocabulary. This constraint is then extended such that a user specified word must also appear in the sentence. We present fast approximations to the ideal rejection samplers and increase variability in generated text through controlled smoothing. Data driven Natural Language Generation (NLG) is a fascinating topic explored by academics and artists alike, but motivating its empirical study is a difficult task. While many language models used in statistical NLP are generative and can easily produce sample sentences from distributions estimated from data, if all that is required is a plausible sentence one might as well pick one at random from any existing corpus.
NLG is useful when constraints are applied such that only certain plausible sentences are valid. The majority of NLG applies the semantic constraint of "what to say", producing sentences with communicative goals. Other work such as ours investigates constraints in structure; producing sentences of a certain form without concern for their meaning.
We motivate two specific constraints concerning the words that are allowed in a sentence. The first sets a fixed vocabulary such that only sentences where all words are in-vocab are allowed. The second demands not only that all words are in-vocab, but specifies the inclusion of a single arbitrary word somewhere in the sentence. These contraints are most natural in the case of language education, where students have small known vocabularies and exercises that reinforce the knowledge of arbitrary words are required. This use

Abstract
We investigate data driven natural language generation under the constraint that all words must come from a fixed arbitrary vocabulary. This constraint is then extended such that a user specified word must also appear in the sentence. We present fast approximations to the ideal rejection samplers and increase variability in generated text through controlled smoothing. Data driven Natural Language Generation (NLG) is a fascinating topic explored by academics and artists alike, but motivating its empirical study is a difficult task. While many language models used in statistical NLP are generative and can easily produce sample sentences from distributions estimated from data, if all that is required is a plausible sentence one might as well pick one at random from any existing corpus.
NLG is useful when constraints are applied such that only certain plausible sentences are valid. The majority of NLG applies the semantic constraint of "what to say", producing sentences with communicative goals. Other work such as ours investigates constraints in structure; producing sentences of a certain form without concern for their meaning.
We motivate two specific constraints concerning the words that are allowed in a sentence. The first sets a fixed vocabulary such that only sentences where all words are in-vocab are allowed. The second demands not only that all words are in-vocab, but specifies the inclusion of a single arbitrary word somewhere in the sentence. These contraints are most natural in the case of language education, where students have small known vocabularies and exercises that reinforce the knowledge of arbitrary words are required. This use trary vocabulary. This constraint is then extended such that a user specified word must also appear in the sentence. We present fast approximations to the ideal rejection samplers and increase variability in generated text through controlled smoothing. Data driven Natural Language Generation (NLG) is a fascinating topic explored by academics and artists alike, but motivating its empirical study is a difficult task. While many language models used in statistical NLP are generative and can easily produce sample sentences from distributions estimated from data, if all that is required is a plausible sentence one might as well pick one at random from any existing corpus.
NLG is useful when constraints are applied such that only certain plausible sentences are valid. The majority of NLG applies the semantic constraint of "what to say", producing sentences with communicative goals. Other work such as ours investigates constraints in structure; producing sentences of a certain form without concern for their meaning.
We motivate two specific constraints concerning the words that are allowed in a sentence. The first sets a fixed vocabulary such that only sentences where all words are in-vocab are allowed. The second demands not only that all words are in-vocab, but specifies the inclusion of a single arbitrary word somewhere in the sentence. These contraints are most natural in the case of language education, where students have small known vocabularies and exercises that reinforce the knowledge of arbitrary words are required. This use

Abstract
We investigate data driven natural language generation under the constraint that all words must come from a fixed arbitrary vocabulary. This constraint is then extended such that a user specified word must also appear in the sentence. We present fast approximations to the ideal rejection samplers and increase variability in generated text through controlled smoothing. Data driven Natural Language Generation (NLG) is a fascinating topic explored by academics and artists alike, but motivating its empirical study is a difficult task. While many language models used in statistical NLP are generative and can easily produce sample sentences from distributions estimated from data, if all that is required is a plausible sentence one might as well pick one at random from any existing corpus.
NLG is useful when constraints are applied such that only certain plausible sentences are valid. The majority of NLG applies the semantic constraint of "what to say", producing sentences with communicative goals. Other work such as ours investigates constraints in structure; producing sentences of a certain form without concern for their meaning.
We motivate two specific constraints concerning the words that are allowed in a sentence. The first sets a fixed vocabulary such that only sentences where all words are in-vocab are allowed. The second demands not only that all words are in-vocab, but specifies the inclusion of a single arbitrary word somewhere in the sentence. These contraints are most natural in the case of language education, where students have small known vocabularies and exercises that reinforce the knowledge of arbitrary words are required. This use ROOT VBZ likes PRP NNS she dogs ROOT VBZ NNS likes dogs ROOT PRP VBZ she likes is a plausible sentence one might as well pick one at random from any existing corpus. NLG is useful when constraints are applied such that only certain plausible sentences are valid. The majority of NLG applies the semantic constraint of "what to say", producing sentences with communicative goals. Other work such as ours investigates constraints in structure; producing sentences of a certain form without concern for their meaning.
We motivate two specific constraints concerning the words that are allowed in a sentence. The first sets a fixed vocabulary such that only sentences where all words are in-vocab are allowed. The second demands not only that all words are in-vocab, but specifies the inclusion of a single arbitrary word somewhere in the sentence. These contraints are most natural in the case of language education, where students have small known vocabularies and exercises that reinforce the knowledge of arbitrary words are required. This use

Freeform Generation from a Fixed Vocabulary Abstract
We investigate data driven natural language generation under the constraint that all words must come from a fixed arbitrary vocabulary. This constraint is then extended such that a user specified word must also appear in the sentence. We present fast approximations to the ideal rejection samplers and increase variability in generated text through controlled smoothing. Data driven Natural Language Generation (NLG) is a fascinating topic explored by academics and artists alike, but motivating its empirical study is a difficult task. While many language models used in statistical NLP are generative and can easily produce sample sentences from distributions estimated from data, if all that is required is a plausible sentence one might as well pick one at random from any existing corpus.
NLG is useful when constraints are applied such that only certain plausible sentences are valid. The majority of NLG applies the semantic constraint of "what to say", producing sentences with communicative goals. Other work such as ours investigates constraints in structure; producing sentences of a certain form without concern for their meaning.
We motivate two specific constraints concerning the words that are allowed in a sentence. The first sets a fixed vocabulary such that only sentences where all words are in-vocab are allowed. The second demands not only that all words are in-vocab, but specifies the inclusion of a single arbitrary word somewhere in the sentence. These contraints are most natural in the case of language education, where students have small known vocabularies and exercises that reinforce the knowledge of arbitrary words are required. This use Freeform Generation from a Fixed Vocabulary Abstract We investigate data driven natural language generation under the constraint that all words must come from a fixed arbitrary vocabulary. This constraint is then extended such that a user specified word must also appear in the sentence. We present fast approximations to the ideal rejection samplers and increase variability in generated text through controlled smoothing. Data driven Natural Language Generation (NLG) is a fascinating topic explored by academics and artists alike, but motivating its empirical study is a difficult task. While many language models used in statistical NLP are generative and can easily produce sample sentences from distributions estimated from data, if all that is required is a plausible sentence one might as well pick one at random from any existing corpus.
NLG is useful when constraints are applied such that only certain plausible sentences are valid. The majority of NLG applies the semantic constraint of "what to say", producing sentences with communicative goals. Other work such as ours investigates constraints in structure; producing sentences of a certain form without concern for their meaning.
We motivate two specific constraints concerning the words that are allowed in a sentence. The first sets a fixed vocabulary such that only sentences where all words are in-vocab are allowed. The second demands not only that all words are in-vocab, but specifies the inclusion of a single arbitrary word somewhere in the sentence. These contraints are most natural in the case of language education, where students have small known vocabularies and exercises that reinforce the knowledge of arbitrary words are required. This use Figure 2: The generation system SPINEDEP draws on dependency tree syntax where we use the term node to refer to a POS/word pair. Contexts consist of a node, its parent node, and grandparent POS tag, as shown in squares. Outcomes, shown in squares with rounded right sides, are full lists of dependents or the END symbol. The shaded rectangles contain the results of I(c, o) from the indicated (c, o) pair.

Experiments
We train our models on sentences drawn from the Simple English Wikipedia 1 . We obtained these sentences from a data dump which we liberally filtered to remove items such as lists and sentences longer than 15 words or shorter then 3 words. We parsed this data with the recently updated Stanford Parser (Socher et al., 2013) to Penn Treebank constituent form, and removed any sentence that did not parse to a top level S containing at least one NP and one VP child. Even with such strong filters, we retained over 140K sentences for use as training data, and provide this exact set of parse trees for use in future work. 2 Inspired by the application in language education, for our vocabulary list we use the English Vocabulary Profile (Capel, 2012), which predicts student vocabulary at different stages of learning English as a second language. We take the most basic American English vocabulary (the A1 list), and retrieve all inflections for each word using Sim-pleNLG (Gatt and Reiter, 2009), yielding a vocabulary of 1226 simple words and punctuation.
To mitigate noise in the data, we discard any pair of context and outcome that appears only once in the training data, and estimate the parameters of the unconstrained model using EM.

Model Comparison
We experimented with many generation models before converging on SPINEDEP, described in Figure 2, which we use in these experiments.  SPINEDEP uses dependency grammar elements, with parent and grandparent information in the contexts to capture such distinctions as that between main and clausal verbs. Its outcomes are full configurations of dependents, capturing coordinations such as subject-object pairings. This specificity greatly increases the size of the model and in turn reduces the speed of the true rejection sampler, which fails over 90% of the time to produce an in-vocab sentence. We found that large amounts of smoothing quickly diminishes the amount of error free output, and so we smooth very cautiously, mapping words in the contexts and outcomes to fine semantic classes. We compare the use of human annotated hypernyms from Wordnet (Miller, 1995) with automatic word clusters from word2vec (Mikolov et al., 2013), based on vector space word embeddings, evaluating both 500 and 5000 clusters for the latter.
We compare these models against several baseline alternatives, shown in Figure 3. To determine correctness, used Amazon Mechanical Turk, asking the question: "Is this sentence plausible?". We further clarified this question in the instructions with alternative definitions of plausibility as well as both positive and negative examples. Every sentence was rated by five reviewers and its correctness was determined by majority vote, with a .496 Fleiss kappa agreement. To avoid spammers, we limited our hits to Turkers with an over 95% approval rating.
Traditional language modeling techniques such as such as the Dependency Model with Valence (Klein and Manning, 2004) and 5-gram Kneser Ney (Chen and Goodman, 1996) perform poorly, which is unsurprising as they are designed for tasks in recognition rather than generation. For n-gram models, accuracy can be greatly increased by decreasing the amount of smoothing, but it becomes difficult to find long n-grams that are completely in-vocab and results become redundant, parroting the few completely in-vocab sentences from the training data. The DMV is more flexible, but makes assumptions of conditional independence that are far too strong. As a result it is unable to avoid red flags such as sentences not ending in punctuation or strange subject-object coordinations. Without smoothing, SPINEDEP suffers from a similar problem as unsmoothed n-gram models; high accuracy but quickly vanishing productivity.
All of the smoothed SPINEDEP systems show clear advantages over their competitors. The tradeoff between correctness and generative capacity is also clear, and our results suggest that the number of clusters created from the word2vec embeddings can be used to trace this curve. As for the ideal position in this tradeoff, we leave such decisions which are particular to specific application to future work, arbitrarily using SPINEDEP WordNet for our following experiments.

Fixed Vocabulary
To show the tightness of the approximation presented in Section 4.2, we evaluate three settings for the probabilities of the pruned model. The first is a weak baseline that sets all distributions to uniform. For the second, we simply renormalize the true model's probabilities, which is equivalent to setting G(c) = 1 for all c in Equation 2. Finally, we use our proposed method to estimate G(c).
We show in Figure  more closely approximates the distribution of the rejection sampler by drawing 500K samples from each model and comparing them with 500K samples from the rejection sampler itself. We quantify this comparison with the likelihood ratio statistic, evaluating the null hypothesis that the two samples were drawn from the same distribution. Not only does our method more closely emulate that of the rejection sampler, be we see welcome evidence that closeness to the true distribution is correlated with correctness.

Word Inclusion
To explore the word inclusion constraint, for each word in our vocabulary list we sample 1000 sentences that are constrained to include that word using both unsmoothed and WordNet smoothed SPINEDEP. We compare these results to the "Corpus" model that simply searches the training data and uniformly samples from the existing sentences that satisfy the constraints. This corpus search approach is quite a strong baseline, as it is trivial to implement and we assume perfect correctness for its results. This experiment is especially relevant to our motivation of language education. The natural question when proposing any NLG approach is whether or not the ability to automatically produce sentences outweighs the requirement of a postprocess to ensure goal-appropriate output. This is a challenging task in the context of language education, as most applications such as exam or homework creation require only a handful of sentences. In order for an NLG solution to be appropriate, the constraints must be so strong that a corpus search based method will frequently produce too few options to be useful. The word inclusion constraint highlights the strengths of our method as it is not only highly plausible in a language ed- Figure 5: Using systems that implement the word inclusion constraint, this table shows the number of words for which the amount of unique sentences out of 1000 samples was less than 10 or greater than 100, along with the correctness of each system.
ucation setting but difficult to satisfy by chance in large corpora. Figure 5 shows that the corpus search approach fails to find more than ten sentences that obey the word inclusion constraints for most target words. Moreover, it is arguably the case that unsmoothed SPINEDEP is even worse due to its inferior correctness. With the addition of smoothing, however, we see a drastic shift in the number of words for which a large number of sentences can be produced. For the majority of the vocabulary words this model generates over 100 sentences that obey both constraints, of which approximately 80% are valid English sentences.

Conclusion
In this work we address two novel NLG constraints, fixed vocabulary and fixed vocabulary with word inclusion, that are motivated by language education scenarios. We showed that under these constraints a highly parameterized model based on dependency tree syntax can produce a wide range of accurate sentences, outperforming the strong baselines of popular generative language models. We developed a pruning and estimation algorithm for the fixed vocabulary constraint and showed that it not only closely approximates the true rejection sampler but also that the tightness of approximation is correlated with human judgments of correctness. We showed that under the word inclusion constraint, precise semantic smoothing produces a system whose abilities exceed the simple but powerful alternative of looking up sentences in large corpora.
SPINEDEP works surprisingly well given the widely held stigma that freeform NLG produces either memorized sentences or gibberish. Still, we expect that better models exist, especially in terms of definition of smoothing operators. We have presented our algorithms in the flexible terms of context and outcome, and clearly stated the properties that are required for the full use of our methodology. We have also implemented our code in these general terms 3 , which performs EM based parameter estimation as well as efficient generation under the constraints discussed above. All systems used in this work with the exception of 5-gram interpolated Kneser-Ney were implemented in this way, are included with the code, and can be used as templates.
We recognize several avenues for continued work on this topic. The use of form-based constraints such as word inclusion has clear application in language education, but many other constraints are also desirable. The clearest is perhaps the ability to constrain results based on a "vocabulary" of syntactic patterns such as "Not only ... but also ...". Another extension would be to incorporate the rough communicative goal of response to a previous sentence as in Wu et al. (2013) and attempt to produce in-vocab dialogs such as are ubiquitous in language education textbooks.
Another possible direction is in the improvement of the context-outcome framework itself. While we have assumed a data set of one derivation tree per sentence, our current methods easily extend to sets of weighted derivations for each sentence. This suggests the use of techinques that have proved effective in grammar estimation that reason over large numbers of possible derivations such as Bayesian tree substitution grammars or unsupervised symbol refinement.