Simpler Context-Dependent Logical Forms via Model Projections

We consider the task of learning a context-dependent mapping from utterances to denotations. With only denotations at training time, we must search over a combinatorially large space of logical forms, which is even larger with context-dependent utterances. To cope with this challenge, we perform successive projections of the full model onto simpler models that operate over equivalence classes of logical forms. Though less expressive, we find that these simpler models are much faster and can be surprisingly effective. Moreover, they can be used to bootstrap the full model. Finally, we collected three new context-dependent semantic parsing datasets, and develop a new left-to-right parser.


Introduction
Suppose we are only told that a piece of text (a command) in some context (state of the world) has some denotation (the effect of the command)-see Figure 1 for an example.How can we build a system to learn from examples like these with no initial knowledge about what any of the words mean?
We start with the classic paradigm of training semantic parsers that map utterances to logical forms, which are executed to produce the denotation (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Wong and Mooney, 2007;Zettlemoyer and Collins, 2009;Kwiatkowski et al., 2010).More recent work learns directly from denotations (Clarke et al., 2010;Liang, 2013;Berant et al., 2013;Artzi and Zettlemoyer, 2013), but in this setting, a constant struggle is to contain the exponential explosion of possible logical forms.With no initial lexicon and longer contextdependent texts, our situation is exacerbated.In this paper, we propose projecting a full semantic parsing model onto simpler models over equivalence classes of logical form derivations.As illustrated in Figure 2, we consider the following sequence of models: • Model A: our full model that derives logical forms (e.g., in Figure 1, the last utterance maps to mix(args [1][1])) compositionally from the text so that spans of the utterance (e.g., "it") align to parts of the logical form (e.g., args [1][1], which retrieves an argument from a previous logical form).This is based on standard semantic parsing (e.g., Zettlemoyer and Collins (2005)).
• Model B: collapse all derivations with the same logical form; we map utterances to full logical forms, but without an alignment between the utterance and logical forms.This "floating" approach was used in Pasupat and Liang (2015) and Wang et al. (2015).
• Model C: further collapse all logical forms whose top-level arguments have the same denotation.In other words, we map utterances to flat logical forms (e.g., mix(beaker2)), where the arguments of the top-level predicate are objects in the world.This model is in the spirit of Yao et al. (2014) and Bordes et al. (2014), who directly predicted concrete paths in a knowledge graph for question answering.
Model A excels at credit assignment: the latent derivation explains how parts of the logical form are triggered by parts of the utterance.The price is an unmanageably large search space, given that we do not have a seed lexicon.At the other end, Model C only considers a small set of logical forms, but the mapping from text to the correct logical form is more complex and harder to model.
We collected three new context-dependent semantic parsing datasets using Amazon Mechanical Turk: ALCHEMY (Figure 1), SCENE (Figure 3), and TANGRAMS (Figure 4).Along the way, we develop a new parser which processes utterances left-to-right but can construct logical forms without an explicit alignment.
Our empirical findings are as follows: First, Model C is surprisingly effective, mostly surpassing the other two given bounded computational resources (a fixed beam size).Second, on a synthetic dataset, with infinite beam, Model A outperforms the other two models.Third, we can bootstrap up to Model A from the projected models with finite beam.

Context:
Text: A man in a red shirt and orange hat leaves to the right, leaving behind a man in a blue shirt in the middle.He takes a step to the left. Denotation:

Task
In this section, we formalize the task and describe the new datasets we created for the task.

Setup
First, we will define the context-dependent semantic parsing task.Define w 0 as the initial world state, which consists of a set of entities (beakers in ALCHEMY) and properties (location, color(s), and amount filled).The text x is a sequence of utterances x 1 , . . ., x L .For each utterance x i (e.g., "mix"), we have a latent logical form z i (e.g., mix(args[1][2])).Define the context c i = (w 0 , z 1:i−1 ) to include the initial world state w 0 and the history of past logical forms z 1:i−1 .Each logical form z i is executed on the context c i to produce the next state: for each i = 1, . . ., L. Overloading notation, we write w L = Exec(w 0 , z), where z = (z 1 , . . ., z L ).
The learning problem is: given a set of training examples {(w 0 , x, w L )}, learn a mapping from the text x to logical forms z = (z 1 , . . ., z L ) that produces the correct final state (w L = Exec(w 0 , z)).

Datasets
We created three new context-dependent datasets, ALCHEMY, SCENE, and TANGRAMS (see Table 1 for a summary), which aim to capture a diverse set of context-dependent linguistic phenomena such as ellipsis (e.g., "mix" in ALCHEMY), anaphora on entities (e.g., "he" in SCENE), and anaphora on actions (e.g., "repeat step 3", "bring it back" in TANGRAMS).
For each dataset, we have a set of properties and actions.In ALCHEMY, properties are color, and amount; actions are pour, drain, and mix.In SCENE, properties are hat-color and shirt-color; actions are enter, leave, move, and trade-hats.In TANGRAMS, there is one property (shape), and actions are add, remove, and swap.In addition, we include the position property (pos) in each dataset.Each example has L = 5 utterances, each denoting some transformation of the world state.
Our datasets are unique in that they are grounded to a world state and have rich linguistic context-dependence.In the context-dependent ATIS dataset (Dahl et al., 1994) used by Zettlemoyer and Collins (2009), logical forms of utterances depend on previous logical forms, though there is no world state and the linguistic phenomena is limited to nominal references.In the map navigation dataset (Chen and Mooney, 2011), used by Artzi and Zettlemoyer (2013), utterances only reference the current world state.Vlachos and Clark (2014) released a corpus of annotated dialogues, which has interesting linguistic contextdependence, but there is no world state.Data collection.Our strategy was to automatically generate sequences of world states and ask Amazon Mechanical Turk (AMT) workers to describe the successive transformations.Specifically, we started with a random world state w 0 .For each i = 1, . . ., L, we sample a valid action and argument (e.g., pour(beaker1, beaker2)).To encourage context-dependent descriptions, we upweight recently used ac-tions and arguments (e.g., the next action is more like to be drain(beaker2) rather than drain(beaker5)).Next, we presented an AMT worker with states w 0 , . . ., w L and asked the worker to write a description in between each pair of successive states.
In initial experiments, we found it rather nontrivial to obtain interesting linguistic contextdependence in these micro-domains: often a context-independent utterance such as "beaker 2" is just clearer and not much longer than a possibly ambiguous "it".We modified the domains to encourage more context.For example, in SCENE, we removed any visual indication of absolute position and allowed people to only move next to other people.This way, workers would say "to the left of the man in the red hat" rather than "to position 2".

Model
We now describe Model A, our full contextdependent semantic parsing model.First, let Z denote the set of candidate logical forms (e.g., pour(color(green),color(red))).Each logical form consists of a top-level action with arguments, which are either primitive values (green, 3, etc.), or composed via selection and superlative operations.See Table 2 for a full description.One notable feature of the logical forms is the context dependency: for example, given some context (w 0 , z 1:4 ), the predicate actions[2] refers to the action of z 2 and args We use the term anchored logical forms (a.k.a.derivations) to refer to logical forms augmented with alignments between sub-logical forms of z i and spans of the utterance x i .In the example above, color(green) might align with "green beaker" from Figure 1; see Figure 2 for another example.
top-level action applied to arguments v1, v2 Table 2: Grammar that defines the space of candidate logical forms.Values include numbers, colors, as well as special tokens args[i][j] (for all i ∈ {1, . . ., L} and j ∈ {1, 2}) that refer to the j-th argument used in the i-th logical form.Actions include the fixed domain-specific set plus special tokens actions[i] (for all i ∈ {1, . . ., L}), which refers to the i-th action in the context.

Derivation condition Example
(F1) zi contains predicate r (zi contains predicate pour, "pour") (F2) property p of zi.bj is y (color of arg 1 is green, "green") (F3) action zi.a is a and property p of zi.yj is y (action is pour and pos of arg 2 is 2, "pour, 2") (F4) properties p of zi.v1 is y and p of zi.v2 is y (color of arg 1 is green and pos of arg 2 is 2, "first green, 2") (F5) arg zi.vj is one of zi−1's args (arg reused, "it") (F6) action zi.a = zi−1.a(action reused, "pour") (F7) properties p of zi.yj is y and p of zi−1.yk is y (pos of arg 1 is 2 and pos of prev.arg 2 is 2, "then") (F8) t1 < s2 spans don't overlap Table 3: Features φ(x i , c i , z i ) for Model A: The left hand side describes conditions under which the system fires indicator features, and right hand side shows sample features for each condition.For each derivation condition (F1)-(F7), we conjoin the condition with the span of the utterance that the referenced actions and arguments align to.For condition (F8), we just fire the indicator by itself.
Log-linear model.We place a conditional distribution over anchored logical forms z i ∈ Z given an utterance x i and context c i = (w 0 , z 1:i−1 ), which consists of the initial world state w 0 and the history of past logical forms z 1:i−1 .We use a standard log-linear model: where φ is the feature mapping and θ is the parameter vector (to be learned).Chaining these distributions together, we get a distribution over a sequence of logical forms z = (z 1 , . . ., z L ) given the whole text x: Features.Our feature mapping φ consists of two types of indicators: 1.For each derivation, we fire features based on the structure of the logical form/spans.
2. For each span s (e.g., "green beaker") aligned to a sub-logical form z (e.g., color(green)), we fire features on unigrams, bigrams, and trigrams inside s conjoined with various conditions of z.
The exact features given in Table 3, references the first two utterances of Figure 1 and the associated logical forms below: x 1 = "Pour the last green beaker into beaker 2." z 1 = pour(argmin(color(green),pos),pos(2)) x 2 = "Then into the first beaker." We describe the notation we use for Table 3, restricting our discussion to actions that have two or fewer arguments.Our featurization scheme, however, generalizes to an arbitrary number of arguments.Given a logical form z i , let z i .abe its action and (z i .b 1 , z i .b 2 ) be its arguments (e.g., color(green)).The first and second arguments are anchored over spans [s 1 , t 1 ] and [s 2 , t 2 ], respectively.Each argument z i .bj has a corresponding value z i .vj (e.g., beaker1), obtained by executing z i .bj on the context c i .Finally, let j, k ∈ {1, 2} be indices of the arguments.For example, we would label the constituent parts of z 1 (defined above) as follows: 4 Left-to-right parsing We describe a new parser suitable for learning from denotations in the context-dependent setting.Like a shift-reduce parser, we proceed left to right, but each shift operation advances an entire utterance rather than one word.We then sit on the utterance for a while, performing a sequence of build operations, which either combine two logical forms on the stack (like the reduce operation) or generate fresh logical forms, similar to what is done in the floating parser of Pasupat and Liang (2015).
Our parser has two desirable properties: First, proceeding left-to-right allows us to build and score logical forms z i that depend on the world state w i−1 , which is a function of the previous logical forms.Note that w i−1 is a random variable in our setting, whereas it is fixed in Zettlemoyer and Collins (2009).Second, the build operation allows us the flexibility to handle ellipsis (e.g., "Mix.") and anaphora on full logical forms (e.g., "Do it again."),where there's not a clear alignment between the words and the predicates generated.
The parser transitions through a sequence of hypotheses.Each hypothesis is h = (i, b, σ), where i is the index of the current utterance, where b is the number of predicates constructed on utterance x i , and σ is a stack (list) of logical forms.The stack includes both the previous logical forms z 1:i−1 and fragments of logical forms built on the current utterance.When processing a particular hypothesis, the parser can choose to perform either the shift or build operation: Shift: The parser moves to the next utterance by incrementing the utterance index i and resetting b, which transitions a hypothesis from (i, b, σ) to (i + 1, 0, σ).

Build:
The parser creates a new logical form by combining zero or more logical forms on the stack.There are four types of build operations: 1. Create a predicate out of thin air (e.g., args[1][1] in Figure 5).This is useful when the utterance does not explicitly reference the arguments or action.For example, in Figure 5, we are able to generate the logical form args[1][1] in the presence of ellipsis.
2. Create a predicate anchored to some span of the utterance (e.g., actions[1] anchored to "Repeat").This allows us to do credit assignment and capture which part of the utterance explains which part of the logical form.
3. Pop z from the stack σ and push z onto σ, where z is created by applying a rule in Table 2 to z.
4. Pop z, z from the stack σ and push z onto σ, where z is created by applying a rule in Table 2 to z, z (e.g.
The build step stops once a maximum number of predicates B have been constructed or when the top-level rule is applied.We have so far described the search space over logical forms.In practice, we keep a beam of the K hypotheses with the highest score under the current log-linear model.

Model Projections
Model A is ambitious, as it tries to learn from scratch how each word aligns to part of the logical form.For example, when Model A parses "Mix it", one derivation will correctly align "Mix" to mix, but others will align "Mix" to args[1] [1], "Mix" to pos(2), and so on (Figure 2).
As we do not assume a seed lexicon that could map "Mix" to mix, the set of anchored logical forms is exponentially large.For example, parsing just the first sentence of Figure 1 would generate 1,216,140 intermediate anchored logical forms.
How can we reduce the search space?The key is that the space of logical forms is much smaller than the space of anchored logical forms.Even though both grow exponentially, dealing directly with logical forms allows us to generate pour without the combinatorial choice over alignments.We thus define Model B over the space of these logical forms.Figure 2 shows that the two anchored logical forms, which are treated differently in Model A are collapsed in Model B. This dramatically reduces the search space; parsing the first sentence of Figure 1 generates 7,047 intermediate logical forms.
We can go further and notice that many compositional logical forms reduce to the same flat logical form if we evaluate all the arguments.For example, in Figure 2, mix(args [1] [1]) and mix(pos(2)) are equivalent to mix(beaker2).We define Model C to be the space of these flat logical forms which consist of a top-level action plus primitive arguments.Using Model C, parsing the first sentence of Figure 1 generates only 349 intermediate logical forms.
Note that in the context-dependent setting, the number of flat logical forms (Model C) still increases exponentially with the number of utterances, but it is an overwhelming improvement over Model A. Furthermore, unlike other forms of relaxation, we are still generating logical forms that can express any denotation as before.The gains from Model B to Model C hinge on the fact that in our world, the number of denotations is much smaller than the number of logical forms.
Projecting the features.While we have defined the space over logical forms for Models B and C, we still need to define a distribution over these spaces to to complete the picture.To do this, we propose projecting the features of the log-linear model (1).Define Π A→B to be a map from a anchored logical form z A (e.g., mix(pos(2) ) aligned to "mix") to an unanchored one z B (e.g., mix(pos(2))), and define Π B→C to be a map from z B to the flat logical form z C (e.g., mix(beaker2)).
We construct a log-linear model for Model B by constructing features φ(z B ) (omitting the dependence on x i , c i for convenience) based on the Model A features φ(z A ). Specifically, φ(z B ) is the component-wise maximum of φ(z A ) over all z A that project down to z B ; φ(z C ) is defined similarly: Concretely, Model B's features include indicator features over LF conditions in Table 3 conjoined with every n-gram of the entire utterance, as there is no alignment.This is similar to the model of Pasupat and Liang (2015).Note that most of the derivation conditions (F2)-(F7) already depend on properties of the denotations of the arguments, so in Model C, we can directly reason over the space of flat logical forms z C (e.g., mix(beaker2)) rather than explicitly computing the max over more complex logical forms z B (e.g., mix(color(red))).
Expressivity.In going from Model A to Model C, we gain in computational efficiency, but we lose in modeling expressivity.For example, for "second green beaker" in Figure 1, instead of predicting color(green)[2], we would have to predict beaker3, which is not easily explained by the words "second green beaker" using the simple features in Table 3.
At the same time, we found that simple features can actually simulate some logical forms.For example, color(green) can be explained by the feature that looks at the color property of beaker3.Nailing color(green)[2], however, is not easy.Surprisingly, Model C can use a conjunction of features to express superlatives (e.g., argmax(color(red),pos)) by using one feature that places more mass on selecting objects that are red and another feature that places more mass on objects that have a greater position value.

Experiments
Our experiments aim to explore the computationexpressivity tradeoff in going from Model A to Model B to Model C. We would expect that under the computational constraint of a finite beam size, Model A will be hurt the most, but with an infinite beam, Model A should perform better.We evaluate all models on accuracy, the fraction of examples that a model predicts correctly.A predicted logical form z is deemed to be correct for an example (w 0 , x, w L ) if the predicted logical form z executes to the correct final world state w L .We also measure the oracle accuracy, which is the fraction of examples where at least one z on the beam executes to w L .All experiments train for 6 iterations using AdaGrad (Duchi et al., 2010) and L 1 regularization with a coefficient of 0.001.

Real data experiments
Setup.We use a beam size of 500 within each utterance, and prune to the top 5 between utterances.For the first two iterations, Models B and C train on only the first utterance of each example (L = 1).In the remaining iterations, the models train on two utterance examples.We then evaluate on examples with L = 1, . . ., 5, which tests our models ability to extrapolate to longer texts.
Accuracy with finite beam.We compare models B and C on the three real datasets for both L = 3 and L = 5 utterances (Model A was too expensive to use).Table 4 shows that on 5 utterance examples, the flatter Model C achieves an average accuracy of 20% higher than the more compositional Model B. Similarly, the average oracle accuracy is 39% higher.This suggests that (i) the correct logical form often falls off the beam for Model B due to a larger search space, and (ii) the expressivity of Model C is sufficient in many cases.
On the other hand, Model B outperforms Model C on the TANGRAMS dataset.This happens for two reasons.The TANGRAMS dataset has the smallest search space, since all of the utterances refer to objects using position only.Additionally, many utterances reference logical forms that Model C is unable to express, such as "repeat the first step", or "add it back".
Figure 6 shows how the models perform as the number of utterances per example varies.When the search space is small (fewer number of utterances), Model B outperforms or is competitive with Model C.However, as the search space increases (tighter computational constraints), Model C does increasingly better.
Overall, both models perform worse as L increases, since to predict the final world state w L correctly, a model essentially needs to predict an entire sequence of logical forms z 1 , . . ., z L , and errors cascade.Furthermore, for larger L, the utterances tend to have richer context-dependence.

Artificial data experiments
Setup.Due to the large search space, running model A on real data is impractical.In order feasibly evaluate Model A, we constructed an artificial dataset.The worlds are created using the procedure described in Section 2.2.We use a simple template to generate utterances (e.g., "drain 1 from the 2 green beaker").
To reduce the search space for Model A, we only allow actions (e.g., drain) to align to verbs and property values (e.g., green) to align to adjectives.Using these linguistic constraints provides a slightly optimistic assessment of Model A's performance.
We train on a dataset of 500 training examples and evaluate on 500 test examples.We repeat this procedure for varying beam sizes, from 40 to 260.The model only uses features (F1) through (F3).
Accuracy under infinite beam.Since Model A is more expressive, we would expect it to be more powerful when we have no computational constraints.Figure 7 shows that this is indeed the case: When the beam size is greater than 250, all models attain an oracle of 1, and Model A outperforms Model B, which performs similarly to Model C.This is because the alignments provide a powerful signal for constructing the logical forms.Without alignments, Models B and C learn noisier features, and accuracy suffers accordingly.
Bootstrapping.Model A performs the best with unconstrained computation, and Model C performs the best with constrained computation.Is there some way to bridge the two?Even though Model C has limited expressivity, it can still learn  to associate words like "green" with their corresponding predicate green.These should be useful for Model A too.
To operationalize this, we first train Model C and use the parameters to initialize model A. Then we train Model A. Figure 7 shows that although Model A and C predict different logical forms, the initialization allows Model C to A to perform well in constrained beam settings.This bootstrapping works here because Model C is a projection of Model A, and thus they share the same features.

Error Analysis
We randomly sampled 20 incorrect predictions on 3 utterance examples from each of the three real datasets for Model B and Model C. We categorized each prediction error into one of the following categories: (i) logical forms falling off the beam; (ii) choosing the wrong action (e.g., mapping "drain" to pour); (iii) choosing the wrong argument due to misunderstanding the description (e.g., mapping "third beaker" to pos(1)); (iv) choosing the wrong action or argument due to misunderstanding of context (see Figure 8); (v) noise in the dataset.Table 5 shows the fraction of each error category.

Related Work and Discussion
Context-dependent semantic parsing.Utterances can depend on either linguistic context or world state context.Zettlemoyer and Collins (2009) developed a model that handles references to previous logical forms; Artzi and Zettlemoyer (2013) developed a model that handles references to the current world state.Our system considers both types of context, handling linguistic phenomena such as ellipsis and anaphora that reference both previous world states and logical forms.
Logical form generation. Traditional semantic parsers generate logical forms by aligning each part of the logical form to the utterance (Zelle and Mooney, 1996;Wong and Mooney, 2007;Zettlemoyer and Collins, 2007;Kwiatkowski et al., 2011).In general, such systems rely on a lexicon, which can be hand-engineered, extracted (Cai and Yates, 2013;Berant et al., 2013), or automatically learned from annotated logical forms (Kwiatkowski et al., 2010;Chen, 2012).
Recent work on learning from denotations has moved away from anchored logical forms.Pasupat and Liang (2014) and Wang et al. (2015) proposed generating logical forms without alignments, similar to our Model B. Yao et al. (2014) and Bordes et al. (2014) have explored predicting paths in a knowledge graph directly, which is similar to the flat logical forms of Model C.
Relaxation and bootstrapping.The idea of first training a simpler model in order to work up to a more complex one has been explored other contexts.In the unsupervised learning of generative models, bootstrapping can help escape local optima and provide helpful regularization (Och and Ney, 2003;Liang et al., 2009).When it is difficult to even find one logical form that reaches the denotation, one can use the relaxation technique of Steinhardt and Liang (2015).
Recall that projecting from Model A to C creates a more computationally tractable model at the cost of expressivity.However, this is because Model C used a linear model.One might imagine that a non-linear model would be able to recuperate some of the loss of expressivity.Indeed, Neelakantan et al. (2016) use recurrent neural networks attempt to perform logical operations.
One could go one step further and bypass logical forms altogether, performing all the logical reasoning in a continuous space (Bowman et al., 2014;Weston et al., 2015;Guu et al., 2015;Reed and de Freitas, 2016).This certainly avoids the combinatorial explosion of logical forms in Model A, but could also present additional optimization challenges.It would be worth exploring this avenue to completely understand the computationexpressivity tradeoff.

Figure 1 :
Figure 1: Our task is to learn to map a piece of text in some context to a denotation.An example from the ALCHEMY dataset is shown.In this paper, we ask: what intermediate logical form is suitable for modeling this mapping?

Figure 2 :
Figure 2: Derivations generated for the last utterance in Figure 1.All derivations above execute to mix(beaker2).Model A generates anchored logical forms (derivations) where words are aligned to predicates, which leads to multiple derivations with the same logical form.Model B discards these alignments, and Model C collapses the arguments of the logical forms to denotations.

Figure 3 :Figure 4 :
Figure 3: SCENE dataset: Each person has a shirt of some color and a hat of some color.They enter, leave, move around on a stage, and trade hats.

Figure 5 :
Figure 5: Suppose we have already constructed delete(pos(2)) for "Delete the second figure."Continuing,we shift the utterance "Repeat".Then, we build action[1] aligned to the word "Repeat."followed by args[1][1], which is unaligned.Finally, we combine the two logical forms.

Figure 6 :
Figure 6: Test results on our three datasets as we vary the number of utterances.The solid lines are the accuracy, and the dashed line are the oracles: With finite beam, Model C significantly outperforms Model B on ALCHEMY and SCENE, but is slightly worse on TANGRAMS.

Figure 7 :
Figure 7: Test results on our artificial dataset with varying beam sizes.The solid lines are the accuracies, and the dashed line are the oracle accuracies.Model A is unable to learn anything with beam size < 240.However, for beam sizes larger than 240, Model A attains 100% accuracy.Model C does better than Models A and B when the beam size is small < 40, but otherwise performs comparably to Model B. Bootstrapping Model A using Model C parameters outperforms all of the other models and attains 100% even with smaller beams.

Figure 8 :
Figure 8: Predicted logical forms for this text: The logical form add takes a figure and position as input.Model B predicts the correct logical form.Model C does not understand that "back" refers to position 5, and adds the cat figure to position 1.

Table 1 :
We collected three datasets.The number of examples, train/test split, number of tokens per example, along with interesting phenomena are shown for each dataset.

Table 5 :
Percentage of errors for Model B and C: