Learning Algebraic Recombination for Compositional Generalization

Neural sequence models exhibit limited compositional generalization ability in semantic parsing tasks. Compositional generalization requires algebraic recombination, i.e., dynamically recombining structured expressions in a recursive manner. However, most previous studies mainly concentrate on recombining lexical units, which is an important but not sufficient part of algebraic recombination. In this paper, we propose LeAR, an end-to-end neural model to learn algebraic recombination for compositional generalization. The key insight is to model the semantic parsing task as a homomorphism between a latent syntactic algebra and a semantic algebra, thus encouraging algebraic recombination. Specifically, we learn two modules jointly: a Composer for producing latent syntax, and an Interpreter for assigning semantic operations. Experiments on two realistic and comprehensive compositional generalization benchmarks demonstrate the effectiveness of our model. The source code is publicly available at https://github.com/microsoft/ContextualSP.


Introduction
The principle of compositionality is an essential property of language: the meaning of a complex expression is fully determined by its structure and the meanings of its constituents . Based on this principle, human intelligence exhibits compositional generalization -the algebraic capability to understand and produce a potentially infinite number of novel expressions by dynamically recombining known components . For example, people who know the meaning of "John teaches the girl" and John teaches the girl.
The girl teaches John.

Tom teaches John's John tea
Tom's (b) Most previous studies mainly concentrate on recombining lexical units, which is an important but not sufficient part of algebraic recombination. "Tom's daughter" must know the meaning of "Tom teaches John's daughter's daughter" (Figure 1a), even though they have never seen such complex sentences before. In recent years, there has been accumulating evidence that end-to-end deep learning models lack such ability in semantic parsing (i.e., translating natural language expressions to machine interpretable semantic meanings) tasks (Lake and Tsarkov et al., 2020).
Compositional generalization requires algebraic recombination, i.e., dynamically recombining structured expressions in a recursive manner. In the example in Figure 1a, understanding "John's daughter's daughter" is a prerequisite for understanding "Tom teaches John's daughter's daughter", while "John's daughter's daughter" is also a novel compound expression, which requires recom-bining "John" and "Tom's daughter" recursively.
Most previous studies on compositional generalization mainly concentrate on recombining lexical units (e.g., words and phrases) (Lake, 2019; Akyürek et al., 2020;, of which an example is shown in Figure 1b. This is a necessary part of algebraic recombination, but it is not sufficient for compositional generalization. There have been some studies on algebraic recombination (Liu et al., 2020;. However, they are highly specific to a relative simple domain SCAN (Lake and  and can hardly generalize to more complex domains.
In this paper, our main point to achieve algebraic recombination is to model semantic parsing as a homomorphism between a latent syntactic algebra and a semantic algebra . Based on this formalism, we focus on learning the high-level mapping between latent syntactic operations and semantic operations, rather than the direct mapping between expression instances and semantic meanings.
Motivated by this idea, we propose LEAR (Learning Algebraic Recombination), an end-toend neural architecture for compositional generalization. LEAR consists of two modules: a Composer and an Interpreter. Composer learns to model the latent syntactic algebra, thus it can produce the latent syntactic structure of each expression in a bottom-up manner; Interpreter learns to assign semantic operations to syntactic operations, thus we can transform a syntactic tree to the final composed semantic meaning.

Compositionality: An Algebraic View
A semantic parsing task aims to learn a meaningassignment function m : L → M , where L is the set of (simple and complex) expressions in the language, and M is the set of available semantic meanings for the expressions in L. Many end-toend deep learning models are built upon this simple and direct formalism, in which the principle of compositionality is not leveraged, thus exhibiting limited compositional generalization.
To address this problem, in this section we put forward the formal statement that "compositionality requires the existence of a homomorphism between the expressions of a language and the meanings of those expressions" .
Let us consider a language as a partial algebra L = L, (f γ ) γ∈Γ , where Γ is the set of underlying syntactic (grammar) rules, and we use f γ : L k → L to denote the syntactic operation with a fixed arity k for each γ ∈ Γ. Note that f γ is a partial function, which means that we allow f γ be undefined for certain expressions. Therefore, L is a partial algebra, and we call it a syntactic algebra. In a semantic parsing task, L is latent, and we need to model it by learning from data.
Consider now M = M, G , where G are semantic operations upon M . M is also a partial algebra, and we call it a semantic algebra. In a semantic parsing task, we can easily define this algebra (by enumerating all available semantic primitives and semantic operations), since M is a machineinterpretable formal system.
The key to compositionality is that the meaningassignment function m should be a homomorphism from L to M. That is, for each k-ary syntactic operation f γ in L, there exists a k-ary semantic operation g γ ∈ G such that whenever f γ (e 1 , ..., e k ) is defined, m(f γ (e 1 , ..., e k )) = g γ (m(e 1 ), ..., m(e k )). (1) Based on this formal statement, the task of learning the meaning-assignment function m can be transformed as two sub-tasks: (1) learning latent syntax of expressions (i.e., modeling the syntactic algebra L); (2) learning the operation assignment function (f γ ) γ∈Γ → G. Learning latent syntax. We need to learn a syntactic parser that can produce the syntactic structure of each given expression. To ensure compositional generalization, there must be an underlying grammar (i.e., Γ), and we hypothesize that Γ is a context-free grammar. Learning operation assignment. In the syntax tree, for each nonterminal node with k nonterminal children, we assign a k-ary semantic operation to it. This operation assignment entirely depends on the underlying syntactic operation γ of this node.
In semantic parsing tasks, we do not have respective supervision for these two sub-tasks. Therefore, we need to jointly learning these two sub-tasks only from the end-to-end supervision D ⊂ L × M .  Figure 2: An overview of LEAR: (1) Composer C θ (z|x) is a neural network based on latent Tree-LSTM, which produces the latent syntax tree z of input expression x;

Who executive produced M0
(2) Interpreter I φ (g|x, z) is a neural network that assigns a semantic operation for each nonterminal node in z.

Model
We propose a novel end-to-end neural model LEAR (Learning Algebraic Recombination) for compositional generalization in semantic parsing tasks. Figure 2 shows its overall architecture. LEAR consists of two parts: (1) Composer C θ (z|x), which produces the latent syntax tree z of input expression x; (2) Interpreter I φ (g|x, z), which assigns a semantic operation for each nonterminal node in z. θ and φ refers to learnable parameters in them respectively. We generate a semantic meaning m(x) according to the predicted z and g in a symbolic manner, then check whether it is semantic equivalent to the ground truth semantic meaning y to produce rewards for optimizing θ and φ.

Composer
We use x = [x 1 , ..., x T ] to denote an input expression of length T . Composer C θ (z|x) will produce a latent binary tree z given x.

Latent Tree-LSTM
We build up the latent binary tree z in a bottom-up manner based on Tree-LSTM encoder, called latent Tree-LSTM . Given the input sequence x of length T , latent Tree-LSTM merges two nodes into one parent node at each merge step, constructing a binary tree after T − 1 merge steps. The merge process is implemented by selecting the adjacent node pair which has the highest merging score.
At the t-th (1 ≤ t < T ) merge step, we have: Here "Tree-LSTM" is the standard child-sum tree-structured LSTM encoder . We use v t i to denote the i-th cell at layer t (the t-th merge step is determined by the t-th layer), and use r t i to denote the representation of v t i : Then we can obtain a unlabeled binary tree, in which {v 1 1 , v 1 2 , ..., v 1 T } are leaf nodes, and } are non-leaf nodes.

Abstraction by Nonterminal Symbols
As discussed in Section 2, our hypothesis is that the underlying grammar Γ is context-free. Therefore, each syntactic rule γ ∈ Γ can be expressed in the form of: where N is a finite set of nonterminals, and Σ is a finite set of terminal symbols.
Abstraction is an essential property of contextfree grammar: each compound expression e will be abstracted as a simple nonterminal symbol N (e), then it can be combined with other expressions to produce more complex expressions, no matter what details e originally has. This setup may benefit the generalizability, thus we want to incorporate it as an inductive bias into our model.
Concretely, we assume that there are at most N latent nonterminals in language L (i.e., N = {N 1 , ..., N N }, where N is a hyper-parameter). For each node v t i in tree z, we perform a (N + 1)-class classification: We assign the nonterminal Nĉ : Equation 6 means that: in nonterminal nodes, the bottom-up message passing will be reduced from r t i to a nonterminal symbol N (v t i ), thus mimicking the abstraction setup in context-free grammar.

Interpreter
For each nonterminal node v ∈ V z , Interpreter I φ (g|x, z) assigns a semantic operation g v to it.
We divide nonterminal nodes into two categories: (1) lexical nodes, which refer to those containing no any other nonterminal node in the corresponding sub-trees; (2) algebraic nodes, which refer to the rest of nonterminal nodes.
Interpreting Lexical Nodes For each lexical node v, Interpreter assigns a semantic primitive (i.e., 0-ary semantic operation) to it. Take the CFQ benchmark as an example: it uses SPARQL queries to annotate semantic meanings, thus semantic primitives in CFQ are entities (e.g., m.0gwm wy), predicates (e.g., ns:film.director.film) and attributes (e.g., ns:people.person.gender m 05zppz).
We use a classifier to predict the semantic primitive: where G lex is the collection of semantic primitives in the domain, and h v,x is the contextual representation of the span corresponding to v (implemented using Bi-LSTM). Contextually conditioned variation is an important phenomenon in language: the meaning of lexical units varies according to the contexts in which they appear . For example, "editor" means a predicate "film.editor.film" in expression "Is M0 an editor of M1?", while it means an attribute "film.editor" in expression "Is M0 an Italian editor?". This is the reason why we use contextual representation in Equation 7.
Interpreting Algebraic Nodes For each algebraic node v, Interpreter assigns a semantic operation to it. The collection of all possible semantic operations G opr also depends on the domain. Take the CFQ benchmark as an example 1 , this domain has two operations (detailed in Table 1): ∧ (conjunction) and JOIN.
We also use a classifier to predict the semantic operation of v: where r v is the latent Tree-LSTM representation of node v (see Equation 6).
In Equation 8, we do not use any contextual information from outside v. This setup is based on the assumption of semantic locality: each compound expression should mean the same thing in different contexts.

Training
Denote τ = {z, g} as the trajectory produced by our model where z and g are actions produced from Composer and Interpreter, respectively, and R(τ ) as the reward of trajectory τ (elaborated in Sec. 4.1). Using policy gradient  with the likelihood ratio trick, our model can be optimized by ascending the following gradient: where θ and φ are learnable parameters in Composer and Interpreter respectively and ∇ is the abbreviation of ∇ θ,φ . Furthermore, the REINFORCE algorithm (Williams, 1992) is leveraged to approximate Eq. 9 and the mean-reward baseline (Weaver and Tao, 2001) is employed to reduce variance.

Reward Design
The reward R (τ ) combines two parts as: (10) Logic-based Reward R 1 (τ ). We use m(x) and y to denote the predicted semantic meaning and the ground truth semantic meaning respectively. Each semantic meaning can be converted to a conjunctive normal form 2 . We use S m(x) and S y to denote conjunctive components in m(x) and y, then define R 1 (τ ) based on Jaccard similarity (i.e., intersection over union): Primitive-Based Reward R 2 (τ ). We use S m(x) and S y to denote semantic primitives ocurred in m(x) and y. Then we define R 2 (τ ) as:

Reducing Search Space
To reduce the huge search space of τ , we make two constraints as follows.
Parameter Constraint. Consider v, a tree node with n(n > 0) nonterminal children. Composer will never make v a nonterminal node, if no semantic operation has n parameters. Phrase Table Constraint. Following the strategy proposed in , we build a "phrase table" consisting of lexical units (i.e., words and phrases) paired with semantic primitives that frequently co-occur with them 3 . Composer will never produce a lexical node outside of this table, and Interpreter will use this table to restrict candidates in Equation 7.

Curriculum Learning
To help the model converge better, we use a simple curriculum learning  strategy to train the model. Specifically, we first train the model on samples of input length less than a cut-off N CL , then further train it on the full train set.
3 Mainly based on statistical word alignment technique in machine translation, detailed in the Appendix.  a Input/output pattern coverage is the percentage of test x/y whose patterns occur in the train data. Output patterns are determined by anonymizing semantic primitives, and input patterns are determined by anonymizing their lexical units.

Experimental Setup
Benchmarks. We mainly evaluate LEAR on CFQ  and COGS , two comprehensive and realistic benchmarks for measuring compositional generalization. They use different semantic formulations: CFQ uses SPARQL queries, and COGS uses logical queries ( Figure 3 shows examples of them). We list dataset statistics in Table 2. The input/output pattern coverage indicates that: CFQ mainly measures the algebraic recombination ability, while COGS measures both lexical recombination (∼ 78%) and algebraic recombination (∼ 22%).
In addition to these two compositional generalization benchmarks in which utterances are synthesized by formal grammars, we also evaluate LEAR on GEO (Zelle and Mooney, 1996), a widely used semantic parsing benchmark, to see whether LEAR can generalize to utterances written by real users. We use the variable-free FunQL  as the semantic formalism, and we follow the compositional train/test split  to evaluate compositional generalization. Baselines. For CFQ, we consider 3 groups of models as our baselines: (1) sequence-to-sequence mod-   7.1±1.8 13.2±3.9 1.6±0.8 6.6±0.6 HPD  67   , and state-of-the-art model HPD . For COGS, we quote the baseline results in the original paper . For GEO, we take the baseline results reported by , and also compare with two specially designed methods: SpanBasedSP  and PDE . Evaluation Metric. We use accuracy as the evaluation metric, i.e., the percentage test samples of which the predicted semantic meaning m(x) is semantically equivalent to the ground truth y. Hyper-Parameters. We set N = 3/2/3 (the number of nonterminal symbols), and α = 0.5/1.0/0.9 for CFQ/COGS/GEO respectively. In CFQ, the curriculum cut-off N CL is set to 11, as we statistically find that this is the smallest curriculum that contains the complete vocabulary. We do not apply curriculum learning strategy to COGS and GEO, as LEAR can work well without curriculum learning in both benchmarks. Learnable parameters (θ and φ) are optimized with AdaDelta (Zeiler, 2012), and the setting of learning rate is discussed in Section 6.1. We take the model that performs best Model Acc Transformer  35 ± 6 LSTM (Bi)  16 ± 8 LSTM (Uni)  32 ± 6 LEAR 97.7 ± 0.7 w/o Abstraction 94.5 ± 2.8 w/o Semantic locality 94.0 ± 3.6 w/o Tree-LSTM 80.7 ± 4.3

Model
Acc Seq2Seq  46.0 BERT2Seq  49.6 GRAMMAR  54.0 PDE  81.2 SpanBasedSP  82.2 LEAR 84.1  Table 3 shows average accuracy and 95% confidence intervals on three splits of CFQ. LEAR achieves an average accuracy of 90.9% on these three splits, outperforming all baselines by a large margin. We list some observations as follows.

Results and Discussion
Methods for lexical recombination cannot generalize to algebraic recombination. Many methods for compositional generalization have been proved effective for lexical recombination. Neural Shuffle Exchange and CGPS are two representatives of them. However, experimental results show that they cannot generalize to CFQ, which focus on algebraic recombination. Knowledge of semantics is important for compositional generalization. Seq2seq models show poor compositional generalization ability (∼ 20%).
Pre-training helps a lot (∼ 20% →∼ 40%), but still not satisfying. HPD and LEAR incorporate knowledge of semantics (i.e., semantic operations) into the models, rather than simply model semantic meanings as sequences. This brings large profit.
Exploring latent compositional structure in a bottom-up manner is key to compositional generalization. HPD uses LSTM to encode the input expressions, while LEAR uses latent Tree-LSTM, which explicitly explores latent compositional structure of expressions. This is the key to the large accuracy profit (67.3% → 90.9%). Table 4 shows the results on COGS benchmark. It proves that LEAR can well generalize to domains which use different semantic formalisms, by specifying domain-specific G lex (semantic primitives) and G opr (semantic operations). Table 5 shows the results on GEO benchmark. It proves that LEAR can well generalize to utterances written by real users (i.e., non-synthetic utterances). Table 3 and 4 also report results of some ablation models. Our observations are as follows. Abstraction by nonterminal symbols brings profit. We use "w/o abstraction" to denote the ablation model in which Equation 6 is disabled. This ablation leads to 5.5%/3.2% accuracy drop on CFQ/COGS. Incorporating semantic locality into the model brings profit. We use "w/o semantic locality" to denote the ablation model in which a Bi-LSTM layer is added before the latent Tree-LSTM. This ablation leads to 3.0%/3.7% accuracy drop on CFQ/COGS. Tree-LSTM contributes significantly to compositional generalization. In the ablation "w/o Tree-LSTM", we replace the Tree-LSTM encoder with a span-based encoder, in which each span is represented by concatenating its start and end LSTM representations (similar to ). In Table 3 and 4, we can see that span-based encoder severely affects the performance and even    much worse than the results of "w/o abstraction" and "w/o semantic locality". This ablation hints that Tree-LSTM is the main inductive bias of compositionality in our model. Primitive-based reward helps the model converge better. The ablation "w/o primitive-based reward" leads to 5.6% accuracy drop on CFQ, and the model variance has become much larger. The key insight is: primitive-based reward guides the model to interpret polysemous lexical units more effectively, thus helping the model converge better.

Ablation Study
Curriculum learning helps the model converge better. The ablation "w/o curriculum learning" leads to 19% accuracy drop on CFQ, and the model variance has become much larger. This indicates the importance of curriculum learning. On COGS, LEAR performs well without curriculum learning. We speculate that there are two main reasons: (1) expressions of COGS is much shorter than CFQ; (2) the input/output pattern coverage of COGS is much higher than CFQ.
Higher component with smaller learning rate. Inspired by the differential update strategy used in Liu et al. (2020)(i.e., the higher level the component is positioned in the model, the slower the parameters in it should be updated), we set three different learning rates to three different components in LEAR (in bottom-up order): lexical Interpreter, Composer, and algebraic Interpreter. We fix the learning rate of lexical Interpreter to 1, and adjust the ratio of the learning rates of Composer and algebraic Interpreter to lexical Interpreter. Table 6 shows the results on CFQ. The hierarchical learning rate setup (1 : 0.5 : 0.1) achieves the best performance. ∧ (b) Interpreter error. In this expression, the first "influenced" should be assigned a semantic primitive "influence.influence node.influenced", while Interpreter incorrectly assigns "influence.influence node.influenced by" (abbreviated as "INFLU BY" in this figure) to it. Figure 5: Two error cases. We use solid nodes to denote predicted nonterminal nodes. Incorrect parts are colored red.

Closer Analysis
We also conduct closer analysis to the results of LEAR as follows.

Performance by Input Length
Intuitively, understanding longer expressions requires stronger algebraic recombination ability than shorter examples. Therefore, we expect that our model should keep a good and stable performance with the increasing of input length. Figure 4 shows the performance of LEAR and HPD (the state-of-the-art model on CFQ) under different input lengths. Specifically, test instances are divided into 6 groups by length: [1, 5], [6, 10], ..., [26,30]), and we report accuracy on each group separately. The results indicate that LEAR has stable high performance for different input lengths, with only a slow decline as length increases. Even on the group with the longest input length, LEAR can maintain an average 86.3% accuracy across three MCD-splits.

Error Analysis
To understand the source of errors, we take a closer look at the failed test instances of LEAR on CFQ. These failed test instances account for 9.1% of the test dataset. We category them into two error types:  Composer error (CE), i.e., test cases where Composer produces incorrect syntactic structures (only considering nonterminal nodes). Figure 5a shows an example. As we do not have ground-truth syntactic structures, we determine whether a failed test instance belongs to this category based on handcraft syntactic templates.
Interpreter error (IE), i.e., test cases where Composer produces correct syntactic structures but Interpreter assigns one or more incorrect semantic primitives or operations. Figure 5b shows an example, which contains an incorrect semantic primitive assignment. Table 7 shows the distribution of these two error types. On average, 39.19% of failed instances are composer errors, and the remaining 60.81% are interpreter errors.

Limitations
Our approach is implicitly build upon the assumption of primitive alignment, that is, each primitive in the meaning representation can align to at least one span in the utterance. This assumption holds in most cases of various semantic parsing tasks, including CFQ, COGS, and GEO. However, for robustness and generalizability, we also need to consider cases that do not meet this assumption. For example, consider this utterance "Obama's brother", of which the corresponding meaning representation is "Slibing(P eople[Obama]) ∧ Gender[M ale]". Neither "Slibing" nor "Gender[M ale]" can align to a span in the utterance, as the composed meaning of them is expressed by a single word ("brother"). Therefore, LEAR is more suitable for formalisms where primitives can better align to natural language.
In addition, while our approach is general for various semantic parsing tasks, the collection of semantic operations needs to be redesigned for each task. We need to ensure that these semantic operations are k-ary projections (as described in Section 2), and all the meaning representations are covered by the operations collection. This is tractable, but still requires some efforts from domain experts. 7 Related Work

Compositional Generalization
Recently, exploring compositional generalization (CG) on neural networks has attracted large attention in NLP community. For SCAN (Lake and , the first benchmark to test CG on seq2seq models, many solutions have been proposed, which can be classified into two tracks: data augmentation (Andreas, 2019; Akyürek et al., 2020; and specialized architecture (Lake, 2019; . However, most of these works only focus on lexical recombination. Some works on SCAN have stepped towards algebraic recombination (Liu et al., 2020;, but they do not generalize well to other tasks such as CFQ  and COGS .
Before our work, there is no satisfactory solution on CFQ and COGS. Previous works on CFQ demonstrated that MLM pre-training  and iterative back-translation  can improve traditional seq2seq models. HPD , the state-of-the-art solution before ours, was shown to be effective on CFQ, but still far from satisfactory. As for COGS, there is no solution to it to the best of our knowledge.

Compositional Semantic Parsing
In contrast to neural semantic parsing models which are mostly constructed under a fully seq2seq paradigm, compositional semantic parsing models predict partial meaning representations and compose them to produce a full meaning representation in a bottom-up manner (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2012;. Our model takes the advantage of compositional semantic parsing, without requiring any handcraft lexicon or syntactic rule.

Unsupervised Parsing
Unsupervised parsing (or grammar induction) trains syntax-dependent models to produce syntactic trees of natural language expressions without direct syntactic annotation . Comparing to them, our model learns both syntax and semantics jointly.

Conclusion
In this paper, we introduce LEAR, a novel end-toend neural model for compositional generalization in semantic parsing tasks. Our contribution is 4fold: (1) LEAR focuses on algebraic recombination, thus it exhibits stronger compositional generalization ability than previous methods that focus on simpler lexical recombination.
(2) We model the semantic parsing task as a homomorphism between two partial algebras, thus encouraging algebraic recombination.
(3) We propose the model architecture of LEAR, which consists of a Composer (to learn latent syntax) and an Interpreter (to learn operation assignments). (4) Experiments on two realistic and comprehensive compositional generalization benchmarks demonstrate the effectiveness of our model.

Ethical Consideration
The experiments in this paper are conducted on existing datasets. We describe the model architecture and training method in detail, and provide more explanations in the supplemental materials. All the data and code will be released with the paper. The resources required to reproduce the experiments is a Tesla P100 GPU, and for COGS benchmark even one CPU is sufficient. Since the compositional generalization ability explored in this paper is a fundamental problem of artificial intelligence and has not yet involved real applications, there are no social consequences or ethical issues. This is the Appendix for the paper: "Learning Algebraic Recombination for Compositional Generalization".

A Semantic Operations in COGS
The semantic primitives used in COGS benchmark are entities (e.g., Emma and cat(x 1)), predicates (e.g., eat) and propositions (e.g., eat.agent(x 1, Emma)). The semantic operations in COGS are listed in Table 8.
The operations with " −1 " (e.g., ON −1 ) are right-to-left operations (e.g., ON −1 (cake, table)→table.ON.cake) while the operations without "-1" represent the left-to-right operations (e.g., ON(cake, table)→cake.ON.table). For operation FillFrame, the entity in its arguments will be filled into predicate/proposition as an AGENT, THEME or RECIPIENT, which is decided by model.

B Semantic Operations in GEO and
Post-process The semantic primitives used in GEO benchmark are entities (e.g., var0), predicates (e.g., state()) and propositions (e.g., state(var0)). The semantic operations in GEO are listed in Table 9.
To fit the FunQL formalism, we design two postprocessing rules for the final semantics generated by the model. First, if the final semantic is a predicate (not a proposition), it will be converted in to a proposition by filling the entity all. Second, the predicate most will be shifted forward two positions in the final semantics.

C Policy Gradient and Differential Update
In this section, we will show more details about the formulation of our RL training based on policy gradient and how to use differential update strategy on it. Denoting τ = {z, g} as the trajectory of our model where z and g are actions (or called results) produced from Composer and Interpreter, respectively, and R(τ ) as the reward of a trajectory τ (elaborated in Sec. 4.1), the training objective of our model is to maximize the expectation of rewards as: where π θ,φ is the policy of the whole model θ and φ are the parameters in Composer and Interpreter, respectively. Applying the likelihood ratio trick, θ and φ can be optimized by ascending the following gradient: which is same with Eq. 9. As described in Sec. 3 that the interpreting process can be divided into two stages: interpreting lexical nodes and interpreting algebraic nodes, the action g can also be split as the semantic primitives of lexical nodes g l and the semantic operations of algebraic nodes g a . In our implement, we utilize two independent neural modules for interpreting lexical nodes and interpreting algebraic nodes, with parameters φ l and φ a respectively. Therefore, ∇ log π θ,φ (τ ) in Eq. 9 can be expanded via the chain rule as: ∇ log π θ,φ (τ ) =∇ log π θ (z|x) + ∇ log π φ l (g l |x, z) + ∇ log π φa (g a |x, z, g l ) .
(15) Furthermore, in our experiments, the AdaDelta optimizer (Zeiler, 2012) is employed to optimize our model. Table   The phrase table consists Table 10 As to COGS, for each possible lexical unit, we first filter out the semantic primitives that exactly co-occur with it, and delete lexical units with no semantic primitive. Among the remaining lexical units, for those only contain one semantic primitive, we record their co-occurring semantic primitives  Table 8: Semantic operations in COGS. "Pred" and "Prop" are abbreviations of "Predicate" and "Proposition", respectively. "AGE", "THE" and "REC" are abbreviations of "AGENT", "THEME" and "RECIPIENT", respectively. "-" omits similar examples. Some operations contain "NONE" example, indicating that no example utilize these operations in dataset.  Table 9: Semantic operations in GEO. "Pred" and "Prop" are abbreviations of "Predicate" and "Proposition", respectively. "INTER", "EXC" and "CONCAT" are abbreviations of "INTERSECTION", "EXCLUDE" and "CON-CATENATION", respectively.

Operation
as ready semantic primitives. For lexical units with more than one semantic primitives, we delete the ready semantic primitives from their co-occurring semantic primitives. Finally, we obtain 731 lexical units and each lexical unit is paired with just one semantic primitive. As GEO is quite small, we obtain its phrase table by handcraft.

E More Examples
We show more examples of generated treestructures and semantics in Figure 6.