Few-Shot Novel Concept Learning for Semantic Parsing

Humans are capable of learning novel concepts from very few examples; in contrast, state-of-the-art machine learning algorithms typically need thousands of examples to do so. In this paper, we propose an algorithm for learning novel concepts by representing them as programs over existing concepts. This way the concept learning problem is naturally a program synthesis problem and our algorithm learns from a few examples to synthesize a program representing the novel concept. In addition, we perform a theoretical analysis of our approach for the case where the program deﬁning the novel concept over existing one is context-free. We show that given a learned grammar-based parser and a novel production rule, we can augment the parser with the production rule in a way that provably generalizes. We evaluate our approach by learning concepts in the semantic parsing domain extended to the few-shot novel concept learning setting, showing that our approach signiﬁcantly outperforms end-to-end neural semantic parsers.


Introduction
A key feature of human intelligence is few-shot learning-namely, their ability to learn novel concepts from as few as one or two examples. In contrast, current deep learning approaches face significant challenges with such tasks. This is a limitation of deep learning across many applications, since in many domains, there are a large number of concepts, and for many, there are only a few examples available for learning.
There has been substantial recent interest in studying few-shot learning in the context of semantic parsing Loula et al., 2018), which is the task of mapping natural language utterances to an executable meaning representation (Mooney, 2007). This setting http://cogcomp.org/page/publication_view/952 provides a rich opportunity for concept learning since concepts can naturally be grounded in symbolic representations in the form of programs. As a consequence, there is an opportunity to learn novel, unseen concepts in a compositional way: the novel concept can be represented as a composition of existing concepts, and can then be composed with existing concepts to form more complex ones. While this compositional structure exists in many natural language tasks, it is explicit in semantic parsing.
As a concrete example, consider the novel concept 4 times in the utterance run 4 times, where run is an existing concept. The program representing 4 times is the program λx . (REPEAT 4 x), which is composed with the program RUN representing run to form the program (REPEAT 4 RUN).
In this paper, we propose a novel algorithm for learning novel concepts for semantic parsing. At a high level, leveraging the fact that concepts can be represented as programs, our algorithm is based on program synthesis (also known as program induction). In particular, given a set of input-output examples, synthesis algorithms search over the space of possible programs to identify one that is consistent with these examples; they can often find the correct program from very few examples.
The key challenge in applying program synthesis to our setting is that the given training examples are for the whole semantic parsing task rather than for the specific sub-program corresponding to the novel concept. In more detail, we consider the problem of learning a semantic parser from denotations alone-i.e., each training example consists of an utterance (the user's input) labeled with the execution of the corresponding program (the user's desired output) rather than the program itself. We assume that the utterance contains a single new natural language concept (e.g., a word or phrase) that we are trying to learn (i.e., synthesize a program representing that word or phrase). To address this issue, our algorithm proceeds as follows: Figure 1: A pictorial overview of our proposed algorithm for concept learning. Using a single teaching example, our semantic parser is able to learn the 4 times concept and use it in unseen contexts.
• Sketch synthesis: Since we already know the remaining concepts in the utterance, we want to avoid synthesizing them as well. To this end, our algorithm first synthesizes a sketch (Solar-Lezama, 2008), which is an incomplete program where one of its expressions has been left as a hole that remains to be filled in. This hole is supposed to be filled by the novel concept we are trying to synthesize.
• Hole synthesis: Next, our algorithm enumerates through possible sub-programs to use to fill the hole, with the goal of identifying one such that the entire program evaluates to the user's desired output. The resulting subprogram is our synthesized representation of the novel concept.
Finally, whenever the novel concept is encountered in future examples, we can substitute it with the synthesized sub-program. Next, we perform a theoretical analysis of our approach, which is designed to more generally elucidate why an approach such as ours can enable few-shot learning of novel concepts. In our analysis, we assume that a representation of the novel concept has already been synthesized; instead, our goal is to illustrate why augmenting the existing model with this new concept can generalize well. In particular, the main issue is that the novel concept can result in a shift in the distribution of decisions that must be learned by the model (e.g., application of parsing rules). We focus on the problem of parsing context-free grammars, which is simpler since it is a classification problem instead of a structured prediction problem. Then, we show that assuming the learned model is grammar-based (i.e., learn which production rules are in the grammar), then augmenting it with the novel concept (i.e., a novel production rule) can generalize well.
Finally, we experimentally evaluate our approach on two semantic parsing benchmarks, SCAN  and GeoQuery (Zelle, 1996), extended to our problem of few-shot novel concept learning. We demonstrate that while end-to-end deep learning baselines fail to learn the novel concepts, our approach does so effectively.
In summary, our key contributions are: • We propose a novel algorithm, Substitution-Driven Concept Learning (SDCL), for synthesizing programmatic representations of novel concepts in the context of semantic parsing.
• We prove generalization bounds on our approach adapted to context-free parsing; the key challenge is bounding the distribution shift induced by adding the novel concept.
• We empirically validate our approach on the extended SCAN and GeoQuery datasets, showing that SDCL substantially outperforms end-to-end deep learning approaches.
Example. Consider Fig. 1. The user provides a single example of the novel concept 4 timesi.e., an utterance run 4 times and walk and its denotation RUN RUN RUN RUN WALK (but not the program (REPEAT 4 RUN) WALK). First, we infer the type of 4 times by substituting it with other concepts; in particular, when 4 times is substituted with twice or thrice, we observe that the resulting sentence is grammatical. For these substitutions, the semantic parser produces programs (REPEAT 2 RUN) WALK and (REPEAT 3 RUN) WALK, respectively. Then, we compute the difference of these programs to derive the sketch (REPEAT ?? RUN) WALK. Finally, we enumerate implementations of ??; filling ?? with 4 produces (REPEAT 4 RUN) WALK, whose denotation is RUN RUN RUN RUN WALK, as desired. In the future, given an unlabeled utterance jump after walk 4 times, we produce sketch (REPEAT ?? WALK) JUMP, substitute ?? with 4, and return WALK WALK WALK WALK JUMP.

Related Work
Semantic parsing. Traditional semantic parsers consist of a grammar, which defines the space of derivations from the input utterances to output logical forms, along with a model, which ranks derivations in the grammar in order of likelihood (Kwiatkowksi et al., 2010;Kate and Mooney, 2006;Artzi and Zettlemoyer, 2013). More recently, deep learning approaches have treated semantic parsing as a sequence-to-sequence problem directly mapping the input to the output without use of a grammar (Dong and Lapata, 2016;Jia and Liang, 2016); this formulation provides a high degree of flexibility since it does not require designing a grammar for every new domain. One promising recent approach is to have the neural network decode a sequence of grammar rules to produce the output sequence (Krishnamurthy et al., 2017;Dasigi et al., 2019;Yin and Neubig, 2018); this strategy only requires a grammar over logical forms (which is typically available), not one over utterances. In this work, we investigate an important shortcoming of deep learning approaches-namely, few-shot learning of novel concepts. We provide an algorithm that enables learning novel concepts from a few examples in the context of deep learning approaches.
Systematicity in deep learning. Prior work has demonstrated that neural networks to not possess systematicity (Fodor and Pylyshyn, 1988;Loula et al., 2018), a property where the capacity to learn certain concepts implies the capacity to learn novel, structurally-related concepts. Lake and Baroni (2018) empirically investigates novel concept learning on the SCAN dataset. They consider models that have not seen the concept jump during training, but only in isolation, where jump is represented by the program JUMP, whose denotation is also JUMP. At test time, the models have to correctly predict the output sequence for jump in various contexts, e.g., jump and look. Models are expected to be able to do so since they have seen walk and look and run and look during training, and are expected to extrapolate these examples to jump. They find that several sequence-to-sequence models perform very poorly on this task. However, the kind of concepts they use to assess generalization are very limited. For instance, in SCAN, the jump primitive is an independent concept with no relation to the existing concepts such as walk, look, run, after, twice, around. In contrast, we consider a more general notion of a novel concept as a program that may be composed of existing concepts, and compare the ability of different models as well as our approach to learn such concepts from few examples.
Data augmentation. Recently, there have been several approaches attempting to improve systematicity of deep neural networks by using data augmentation. One approach, called data recombination (Jia and Liang, 2016), is to substitute concepts with other words of the same type. However, their approach assumes the type of the concept word is known, whereas we do not make this assumption. Furthermore, their approach is restricted to shallow concepts similar to the jump concept in , and does not extend to higher-order concepts. Another approach is "Good Enough Compositional Data Augmentation" (Andreas, 2020), which uses overlap with other concepts of the same type in the training data to perform data augmentation. In our setting, because we only provide a single teaching example, there is no context overlap with other concepts of the same type in the training data, so their approach is unable to produce any new examples; thus, they perform the same as the end-to-end approach.

Substitution-Driven Concept Learning
In this section, we describe our neurosymbolic algorithm for synthesizing programmatic representations of novel concepts from few examples.

Problem Formulation
We assume that we have already have a trained semantics parser f θ : Σ * → Π, which maps utterances x ∈ X = Σ * to programs π = f θ (x), along with denotational semantics · : Π → Y that maps programs π to denotations y = π . Now, we consider given a novel concept c, which is a word (or phrase) that does not occur in the data used to train f θ . In particular, we assume given a utterance x ∈ x such that x = x 0 c x 1 , but f θ cannot be used to parse x. In addition, we assume the user provides the desired denotation y ∈ Yi.e., y = π , where π = f * (x) is the desired program (which we are not given). Then, our goal is to infer the program π representing novel concept c from the given example (x, y).
We assume access to two additional pieces of information. First, we assume we have a repository of other concepts c ∈ C train , such that f θ can correctly parse utterances composed of these concepts. More importantly, we also assume we know the type of each concept c as well as the given concept c; we describe a heuristic for inferring the type of c below. In particular, the type of a concept c is the type of the value f * (c) . This information is used by our algorithm to substitute c in x with other concepts c ∈ C train of the same type as c; as described below, these substitutions are needed to help construct a sketch for x.

Overall Algorithm
Our algorithm, which we call Substitution-Driven Concept Learning (SDCL), is summarized in Algorithm 1. As can be seen, it is divided into two steps: sketch synthesis (the SKETCHSYNTH subroutine) and hole synthesis (the HOLESYNTH subroutine).
Sketch synthesis. At a high level, the sketch synthesis subroutine computes a sketchπ, which is an incomplete programπ = π 0 ?? π 1 with a hole represented by the symbol ??. We useπ to denote incomplete programs, and π to denote complete programs. For simplicity, we represent programs (and partial programs) as sequences, but in our implementation, we represent them as trees.
To synthesize a sketch, this subroutine first substitutes the novel concept c in x = x 0 c x 1 with each known concept c ∈ C train that have the same type as c, to obtain a modified utterance x = x 0 c x 1 . Since the type of c is the same, the resulting utterance x is valid, and furthermore, all concepts in x are known; thus, we can run the learned semantic parser on x to obtain a program π c = f θ (π c ). By compositionality, the concept c should parse to a program h c such that the overall program π c omitting h c is independent of c -i.e., π c = π 0 h c π 1 for some π 0 , π 1 . Here, h c is a complete program; we have used h instead of π to indicate that it is the representative of concept c . Intuitively, the pair (π 0 , π 1 ) represent the portion of the program corresponding to the context (x 0 , x 1 ) of c in x .
Finally, this subroutine returns the sketchπ = π 0 ?? π 1 for the original utterance x; in particular, the hole ?? represents the missing portion of the program that is supposed to be filled with the programmatic representation of concept c.
Hole synthesis. Next, our algorithm searches over possible programs h that can be used to fill the hole inπ = π 0 ?? π 1 . In particular, it enumerates programs h of the same type as c, constructs the complete program π = π 0 h π 1 , executes π to obtain y = π , and checks if y = y , where y is the desired result provided by the user. If so, then it returns h; otherwise, it continues its search.
The search order over programs h is important, since multiple programs might evaluate to the desired denotation y. A typical strategy is to search for the smallest program h; intuitively, this choice serves as regularization, since smaller programs correspond to simpler functionalities. In this case, we enumerate h using breadth first search (assuming the possibilities are represented by a contextfree grammar over programs), which ensures that we identify the smallest one.
Leveraging learned concepts. Finally, given a new utterance x with novel concept c, we parse it as follows. First, we synthesize a sketch π 0 ?? π 1 for x as before. Then, we fill the hole ?? with h c , where h c is the program we synthesized that represents c. Finally, we obtain π 0 h c π 1 .

Algorithm Details
Multiple examples. We have described our approach for using a single example (x, y) to learn the novel concept; it can easily be extended to multiple examples (x 1 , y 1 ), ..., (x n , y n ). In particular, we run sketch synthesis independently for each example, obtaining sketchesπ 1 , ...,π n . Hole synthesis (only run once) is given all of these sketches; then, it replaces the condition π 0 h π 1 = y with the condition n i=1 ( π i,0 h π i,1 = y i ), wherê π i = π i,0 ?? π i,1 for each i ∈ {1, ..., n}. That is, it ensures that h is consistent with all examples. Using multiple examples can reduce ambiguity (i.e., there may be multiple programs h that are consistent with a single example (x, y)).
Grammaticality-based substitution. So far, we have assumed that the type of the novel concept c is known. We describe a strategy for inferring its type; note that we continue to assume that the types for existing concepts c ∈ C train are known. At a high level, we separately train a dedicated model to detect whether a given substitution is grammatical. In particular, for each type τ , let C τ train ⊆ C train denote the known concepts of type τ . Now, for each type τ , we substitute c with each concept c ∈ C τ train into x = x 0 c x 1 to obtain x = x 0 c x 1 . Then, run a model p θ (x ) ∈ [0, 1] that predicts the probability that x is grammatical. We choose the type τ such that these substitutions are grammatical with the highest confidence-i.e., To train p θ , we generate training data using our known concepts C train , including both valid substitutions (labeled 1) and invalid ones (labeled 0).

Generalization Bounds
In this section, we prove generalization bounds on our approach adapted to context-free parsing.

Problem Formulation
We consider the problem of parsing-i.e., given a sentence x ∈ X = Σ * , decide whether x ∈ L(C * ).
Here, C * = (V, Σ, R, S) is an unknown contextfree grammar (CFG), where V are the nonterminals, Σ are the terminals, R are the productions, and S ∈ V is the start symbol. 1 We assume that C * is normalized-i.e., all productions are either unary A → B or binary A → BD. For simplicity, we consider fixed-length sentences (i.e., X = Σ K for some K ∈ N); we also assume all productions in C andC are binary (i.e., there are no unary productions). In addition, we assume given a probability distribution P(x) over sentences; then, our goal is to achieve good performance for sentences x ∼ P.
We consider a novel concept in the form of a single productionr =Ã →BD added to C * to obtain a modified CFGC * = (V, Σ,R, S)-i.e., That is,C * equals C * but with an extra productioñ r. For our theoretical analysis, we assume given • A learned model g : Σ * → {0, 1} such that g(w) ≈ 1(w ∈ L(C * )) (more precisely, they are equal with high probability).
• The novel productionr =Ã →BC. Then, our goal is to augment g with r to obtain a new classifierg that works well forC * for x ∼ P.

Grammar-Based Approach
Next, we propose and analyze an algorithm for augmenting a learned grammar-based parser with the given novel productionr.
Model. This strategy encodes the CFG as a function φ : In other words, φ is the indicator function over all |W | 3 possible productions. Then, given a CFG C φ , we construct a classifier f φ : To implement this check, we assume f φ runs a CYK parser on the input x = σ 1 ...σ K -i.e., for each for i < j. Then, f φ checks whether the start symbol is contained in the parse of the input x-i.e., Algorithm. We consider an algorithm that takes as input a pretrained model f φ designed to parse C, along with the novel productionr; then, this algorithm returns the modified model fφ, wherẽ That is, this algorithm simply overrides φ when r = r, thus augmenting it with the novel production.

Theoretical Analysis
Bounded error assumption. To obtain generalization bounds, we need to assume the accuracy of the given model f φ is bounded. Specifically, we assume the accuracy of φ is bounded, and then bound the accuracy of f φ in terms of the accuracy φ. In particular, we say φ is -correct if where φ * (r) = 1(r ∈ R), and p(r) is the distribution over productions encountered by the CYK algorithm when compute f φ (x) for a random sample x ∼ P; see Appendix A.1. That is, φ is -correct if it equals the ground truth φ * at least 1 − of the time. Similarly, we say f φ is -correct if where f * (x) = 1(x ∈ L(C * )). Now, we have: We give a proof in Appendix A.2.
Bounded shift assumption. In addition, we need to ensure that the shift from R toR does not induce too large of a shift in terms of the distribution over productions-i.e., the distribution p(r) of productions encountered while parsing x ∼ P using C * is not too different from the distributioñ p(r) of productions encountered while parsing x usingC * . To this end, we need to bound the degree to which the novel productionr affects p(r). In particular, we sayr is α-bounded if for all i < k < j. In other words, when parsing using C * , the probability thatr =Ã →BD would apply is bounded by α. Then: Lemma 2. Ifr is α-bounded, then r∈R |p(r) − p(r)| ≤ K 3 |R|α.
We give a proof in Appendix A.3.
Main result. Finally, our main result bounds the error of the modified model fφ onC * .

Experiments
In this section, we provide empirical evidence that our approach can perform few-shot novel concept learning. We use two existing datasets, SCAN and GeoQuery, which we have extended to our setting.

The HigherSCAN Dataset
Dataset. The SCAN dataset is a benchmark for evaluating systematicity in neural networks Loula et al., 2018). We extend SCAN to include different categories of novel concepts. The SCAN benchmark tests whether models can learn to understand instructions involving a novel primitive action such as jump without having seen jump in any context during training. However, jump is a terminal concept since it maps directly to the output token JUMP. We augment SCAN with higher-order novel concepts, where the novel concepts are programs composed of primitive concepts, and thus affect the structure of the output sequence. We consider the following novel concepts: • Extended Quantification: We introduce the n-times input token, whose semantics are to repeat a given action n times.
• Composite Actions: We introduce a new input token whose denotation is a composite action-i.e., a sequence of primitive actions. For example, consider the novel concept jog = WALK RUN. The input instruction jog twice and run should have denotation to WALK RUN WALK RUN RUN. We introduce several composite actions of varying complexity (i.e., length of its denotation). The dataset including novel concepts is generated from the SCAN grammar augmented with these concepts. The modified training set consists of the original SCAN dataset, which does not include any of our novel concepts, along with a single example using the novel concept. The test set consists of examples that use our novel concepts. The original SCAN grammar generates 20910 unique examples. In HigherSCAN, we have 7706 new utterances corresponding to each novel composite action concept, and 11594 new utterances corresponding to each novel extended quantification concept. 2 Our approach. We first train the neural semantic parser, which has a sequence-to-sequence encoderdecoder architecture (same as the baseline described below), on the original SCAN dataset. Then, for each novel concept, we run SDCL (Algorithm 1) with this semantic parser and a single training example for that novel concept. To train the grammaticality model, we randomly substitute words of a each type with those of different types in the original SCAN training examples to generate ungrammatical sentences, sampling a number of such ungrammatical sentences equal to the number of SCAN training examples. Then, we train a onelayered LSTM (with 50 hidden units) to predict the probability an instruction is grammatical.
Baseline. We compare to an end-to-end approach that uses a sequence-to-sequence neural network as the semantic parser, trained on the modified training set. We tried several architecture choices: LSTM cells vs. GRU cells, one vs. two layers, 100 vs. 200 hidden units, and with vs. without attention. We report results for the one-layered LSTM with 100 hidden units and with attention, which performed best on our validation set.
In addition, we compare to two variants of our approach SDCL: (i) one that uses oracle substitutions (i.e., the type of x is known, and (ii) one that uses confidence based substitutions. For (ii), we try two approaches: (a) train a separate model to predict whether the substitution x is grammatical and (b) simply use the confidence of the semantic parser-i.e., we take p θ (x ) to be the confidence of our semantic parser in its predicted program for x . Also, we use the product of the confidences rather than the average, which we find to work better.
Results. In Table 1, we compare the performance of our algorithm SDCL to the baselines, for each of the concept categories. If we know the type of the novel concept (i.e., the oracle), then we are able to achieve near perfect accuracy. Furthermore, using a separate model trained to detect grammatical sentences from ungrammatical ones is highly effective. For the ablation; even the crude approach using the parser confidence to determine the type of the novel concept significantly outperforms end-to-end learning for the one-shot concept learning task. Next, we demonstrate that end-to-end models cannot perform well even with a significantly larger number of examples. In Fig. 2, we show the performance of a sequence-to-sequence encoder-decoder model on the extended quantification and composite actions novel concepts, as a function of the number of times the novel concept is seen during training. In particular, we vary the number As can be seen, sequence-to-sequence models perform very poorly in the few-shot setting, and performance gradually improves as more examples of the novel concept are given.
One important observation is that both categories of novel concepts can make the length of the output program longer compared to examples in the original training data, which poses a challenge for end-to-end sequence models, especially when the concept has been seen only in a few instructions during training. Poor length extrapolation has also been observed to cause poor generalization in a different context

The GeoQuery Dataset
Dataset. To evaluate the our approach beyond the synthetic SCAN dataset, we consider a modification of the GeoQuery dataset (Zelle, 1996) to include the extended quantification concepts. In particular, extended GeoQuery has concepts such as n th -longest/shortest/largest/smallest river/mountain/state, etc. (we also change the corresponding predicates in the logical forms to include an argument n). As an example, the question "What are the major cities in the smallest state in the us?", which has corresponding logical form const(C, countryid(USA)))))).  Then, we run SDLC with this semantic parser and a single teaching example of novel concept, averaging results over 5 different choices of this example.
Baselines. We compare to the end-to-end model from (Jia and Liang, 2016), which is a singlelayer sequence-to-sequence encoder-decoder architecture with attention, with 200 hidden units and trained for 30 epochs using stochastic gradient descent (with a learning rate of 0.1 which is halved after every 5 epochs starting from epoch 15). The baseline model is trained on the extended training set and the teaching example (repeated 24 times).
Results. Table 2 shows the accuracy of each approach on the extended quantification 4 times and 5 times concepts. As before, the end-to-end model is unable to learn the novel concepts from the a single training example, whereas SDCL is able to learn the novel concepts with high accuracy. For this dataset, the grammaticality model for substitution is able to perfectly identify the correct type.

Conclusion
We have proposed a novel approach for few-shot novel concept learning in semantic parsing. Our approach, SDCL, leverages substitutions to infer a sketch of the target program, and then uses program synthesis to infer the sub-program corresponding to the novel concept. Thus, SDCL incorporates symbolic techniques that are able to learn from few examples into flexible end-to-end deep learning models. We have provided a theoretical analysis of how SDCL enables few-shot learning. Finally, we have empirically demonstrated that SDCL can learn novel concepts from a single example on two semantic parsing benchmarks, which we have extended to the novel concept learning setting.

A.1 Preliminaries
To facilitate our analysis, we let R i,j φ,x be the productions relevant to constructing V i,j φ,x from V i,k φ,x and V k+1,j φ,x (for i ≤ k < j)-i.e., R i,i φ,x = ∅, and for i < j. Then, the set of all productions relevant to parsing input x is In addition, we let π i,j φ,x (A) = 1(A ∈ V i,j φ,x ) π i,j φ,x (r) = 1(r ∈ R i,j φ,x ) π φ,x (r) = 1(r ∈ R φ,x ).
In general, we useX to denote the variant of X defined usingR instead of R. Also, we omit φ when φ = φ * . For example, we haveπ x (r) = 1(r ∈R φ * ,x ). Finally, the distribution over productions used is where p(r | x) = Uniform(r; R φ,x ).
That is, we need to correctly predict all productions considered by the CYK algorithm to avoid an error.