Few-Shot Semantic Parsing for New Predicates

In this work, we investigate the problems of semantic parsing in a few-shot learning setting. In this setting, we are provided with k utterance-logical form pairs per new predicate. The state-of-the-art neural semantic parsers achieve less than 25% accuracy on benchmark datasets when k = 1. To tackle this problem, we proposed to i) apply a designated meta-learning method to train the model; ii) regularize attention scores with alignment statistics; iii) apply a smoothing technique in pretraining. As a result, our method consistently outperforms all the baselines in both one and two-shot settings.


Introduction
Semantic parsing is the task of mapping natural language (NL) utterances to structured meaning representations, such as logical forms (LF). One key obstacle preventing the wide application of semantic parsing is the lack of task-specific training data. New tasks often require new predicates of LFs. Suppose a personal assistant (e.g. Alexa) is capable of booking flights. Due to new business requirement it needs to book ground transport as well. A user could ask the assistant "How much does it cost to go from Atlanta downtown to airport?". The corresponding LF is as follows: (lambda $0 e (exists $1 (and ( ground transport $1 ) (to city $1 atlanta:ci )(from airport $1 atlanta:ci) ( =(ground fare $1 ) $0 )))) where both ground transport and ground fare are new predicates while the other predicates are used in flight booking, such as to city, from airport. As manual construction of large parallel training data is expensive and time-consuming, we consider the few-shot formulation of the problem, which requires only a handful of utterance-LF training pairs * corresponding author for each new predicate. The cost of preparing fewshot training examples is low, thus the corresponding techniques permit significantly faster prototyping and development than supervised approaches for business expansions. Semantic parsing in the few-shot setting is challenging. In our experiments, the accuracy of the state-of-the-art (SOTA) semantic parsers drops to less than 25%, when there is only one example per new predicate in training data. Moreover, the SOTA parsers achieve less than 32% of accuracy on five widely used corpora, when the LFs in the test sets do not share LF templates in the training sets (Finegan-Dollak et al., 2018). An LF template is derived by normalizing the entities and attribute values of an LF into typed variable names (Finegan-Dollak et al., 2018). The few-shot setting imposes two major challenges for SOTA neural semantic parsers. First, it lacks sufficient data to learn effective representations for new predicates in a supervised manner. Second, new predicates bring in new LF templates, which are mixtures of known and new predicates. In contrast, the tasks (e.g. image classification) studied by the prior work on few-shot learning (Snell et al., 2017;Finn et al., 2017) considers an instance exclusively belonging to either a known class or a new class. Thus, it is non-trivial to apply conventional few-shot learning algorithms to generate LFs with mixed types of predicates.
To address above challenges, we present ProtoParser, a transition-based neural semantic parser, which applies a sequence of parse actions to transduce an utterance into an LF template and fills the corresponding slots. The parser is pretrained on a training set with known predicates, followed by fine-tuning on a support set that contains few-shot examples of new predicates. It extends the attention-based sequence-to-sequence architecture (Sutskever et al., 2014) with the following novel techniques to alleviate the specific problems in the few-shot setting: • Predicate-droput. Predicate-droput is a metalearning technique to improve representation learning for both known and new predicates. We empirically found that known predicates are better represented with supervisely learned embeddings, while new predicates are better initialized by a metric-based few-shot learning algorithm (Snell et al., 2017). In order to let the two types of embeddings work together in a single model, we devised a training procedure called predicate-dropout to simulate the testing scenario in pre-training.
• Attention regularization. In this work, new predicates appear approximately once or twice during training. Thus, it is insufficient to learn reliable attention scores in the Seq2Seq architecture for those predicates. In the spirit of supervised attention (Liu et al., 2016), we propose to regularize them with alignment scores estimated by using co-occurrence statistics and string similarity between words and predicates. The prior work on supervised attention is not applicable, because it requires either large parallel data (Liu et al., 2016), significant manual effort (Bao et al., 2018;Rabinovich et al., 2017), or it is designed only for applications other than semantic parsing (Liu et al., 2017;Kamigaito et al., 2017).
• Pre-training smoothing. The vocabulary of predicates in fine-tuning is higher than that in pre-training, which leads to a distribution discrepancy between the two training stages. Inspired by Laplace smoothing (Manning et al., 2008), we achieve significant performance gain by applying a smoothing technique during pre-training to alleviate the discrepancy.
Our extensive experiments on three benchmark corpora show that ProtoParser outperforms the competitive baselines with a significant margin. The ablation study demonstrates the effectiveness of each individual proposed technique. The results are statistically significant with p≤0.05 according to the Wilcoxon signed-rank test (Wilcoxon, 1992).

Related Work
Semantic parsing There is ample of work on machine learning models for semantic parsing.
The recent surveys (Kamath and Das, 2018;Zhu et al., 2019) cover a wide range of work in this area. The semantic formalism of meaning representations range from lambda calculas (Montague, 1973), SQL, to abstract meaning representation (Banarescu et al., 2013). At the core of most recent models (Chen et al., 2018;Cheng et al., 2019;Lin et al., 2019;Zhang et al., 2019b; is SEQ2SEQ with attention (Bahdanau et al., 2014) by formulating the task as a machine translation problem. COARSE2FINE (Dong and Lapata, 2018) reports the highest accuracy on GEO-QUERY (Zelle and Mooney, 1996) and ATIS (Price, 1990) in a supervised setting. IRNET (Guo et al., 2019) and RATSQL (Wang et al., 2019) are two best performing models on the Text-to-SQL benchmark, SPIDER . They are also designed to be able to generalize to unseen database schemas. However, supervised models perform well only when there is sufficient training data.
Data Sparsity Most semantic parsing datasets are small in size. To address this issue, one line of research is to augment existing datasets with automatically generated data (Su and Yan, 2017;Jia and Liang, 2016;Cai and Yates, 2013). Another line of research is to exploit available resources, such as knowledge bases (Krishnamurthy et al., 2017;Herzig and Berant, 2018;Chang et al., 2019;Lee, 2019;Zhang et al., 2019a;Guo et al., 2019;Wang et al., 2019), semantic features in different domains (Dadashkarimi et al., 2018;Li et al., 2020), or unlabeled data Kočiskỳ et al., 2016;Sun et al., 2019). Those works are orthogonal to our setting because our approach aims to efficiently exploit a handful of labeled data of new predicates, which are not limited to the ones in knowledge bases. Our setting also does not require involvement of humans in the loop such as active learning (Duong et al., 2018;Ni et al., 2019) and crowd-sourcing Herzig and Berant, 2019). We assume availability of resources different than the prior work and focus on the problems caused by new predicates. We develop an approach to generalize to unseen LF templates consisting of both known and new predicates.
Few-Shot Learning Few-shot learning is a type of machine learning problems that provides a handful of labeled training examples for a specific task. The survey (Zhu et al., 2019) gives a comprehensive overview of the data, models, and algorithms  proposed for this type of problems. It categorizes the models into multitask learning (Hu et al., 2018), embedding learning (Snell et al., 2017;Vinyals et al., 2016), learning with external memory (Lee and Choi, 2018;Sukhbaatar et al., 2015), and generative modeling (Reed et al., 2017) in terms of what prior knowledge is used.  tackles the problem of poor generalization across SQL templates for SQL query generation in the one-shot learning setting. In their setting, they assume all the SQL templates on test set are shared with the templates on support set. In contrast, we assume only the sharing of new predicates between a support set and a test set. In our one-shot setting, only around 10% of LF templates on test set are shared with the ones in the support set of GEOQUERY dataset.

Semantic Parser
ProtoParser follows the SOTA neural semantic parsers (Dong and Lapata, 2018;Guo et al., 2019) to map an utterance into an LF in two steps: template generation and slot filling 1 . It implements a designated transition system to generate templates, followed by filling the slot variables with values extracted from utterances. To address the challenges in the few-shot setting, we proposed three training methods, detailed in Sec. 4.
Many LFs differ only in mentioned atoms, such as entities and attribute values. An LF template is created by replacing the atoms in LFs with typed slot variables. As an example, the LF template of our example in Sec. 1 is created by substituting i) a typed atom variable v e for the entity "atlanta:ci"; ii) a shared variable name v a for all variables "$0" and "$1".
(lambda va e (exists va (and ( ground transport va ) (to city va ve )(from airport va ve) ( =(ground fare va ) va )))) Formally, let x = {x 1 , ..., x n } denote an NL utterance, and its LF is represented as a semantic tree y = (V, E), where V = {v 1 , ..., v m } denotes the node set with v i ∈ V, and E ⊆ V × V is its edge set. The node set V = V p ∪ V v is further divided into a template predicate set V p and a slot value set V v . A template predicate node represents a predicate symbol or a term, while a slot value node represents an atom mentioned in utterances. Thus, a semantic tree y is composed of an abstract tree τ y representing a template and a set of slot value nodes V v,y attaching to the abstract tree.
In the few-shot setting, we are provided with a train set D train , a support set D s , and a test set D test . Each example in either of those sets is an Each new predicate appears also in the test set. The goal is to maximize the accuracy of estimating LFs given utterances in D test by using a parser trained on D train ∪ D s .

Transition System
We apply the transition system (Cheng et al., 2019) to perform a sequence of transition actions to generate the template of a semantic tree. The transition system maintains partially-constructed outputs using a stack. The parser starts with an empty stack. At each step, it performs one of the following transition actions to update the parsing state and generate a tree node. The process repeats until the stack contains a complete tree.
• GEN [y] creates a new leaf node y and pushes it on top of the stack.
• REDUCE [r]. The reduce action identifies an implication rule head : −body. The rule body is first popped from the stack. A new subtree is formed by attaching the rule head as a new parent node to the rule body . Then the whole subtree is pushed back to the stack. Table 1 shows such an action sequence for generating the above LF template. Each action produces known or new predicates.

Base Parser
ProtoParser generates an LF in two steps: i) template generation, ii) slot filling. The base architecture largely resembles (Cheng et al., 2019).
Template Generation Given an utterance, the task is to generate a sequence of actions a = a 1 , ..., a k to build an abstract tree τ y . We found out LFs often contain idioms, which are frequent subtrees shared across LF templates. Thus we apply a template normalization procedure in a similar manner as (Iyer et al., 2019) to preprocess all LF templates. It collapses idioms into single units such that all LF templates are converted into a compact form. The neural transition system consists of an encoder and a decoder for estimating action probabilities.
Encoder We apply a bidirectional Long Shortterm Memory (LSTM) network (Gers et al., 1999) to map a sequence of n words into a sequence of contextual word representations {e} n i=1 . Template Decoder The decoder applies a stack-LSTM (Dyer et al., 2015) to generate action sequences. A stack-LSTM is an unidirectional LSTM augmented with a pointer. The pointer points to a particular hidden state of the LSTM, which represents a particular state of the stack. It moves to a different hidden state to indicate a different state of the stack.
At time t, the stack-LSTM produces a hidden state where µ t is a concatenation of the embedding of the action c a t−1 estimated at time t − 1 and the representation h y t−1 of the partial tree generated by history actions at time t − 1.
As a common practice, h d t is concatenated with an attended representation h a t over encoder hidden where W is a weight matrix and h a t is created by soft attention, We apply dot product to compute the normalized attention scores P (e i |h d t ) (Luong et al., 2015). The supervised attention (Rabinovich et al., 2017; is also applied to facilitate the learning of attention weights. Given h t , the probability of an action is estimated by: where c a denotes the embedding of action a, and A t denotes the set of applicable actions at time t. The initialization of those embeddings will be explained in the following section.
Slot Filling A tree node in a semantic tree may contain more than one slot variables due to template normalization. Since there are two types of slot variables, given a tree node with slot variables, we employ a LSTM-based decoder with the same architecture as the Template decoder to fill each type of slot variables, respectively. The output of such a decoder is a value sequence of the same length as the number of slot variables of that type in the given tree node.

Few-Shot Model Training
The We take two steps to train our model: i) pretraining on the training set, ii) fine-tuning on the support set. Its predictive performance is measured on the test set. We take the two-steps approach because i) our experiments show that this approach performs better than training on the union of the train set and the support set; ii) for any new support sets, it is computationally more time efficient than training from scratch on the union of the train set and the support set.
There is a distribution discrepancy between the train set and the support set due to new predicates, the meta-learning algorithms (Snell et al., 2017;Finn et al., 2017) suggest to simulate the testing scenario in pre-training by splitting each batch into a meta-support set and a meta-test set. The models utilize the information (e.g. prototype vectors) acquired from the meta-support set to minimize errors on the meta-test set. In this way, the metasupport and meta-test sets simulate the support and test sets sharing new predicates.
However, we cannot directly apply such a training procedure due to the following two reasons. First, each LF in the support and test sets is a mixture of both known predicates and new predicates. To simulate the support and test sets, the meta-support and meta-test sets should include both types of predicates as well. We cannot assume that there are only one type of predicates. Second, our preliminary experiments show that if there is sufficient training data, it is better off training action embeddings of known predicates c (Eq. (3)) in a supervised way, while action embeddings initialized by a metric-based meta-learning algorithm (Snell et al., 2017) perform better for rarely occurred new predicates. Therefore, we cope with the differences between known and new predicates by using a customized initialization method in finetuning and a designated pre-training procedure to mimic fine-tuning on the train set. In the following, we introduce fine-tuning first because it helps understand our pre-training procedure.

Fine-tuning
During fine-tuning, the model parameters and the action embeddings in Eq. (3) for known predicates are obtained from the pre-trained model. The embedding of actions that produce new predicates c at are initialized using prototype vectors as in prototypical networks (Snell et al., 2017). The prototype representations act as a type of regularization, which shares the similar idea as the deep learning techniques using pre-trained models.
A prototype vector of an action a t is constructed by using the hidden states of the template decoder collected at the time of predicting a t on a support set. Following (Snell et al., 2017), a prototype vector is built by taking the mean of such a set of hidden states h t .
where M denotes the set of all hidden states at the time of applying the action a t . After initialization, the whole model parameters and the action embeddings are further improved by fine-tuning the model on the support set with a supervised training objective L f .
where L s is the cross-entropy loss and Ω is an attention regularization term explained below. The degree of regularization is adjusted by λ ∈ R + .

Attention Regularization
We address the poorly learned attention scores P (e i |h d t ) of infrequent actions by introducing a novel attention regularization. We observe that the probability P (a j |x i ) = count(a j ,x i ) count(x i ) and the character similarity between the predicates generated by action a j and the token x i are often strong indicators of their alignment. The indicators can be further strengthened by manually annotating the predicates with their corresponding natural language tokens. In our work, we adopt 1 − dist(a j , x i ) as the character similarity, where dist(a j , x j ) is normalized Levenshtein distance (Levenshtein, 1966). Both measures are in the range [0, 1], thus we apply g(a j , x i ) = σ(·)P (a j |x i ) + (1 − σ(·)char sim(a j , x i ) to compute alignment scores, where the sigmoid function σ(w p h d t ) combines two constant measures into a single score. The corresponding normalized attention scores is given by The attention scores P (x i |a k ) should be similar to P (x i |a k ). Thus, we define the regularization term as Ω = ij |P (x i |a j ) − P (x i |a j )| during training.

Pre-training
The pre-training objective are two-folds: i) learn action embeddings for known predicates in a supervised way, ii) ensure our model can quickly adapt to the actions of new predicates, whose embeddings are initialized by prototype vectors before fine-tuning.
Predicate-dropout Starting with randomly initialized model parameters, we alternately use one batch for the meta-loss L m and one batch for optimizing the supervised loss L s . In a batch for L m , we split the data into a metasupport set and a meta-test set. In order to simulate existence of new predicates, we randomly select a subset of predicates as "new", thus their action embeddings c are replaced by prototype vectors constructed by applying Eq. (4) over the meta-support set. The actions of remaining predicates keep their embeddings learned from previous batches. The resulted action embedding matrix C is the combination of both.
where C s is the embedding matrix learned in a supervised way, and C m is constructed by using prototype vectors on the meta-support set. The mask vector m is generated by setting the indices of actions of the "new" predicates to ones and the other Sample a meta-support example s with template t from D without replacement Sample a meta-test set Q of size n with template t from D S = S ∪ s Q = Q ∪ Q end Build a prototype matrix Cm on S Extract a predicate set P from S Sample a subset Ps of size r × |P| from P as new predicates Build a mask m using Ps With Cs, Cm and m, apply Eq. (7) to compute C Compute Lm, the cross-entropy on Q with C to zeros. We refer to this operation as predicatedropout. The training algorithm for the meta-loss is summarised in Algorithm 1.
In a batch for L s , we update the model parameters and all action embeddings with a cross-entropy loss L s , together with the attention regularization. Thus, the overall training objective becomes Pre-training smoothing Due to the new predicates, the number of candidate actions during the prediction of fine-tuning and testing is larger than the one during pre-training. That leads to distribution discrepancy between pre-training and testing.
To minimize the differences, we assume a prior knowledge on the number of actions for new predicates by adding a constant k to the denominator of Eq. (3) when estimating the action probability P (a t |h t ) during pre-training.
We do not consider this smoothing technique during fine-tuning and testing. Despite its simplicity, the experimental results show a significant performance gain on benchmark datasets.

Experiments
Datasets. We use three semantic parsing datasets: JOBS, GEOQUERY, and ATIS. JOBS contains 640 question-LF pairs in Prolog about job listings. GEOQUERY (Zelle and Mooney, 1996) and ATIS (Price, 1990) include 880 and 5,410 utterance-LF pairs in lambda calculas about US geography and flight booking, respectively. The number of predicates in JOBS, GEOQUERY, ATIS is 15, 24, and 88, respectively. All atoms in the datasets are anonymized as in (Dong and Lapata, 2016).
For each dataset, we randomly selected m predicates as the new predicates, which is 3 for JOBS, and 5 for GEOQUERY and ATIS. Then we split each dataset into a train set and an evaluation set. And we removed the instances, the template of which is unique in each dataset. The number of such instances is around 100, 150 and 600 in JOBS, GEOQUERY, and ATIS. The ratios between the evaluation set and the train set are 1:4, 2:5, and 1:7 in JOBS, GEOQUERY, and ATIS, respectively. Each LF in an evaluation set contains at least a new predicate, while an LF in a train set contains only known predicates. To evaluate k-shot learning, we build a support set by randomly sampling k pairs per new predicate without replacement from an evaluation set, and keep the remaining pairs as the test set. To avoid evaluation bias caused by randomness, we repeat the above process six times to build six different splits of support and test set from each evaluation set. One for hyperparameter tuning and the rest for evaluation. We consider at most 2-shot learning due to the limited number of instances per new predicate in each evaluation set.
Training Details. We pre-train our parser on the training sets for {80, 100} epochs with the Adam optimizer (Kingma and Ba, 2014). The batch size is fixed to 64. The initial learning rate is 0.0025, and the weights are decayed after 20 epochs with decay rate 0.985. The predicate dropout rate is 0.5. The smoothing term is set to {3, 6}. The number of meta-support examples is 30 and the number of meta-test examples per support example is 15. The coefficient of attention regularization is set to 0.01 on JOBS and 1 on the other datasets. We employ the 200-dimensional GLOVE embedding (Pennington et al., 2014) to initialize the word embeddings for utterances. The hidden state size of all LSTM models (Hochreiter and Schmidhuber, 1997) is 256. During fine-tuning, the batch size is 2, the learning rates and the epochs are selected from {0.001, 0.0005} and {20, 30, 40, 60, 120}, respectively.  Baselines. We compared our methods with five competitive baselines, SEQ2SEQ with attention (Luong et al., 2015), COARSE2FINE (Dong and Lapata, 2018), IRNET (Guo et al., 2019), PT-MAML (Huang et al., 2018) and DA (Li et al., 2020). COARSE2FINE is the best performing supervised model on the standard split of GEOQUERY and ATIS datasets. PT-MAML is a few-shot learning semantic parser that adopts Model-Agnostic Meta-Learning (Finn et al., 2017). We adapt PT-MAML in our scenario by considering a group of instances that share the same template as a pseudotask. DA is the most recently proposed neural semantic parser applying domain adaptation techniques. IRNET is the strongest semantic parser that can generalize to unseen database schemas. In our case, we consider a list of predicates in support sets as the columns of a new database schema and incorporate the schema encoding module of IRNET into the encoder of our base parser. We choose IRNET over RATSQL (Wang et al., 2019) because IRNET achieves superior performance on our datasets. We consider three different supervised learning settings. First, we pre-train a model on a train set, followed by fine-tuning it on the corresponding support set, coined pt. Second, a model is trained on the combination of a train set and a support set, coined cb. Third, the support set in cb is oversampled by 10 times and 5 times for one-shot and two-shot respectively, coined os.
Evaluation Details. The same as prior work (Dong and Lapata, 2018;Li et al., 2020), we report accuracy of exactly matched LFs as the main evaluation metric.
To investigate if the results are statistically significant, we conducted the Wilcoxon signed-rank test, which assesses whether our model consistently performs better than another baseline across all evaluation sets. It is considered superior than ttest in our case, because it supports comparison across different support sets and does not assume normality in data (Demšar, 2006). We include the corresponding p-values in our result tables. Table 2 shows the average accuracies and significance test results of all parsers compared on all three datasets. Overall, ProtoParser outperforms all baselines with at least 2% on average in terms of accuracy in both one-shot and twoshot settings. The results are statistically significant w.r.t. the strongest baselines, IRNET (cb) and COARSE2FINE (pt). The corresponding p-values are 0.00276 and 0.000148, respectively. Given one-shot example on JOBS, our parser achieves 7% higher accuracy than the best baseline, and the gap is 4% on GEOQUERY with two-shots examples. In addition, none of the SOTA baseline parsers can consistently outperform other SOTA parsers when there are few parallel data for new predicates. In one-shot setting, the best supervised baseline IRNET (cb) can achieve the best results on GEOQUERY and JOBS among all baselines, and on two-shot setting, it performs best only on GEO-QUERY. It is also difficult to achieve good performance by adapting the existing meta-learning or transfer learning algorithms to our problem, as evident by the moderate performance of PT-MAML and DA on all datasets.

Results and Discussion
The problems of few-shot learning demonstrate the challenges imposed by infrequent predicates. There are significant proportions of infrequent predicates on the existing datasets. For example, on GEOQUERY, there are 10 predicates contributing to only 4% of the total frequency of all 24 predicates, while the top two frequent predicates amount  to 42%. As a result, the SOTA parsers achieve merely less than 25% and 44% of accuracy with one-shot and two-shots examples, respectively. In contrast, those parsers achieve more than 84% accuracy on the standard splits of the same datasets in the supervised setting. Infrequent predicates in semantic parsing can also be viewed as a class imbalance problem, when support sets and train sets are combined in a certain manner. In this work, the ratio between the support set and the train set in JOBS, GEOQUERY, and ATIS is 1:130, 1:100, and 1:1000, respectively. Different models prefer different ways of using the train sets and support sets. The best option for COARSE2FINE and SEQ2SEQ is to pre-train on a train set followed by fine-tuning on the corresponding support set, while IRNET favors oversampling in two-shot setting.

Ablation Study
We examine the effect of different components of our parser by removing each of them individually and reporting the corresponding average accuracy. As shown in Table 3, removing any of the components almost always leads to statistically significant drop of performance. The corresponding p-values are all less than 0.00327.
To investigate predicate-dropout, we exclude either supervised-loss during pre-training (-sup) or initialization of new predicate embeddings by prototype vectors before fine-tuning (-proto). It is clear from Table 3 that ablating either supervisely trained action embeddings or prototype vectors hurts performance severely.
We further study the efficacy of attention regularization by removing it completely (-reg), removing only the string similarity feature (-strsim), or conditional probability feature (-cond). Removing the regularization completely degrades performance sharply except on JOBS in the one-shot setting. Our further inspection shows that model learning is easier on JOBS than on the other two datasets. Each predicate in JOBS almost always aligns to The performance drop with -strsim and -cond indicates that we cannot only reply on a single statistical measure for regularization. For instance, we cannot always find predicates take the same string form as the corresponding words in input utterances. In fact, the proportion of predicates present in input utterances is only 42%, 38% and 44% on JOBS, ATIS, and GEOQUERY, respectively. Furthermore, without pre-training smoothing (smooth), the accuracy drops at least 1.6% in terms of mean accuracy on all datasets. Smoothing enables better model parameter training by more accurate modelling in pre-training.

Support Set Analysis
We observe that all models consistently achieve high accuracy on certain support sets of the same dataset, while obtaining low accuracies on the other ones. We illustrate the reasons of such effects by plotting the evaluation set of GEOQUERY. Each data point in Figure 1 depicts an representation, which is generated by the encoder of our parser after pre-training. We applied T-SNE (Maaten and Hinton, 2008) for dimension reduction. We highlight two support sets used in the one-shot setting on GEOQUERY. All exam-ples in the highest performing support set tend to scatter evenly and cover different dense regions in the feature space, while the examples in the lowest performing support set are far from a significant number of dense regions. Thus, the examples in good support sets are more representative of the underlying distribution than the ones in poor support sets. When we leave out each example in the highest performing support set and re-evaluate our parser each time, we observe that the good ones (e.g. the green box in Figure 1) locate either in or close to some of the dense regions.

Conclusion and Future Work
We propose a novel few-shot learning based semantic parser, coined ProtoParser, to cope with new predicates in LFs. To address the challenges in few-shot learning, we propose to train the parser with a pre-training procedure involving predicatedropout, attention regularization, and pre-training smoothing. The resulted model achieves superior results over competitive baselines on three benchmark datasets.