On The Ingredients of an Effective Zero-shot Semantic Parser

Semantic parsers map natural language utterances into meaning representations (e.g., programs). Such models are typically bottlenecked by the paucity of training data due to the required laborious annotation efforts. Recent studies have performed zero-shot learning by synthesizing training examples of canonical utterances and programs from a grammar, and further paraphrasing these utterances to improve linguistic diversity. However, such synthetic examples cannot fully capture patterns in real data. In this paper we analyze zero-shot parsers through the lenses of the language and logical gaps (Herzig and Berant, 2019), which quantify the discrepancy of language and programmatic patterns between the canonical examples and real-world user-issued ones. We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods using canonical examples that most likely reflect real user intents. Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.


Introduction
Semantic parsers translate natural language (NL) utterances into formal meaning representations.In particular, task-oriented semantic parsers map user-issued utterances (e.g.Find papers in ACL) into machine-executable programs (e.g. a database query), play a key role in providing natural language interfaces to applications like conversational virtual assistants (Gupta et al., 2018;Andreas et al., 2020), robot instruction following (Artzi and Zettlemoyer, 2013;Fried et al., 2018), as well as querying databases (Li and Jagadish, 2014;Yu et al., 2018) or generating Python code (Yin and Neubig, 2017).
Learning semantic parsers typically requires parallel data of utterances annotated with programs, which requires significant expertise and cost (Berant et al., 2013).Thus, the field has explored alternative approaches using supervisions cheaper to acquire, such as the execution results (Clarke et al., 2010) or unlabeled utterances (Poon, 2013).In particular, the seminal OVERNIGHT approach (Wang et al., 2015) synthesizes parallel data by using a synchronous grammar to align programs and their canonical NL expressions (e.g.Filter(paper,venue= ? ) ↔ papers in ?and acl↔ACL), then generating examples of compositional utterances (e.g.Papers in ACL) with programs (e.g.Filter(paper,venue=acl)).The synthesized utterances are paraphrased by annotators, a much easier task than writing programs.
Recently, Xu et al. (2020b) build upon OVERNIGHT and develop a zero-shot semantic parser replacing the manual paraphrasing process with an automatic paraphrase generator ( §2).While promising, there are still several open challenges.First, such systems are not truly zero-shot -they still require labeled validation data (e.g. to select the best checkpoint at training).Next, to ensure the quality and broad-coverage of synthetic canonical examples, those models rely on heavily curated grammars (e.g. with 800 production rules), which are cumbersome to maintain.More importantly, as suggested by Herzig and Berant (2019) who study OVERNIGHT models using manual paraphrases, such systems trained on synthetic samples suffer from fundamental mismatches between the distributions of the automatically generated examples and the natural ones issued by real users.Specifically, there are two types of gaps.First, there is a logical gap between the synthetic and real programs, as real utterances (e.g.Paper coauthored by Peter and Jane) may exhibit logic patterns outside of the domain of those covered by the grammar (e.g.Paper by Jane).The second is the language gap between the synthetic and real utterances, as paraphrased utterances (e.g.u 1 in Fig. 1) still follow similar linguistic patterns as the canonical ones they are paraphrased from (e.g.u 1 ), while user-issued utter-ances are more linguistically diverse (e.g.u 2 ).
In this paper we analyze zero-shot parsers through the lenses of language and logical gaps, and propose methods to close those gaps ( §3).Specifically, we attempt to bridge the language gap using stronger paraphrasers and more expressive grammars tailored to the domain-specific idiomatic language patterns.We replace the large grammars of previous work with a highly compact grammar with only 46 domain-general production rules, plus a small set of domain-specific productions to capture idiomatic language patterns (e.g.u 2 in Fig. 1, §3.1.1).We demonstrate that models equipped with such a smaller but more expressive grammar catered to the domain could generate utterances with more idiomatic and diverse language styles.
On the other hand, closing the logical gap is non-trivial, since canonical examples are generated by exhaustively enumerating all possible programs from the grammar up to a certain depth, and increasing the threshold to cover more complex real-world examples will lead to exponentially more canonical samples, the usage of which is computationally intractable.To tackle the exponentially exploding sample space, we propose an efficient sampling approach by retaining canonical samples that most likely appear in real data ( §3.1.2).Specifically, we approximate the likelihood of canonical examples using the probabilities of their utterances measured by pre-trained language models (LMs).This enables us to improve logical coverage of programs while maintaining a tractable number of highlyprobable examples as training data.
In experiments, we show that by bridging the language and logical gaps, our system achieves strong results on two datasets featuring realistic utterances (SCHOLAR and GEO).Despite the fact that our model uses zero annotated data for training and validation, it outperforms other supervised methods like OVERNIGHT and GRANNO (Herzig and Berant, 2019) that require manual annotation.Analysis shows that current models are far from perfect, suggesting logical gap still remains an issue, while stronger paraphrasers are needed to further close the language gap.

Zero-shot Semantic Parsing via Data Synthesis
Problem Definition Semantic parsers translate a user-issued NL utterance u into a machineexecutable program z (Fig. 1).We consider a zeroshot learning setting without access to parallel data in the target domain.Instead, the system is trained on a collection of machine-synthesized examples.
Overview Our system is inspired by the existing zero-shot parser by Xu et al. (2020b).Fig. 1 illustrates our framework.Intuitively, we automatically create training examples with canonical utterances from a grammar, which are then paraphrased to increase diversity in language style.Specifically, there are two stages.First, a set of seed canonical examples (Fig. 1b) are generated from a synchronous grammar, which defines compositional rules of NL expressions to form utterances (Fig. 1a).
Next, in the iterative training stage, a paraphrase generation model rewrites the canonical utterances to more natural and linguistically diverse alternatives (Fig. 1c).The paraphrased examples are then used to train a semantic parser.To mitigate noisy paraphrases, a filtering model, which is the parser trained on previous iterations, rejects paraphrases that are potentially incorrect.This step of paraphrasing and training could proceed for multiple iterations, with the parser trained on a dataset with growing diversity of language styles.
Synchronous Grammar Seed canonical examples are generated from a synchronous context free grammar (SCFG).Fig. 1a lists simplified production rules in the grammar.Intuitively, productions specify how utterances are composed from lowerlevel language constructs and domain lexicons.For instance, given a database entity allan_turing with a property citations, u 3 in Fig. 1 could be generated using r 1 .Productions could be applied recursively to derive more compositional utterances (e.g.u 2 using r 2 , r 4 and r 6 ).Our SCFG is based on Herzig and Berant (2019) 2 superlative( filter(paper, topic=DL), key=year) ) that contain lexically and syntactically diverse paraphrases.The model therefore learns to produce paraphrases with a variety of linguistic patterns, which is essential for closing the language gap when paraphrasing from canonical utterances ( §4).Still, some paraphrases are noisy or potentially vague ( in Fig. 1c).We follow Xu et al. (2020b) and use the parser trained on previous iterations as the filtering model, and reject paraphrases for which the parser cannot predict their programs.

Bridging the Gaps between Canonical and Natural Data
Language and Logical Gaps The synthesis approach in §2 will generate a large set of paraphrased canonical data (denoted as D par ).However, as noted by Herzig and Berant (2019) (hereafter HB19), the synthetic examples cannot capture all the language and programmatic patterns of authors) by first generating all the canonical samples and then filtering those that violate the constraints.
real-world natural examples from users (denoted as D nat ).There are two mismatches between D par and D nat .First, there is a logical gap between the programs in D nat capturing real user intents, and the synthetic ones in D par .Notably, since programs are exhaustively enumerated from the grammar up to a certain compositional depth, D par will not cover more complex programs in D nat beyond the threshold.Ideally we could improve the coverage using a higher threshold.However, the space of possible programs will grow exponentially, and combinatorial explosion happens even with small thresholds.
Next, there is a language gap between paraphrased canonical utterances and real-world userissued ones.Real utterances (e.g. the u 2 in Fig. 1, modeled later in §3.1.1)enjoy more lexical and syntactical diversity, while the auto-paraphrased ones (e.g.u 1 ) are typically biased towards the monotonous and verbose language style of their canonical source (e.g.u 1 ).While we could increase diversity via iterative rounds of paraphrasing (e.g.u 2 → u 2 → u 2 ), the paraphraser could still fail on canonical utterances that are not natural English sentences at all, like u 1 .

Bridging Language and Logical Gaps
We introduce improvements to the system to close the language ( §3.1.1)and logical ( §3.1.2) gaps.

Idiomatic Productions
To close language gaps, we augment the grammar with productions capturing domain-specific idiomatic language styles.Such productions compress the clunky canonical expressions (e.g.u 1 in Fig. 1) to more succinct and natural alternatives (e.g.u 2 ).We focus on two language patterns: Non-compositional expressions for multi-hop relations Compositional canonical utterances typically feature chained multi-hop relations that are joined together (e.g.Author that writes paper whose topic is NLP), which can be compressed using more succinct phrases to denote the relation chain, where the intermediary pivoting entities (e.g.paper) are omitted (e.g.Author that works on NLP).The pattern is referred to as sub-lexical compositionality in Wang et al. (2015) and used by annotators to compress verbose canonical utterances, while we model them using grammar rules.Refer to Appendix B for more details.

Idiomatic Comparatives and Superlatives
The general grammar in Fig. 1a uses canonical constructs for comparative (e.g.smaller than) and superlative (e.g.largest) utterances (e.g.u 1 ), which is not ideal for entity types with special units (e.g.time, length).We therefore create productions specifying idiomatic comparative and superlative expressions (e.g.paper published before 2014, and u 2 in Fig. 1).Sometimes, answering a superlative utterance also requires reasoning with other pivoting entities.For instance, the relation in "venue that X publish mostly in" between authors and venues implicitly involves counting the papers that X publishes.For such cases, we create "macro" productions, with the NL phrase mapped to a program that captures the computation involving the pivoting entity (Appendix B).

Discussion
In line with Su and Yan (2017) and Marzoev et al. (2020), we remark that such functionality-driven grammar engineering to cover representative patterns in real data using a small set of curated production rules is more efficient and costeffective than example-driven annotation, which requires labeling a sufficient number of parallel samples to effectively train a data-hungry neural model over a variety of underlying meanings and surface language styles.In contrast, our approach follows Xu et al. (2020b) to automatically synthesize complex compositional samples from the user-specified productions, which are further paraphrased to significantly increase their linguistic diversity.

Naturalness-driven Data Selection
To cover real programs in D nat with complex structures while tackling the exponential sample space, we propose an efficient approach to sub-sample a small set of examples from this space as seed canonical data D can (Fig. 1b) for paraphrasing.Our core idea is to only retain a set of examples u, z that most likely reflect the intents of real users.We use the probability p LM (u) measured by a language model to approximate the "naturalness" of canonical examples.2Specifically, given all canonical examples allowed by the grammar, we form buckets based on their derivation depth d.For each bucket can , we compute p LM (u) for its examples, and group the examples using program templates as the key (e.g.u 1 and u 2 in Fig. 1 are grouped together).For each group, we find the example u * , z with the highest p LM (u * ), and discard other examples u, z if log p LM (u * ) − log p LM (u) > δ (δ = 5.0), removing unlikely utterances from the group (e.g.u 1 ). 3 Finally, we rank all groups in can based on p LM (u * ), and retain examples in the top-K groups.This method offers trade-off between program coverage and efficiency and, more surprisingly, we show that using only 0.2%∼1% top-ranked examples also results in significantly better final accuracy ( §4).

Generating Validation Data
Zero-shot learning is non-trivial without a highquality validation set, as the model might overfit on the (paraphrased) canonical data, which is subject to language and logical mismatch.While existing methods (Xu et al., 2020b) circumvent the issue using real validation data, in this work we create validation sets from paraphrased examples, making our method truly labeled data-free.Specifically, we consider a two-stage procedure.First, we run the iterative paraphrasing algorithm ( §2) without validation, and then sample u, z from its output with a probability p(u, z) ∝ p LM (u) α (α = 0.4), ensuring the resulting sampled set D val par is representative.Second, we restart training using D val par for validation to find the best checkpoint.The paraphrase filtering model is also initialized with the parser trained in the first stage, which has higher precision and accepts more valid paraphrases.This is similar to iterative training of weakly-supervised semantic parsers (Dasigi et al., 2019), where the model searches for candidate programs for unlabeled utterances in multiple stages of learning.

Experiments
We evaluate our zero-shot parser on two datasets.SCHOLAR (Iyer et al., 2017) is a collection of utterances querying an academic database (Fig. 1).Examples are collected from users interacting with a parser, which are later augmented with Turker paraphrases.We use the version from HB19 with programs represented in λ-calculus logical forms.The sizes of the train/test splits are 579/211.Entities in utterances and programs (e.g.semantic parsing paper in ACL) are canonicalized to typed slots (e.g.keyphrase0, venue0) as in Dong and Lapata (2016), and are recovered when programs are executed during evaluation.We found in the original dataset by HB19, slots are paired with with random entities for execution (e.g.keyphrase0 →optics).Therefore reference programs are likely to execute to empty results, making metrics like answer accuracy more prone to false-positives.We manually fix all such examples in the dataset, as well as those with execution errors.GEO (Zelle and Mooney, 1996) is a classical dataset with queries about U.S. geography (e.g.Which rivers run through states bordering California?).Its database contains basic geographical entities like cities, states, and rivers.We also use the release from HB19, of size 537/237.

Models and Configuration
Our semantic parser is a sequence-to-sequence model with a pre-trained BERT Base encoder (Devlin et al., 2019) and an LSTM decoder augmented with a copy mechanism.The paraphraser is a BART Large model (Lewis et al., 2020).We use the same set of hyper-parameters for both datasets.Specifically, we synthesize canonical examples from the SCFG with a maximal program depth of 6, and collect the top-K (K = 2, 000) GPT-scored sample groups for each depth as the seed canonical data D can ( §3.1.2).We perform the iterative paraphrasing and training procedure ( §2) for two iterations.We create validation sets of size 2, 000 in the first stage of learning ( §3.2), and perform validation using perplexity in the second stage.Refer to Appendix C for more details.Note that our model only uses the natural examples in both datasets for evaluation purposes, and the training and validation splits are not used during learning.

Measuring Language and Logical Gaps
We measure the language mismatch between utterances in the paraphrased canonical (D par ) and natural (D nat ) data using perplexities of natural utterances in D nat given by a GPT-2 LM fine-tuned on D par .
For logical gap, we follow HB19 and compute the coverage of natural programs z ∈ D nat in D par .
Metric We use denotation accuracy on the execution results of model-predicted programs.We report the mean and standard deviation with five random restarts.

Results
We first compare our model with existing approaches using labeled data.Next, we analyze how our proposed methods close the language and logical gaps.Tab. 1 reports accuracies of various systems on the test sets, as well as their form of supervision.Specifically, the supervised parser uses a standard parallel corpus D nat of real utterances annotated with programs.OVERNIGHT uses paraphrased synthetic examples D par like our model, but with manually written paraphrases.GRANNO uses unlabeled real utterances u nat ∈ D nat , and manual paraphrase detection to pair u nat with the canonical examples D can .Our model outperforms existing approaches on the two benchmarks without using any annotated data, while GRANNO, the currently most cost-effective approach, still spends $155 in manual annotation (besides collecting real utterances) to create training data for the two datasets (Herzig and Berant (2019), HB19).Overall, the results demonstrate that zero-shot parser based on idiomatic synchronous grammars and automatic paraphrasing using pre-trained LMs is a data-efficient and cost-effective paradigm to train semantic parsers for emerging domains.Still, our system falls behind fully supervised models trained on natural datasets D nat , due to language and logical gaps between D par and D nat .In following experiments, we explore whether our proposed methods are effective at narrowing the gaps and improving accuracy.Since the validation splits of the two datasets are small (e.g.only 99 samples for SCHOLAR), we use the full training/validation splits for evaluation to get more reliable results.
More expressive grammars narrow language and logical gaps We capture domain-specific language patterns using idiomatic productions to close language mismatch ( §3.1.1).Tables 2 and 3 list the results when we gradually improve the expressiveness of the grammar by adding different types of idiomatic productions.We observe more expressive grammars help close the language gap, as indicated by the decreasing perplexities.This is especially important for SCHOLAR, which features diverse idiomatic NL expressions hard to infer from plain canonical utterances.For instance, it could be non-trivial to paraphrase canonical utterances with multi-hop (e.g.Author that cites paper by X) or superlative relations (e.g.Topic of the most number of ACL paper) to more idiomatic alternatives (e.g."Author that cites X", and "The most popular topic for ACL paper"), while directly including such patterns in the grammar (+Multihop Rel. and +Superlative) is helpful.Additionally, we observe that more expressive grammars also improve logical coverage.The last columns (Logical Coverage) of Tables 2 and 3 report the percentage of real programs that are covered by the seed canonical data before (D can ) and after (D par ) iterative paraphrasing.Intuitively, idiomatic grammar rules could capture compositional program patterns like multi-hop relations and complex superlative queries (e.g.Author that publish mostly in ACL, §3.1.1)within a single production, enabling the grammar to generate more compositional programs under the same threshold on the derivation depth.Notably, when adding all the idiomatic productions on SCHOLAR, the number of exhaustively generated examples with a program depth of 6 is tripled (530K → 1, 700K).
Moreover, recall that the seed canonical dataset D can contains examples with highly-likely utterances under the LM ( §3.1.2).Therefore, examples created by idiomatic productions are more likely to be included in D can , as their more natural and wellformed utterances often have higher LM scores.However, note that this could also be counterproductive, as examples created with idiomatic productions could dominate the LM-filtered D can , "crowding out" other useful examples with lower LM scores.This likely explains the slightly decreased logical coverage on GEO (Tab.3), as more than 30% samples in the filtered D can include idiomatic multi-hop relations directly connecting geographic entities with their countries (e.g."City in US", c.f. "City in state in US"), while such examples only account for ∼ 8% of real data.While the over-representation issue might not negatively impact accuracy, we leave generating more balanced synthetic data as important future work.
Finally, we note that the logical coverage drops after paraphrasing (D can v.s.D par in Tables 2 and 3).This is because for some samples in D can , the paraphrase filtering model rejects all their paraphrases.We provide further analysis later in a case study.Do smaller logical gaps entail better performance?As in §3.1.2,the seed canonical data D can consists of top-K highest-scoring examples under GPT-2 for each program depth.This data selection method makes it possible to train the model efficiently in the iterative paraphrasing stage using a small set of canonical samples that most likely appear in natural data out of the exponentially large sample space.However, using a smaller cutoff threshold K might sacrifice logical coverage, as fewer examples are in D can .To investigate this trade-off, we report results with varying K in Tab. 4. Notably, with K = 1, 000 and around 3K seed canonical data D can (before iterative paraphrasing), D can already covers 88% and 80% natural programs on SCHOLAR and GEO, resp.This small portion of samples only account for 0.2% (1%) of the full set of 1.7M + (0.27M ) canonical examples exhaustively generated from the grammar on SCHOLAR (GEO).This demonstrates our data selection approach is effective in maintaining learning efficiency while closing the logical gap.In contrast, the baseline data selection strategy of randomly choosing canonical examples from each level of program depth instead of using the top-K highest scored samples is less effective.As an example, this baseline strategy achieves an accuracy of 69.7% and 65.5% on SCHOLAR and GEO respectively when K = 2, 000, which is around 7% lower than the accuracy achieved by our approach (77.8% and 72.8%, Tab. 4).More interestingly, while larger K yields higher logical form coverage, the accuracy might not improve.This is possibly because while the recall of real programs improves, the percentage of such programs in paraphrased canonical data D par (numbers in parentheses) actually drops.Out of the remaining 90%+ samples in D par whose programs are not in D nat , many have unnatural intents that real users are unlikely to issue (e.g."Number of titles of papers with the smallest citations", or "Mountain whose elevation is the length of Colorado River").Such unlikely samples are potentially harmful to the model, causing worse language mismatch, as suggested by the increasing perplexity when K = 8, 000.Similar to HB19, empirically we observe around one-third of samples in D can and D par are unlikely.As later in the case study, such unlikely utterances have noisier paraphrases, which hurts the quality of D par .
Does the model generalize to out-of-distribution samples?Next, to investigate whether the model could generalize to utterances with out-ofdistribution program patterns not seen in the training data D par , we report accuracies on the splits whose program templates are covered (In Coverage) and not covered (Out of Coverage) by D par .Not surprisingly, the model performs significantly better on the in-coverage sets with less language mismatch. 4Our results are also in line with recent research in compositional generalization of semantic parsers (Lake and Baroni, 2018;Finegan-Dollak et al., 2018), which suggests that existing models generalize poorly to utterances with novel compositional patterns (e.g.conjunctive objects like Most cited paper by X and Y) not seen during training.Still surprisingly, our model generalizes reasonably to compositionally novel (out-of-coverage) splits, registering 30% ∼ 50% accuracies, in contrast to HB19 reporting accuracies smaller than 10% on similar benchmarks for OVERNIGHT.We hypothesize that synthesizing compositional samples increases the number of unique program templates in training, which could be helpful for compositional generalization (Akyürek et al., 2021).As an example, the number of unique program templates in D par when K = 2, 000 on SCHOLAR and GEO is 1.9K and 1.7K, resp, compared to only 125 and 187 in D nat .This finding is reminiscent of data augmentation strategies for supervised parsers using synthetic samples induced from (annotated) parallel data (Jia and Liang, 2016;Wang et al., 2021b).
Impact of Validation Data Our system generates validation data from samples of the paraphrased data in an initial run ( §3.2).Tab Tab. 5 compares this strategy of generating validation data with a baseline approach, which randomly splits the seed canonical examples in D can into training and validation sets, and runs the iterative paraphrasing and training algorithm on the two sets in parallel.In each iteration, the checkpoint that achieves the best perplexity on the paraphrased validation examto specify compositional constraints (e.g.ACL 2021 parsing papers), a language style common for in-coverage samples but not captured by the grammar.With smaller K and Dcan, it is less likely for the paraphrased data Dpar to capture similar syntactic patterns.Anther factor that makes the out-of-coverage PPL smaller when K = 500 is that there are more (simpler) examples in the set compared to K > 500, and the relatively simple utterances will also bring down the PPL.Impact of Paraphrasers Our system relies on strong paraphrasers to generate diverse utterances in order to close the language gap.Tab.6 compares the performance of the system trained with our paraphraser and the one used in Xu et al. (2020b).
Both models are based on BART, while our paraphraser is fine-tuned to encourage lexically and syntactically diverse outputs (Appendix A).We measure lexical diversity using token-level F 1 between the original and paraphrased utterances u, u (Rajpurkar et al., 2016;Krishna et al., 2020).For syntactic divergence, we use Kendall's τ (Lapata, 2006) to compute the ordinal correlation between u and u , which intuitively measures the number of times to swap tokens in u to get u using bubble sort.Our paraphraser generates more diverse paraphrases (e.g.What is the biggest state in US?) from the source (e.g.State in US and that has the largest area), as indicated by lower token-level overlaps and ordinal coefficients, comparing to the existing paraphraser (e.g.The state in US with the largest surface area).Nevertheless, our paraphraser is still not perfect, as discussed next.Imperfect Paraphraser The imperfect paraphraser could generate semantically incorrect predictions (e.g.u 1,1 ), especially when the source canonical utterance contains uncommon or polysemic concepts (e.g.venue in u 1 ), which tend to be ignored or interpreted as other entities (e.g.sites).Besides rare concepts, the paraphraser could also fail on utterances that follow special compositionality patterns.For instance, u * nat in Example 2 uses compound nouns to denote the occurrence of a conference, which is difficult to automatically paraphrase from u 2 (that uses prepositional phrases) without any domain knowledge.While the model could still correctly answer u * nat in this case, u * nat 's perplexity is high, suggesting language mismatch.

Limitations and Discussion
Unnatural Utterances While we have attempted to close the language gap by generating canonical utterances that are more idiomatic in lan-guage style, some of those synthetic utterances are still not natural enough for the paraphraser to rewrite.This is especially problematic for relations not covered by our idiomatic productions.For instance, our SCFG does not cover the co-authorship relation in Example 3. Therefore the generated synthetic utterance u 3 uses a clumsy multi-hop query to express this intent, which is non-trivial for the model to paraphrase to an idiomatic expression such as u * nat .While this issue could be potentially mitigated using additional production rules, grammar engineering could still remain challenging, as elaborated later in this section.
Unlikely Examples Related to the issue of unnatural canonical utterances, another challenge is the presence of unlikely examples with convoluted logical forms that rarely appear in real data.As discussed earlier in §4, D can contains around 30% such unlikely canonical examples (e.g.u 4 ).Similar to the case of unnatural utterances, paraphrases of those logically unlikely examples are also much noisier (e.g.u 4, * ).Empirically, we observe the paraphraser's accuracy is only around 30% for utterances of such unlikely samples, compared to 70% for the likely ones.The filtering model is also less effective on unlikely examples (false positives ).These noisy samples will eventually hurt performance of the parser.We leave modeling utterance naturalness as important future work.
Cost of Grammar Engineering Our approach relies on an expressive SCFG to bridge the language and logical gaps between synthetic and real data.While in §3.1.1 we have identified a set of representative categories of grammar patterns necessary to capture domain-specific language style, and attempted to standardize the process of grammar construction by designing idiomatic productions following those categories, grammar engineering still remains a non-trivial task.One need to have a good sense of the idiomatic language patterns that would frequently appear in real-world data, which requires performing user study or access to sampled data.Additionally, encoding those language patterns as production rules assumes that the user understands the grammar formalism (λ-calculus) used by our system, which could limit the applicability of the approach to general users.Still, as discussed in §3.1.1,we remark that for users proficient in the grammar formalism, curating a handful of idiomatic production rules is still more efficient than labeling parallel samples to exhaustively cover compositional logical patterns and diverse language style, and the size of annotated samples required could be orders-of-magnitude larger compared to the size of the grammar.Meanwhile, the process of creating production rules could potentially be simplified by allowing users to define them using natural language instead of λ-calculus logical rules, similar in the spirit of the studies on naturalizing programs using canonical language (Wang et al., 2017;Shin et al., 2021;Herzig et al., 2021).

Conclusion
In this paper, we propose a zero-shot semantic parser that closes the language and logical gaps between synthetic and real data.on SCHOLAR and GEO, our system outperforms other annotationefficient approaches with zero labeled data.Where the program of UnaryNP is an entity set of papers, and the program of NP is a lambda function with a variable x, which filters the entity set.The semantic function of r 3 specifies how these two programs should be composed to form the program of their parent node NP+CP, which performs β reduction, assigning the entity set returned by UnaryNP to the variable x: # Get all papers whose keyphrase is deep learning $NP+CP: (call filter ( call getProperty (call singleton fb:en.paper)(string !type) ) (string paper.keyphrase)(string =) (fb:en.keyphrase.deep_learning) )

B.1 Idiomatic Productions
Multi-hop Relations We create idiomatic productions for non-compositional NL phrases of multi-hop relations (e.g.Author that writes paper in ACL).We augment the database with entries for those multi-hop relations (e.g.X, author.publish_in,acl ), and then create productions in the grammar aligning those relations with their NL phrases (e.g.r 1 in Tab.9).

Comparatives and Superlatives
We also create productions for idiomatic comparatives and superlative expressions.Those productions specify the NL expressions for the comparative/superlative form of some relations.For example, for the relation paper.publication_yearwith objects of date time, its < l a t e x i t s h a 1 _ b a s e 6 4 = " j 6 v D H f m d l V H I C s a n q 3 4 h X P w 4 B r q c A 8 N a A I B D i / w C m / O s / P u f D i f 8 9 a C k 8 8 c w w K c r 1 / a e J a R < / l a t e x i t > u (b) Canonical Data Generation (c) Iterative Paraphrasing and Training < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 R Z 9 n T N N c K 3 h 0 P u M b 7 kY j n A T B 1 M = " > A A A B / 3 i c b V A 9 S w N B E J 3 z M 8 a v q K X N Y h C t w l 0 Q t Q z a W E Y w H 5 I c Y W + z l y z Z 3 T t 2 9 4 R w X O F v s N X a T m z 9 K Z b + E z f J F S b x w c D j v Rl m 5 g U x Z 9 q 4 7 r e z s r q 2 v r F Z 2 C p u 7 + z u 7 Z c O D p s 6 S h S h D R L x S L U D r C l n k j Y M M 5 y 2 Y 0 W x C D h t B a P b i d 9 6 o k q z S D 6 Y c U x 9 g Q e S h Y x g Y 6 X H b i D S J D v r e b 1 S 2 a 2 4 U 6 B l 4 u W k D D n q v d J P t x + R R F B p C M d a d z w 3 N n 6 K l W G E 0 6 z Y T T S N M R n h A e 1 Y K r G g 2 k + n B 2 f o 1 C p 9 F E b K l j R o q v 6 d S L H Q e i w C 2 y m w G e p F b y L + 5 3 U S E 1 7 7 K Z N x Y q g k s 0 V h w p G J 0 O R 7 1 G e K E s P H l m C i m L 0 V k S F W m B i b 0 d y W Q G Q 2 E 2 8 x g W X S r F a 8 y 0 r 1 / q J c u 8 n T K c A x n M A 5 e H A F N b i D O j S A g I A X e I U 3 5 9 l 5 d z 6 c z 1 n r i p P P H M E c n K 9 f P K 2 W w A = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " Y i m T O k 6 p o s d I r s B D I b l / q a r F a C q 0 r 1 4 b J c u 8 3 T K a B T d I Y u U I C u U Q 3 d o z p q I I p G 6 A W 9 o j f v 2 X v 3 P r z P e e u a l 8 + c o A V 4 X 7 9 X D Z Z N < / l a t e x i t > 7 !$Compl.< l a t e x i t s h a 1 _ b a s e 6 4 = " Y i m T O k 6 p o s d I r s B D I b l / q a J L v 3 o = " > A A A B / X i c b V D L S g N B E J z 1 G e M r 6 t H L Y B A 8 h d 0 g 6 j H o x W M E 8 4 B k C b O T 2 W T I P J a Z X i E s w W / w q m d v 4 t V v 8 e i f O E n 2 Y B I L G o q q b r q 7 o k R w C 7 7 / 7 a 2 t b 2 x u b R d 2 i r t 7 + w e H p a P j p t W p o a x B t d C m H R H L B F e s A R w E a y e G E R k J 1 o p G d 1 O / 9 c S M 5 V o 9 w j h h o S Q D F a C q 0 r 1 4 b J c u 8 3 T K a B T d I Y u U I C u U Q 3 d o z p q I I p G 6 A W 9 o j f v 2 X v 3 P r z P e e u a l 8 + c o A V 4 X 7 9 X D Z Z N < / l a t e x i t > 7 !$property  !: Prepositional Phrase (e.g., in deep learning)  " : Complementary (e.g., that has the largest citation count) t e x i t s h a 1 _ b a s e 6 4 = " Y i m T O k 6 p o s d I r s B D I b r z P e e u a l 8 + c o A V 4 X 7 9 X D Z Z N < / l a t e x i t > 7 ! # : Entity type (e.g., paper) and relative clause  $ : Conjunctives  % : Idiomatic superlative expressions (a) Grammar $EntSet < l a t e x i t s h a 1 _ b a s e 6 4 = " Y i m T O k 6 p o s d I r s B D I b z r P z 7 n w 4 n 7 P W F S e f O Y I 5 O F + / P o O W w Q = = < / l a t e x i t > u 0 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " I L 7 v / r s Q u S Y P h p Z d 2 B 4 f j w f 5 0 Figure 1: Illustration of the learning process of our zero-shot semantic parser with real model outputs.(a) Synchronous grammar with production rules.(b) Canonical examples of utterances with programs (only z2 is shown) are generated from the grammar (colored spans show productions used).Unnatural utterances like u1 can be discarded, as in §3.1.2(c) At each iteration, canonical examples are paraphrased to increase diversity in language style, and a semantic parser is trained on the paraphrased examples.Potentially noisy or vague paraphrases are filtered (marked as ) using the parser trained on previous iterations.Paraphrase Generation and FilteringThe paraphrase generation model rewrites a canonical utterance u to more natural and diverse alternatives u .u is then paired with u's program to create a new example.We finetune a BART model on the dataset byKrishna et al. (2020), a subset of the PARANMT corpus(Wieting and Gimpel, 2018) that contain lexically and syntactically diverse paraphrases.The model therefore learns to produce paraphrases with a variety of linguistic patterns, which is essential for closing the language gap when paraphrasing from canonical utterances ( §4).Still, some paraphrases are noisy or potentially vague ( in Fig.1c).We followXu et al. (2020b) and use the parser trained on previous iterations as the filtering model, and reject paraphrases for which the parser cannot predict their programs.
Our parser still lags behind the fully supervised model (Tab.1).To understand the remaining bottlenecks, we show representative examples in Tab.7.Low Recall of Filter ModelFirst, the recall of the paraphrase filtering model is low.The filtering model uses the parser trained on the paraphrased Example 1 (Uncommon Concept) u1 Venue of paper by author0 and published in year0 u 1,1 author0's paper, published in year0 u 1,2 Where the paper was published by author0 in year0?u 1,3 Where the paper was published in year0 by author0?u * nat Where did author0 publish in year0?(Wrong Answer) Example 2 (Novel Language Pattern) u2 Author of paper published in venue0 and in year0 u 2,1 Author of papers published in venue0 in year0 u 2,2 Who wrote a paper for venue0 in year0 u 2,3 Who wrote the venue0 paper in year0 u * nat venue0 year0 authors (Correct) Example 3 (Unnatural Utterance) u3 Author of paper by author0 u 3,1 Author of the paper written by author0 u 3,2 Author of author0's paper u 3,3 Who wrote the paper author0 wrote?u * nat Co-authors of author0 (Wrong Answer) Example 4 (Unlikely Example) u4 Paper in year0 and whose author is not the most cited author u 4,1 A paper published in year0 that isn't the most cited author u 4,2 What's not the most cited author in year0 u 4,3 In year0, he was not the most cited author Table 7: Case Study on SCHOLAR.We show the seed canonical utterance ui, the paraphrases u i,j , and the relevant natural examples u * nat .and denote the correctness of paraphrases.denotes false negatives of the filtering model (correct paraphrases that are filtered), denotes false positives (incorrect paraphrases that are accepted).Entities are canonicalized with indexed data generated in previous iterations.Since this model is less accurate, it can incorrectly reject valid paraphrases u ( in Tab.7), especially when u uses a different sentence type (e.g.questions) than the source (e.g.statements).Empirically, we found the recall of the filtering model at the first iteration of the second-stage training ( §3.2) is only around 60%.This creates logical gaps, as paraphrases of examples in the seed canonical data D can could be rejected by the conservative filtering model, leaving no samples with the same programs in D par .

Table 1 :
Herzig and Berant (2019)racy and standard deviation on TEST sets.Results are averaged with five random restarts.†Modelsoriginally fromHerzig and Berant (2019)and run with five random restarts.Results from our model are tested v.s.GRANNO using paired permutation test with p < 0.05.

Table 2 :
Ablation of grammar categories on SCHOLAR.

Table 3 :
Ablation study of grammar categories on GEO.

Table 4 :
Results on SCHOLAR and GEO with varying amount of canonical examples in the seed training data.

Table 6 :
Systems with different paraphrasers.We report endto-end denotation accuracy, as well as F1 and Kendall's τ rank coefficient between utterances and their paraphrases.plesis saved.We use the paraphrase filtering model learned on the training set to filter the paraphrases of validation examples.This baseline approach performs reasonably well.Still, empirically we find this strategy creates larger logical gaps, as some canonical samples whose program patterns appear in the natural data D nat could be partitioned into the validation data, and not used for training.

Table 8 :
Id Productions (Syntactic Body and Semantic Function) Description r1 NP →SuperlativeAdj NP e.g.most recent ?lambda rel, sub ( call superlative (var sub) (string max) (var rel)) lambda function to get the subject sub with the largest relation rel r2 NP →NP+CP A noun phrase head NP and a complementary phrase body CP (e.g.paper in deep learning) Perform beta reduction, applying the function from CP (e.g. in deep learning) to the value of UnaryNP (e.g.paper) r4 UnaryNP →TypeNP CP Entity types, e.g.paper IdentityFn r5 CP →FilterCP -IdentityFn r6 FilterCP →Prep NP e.g. in deep learning lambda rel, obj, sub ( call filter (var sub) (var rel) (string =) (var obj)) Create a lambda function, which filters entities in a list sub such that its relation rel (e.g.topic) equals obj (e.g.deep learning) r5 NP →Entity Entity noun phrases e.g.deep learning IdentityFn Example domain-general productions rules in the SCFG # Get all entities whose type is paper $UnaryNP: call getProperty (call singleton fb:en.paper)(string !type) # A lambda function that returns entities in x whose relation paper.keyphrase is deep_learning