Compositional Generalization and Natural Language Variation: Can a Semantic Parsing Approach Handle Both?

Sequence-to-sequence models excel at handling natural language variation, but have been shown to struggle with out-of-distribution compositional generalization. This has motivated new specialized architectures with stronger compositional biases, but most of these approaches have only been evaluated on synthetically-generated datasets, which are not representative of natural language variation. In this work we ask: can we develop a semantic parsing approach that handles both natural language variation and compositional generalization? To better assess this capability, we propose new train and test splits of non-synthetic datasets. We demonstrate that strong existing approaches do not perform well across a broad set of evaluations. We also propose NQG-T5, a hybrid model that combines a high-precision grammar-based approach with a pre-trained sequence-to-sequence model. It outperforms existing approaches across several compositional generalization challenges on non-synthetic data, while also being competitive with the state-of-the-art on standard evaluations. While still far from solving this problem, our study highlights the importance of diverse evaluations and the open challenge of handling both compositional generalization and natural language variation in semantic parsing.


Introduction
Sequence-to-sequence (seq2seq) models have been widely used in semantic parsing (Dong and Lapata, 2016;Jia and Liang, 2016) and excel at handling the natural language variation 1 of humangenerated queries. However, evaluations on synthetic 2 tasks such as SCAN (Lake and Baroni, 1 We use the term natural language variation in a broad sense to refer to the many different ways humans can express the same meaning in natural language, including differences in word choice and syntactic constructions. 2 We make a coarse distinction between synthetic datasets, where natural language utterances are generated by a program, 2018) have shown that seq2seq models often generalize poorly to out-of-distribution compositional utterances, such as "jump twice" when only "jump", "walk", and "walk twice" are seen during training. This ability to generalize to novel combinations of the elements observed during training is referred to as compositional generalization. This has motivated many specialized architectures that improve peformance on SCAN (Li et al., 2019;Russin et al., 2019;Gordon et al., 2019;Lake, 2019;Liu et al., 2020;Nye et al., 2020;Chen et al., 2020). However, most approaches have only been evaluated on synthetic datasets. While synthetic datasets enable precise, interpretable evaluation of specific phenomena, they are less representative of the natural language variation that a real-world semantic parsing system must handle.
In this paper, we ask: can we develop a semantic parsing approach that handles both natural language variation and compositional generalization?
and non-synthetic datasets, where natural language utterances are collected from humans. Surprisingly, this question is understudied. As visualized in Figure 1, most prior work evaluates either out-of-distribution compositional generalization on synthetic datasets, or in-distribution performance on non-synthetic datasets. Notably, designing approaches that can handle both compositional generalization and the natural language variation of non-synthetic datasets is difficult. For example, large pre-trained seq2seq models that perform well on in-distribution evaluations do not address most of the compositional generalization challenges proposed in SCAN (Furrer et al., 2020).
Our research question has two important motivations. First, humans have been shown to be adept compositional learners (Lake et al., 2019). Several authors have argued that a greater focus on compositional generalization is an important path to more human-like generalization and NLU (Lake et al., 2017;Battaglia et al., 2018). Second, it is practically important to assess performance on non-synthetic data and out-of-distribution examples, as random train and test splits can overestimate real-world performance and miss important error cases (Ribeiro et al., 2020). Therefore, we are interested in approaches that do well not only on controlled synthetic challenges of compositionality or in-distribution natural utterances, but across all of the diverse set of evaluations shown in Figure 2.
Our contributions are two-fold. First, on the evaluation front, we show that performance on SCAN is not well-correlated with performance on non-synthetic tasks. In addition, strong existing approaches do not perform well across all evaluations in Figure 2. We also propose new Target Maximum Compound Divergence (TMCD) train and test splits, extending the methodology of Keysers et al. (2020) to create challenging evaluations of compositional generalization for non-synthetic datasets. We show that TMCD splits complement existing evaluations by focusing on different aspects of the problem.
Second, on the modeling front, we propose NQG, a simple and general grammar-based approach that solves SCAN and also scales to natural utterances, obtaining high precision for non-synthetic data. In addition, we introduce and evaluate NQG-T5, a hybrid model that combines NQG with T5 (Raffel et al., 2020), leading to improvements across several compositional generalization evaluations while also being competitive on the standard splits of GEOQUERY (Zelle and Mooney, 1996) and SPI-  (Keysers et al., 2020) Add Primitive (Lake and Baroni, 2018) Length TMCD Template (Finegan-Dollak et al., 2018) Length Random Figure 2: We evaluate semantic parsing approaches across a diverse set of evaluations focused on natural language variation, compositional generalization, or both. We add TMCD splits to complement existing evaluations. Ordering within each cell is arbitrary. DER (Yu et al., 2018). Our results indicate that NQG-T5 is a strong baseline for our challenge of developing approaches that perform well across a diverse set of evaluations focusing on either natural language variation, compositional generalization, or both. Comparing five approaches across eight evaluations on SCAN and GEOQUERY, its average rank is 1, with the rank of the best previous approach (T5) being 2.9; performance is also competitive across several evaluations on SPIDER. While still far from affirmatively answering our research question, our study highlights the importance of a diverse set of evaluations and the open challenge of handling both compositional generalization and natural language variation. 3

Background and Related Work
In this section, we survey recent work related to compositional generalization in semantic parsing.
Evaluations To evaluate a model's ability to generalize to novel compositions, previous work has proposed several methods for generating train and test splits, as well as several synthetic datasets.
A widely used synthetic dataset for assessing compositional generalization is SCAN (Lake and Baroni, 2018), which consists of natural language commands (e.g., "jump twice") mapping to action sequences (e.g., "I_JUMP I_JUMP"). One split for SCAN is the length split, where examples are separated by length such that the test set contains longer examples than the training set. Another is the primitive split, where a given primitive (e.g., "jump") is seen by itself during training, but the test set consists of the primitive recombined with other elements observed during training (e.g., "jump twice"). Other synthetic datasets have been developed to evaluate aspects of compositional generalization beyond SCAN, including NACS (Bastings et al., 2018), CFQ (Keysers et al., 2020), andCOGS (Kim andLinzen, 2020).
In addition to introducing the CFQ dataset, Keysers et al. (2020) propose Maximum Compound Divergence (MCD) splits based on the notion of a compound distribution. Their algorithm generates train and test splits that maximize the divergence of their respective compound distributions while bounding the divergence of their respective atom distributions. We extend their methodology to create new TMCD splits for non-synthetic datasets.
Another method for generating train and test splits is the template 4 split (Finegan-Dollak et al., 2018). Unlike the aforementioned evaluations, template splits have been applied to non-synthetic datasets, primarily for text-to-SQL. In template splits, any parse template (defined as the target SQL query with entities anonymized) appearing in the training set cannot appear in the test set. We analyze and discuss template splits in § 6.1.
Finally, Herzig and Berant (2019) studies biases resulting from methods for efficiently collecting human-labeled data, providing further motivation for out-of-distribution evaluations.
Approaches Many specialized architectures have been developed to address the compositional generalization challenges of SCAN. Several of them have recently reached 100% accuracy across multiple SCAN challenges (Liu et al., 2020;Nye et al., 2020;Chen et al., 2020). Similarly to the NQG-T5 approach we propose in § 4, all of these models incorporate discrete structure. However, unlike NQG-T5, they have only been evaluated on synthetic parsing tasks.
Recently, Herzig and Berant (2020) also begins to address our research question, proposing an approach that not only solves several SCAN challenges but also achieves strong performance on the standard and template splits of the non-synthetic dataset GEOQUERY. However, their approach requires some manual task-specific engineering. We compare NQG-T5 with this approach and other 4 Also referred to as a query split. SCAN-inspired architectures. Oren et al. (2020) and Zheng and Lapata (2020) also explored compositional generalization on non-synthetic datasets by focusing on the template splits proposed by Finegan-Dollak et al. (2018), demonstrating improvements over standard seq2seq models.
The effect of large-scale pre-training on compositional generalization ability has also been studied. Furrer et al. (2020) finds that pre-training alone cannot solve several compositional generalization challenges, despite its effectiveness across NLP tasks such as question answering (Raffel et al., 2020).
While our work focuses on modeling approaches, compositional data augmentation techniques have also been proposed (Jia and Liang, 2016;Andreas, 2020). NQG-T5 outperforms previously reported results for these methods, but more in-depth analysis is needed.

Target Maximum Compound Divergence (TMCD) Splits
The existing evaluations targeting compositional generalization for non-synthetic tasks are template splits and length splits. Here we propose an additional method which expands the set of available evaluations by generating data splits that maximize compound divergence over non-synthetic datasets, termed Target Maximum Compound Divergence (TMCD) splits. As we show in § 6, it results in a generalization problem with different characteristics that can be much more challenging than template splits, and contributes to the comprehensiveness of evaluation. In standard MCD splits (Keysers et al., 2020), the notion of compounds requires that both source and target are generated by a rule-based procedure, and therefore cannot be applied to existing non-synthetic datasets where natural language utterances are collected from humans. For TMCD, we propose a new notion of compounds based only on the target representations. We leverage their known syntactic structure to define atoms and compounds. For instance, example atoms in FunQL are longest and river, and an example compound is longest(river). Detailed definitions of atoms and compounds for each dataset we study can be found in Appendix B.3.
Given this definition of compounds, our definition of compound divergence, D C , is the same as that of Keysers et al. (2020). Specifically, where F TRAIN and F TEST are the weighted frequency distributions of compounds in the training and test sets, respectively. The Chernoff coefficient Chung et al., 1989) is used with α = 0.1.
For TMCD, we constrain atom divergence by requiring that every atom appear at least once in the training set. An atom constraint is desirable so that the model knows the possible target atoms to generate. A greedy algorithm similar to the one of Keysers et al. (2020) is used to generate splits that approximately maximize compound divergence. First, we randomly split the dataset. Then, we swap examples until the atom constraint is satisfied. Finally, we sequentially identify example pairs that can be swapped between the train and test sets to increase compound divergence without violating the atom constraint, breaking when a swap can no longer be identified.

Proposed Approach: NQG-T5
We propose NQG-T5, a hybrid semantic parser that combines a grammar-based approach with a seq2seq model. The two components are motivated by prior work focusing on compositional generalization and natural language variation, respectively, and we show in § 5 that their combination sets a strong baseline for our challenge.
The grammar-based component, NQG, consists of a discriminative Neural parsing model and a flexible Quasi-synchronous Grammar induction algorithm which can operate over arbitrary pairs of strings. Like other grammar-based approaches, NQG can fail to produce an output for certain inputs. As visualized in Figure 3, in cases where NQG fails to produce an output, we return the output from T5 (Raffel et al., 2020), a pre-trained seq2seq model. This simple combination can work well because NQG often has higher precision than T5 for cases where it produces an output, especially in out-of-distribution settings.
We train NQG and T5 separately. Training data for both components consists of pairs of source and target strings, referred to as x and y, respectively.

Run T5
Return NQG output Return T5 output no yes Figure 3: Overview of how predictions are generated by NQG-T5, a simple yet effective combination of T5 (Raffel et al., 2020) with a high-precision grammarbased approach, NQG. et al., 2015). NQG combines a QCFG induction algorithm with a neural parsing model. Training is a two-stage process. First, we employ a compressionbased grammar induction technique to construct our grammar. Second, based on the induced grammar, we build the NQG semantic parsing model via a discriminative latent variable model, using a powerful neural encoder to score grammar rule applications anchored in the source string x.

NQG Grammar Induction
Grammar Formalism Synchronous contextfree grammars (SCFGs) synchronously generate strings in both a source and target language. Compared to related work based on SCFGs for machine translation (Chiang, 2007) and semantic parsing, NQG uses a slightly more general grammar formalism that allows repetition of a non-terminal with the same index on the target side. Therefore, we adopt the terminology of quasi-synchronous context-free grammars (Smith and Eisner, 2006), or QCFGs, to refer to our induced grammar G. 5 Our grammar G contains a single non-terminal symbol, N T . We restrict source rules to ones containing at most 2 non-terminal symbols, and do not allow unary productions as source rules. This enables efficient parsing using an algorithm similar to CKY (Cocke, 1969;Kasami, 1965;Younger, 1967) that does not require binarization of the grammar.
Induction Procedure To induce G from the training data, we propose a QCFG induction algorithm that does not rely on task-specific heuristics or pre-computed word alignments. Notably, our approach makes no explicit assumptions about the source or target languages, beyond those implicit in the QCFG formalism. Table 1 shows examples of induced rules. Our grammar induction algorithm is guided by the principle of Occam's razor, which leads us to seek the smallest, simplest grammar that explains the data well. We follow the Minimum Description Length (MDL) principle (Rissanen, 1978;Grunwald, 2004) as a way to formalize this intuition. Specifically, we use standard two-part codes to compute description length, where we are interested in an encoding of targets y given the inputs x, across a dataset D consisting of these pairs. A two-part code encodes the model and the targets encoded using the model; the two parts measure the simplicity of the model and the extent to which it can explain the data, respectively. For grammar induction, our model is simply our grammar, G. The codelength can therefore be expressed as H(G) − x,y∈D log 2 P G (y|x) where H(G) corresponds to the codelength of some encoding of G. We approximate H(G) by counting terminal (C T ) and non-terminal (C N ) symbols in the grammar's rules, R. For P G , we assume a uniform distribution over the set of possible derivations. 6 As the only mutable aspect of the grammar during induction is the set of rules R, we abuse notation slightly and write our approximate codelength objective as a function of R only: where Z G x,y is the set of all derivations in G that yield the pair of strings x and y, while Z G x, * ⊃ Z G x,y is the set of derivations that yield source string x and any target string. The constants l N and l T can be interpreted as the average bitlength for encoding non-terminal and terminal symbols, respectively. In practice, these are treated as hyperparameters.
We use a greedy search algorithm to find a grammar that approximately minimizes this codelength objective. We initialize G by creating a rule N T → x, y for every training example (x, y). By construction, the initial grammar perfectly fits the training data, but is also very large. Our algorithm iteratively identifies a rule that can be added to G that decreases our codelength objective by enabling ≥ 1 rule(s) to be removed, under the invariant constraint that G can still derive all training examples. The search completes when no rule that decreases the objective can be identified. In practice, we use several approximations to efficiently select a rule at each iteration. Additional details regarding the grammar induction algorithm are described in Appendix A.2.

NQG Semantic Parsing Model
Based on the induced grammar G, we train a discriminative latent variable parsing model, using a method similar to that of Blunsom et al. (2008). We define p(y | x) as: x,y is the set of derivations of x and y in G. We define p(z | x) as: where s(z, x) is a derivation score and the denominator is a global partition function. Similarly to the Neural CRF model of Durrett and Klein (2015), the scores decompose over anchored rules. Unlike Durrett and Klein (2015), we compute these scores based on contextualized representations from a BERT (Devlin et al., 2019) encoder. Additional details regarding the model architecture can be found in Appendix A.3. At training time, we use a Maximum Marginal Likelihood (MML) objective. We preprocess each example to produce parse forest representations for both Z G x,y and Z G x, * , which correspond to the numerator and denominator of our MML objective, respectively. By using dynamic programming to efficiently sum derivation scores inside the training loop, we can efficiently compute the exact MML objective without requiring approximations such as beam search.
At inference time, we select the highest scoring derivation using an algorithm similar to CKY that considers anchored rule scores generated by the neural parsing model. We output the corresponding target if it can be derived by a CFG defining valid target constructions for the given task.

NQG Discussion
We note that NQG is closely related to work that uses synchronous grammars for hierarchical statistical machine translation, such as Hiero (Chiang, 2007). Unlike Hiero, NQG does not rely on an additional word alignment component. Moreover, Hiero simply uses relative frequency to learn rule weights. Additionally, in contract with traditional SCFG models for machine translation applied to semantic parsing (Wong and Mooney, 2006; Andreas et al., 2013), our neural model conditions on global context from the source x via contextual word embeddings, and our grammar's rules do not need to carry source context to aid disambiguation.

Experiments
We evaluate existing approaches and the newly proposed NQG-T5 across a diverse set of evaluations to assess compositional generalization and handling of natural language variation. We aim to understand how the approaches compare to each other for each type of evaluation and in aggregate, and how the performance of a single approach may vary across different evaluation types.

Experiments on SCAN and GEOQUERY
For our main experiments, we focus on evaluation across multiple splits of two datasets with compositional queries: SCAN (Lake and Baroni, 2018) and GEOQUERY (Zelle and Mooney, 1996;Tang and Mooney, 2001). The two datasets have been widely used to study compositional generalization and robustness to natural language variation, respectively. Both datasets are closed-domain and have outputs with straightforward syntax, enabling us to make clear comparisons between synthetic vs. non-synthetic setups.
Approaches For NQG-T5, to assess the effect of model size, we compare two sizes of the underlying T5 model: Base (220 million parameters) and 3B (3 billion parameters). To evaluate NQG individually, we treat any example where no output is provided as incorrect when computing accuracy.
We select strong approaches from prior work that have performed well in at least one setting. We group them into two families of approaches described in Figure 1. First, for general-purpose models that have shown strong ability to handle natural language variation, we consider T5, a pretrained seq2seq model, in both Base and 3B sizes.
Second, for specialized methods with strong compositional biases, we consider approaches that have been developed for SCAN. Some previous approaches for SCAN require task-specific information such as the mapping of atoms (Lake, 2019; Gordon et al., 2019) or a grammar mimicking the training data (Nye et al., 2020), and as such are difficult to adapt to non-synthetic datasets. Among the approaches that do not need task-specific resources, we evaluate two models with publicly available code: Syntactic Attention (Russin et al., 2019) and CGPS (Li et al., 2019). We report results on SCAN from the original papers as well as new results on our proposed data splits.
Datasets For the SCAN dataset, we evaluate using the length split and two primitive splits, jump and turn left, included in the original dataset (Lake and Baroni, 2018). We also evaluate using the SCAN MCD splits from Keysers et al. (2020). GEOQUERY (Zelle and Mooney, 1996) contains natural language questions about US geography. Similarly to prior work Lapata, 2016, 2018), we replace entity mentions with placeholders. We use a variant of Functional Query Language (FunQL) as the target representation (Kate et al., 2005). In addition to the standard split of Zettlemoyer and Collins (2005)  We report exact-match accuracy for both datasets. 8 Hyperparameters and pre-processing details can be found in Appendix B.

Results
The results are presented in Table 2. The results for T5 on SCAN are from Furrer et al. (2020). Additionally, we include results for GECA 9 (Andreas, 2020), a data augmentation method, as well as LANE (Liu et al., 2020) and NSSM (Chen et al., 2020) 10 . We also compare with SpanBasedSP 11 (Herzig and Berant, 2020).
From the results, we first note that the relative performance of approaches on compositional splits of SCAN is not very predictive of their relative performance on compositional splits of GEO-QUERY. For example, GGPS is better than T5 on the length split of SCAN but is significantly worse than T5 on the length split of GEOQUERY. Similarly, the ranking of most methods is different on the (T)MCD splits of the two datasets. Second, the proposed NQG-T5 approach combines the strengths of T5 and NQG to achieve superior results across all evaluations. It improves over T5 on compositional generalization for both synthetic and non-synthetic data while maintaining T5's performance on handling in-distribution natural language variation, leading to an average rank of 1.0 compared to 2.9 for T5. (To the best of our knowledge, both T5 and NQG-T5 achieve new state-of-the-art accuracy on the standard split of GEOQUERY.) Finally, we note that there is substantial room for improvement on handling both compositional generalization and natural language variation.

Experiments on SPIDER
We now compare the approaches on SPIDER (Yu et al., 2018), a non-synthetic text-to-SQL dataset that includes the further challenges of schema linking and modeling complex SQL syntax.
SPIDER contains 10,181 questions and 5,693 unique SQL queries across 138 domains. The primary evaluation is in the cross-database setting, where models are evaluated on examples for databases not seen during training. The primary challenge in this setting is generalization to new database schemas, which is not our focus. Therefore, we use a setting where the databases are shared between train and test examples. 12 We gen-   Suhr et al. (2020) and use SPIDER-SSP to refer to these same-database splits, and use SPIDER-XSP to refer to the standard cross-database setting. We prepend the name of the target database to the source sequence. For T5, we also serialize the database schema as a string and append it to the source sequence similarly to Suhr et al. (2020). We report exact set match without values, the standard Spider evaluation metric (Yu et al., 2018). Table 3 shows the results of T5 and NQG-T5 on different splits of SPIDER-SSP. We also show T5-Base performance without the schema string appended. The text-to-SQL mapping is not well modeled by NQG. Nevertheless, the performance of NQG-T5 is competitive with T5, indicating a strength of the hybrid approach. Table 4 shows the results on SPIDER-XSP, which focuses on handling unseen schema rather than compositional generalization. To our surprise, T5-3B proves to be competitive with the state-ofthe-art (Choi et al., 2020) for approaches without access to database contents beyond the table and column names. As NQG-T5 simply uses T5's output when the induced grammar lacks coverage, it too is competitive.   13 Length splits are also very challenging, but they lead to a more predictable error pattern for seq2seq models, as discussed next.

T5 Analysis
We analyze NQG-T5's components, starting with T5. On length splits, there is a consistent pattern to the errors. T5's outputs on the test set are not significantly longer than the maximum length observed during training, leading to poor performance. This phenomenon was explored by Newman et al. (2020). Diagnosing the large generalization gap on the (T)MCD splits is more challenging, but we noticed several error patterns. For T5-Base on the GEO-QUERY TMCD split, in 52 of the 201 incorrect predictions (26%), the first incorrectly predicted symbol occurs when the gold symbol has 0 probability under a trigram language model fit to the training data. This suggests that the decoder's implicit target language model might have over-fitted to the distribution of target sequences in the training data, hampering its ability to generate novel compositions. Non-exclusively with these errors, 53% of the incorrect predictions occur when the gold target contains an atom that is seen in only 1   example during training, suggesting that T5 struggles with single-shot learning of new atoms. In other cases, the errors appear to reflect over-fitting to spurious correlations between inputs and outputs.
Some error examples are shown in Appendix B.6.

NQG Analysis
To analyze NQG, we compute its coverage (fraction of examples where NQG produces an output) and precision (fraction of examples with a correct output among ones where an output is produced) on different data splits. The results in Table 5 show that NQG has high precision but struggles at coverage on some data splits.
There is a significant difference in the effectiveness of the grammar induction procedure among the three datasets. Induction is particularly unsuccessful for SPIDER, as SQL has complicated syntax and often requires complex coordination across discontinuous clauses. Most of the induced rules are limited to simply replacing table and column names or value literals with non-terminals, such as the rule shown in Table 1, rather than representing nested sub-structures. The degree of span-to-span correspondence between natural language and SQL is seemingly lower than for other formalisms such as FunQL, which limits the effectiveness of grammar induction. Intermediate representations for SQL such as SemQL (Guo et al., 2019) may help increase the correspondence between source and target syntax.
For both GEOQUERY and SPIDER, NQG is limited by the expressiveness of QCFGs and the simple greedy search procedure used for grammar induction, which can lead to sub-optimal approximations of the induction objective. Notably, QCFGs cannot directly represent relations between source strings, such as semantic similarity, or relations between target strings, such as logical equivalence (e.g. intersect(a,b) ⇔ intersect(b,a)), that could enable greater generalization. However, such extensions pose additional scalability challenges, requiring new research in more flexible approaches for both learning and inference.

Conclusions
Our experiments and analysis demonstrate that NQG and T5 offer different strengths. NQG generally has higher precision for out-of-distribution examples, but is limited by the syntactic constraints of the grammar formalism and by requiring exact lexical overlap with induced rules in order to provide a derivation at inference time. T5's coverage is not limited by such constraints, but precision can be significantly lower for out-of-distribution examples. With NQG-T5, we offer a simple combination of these strengths. While accuracy is still limited for out-of-distribution examples where NQG lacks coverage, we believe it sets a strong and simple baseline for future work.
More broadly, our work highlights that evaluating on a diverse set of benchmarks is important, and that handling both out-of-distribution compositional generalization and natural language variation remains an open challenge for semantic parsing.

Ethical Considerations
This paper proposed to expand the set of benchmarks used to evaluate compositional generalization in semantic parsing. While we hope that ensuring semantic parsing approaches perform well across a diverse set of evaluations, including ones that test out-of-distribution compositional generalization, would lead to systems that generalize better to languages not well represented in small training sets, we have only evaluated our methods on semantic parsing datasets in English.
Our NQG-T5 method uses a pre-trained T5 model, which is computationally expensive in finetuning and inference, especially for larger models (see Appendix B.1 for details on running time and compute architecture). Our method does not require pre-training of large models, as it uses pre-existing model releases. NQG-T5-base outperforms or is comparable in accuracy to T5-3B on the non-SQL datasets, leading to relative savings of computational resources. In this section we describe the NQG grammar induction algorithm and parsing model in detail, starting with relevant background and notation for QCFGs.

A.1 Background: SCFGs and QCFGs
Synchronous Context-Free Grammars (SCFGs) have been used to model the hierarchical mapping between pairs of strings in areas such as compiler theory (Aho and Ullman, 1972) and natural language processing. Informally, SCFGs can be viewed as an extension of Context-Free Grammars (CFGs) that synchronously generate strings in both a source and target language. We write SCFG rules as: Where S is a non-terminal symbol, and α and β are strings of non-terminal and terminal symbols. An SCFG rule can be viewed as two CFG rules, S → α and S → β with a pairing between the occurrences of non-terminal symbols in α and β. This pairing is indicated by assigning each nonterminal in α and β an index ∈ N. Non-terminals sharing the same index are called linked. Following convention, we denote the index for a non-terminal using a boxed subscript, e.g. N T [1] . A complete SCFG derivation is a pair of parse trees, one for the source language and one for the target language. An example derivation is shown in Figure 4.
The ⇒ r operator refers to a derives relation, such that α 1 , β 1 ⇒ r α 2 , β 2 states that the string pair α 2 , β 2 can be generated from α 1 , β 1 by applying the rule r. We write ⇒ to leave the rule unspecified, assuming the set of possible rules is clear from context. We write ⇒⇒ to indicate a chain of 2 rule applications, omitting the intermediate string pair. Finally, we write * ⇒ to denote the reflexive transitive closure of ⇒.
Quasi-Synchronous Context-Free Grammars (QCFGs) QCFGs generalize SCFGs in various ways, notably relaxing the restriction on a strict one-to-one alignment between source and target non-terminals (Smith and Eisner, 2006).
Compositionality Notably, grammar formalisms such SCFGs and QCFGs capture the formal notion of the principle of compositionality as a homomorphism between source and target structures (Montague, 1970;Janssen and Partee, 1997).

A.2 NQG Grammar Induction Details
Having defined the codelength scoring function that we use to compare grammars in section 4.1.1, we describe our greedy search algorithm that finds a grammar that approximately minimizes this objective. 14 Initialization We initialize R to be {N T → x, y | x, y ∈ D}. We also add identity rules for substrings that exactly match between source and target examples, e.g. N T → k, k where k is a substring of both x and y for some x, y ∈ D. 15 Optimization Algorithm Our algorithm was designed with simplicity in mind, and therefore uses a simple greedy search process that could likely be significantly improved upon by future work. At a high level, our greedy algorithm iteratively identifies a rule to be added to R that decreases the codelength by enabling ≥ 1 rules in R to be removed while maintaining the invariant that G allows for deriving all of the training examples, i.e. N T, N T * ⇒ x, y for every x, y ∈ D. The search completes when no rule that decreases L(R) can be identified.
To describe the implementation, first let us define several operations over rules and sets of rules. We define the set of rules that can be derived from a given set of rules, R: We define an operation SPLIT that generates possible choices for splitting a rule into 2 rules: The induction objective contains hyperparameters representing the bitlength of terminal and non-terminal symbols. For all experiments we use lN = 1. For GEOQUERY and SPIDER we use lT = 8, and use lT = 32 for SCAN. 15 These initialization rules are used for GEOQUERY and SPIDER, but SCAN does not contain any exact token overlap between source and target languages. N T → how many N T [1] pass through N T [2] , answer ( count ( intersection ( N T [1] , loc_1 ( N T [2] ) ) ) ) N T → rivers, river N T → the largest N T [1] , largest ( N T [1]   where g and h is a pair of new rules that would maintain the invariant that N T, N T * ⇒ x, y for every x, y ∈ D, even if the provided rule is eliminated. 16 SPLIT can be implemented by considering pairs of sub-strings in α and β to replace with a new indexed non-terminal symbol. For example, the rule "N T → largest state, largest ( state ) " can be split into the rules "N T → largest N T [1] , largest ( N T [1] ) " and "N T → state, state ". This step can require re-indexing of non-terminals.
During our greedy search, we only split rules when one of the two resulting rules can already be derived given R. Therefore, we define a function NEW that returns a set of candidate rules to consider: Similarly, we can compute the set of rules that are made redundant and can be eliminated by introducing one these candidate rules, f : 16 We optionally allow SPLIT to introduce repeated target non-terminals when the target string has repeated substrings. Otherwise, we do not allow SPLIT to replace a repeated substring with a non-terminal, as this can lead to an ambiguous choice. We enable this option for SCAN and SPIDER but not for GEOQUERY, as FunQL does not require such repetitions.
We can then define the codelength reduction of adding a particular rule, 17 Finally, we can select the rule with the largest −∆L: Conceptually, after initialization, the algorithm then proceeds as: For efficiency, we select the shortest N examples from the training dataset, and only consider these during the induction procedure. Avoiding longer examples is helpful as the number of candidates returned by SPLIT is polynomial with respect to source and target length. Once induction has completed, we then determine which of the longer examples cannot be derived based on the set of induced rules, and add rules for these examples. 18 Our algorithm maintains a significant amount of state between iterations to cache computations that are not affected by particular rule changes, based on overlap in terminal symbols. We developed the algorithm and selected some hyperparameters by assessing the size of the induced grammars over the training sets of SCAN and GEOQUERY.
Our grammar induction algorithm is similar to the transduction grammar induction method for machine translation by Saers et al. (2013). More broadly, compression-based criteria have been successfully used by a variety of models for language (Grünwald, 1995;Tang and Mooney, 2001;Ravi and Knight, 2009;Poon et al., 2009).

A.3 NQG Parsing Model Details
In this section we provide details on how we generate derivation scores, s(z, x), using a neural model, as introduced in § 4.1. The derivation scores decompose over anchored rules from our grammar: where r is an index for a rule in G and i and j are indices defining the anchoring in x. The anchored rule scores, φ(r, i, j, x), are based on contextualized representations from a BERT (Devlin et al., 2019) encoder: where [w i , w j ] is the concatenation of the BERT representations for the first and last wordpiece in the anchored span, f r is a feed-forward network with hidden size d that outputs a vector ∈ R d , f s is a feed-forward network with hidden size d that outputs a scalar, and e r is an embedding ∈ R d for the rule index r. Our formulation for encoding spans is similar to that used in other neural span-factored models (Stern et al., 2017;Lee et al., 2017).

B.1 Model Hyperparameters and Runtime
We selected reasonable hyperparameter values and performed some minimal hyperparameter tuning 18 We use N = 500 for SCAN and N = 1000 for SPI-DER. As the GEOQUERY training set contains < 500 unique examples, we use the entire training set.
for T5 and NQG based on random splits of the training sets for GEOQUERY and SPIDER. We used the same hyperparameters for all splits of a given dataset.
For T5, we selected a learning rate of 1e −4 from [1e −3 , 1e −4 , 1e −5 ], which we used for all experiments. Otherwise, we used the default hyperparameters for fine-tuning. We fine-tune for 3, 000 steps for GEOQUERY and 10, 000 for SPIDER. T5-Base trained with a learning rate of 1e −4 reached 94.2% accuracy at 3, 000 steps on a random split of the standard GeoQuery training set into 500 training and 100 validation examples.
For the NQG neural model, we use the pretrained BERT Tiny model of Turc et al. (2019) (4.4M parameters) for SCAN and SPIDER, and BERT Base (110.1M parameters) for GEOQUERY, where there is more headroom for improved scoring. We do not freeze pre-trained BERT parameters during training. For all experiments, we use d = 256 dimensions for computing anchored rule scores. We fine-tune for 256 steps and use a learning rate of 1e −4 . We use a batch size of 256.
We train NQG on 8 V100 GPUs. Training NQG takes < 5 minutes for SCAN and SPIDER (BERT Tiny), and up to 90 minutes for GEOQUERY (BERT Base). We fine-tune T5 on 32 Cloud TPU v3 cores. 19 For GEOQUERY, fine-tuning T5 takes approximately 5 and 37 hours for Base and 3B, respectively. For SPIDER, fine-tuning T5 takes approximately 5 and 77 hours for Base and 3B, respectively.

B.2 Dataset Preprocessing
For GEOQUERY, we use the version of the dataset with variable-free FunQL logical forms (Kate et al., 2005), and expand certain functions based on their logical definitions, such that state(next_to_1(state(all))) becomes the more conventional intersection(state, next_to_1(state)).
We replace entity mentions with placeholders (e.g. "m0", "m1") in both the source and target. For SPIDER, we prepend the name of the target database to the source sequence. For T5, we also serialize the database schema as a string and append it to the source sequence similarly to Suhr et al. (2020). This schema string contains the names of all tables in the database, and the names of the columns for each table. As we use a maximum Source: how many states are next to major rivers Target: answer ( count ( intersection ( state , next_to_2 ( intersection ( major , river ) ) ) ) ) Prediction: answer ( count ( intersection ( state , next_to_2 ( intersection ( major , intersection ( river , m0 ) ) ) ) ) ) Notes: The trigram "major , intersection" occurs 28 times during training, but "major , river" occurs 0 times. In this case, T5 also hallucinates "m0" despite no entity placeholder occuring the source.
Source: which state has the highest peak in the country Target: answer ( intersection ( state , loc_1 ( highest ( place ) ) ) ) Prediction: answer ( highest ( intersection ( state , loc_2 ( highest ( intersection ( mountain , loc_2 ( m0 ) ) ) ) ) ) Notes: The token "highest" occurs after "answer (" in 83% of instances in which "highest" occurs in the training set. Note that T5 also hallucinates "m0" in this case.   source sequence length of 512 for T5, this leads to some schema strings being truncated (affecting about 5% of training examples). SCAN did not require any dataset-specific preprocessing.

B.3 Atom and Compound Definitions
For GEOQUERY, the tree structure of FunQL is given by explicit bracketing. We define atoms as individual FunQL symbols, and compounds as combinations between parent and child symbols in the FunQL tree. Example atoms are longest, river, and exclude and example compounds are longest(river) and exclude(longest(_), _).
For SPIDER, we tokenize the SQL string and define atoms as individual tokens. To define compounds, we parse the SQL string using an unambiguous CFG, and define compounds from the resulting parse tree. We define compounds over both first and second order edges in the resulting parse tree.

B.4 Grammar Sizes
Induced grammar sizes for a selected split of each dataset are shown in Table 8. For SPIDER, the number of induced rules is larger than the original dataset due to the identity rules added during initialization.

B.5 GEOQUERY Variance
In tables 2 and 5 we report the mean of 3 runs for NQG for GEOQUERY. The standard deviations for these runs are reported in Table 9. The reported standard deviations for NQG-T5 use the same finetuned T5 checkpoint, so they do not reflect any additional variance from different fine-tuned T5 checkpoints.

B.6 T5 GEOQUERY Errors
We include several example T5-Base errors on the GEOQUERY TMCD split in Table 7.