Unobserved Local Structures Make Compositional Generalization Hard

While recent work has shown that sequence-to-sequence models struggle to generalize to new compositions (termed compositional generalization), little is known on what makes compositional generalization hard on a particular test instance. In this work, we investigate the factors that make generalization to certain test instances challenging. We first substantiate that some examples are more difficult than others by showing that different models consistently fail or succeed on the same test instances. Then, we propose a criterion for the difficulty of an example: a test instance is hard if it contains a local structure that was not observed at training time. We formulate a simple decision rule based on this criterion and empirically show it predicts instance-level generalization well across 5 different semantic parsing datasets, substantially better than alternative decision rules. Last, we show local structures can be leveraged for creating difficult adversarial compositional splits and also to improve compositional generalization under limited training budgets by strategically selecting examples for the training set.


Introduction
Recent analyses of pre-trained sequence-tosequence (seq2seq) models have revealed that they do not perform well in a compositional generalization setup, i.e., when tested on structures that were not observed at training time (Lake and Baroni, 2018;Keysers et al., 2020).However, while performance drops on average, models do not categorically fail in compositional setups, and in fact are often able to successfully emit new unseen structures.This raises a natural question: What are the conditions under which compositional generalization occurs in seq2seq models?
Measuring compositional generalization at the dataset level obscures the fact that for a particular test instance, performance depends on the instance sub-structures and the examples observed during training.Consequently, it might be possible to predict how difficult a test instance is given the test instance and the training set.Indeed, papers that provide multiple compositional splits (Kim and Linzen, 2020;Bogin et al., 2021a) have demonstrated high variance in accuracy across splits.
In this paper, we investigate the question of what makes compositional generalization hard in the context of semantic parsing, the task of mapping natural language utterances to executable programs.First, we create a large set of compositional train/test data splits over multiple datasets, in which there is no overlap between the programs of the test set and the training set.We then fine-tune and evaluate multiple seq2seq models on these splits.Our first finding is that different models tend to agree on which test examples are difficult and which are not.This indicates example difficulty can be mostly explained by the example itself and the training data, independent of the model.This calls for a better characterization of what makes a test instance hard.
To this end, we analyze the factors that make compositional generalization hard at the instance level.We formulate a simple decision rule that predicts the difficulty of test instances across multiple splits and datasets.Our main observation is that What is the number of gray square cat that is looking at mouse? count (with_relation (filter (gray, filter (square, find(cat))), looking_at, find (mouse))) Overnight (synthetic † , natural • ) † number of played games of player kobe bryant whose number of assists is 3 • how many games has kobe bryant made more than 3 assists (listValue (getProperty (filter (getProperty kobe_bryant (reverse player)) num_assists = 3) num_games_played))
a test instance is hard if it contains a local structure that was not observed at training time.An unobserved local structure is defined as a small connected sub-graph that occurs in the program of the test instance, but does not occur in the training set.Moreover, unobserved structures render an instance difficult only if there are no observed structures that are similar to the unobserved one, where similarity is defined through a simple distributional similarity metric.Fig. 1 presents two splits that contain a tree with the same local structure in the test set.Split 1 is easy because the training set contains similar local structures: exists is similar to all and some.Conversely, in Split 2 emitting the unobserved structure will be hard, as there are no similar observed structures in the training set.
We empirically evaluate our decision rule on five different datasets with diverse semantic formalisms, and show it predicts instance-level generalization well for both synthetic and natural language inputs, with an area under curve (AUC) score ranging from 78.4 to 93.3 (across datasets), substantially outperforming alternative rules.Moreover, we compare our approach to MCD (Keysers et al., 2020), a metric that has been used to characterize difficulty at the dataset level (and not the instance level).We show our rule can be generalized to the dataset level and outperforms MCD in predicting the difficulty of various compositional splits.Last, we find that our decision rule applies not just to Transformers, but also to LSTM decoders.
With these insights, we use our decision rule for two purposes.First, we develop a method for creating difficult compositional splits by picking a set of similar local structures, and holding out instances that include any of these structures.We show that seq2seq models get much lower accuracy on these splits compared to prior approaches.Second, we propose a data-efficient approach for selecting training examples in a way that improves compositional generalization.Given a large pool of examples, we select a training set that maximizes the number of observed local structures.We show this leads to better compositional generalization for a fixed budget of training examples.We release code and data at https://github.com/benbogin/unobserved-local-structures.

Setup
We focus on semantic parsing, where the task is to parse an utterance x into an executable program z.In a compositional generalization setup, examples are split into a training and a test set, such that there is no overlap between programs in the two sets.
We use two methods to generate compositional splits.The first is the template split, proposed in Finegan-Dollak et al. (2018), where examples are randomly split based on an abstract version of their programs.Additionally, we propose a grammar split, which can be used whenever we have access to the context-free grammar rules that generate the dataset programs.A grammar split is advantageous because, unlike the template split, it creates meaningful splits where particular structures are held out.For details on these split methods see App.B.

Datasets
We use five different datasets, covering a wide variety of domains and semantic formalisms.Importantly, since we focus on compositional generalization, we choose datasets for which a random, i.i.d split, yields high accuracy.Thus, errors in the compositional setup can be attributed to the compositional challenge, and not to conflating issues such as lexical alignments or ambiguity.
We consider both synthetic datasets, generated with a synchronous context-free grammar (SCFG) that generates utterance-program pairs, and datasets with natural language utterances.The datasets we use are (see Table 1 for examples): COVR: A synthetic dataset based on Bogin et al. (2021a) with a variable-free functional language.
Includes also 2-LS.Overnight (Wang et al., 2015): Contains both synthetic and natural utterances covering multiple domains; uses the Lambda-DCS formal language (Liang et al., 2011).Examples are generated using a SCFG, providing synthetic utterances, which are then manually paraphrased to natural language.Schema2QA (S2Q, Xu et al. 2020): Uses the ThingTalk language (Campagna et al., 2019).We only use the synthetic examples from the people domain, provided by Oren et al. (2021).ATIS (Hemphill et al., 1990;Dahl et al., 1994): A dataset with natural language questions about aviation, and λ-calculus as its formal language.

Model Agreement across Examples
Our goal is to predict the difficulty of test instances.However, first we must confirm that (1) splits contain both easy and difficult instances and (2) different models agree on how difficult a specific instance is.To check this, we evaluate different models on multiple splits.

Splits
We experiment with the five datasets listed in §2.For COVR, we use all of the 124 grammar splits that were generated.For Overnight, we generate 5 template splits for each domain, using the same splits for the synthetic and the natural language versions.For ATIS and S2Q, we generate 5 template splits.We provide further details in App.B.3.

Models and experimental setting
We experiment with four pre-trained seq2seq models: T5 Base/Large and BART-Base/Large (Raffel et al  agreement rate for at least 3 models (3+/4) and all 4 models (4/4), and random agreement rate (4/4).For Overnight, we show results both on the synthetic and the natural language settings.2020; Lewis et al., 2020) and fine-tune them on each of the splits.See App.C for training details.
We evaluate performance with exact match (EM), that is, whether the predicted output is equal to the gold output.However, since EM is often too strict and results in false-negatives (i.e. a predicted target could be different than the gold target but still semantically equivalent), we use a few manuallydefined relaxations (detailed in App.A).This is important for our analysis since we want to focus on meaningful failures rather than evaluation failures.
Results Table 2 (top) shows average results across splits.We see that the accuracy of different models is roughly similar across datasets.Models tend to do well on most compositional splits, except in Overnight with natural language inputs.To measure agreement, we compute agreement rate, that is, the fraction of times all (or most) models either output a correct prediction or output an erroneous one.We observe that agreement rate is high (Table 2, middle): at least 3 models agree (3+/4) in 93.1%-96.4% of the cases, and all (4/4) models agree in 75.3%-86.8%.We compare this to random agreement rate (Table 2,bottom), where we compute agreement rate assuming that a model with accuracy p outputs a correct prediction for a random subset of p the examples and an incorrect prediction for the rest.Agreement rate is dramatically higher than random agreement rate.
Importantly, the fact that models have high agreement rate suggests that instance difficulty depends mostly on the instance itself and the training set, and not on the model.

What makes an instance hard?
The results in §3 provide a set of test instances of various difficulties from a variety of compositional splits.We analyze these instances and propose a hypothesis for what makes generalization to a test instance difficult.

Unobserved Local Structure
We conjecture that a central factor in determining whether a test instance is difficult, is whether its program contains any unobserved local structures.We represent a program as a graph, and a local structure as a connected sub-graph within it.We formally define these concepts next.
The output z in our setting is a sequence of tokens which defines a program.Each token in z represents a program symbol, which is a function or a value, except for structure tokens (namely, parentheses and commas) that define parent-child relations between function symbols and their arguments.We parse z into a tree T = (V, E), such that each node v ∈ V is labeled by the symbol it represents in z, and the set of edges E = {(p, c)} expresses parent-child relations between the nodes.We additionally add a root node <s> connected as a parent to the original root in T .
To capture also sibling relations, we define a graph based on the tree T that contains an edge set E sib of sibling edges: G = (V, E ∪ E sib ).Specifically, for each parent node p, the program z induces an order over the children of p: (c p 1 , ..., c p Np ), where N p is the number of children.We then define , that is, all consecutive siblings will be connected by edges.Fig. 2 (left) shows an example program z and its graph G.
We define local structures as connected subgraphs in G, with 2 ≤ n ≤ 4 nodes, that have a particular structure, presented next.We term a local structure with n nodes as n-LS.The set of 2-LSs refers to all pairs of parents and their children, and all pairs of consecutive siblings (Fig. 2, structures 1 and 2).The set of 3-LSs includes all 2-LSs, and also structures with (1) two parent-child relations, (2) two siblings relations and (3) a parent with two siblings (structures 3, 4 and 5 in the figure, respectively).The structure 4-LS is a natural extension of 2-LS and 3-LS, defined in App.D. Importantly, the structures we consider are local since they are connected: We do not consider, for example, a grandparent-grandchild pair, or nonconsecutive siblings.

Similarity of Local Structures
Our hypothesis is that if a model observes a test instance with an unobserved local structure, this instance will be difficult.We relax this hypothesis and propose that an example might be easy even if it contains an unobserved local structure s 1 , if the training set contains a similar structure s 2 .
The similarity sim(s 1 , s 2 ) of two structures is defined as follows.Two structures can have positive similarity if (1) they are isomorphic, that is, they have the same number of nodes and the same types of edges between the nodes, and (2) if they are identical up to flipping the label of a single node.In any other case, similarity will be 0.0.If all nodes are identical, similarity is 1.0.When the structures differ by the label of a single node, we define their similarity using the symbol similarity of the two symbols m 1 and m 2 that are different, which is computed using a distributional similarity metric sim(m 1 , m 2 ).
To compute symbol similarity we use the set of programs P o in all training examples.We use this set to find the context that co-occurs with each symbol -that is, the set of symbols that have appeared as parents, children or siblings of a given symbol m.Specifically, we consider four types of contexts c ∈ C, including children, parents, left siblings, and right siblings.We define ctx c (m) as the set of symbols that have appeared in the context c of the symbol m.Finally, given two symbols m 1 and m 2 , we average the Jaccard similarity across the set of relevant context types C ⊆ C: where C contains a context type c iff the set of context symbols ctx c (•) is not empty for m 1 or m 2 .
Consider the top example in Fig. 3.The token exists appears with 2 different parents: or and and.The token most appears with exactly the same set of parents, thus their "parent" similarity is 100%.Likewise, the "children" similarity of the two is 50% since they share 1 out of 2 distinct children (find).Finally, the similarity between the tokens is the average of the two types of similarities, 75.0%(for brevity, sibling contexts are not considered in the figure).

Decision Rule
We can now use the similarity between structures to predict the difficulty of an unobserved program p u given a set of observed programs P o .For exposition purposes, instead of predicting difficulty, we predict its complement: easiness.
We compute the easiness of a program p u by comparing its local structures with the local structures in P o .Thus, we extract from p u the set S u of n-LSs as defined in §4.1, for a chosen n.Similarly, we extract the set S o of all n-LSs in the set of training programs P o , and define the easiness: that is, the easiness of p u is determined based on the least easy (most difficult) unobserved structure in S u .The easiness of a particular structure s u is determined by the structure in S o that is most similar to it.

Alternative Decision Rules
We discuss alternative decision rules, which will be evaluated as baselines in §5.N-grams over sequence While we define structures over program graphs, seq2seq models observe inputs and outputs as a flat sequence of symbols.Thus, it is possible that unobserved n-grams in test instances, rather than local structures, explains difficulty.To test this, we define our decision rule to be identical, but replace local structures with consecutive n-grams in the program sequence, and consider only two context types (left/right cooccurrence) to compute symbol similarity.Length It is plausible that long sequences are more difficult to generalize to.To test this, given a set of training programs P o , we measure the number of symbols m l in the longest program in P o , and define the easiness of a program p u of length m u to be max 1 − mu m l , 0 .TMCD The MCD and TMCD methods (Keysers et al., 2020;Shaw et al., 2021) have been used to create compositional splits, by maximizing compound divergence across the training and test splits.A compound, which is analogous to our n-LS, is defined as any sub-graph of up to a certain size in the program tree, and divergence is computed over the distributions of compounds across the two sets (see papers for details).We can use this method also to predict difficulty instead of creating splits and compare it to our approach.While the two methods are not directly comparable, since we focus on instance-level generalization, we extend our approach for computing the easiness of a split and compare to TMCD in §5.

Experiments
We now empirically test how well our decision rule predicts the easiness of test instances.
Setup We formalize our setup as a binary classification task where we predict the easiness ê(p u ) ∈ [0, 1] of a test instance (utterance-program pair) with program p u , and compare it to the "gold" easiness e(p u ) ∈ {0, 1}, as defined next.For each test instance, we have the EM accuracy of four models ( §3).We thus define e(p u ) to be the majority EM on p u between the four models.We discard examples with no majority (3.6%-6.9% of the cases).In each dataset, we combine all test instances across all splits.We evaluate with Area Under the Curve (AUC), a standard metric for binary classification that does not require setting a threshold, where we compute the area under the curve of the true positive rate against the false positive rate.
Decision rules We compare variations of our n-LS decision rule with n ∈ {2, 3, 4}.Additionally, we evaluate the N-grams over sequence baseline (2-BIGRAM) and LENGTH.The RANDOM baseline samples a number between 0 to 1 uniformly.
We also conduct three ablations.The first, 2-LS-NOSIB ignores sibling relations.Similarly, 2-LS-NOPC ignores parent-child relations.Last, 2-LS-NOSIM tests a more strict decision rule that ignores structure similarity, i.e., ê(p u ) = 0 for a test program p u that has any unobserved 2-LS, even if the unobserved structures have similar observed structures in the training set.

Results
AUC scores for all decision rules are in Table 3, showing that our n-LS classifiers get high AUC scores, ranging from 78.4-93.3,outperforming the baselines and ablations.
Comparing performance across the order n of LSs, there is some variance across datasets, and the best n may depend on the dataset.Still, local structure explains generalization better than the baselines practically for all n's.
The two graph-relation ablations (2-LS-NOSIB and 2-LS-NOPC) show that parent-child relations are more important than sibling relations, but both contribute to the final easiness score.The strict similarity ablation suggests that considering similarity between structures is important in COVR, but less in Overnight and ATIS, and not at all in S2Q.We hypothesize that similarity is important in COVR since the splits were created with a grammar, and not with templates.In grammar splits, often a single group of similar structures is split across the training and test sets (e.g., half of the quantifiers are in the training set, and half in the test set).In such cases, considering local structure similarity is important, since such test instances are easier according to our conjecture.The experiment we conduct in §6.1, where we specifically create such splits, supports this claim.
The low accuracy of 2-BIGRAM indicates that unobserved structures in program space are better predictors compared to unobserved sequences.This is interesting since models train with symbol sequences.Last, we see that predicting difficulty by example length is as bad as a random predictor.
Performace on datasets with natural language (ATIS and Overnight-paraphrased) are lower than synthetic datasets.One reason is that natural language introduces additional challenges such as ambiguity, lexical alignment and evaluation errorsmistakes that cannot be explained by our decision rule.We further discuss limitations in §8.

Token-level analysis
We now analyze the relation between the first incorrect token the models emit and the unobserved structures.Consider the example in Table 4: given a test example with an unobserved parent-child structure or-exists, all models emit a wrong token precisely where they should have output the child, exists.
We measure the frequency of this by looking at model outputs where (1) the model is wrong and (2) unobserved 2-LSs S u were found in the gold program.For each such model output ẑ and gold output z, we find the index i of the first token where ẑi ̸ = z i .We then count the fraction of cases where there is an unobserved 2-LS (a pair of symbols) (m 1 , m 2 ) ∈ S u such that both m 1 appears in the program prefix {z} i−1 j=1 and m 2 = z i .For COVR, this happens in 76.5% of the cases.For Overnight, S2Q and ATIS, this happens in 28.4%, 31.8% and 45.1% of the cases respectively, providing strong evidence that models struggle to emit local structures they have not observed during training.

Comparison to TMCD
As discussed ( §4.4), Maximum Compound Divergence (MCD) and its variation TMCD, are recentlyproposed metrics for estimating the difficulty of a test set, while we measure difficulty at the instance level.To measure how well can local structures predict performance at the split level, we average the EM scores that the 4 models get on each split and use this average as the "gold" easiness of that split.We then average the easiness predictions of our decision rule across all instances in a split to obtain an easiness prediction for that split.For TMCD, we compute the compound divergence of each split (high compound divergence indicates a more difficult split, or lower easiness) following Shaw et al. (2021), see details in App.E.
We evaluate by measuring Pearson correlation between the predicted scores and gold scores.For TMCD we take the negative of the predicted score, such that for both methods a higher correlation is better.We show results for COVR and Overnight only, since the number of splits in these two datasets is large enough (124 and 50 splits respectively).The results, shown in Table 5, demonstrate that n-LS correlates better with the EM of models.

Model Architecture Effect
We now check if our proposed decision rule generalizes beyond Transformer-based seq2seq models.
To that end, we repeat our experiments with an LSTM (Hochreiter and Schmidhuber, 1997) decoder with a copying mechanism (Gu et al., 2016), and BERT-Base as the encoder. 1  Results are given in Table 6, showing that AUC scores with LSTM are close to the scores with a Utterance Either the number of dog is greater than the number of white animal or there is dog

Leveraging Local Structures
We show we can take advantage of the insights presented to (1) create challenging compositional splits ( §6.1) and (2) to improve data sampling efficiency for compositional generalization ( §6.2).

n-LS Splits
We showed that unobserved local structures explain compositional generalization failures.Next, we test our conjecture from the opposite direction: we evaluate accuracy when testing on adversarial splits designed to contain unobserved local structures.
Our goal is to create splits such that the similarity between any structure in the entire set of training programs and any structure in the set of test programs will be minimal.We do this by going over the set of all n-LSs, S, in the set of all programs P. For each s ∈ S we define the set S u which contains all structures that are similar enough to s.
We then attempt to create a new split such that its test set will contain all examples that have any of the structures in S u .See App.B.4 for full details.We test two variations: one where S includes all 2-LS structures, and one where we ignore sibling relations (2-LS-NOSIB).In addition, we create another set of splits where instead of setting S u to include all structures that are similar to s, we instead randomly sample half of them, meaning some similar structures will exist in the training set (2-LS-NOSIB-HALF).For each configuration, we generate multiple splits (see App. B.4), and compute average accuracy across splits and models.For the template split, we use 5 random splits.
Table 7 shows that EM of models on our adversarial splits are dramatically lower than on template splits, especially for the 2-LS-NOSIB variation, with a difference of 74.9 or 45.8 absolute points.Moreover, the scores on 2-LS-NOSIB-HALF are much higher, re-enforcing our hypothesis that unobserved structures are only hard if there are no similar observed structures.

Efficient Sampling
We now test if we can leverage unobserved local structures to choose examples that will lead to better compositional generalization.We assume access to a large set of examples D pool , but We propose a simple iterative algorithm, starting with D train = ϕ.We want our model to observe a variety of different local structures to increase the chance of seeing all local structures in D test .Thus, at each step, we add an example e ∈ D pool that contains a local structure that is unobserved in D train .We do this by first sampling an unseen local structure (if all structures are already observed, we uniformly sample one), and then randomly picking an example that contains this structure.We continue until |D train | = B.
We compare our algorithm against random sampling without replacement across a spectrum of training budgets.We evaluate both methods on 5 different template splits for each dataset, and for each template split, budget and sampling algorithm we perform the experiment with 3 different random seeds.Figure 4 shows the average compositional test accuracies on D test for the two algorithms on the three datasets.Our sampling scheme outperforms or matches the random sampling method on almost all datasets and training budgets.

Related Work
Benchmarks Compositional splits are have been defined either manually (Lake and Baroni, 2018;Bastings et al., 2018;Bahdanau et al., 2019;Kim and Linzen, 2020;Ruis et al., 2020;Bogin et al., 2021a) or automatically in a dataset-agnostic manner (as in § 6.1).Automatic methods include splitting by output length (Lake and Baroni, 2018), by anonymizing programs (Finegan-Dollak et al., 2018), and by maximizing divergence between the training and test sets (Keysers et al., 2020;Shaw et al., 2021).Improving generalization Many approaches have been proposed to examine and improve generalization, including the effect of data size and architecture (Furrer et al., 2020), data augmentation (Andreas, 2020;Akyürek et al., 2021;Guo et al., 2021), data sampling (Oren et al., 2021), model architecture (Herzig and Berant, 2021;Bogin et al., 2021b;Chen et al., 2020), intermediate representations (Herzig et al., 2021) and different training techniques (Oren et al., 2020;Csordás et al., 2021).Measuring compositional difficulty The most closely related methods to our work are MCD and its variation TMCD (Keysers et al., 2020;Shaw et al., 2021), designed to create compositional splits.Both methods define a tree over the program of a given instance.In MCD, it is created from the tree of derivation rules that generates the program, and in TMCD from the program parse tree, similar to our approach.Then, certain sub-trees are considered as compounds, which are analogous to n-LSs (with the main exception that n-LSs contain consecutive siblings).Splits are created to maximize the divergence between the distributions of compounds in the training and test sets.
However, since MCD and TMCD were designed to generate compositional splits, they were not tested on whether they predict difficulty of other splits, such as the template split.In addition, while in TMCD the difficulty is over an entire test set, we predict the difficulty of specific instances.Instancelevel analysis can better characterize the challenges of compositional generalization, and as we show in §5.2 it is a better predictor of difficulty compared to TMCD even at the split level.Moreover, our approach can also be used to create challenging compositional splits ( §6.1).Another important distinction is that we introduce the concept of structure similarity, which further improves the ability to predict the difficulty of instances.

Discussion
Limitations In this work we only use the programs of test instances to predict difficulty, but ignore the utterance.Thus, we do not take into account language variability, which could play a factor in compositional splits.In addition, we analyze datasets where models get high i.i.d accuracies.Our decision rule may not work when other difficulties conflate with the compositional challenge.

Conclusion
We have shown that unobserved local structures have a critical role in explaining the difficulty of compositional generalization over a variety of datasets, formal languages, and model architectures, and demonstrated how these insights can be used for the evaluation of compositional generalization, and to improve data efficiency given a limited training budget.We hope our insights would be useful in future work for improving generalization in sequence to sequence models.

A Datasets
In this appendix we enumerate the different preprocessing steps, anonymization functions and specific evaluation methods used for each dataset, if any, and provide the number of splits/instances created in each splitting method.Note that the described anonymization functions are used only in the template method splits, and are not used in the grammar splits or adversarial splits.

A.1 COVR Generation
The COVR dataset that we use is generated with a manually written grammar that is based on the visual question answering (VQA) dataset of the same name (Bogin et al., 2021a), with utterances that contain either true/false statement or questions regarding different objects in an image.However, the dataset we use is separate from the VQA dataset since it generates only the utterances and the executable programs, without any dependence on a scene graph or image.While we use a grammar to generate the pairs, during generation some pre-defined pruning rules prevent specific cases where illogical programs are produced.The entire grammar, pruning rules and generation code are available in our codebase.
Anonymization We anonymize the following groups of symbols (symbols in each group are replaced with a group-specific constant): numbers, entities (dog,cat, mouse, animal), relations (chasing, playing_with, looking_at), types of attributes (color, shape), attribute values (black, white, brown, gray, round, square, triangle) and logical operators (and, or).

A.2 Overnight
Preprocessing We remove redundant parts of the programs, namely, the scope prefix that is used in functions and entities (edu.stanford.nlp.sempre.overnight.SimpleWorld.XXX) and declarations of types (string, number, date, call).The regex dataset is not used due to parsing issues.
Anonymization We anonymize the following groups of symbols: strings, entities and numbers.We use the type declaration that are removed in the preprocessing part to identify these groups.
Evaluation When evaluating the synthetic versions of Overnight, we did not encounter any evaluation issues.However, on the paraphrased version, some of the instances yielded false negatives, mostly due to inconsistent order of filter conditions.For example, the program for the paraphrased utterance "find an 800 sq ft housing unit posted on january 2" first defines the posting date as a condition, and only then the square feet size.This could happen whenever the crowdsource workers changed the order of the conditions as part of paraphrasing.In such cases, most of the times, models output conditions in the exact order given in the utterance, which would result in a negative exact match accuracy, even though the program is essentially correct.We normalize the order of the conditions to prevent such cases.We have encountered other evaluation issues as well which we did not address, since they were not as common.

A.3 S2Q
Preprocessing The raw generated S2Q dataset provided by Oren et al. (2021) has several issues which made i.i.d results of different models to be too low (due to issues that are not related to compositionality, described below), and n-LSs to be sometimes non-meaningful (since the Thingtalk language is not entirely functional).Thus, we perform several preprocessing steps that were not necessary for the other dataset.First, similar to Overnight, we remove redundant parts that define scope (e.g.@org.schema.Person.Person is replaced with Person).Next, we perform slight changes to the programs of S2Q such that their parsed tree better describes the hierarchy of the function and arguments.For example, see Tab. 8 (top), where we replace the positions of filter and Person, such that filter will be the function that calls Person as its argument.Next, since S2Q programs were generated with thousands of different random entities names which are often interleaved in a non-natural way in the utterance (e.g."people which are alumni of seems like some people , , with job title containing clarifier , or", where "seems like some people" and "clarifier , or" are entities), we anonymize these strings by replacing their occurrences in the program with a constant value STR_VAL (we do so for numbers as well).To anonymize the utterance, we do not replace string values Orig.utterance which are the person which have either learning of true university or ky. in the works for and having job title containing animal health technician that have alger lake , mi Anonymized which are the person which have either worksFor or worksFor in the works for and having job title containing jobTitle that have workLocation Table 8: An example for the preprocessing and anonymization we perform for the S2Q dataset.In the top example, we remove redundant parts, anonymize numbers and do minor modifications such that filter will be the function that calls the other arguments.In the bottom example, we anonymize the utterance to prevent ambiguity for the column type of the entity alger lage , mi (see description in text).Entities are bold.
with just a constant value, due to another related issue: in some cases, filter conditions are used in the program without the utterance mentioning the relevant column that should be filtered.For example, the phrase "which are the person ... that have alger lake , mi" refers to people that have "alger lake , mi" in the column workLocation, however, the usage of this specific column cannot be inferred from the name of the entity.We thus replace every string value in the utterance with the name of the column that it should use.See Tab. 8 (bottom) for an example.Note that unlike the other datasets, here anonymization is used not only for the cause of template splitting, but rather models are trained and evaluated with these anonymized utterances and programs.
Anonymization String values and numbers are already anonymized, as described above.For the purpose of template splitting, we additionally replace the names of fields (e.g.worksFor, alumniOf, faxNumber, etc.) in the programs with a constant value, and the different operators (==, >=, <=, =, contains, asc and desc) as well.
Evaluation We normalization S2Q programs during evaluation to prevent false negative cases, where the predicted program is different than the gold, but the two have the same meaning (i.e. would provide the same answer given any set of input).We address two specific cases.First, we normalize and and or clauses, such that clauses are sorted by alphabetic order while taking into account the precedence between the two operators.Second, we remove a redundant call of the compute count function that is used twice and can thus be ommitted (see our codebase for exact implementation).

A.4 ATIS Preprocessing
The ATIS dataset contains numbered variables ($0, $1, etc).However, it has some inconsistencies in the way these variables are given: sometimes the number appears directly after the dollar sign ($0), and sometimes the letter "v" appears between them ($v0).Additionally, we standardize the order of these numbers by setting the number n i of the i-th variable (where the order of variables is defined by the position of their first appearance in the program), will be equal to n i−1 + 1, except for the first variable, which is 0, so that variables will be numbered in a consistent manner.

B Splits B.1 Template Split
The random template split, proposed in Finegan-Dollak et al. (2018), first converts programs into abstract templates using a program anonymization function, that replaces certain program tokens such as string values, numbers and entities with their abstract type (e.g., for the input "are there people that works for crossbars ?" in Table 1, the string value crossbars is replaced with STR_VAR).See App.A for the exact anonymization function used in each dataset.
We group examples according to their abstract template, randomly split the templates into a training set and a test set, and place each example in the train/test set according to their abstract template.While not a part of the original procedure, to make sure splits are "solvable", we verify in each split that there is no token that appears in the test set but does not appear in the training set (otherwise, we discard the split).To obtain multiple splits, we perform multiple random splits of templates.
To make sure the distribution of number of examples per template is not very skewed, we limit the number of examples for each program template to k train = 1000 in the training set.In case there are more, we randomly sample k train examples for that template and discard the rest.We follow the same procedure for the test split, with k test = 10.

B.2 Grammar Split
We propose a grammar split, which can be used when we have the set of context-free grammar rules G that generate the dataset programs.This can create meaningful splits, since generation is not random.For example, this can create a split where the test set contains exactly the set of examples that have the token and as a parent of the token exists (in the program tree).Such a split is unlikely to emerge in a template split.
To create a grammar split for a set of examples, we look at the derivation d = (r 1 , ..., r N d ) of each example program, i.e., the sequence of production rules from the grammar that generate that program, where r i ∈ G.We create a compositional split by holding out a small number N U of pairs of grammar rules, U = {(r 1 i , r 2 i )} N U i=1 , and define the test set to contain any example whose program derivation contains a pair of rules r 1 and r 2 , where (r 1 , r 2 ) ∈ U.For example, if U contains a single pair of rules (r 1 , r 2 ), where r 1 is a rule that produces the terminal and and r 2 is a rule that produces the terminal exists, then the test set will include only (and all) examples that have both these rules in their derivation.Due to the structure of the grammar, this means the test programs will always have exists as a child of and in the program tree.Importantly, the training set will still have examples with both of these two terminals separately.We repeat this process with different instances of U, as we desribe next.
A grammar rule r i = A → γ ∈ G is comprised of its left hand side (LHS) A, which is a nonterminal, and its right hand side (RHS) γ which is a sequence of terminals and/or non-terminals.In theory, we could have iterated over all possible sets of pairs, and create a split out of each of these sets, but this would be impractical.Instead, we propose the following method to to pick different instances of U.
First, we consider only the set L of nonterminals that are "meaningful".We consider a non-terminal l to be meaningful iff (1) there are at least two rules in G where l is their LHS, or (2) l is a non-terminal such that for at least two different rules r a , r b ∈ G, l belongs to the RHS sequence of both r a and r b .By selecting only meaningful non-terminals, we can avoid considering redundant rules that always appear together with other rules.For example, in Fig. 5, the rule count is a nonterminal of just a single rule, and it belongs to the RHS of just a single rule, thus it is not considered meaningful.In this example, L will consist of the meaningful non-terminals boolean_pair and boolean_single.
We then iterate over all possible unique pairs (l 1 , l 2 ) of non-terminals in L. For each such pair, we take the set G l 1 of grammar rules for which l 1 is the LHS of, and the set G l 2 of grammar rules for which l 2 is the LHS of.In the example in Fig. 5, we will have one possible such pair of non-terminals (l 1 , l 2 ), boolean_pair and boolean_single.The figure shows the set G l 1 that contains all rules with boolean_pair as its non-terminal, and similarly for G l 2 .
Given the two sets of rules G l 1 and G l 2 , we set U comb to be the set of all possible combinations from G l 1 and from G l 2 (the Cartesian product of the two sets).Next, we use the following pseudo code to generate different instances of U (each set that we yield is a different generated instance of U): 1: yield U comb 2: for (r 1 , r 2 ) ∈ U comb do 3: yield {(r 1 , r 2 )} 4: end for 5: for r ∈ (G l 1 ∪ G l 2 ) do 6: yield {(r 1 , r 2 ) ∈ U comb if r 1 = r or r 2 = r} 7: end for Following the figure, U 6 will be generated in line 1, U 2..5 in line 3, and U 6..7 in line 6.Finally, for each instance of generated U, we discard it if it generates an invalid split (where symbols in the test set do not occur in the training set).
Example grammar splits We show a few selected examples of the generated grammar splits in Table 9.

B.3 Sizes of Splits
For Overnight and ATIS, we hold out 20% of all program templates.For Schema2QA, we hold out 70%, since otherwise model performance is too high.
For each dataset, we list the number of splits, the total number of instances over which we compute  AUC scores, and the average number of train/test templates (after anonymization) in Table 10.

B.4 n-LS Splits
As we mention in §6.1, we create adversarial splits by leveraging n-LSs, then evaluate the accuracy of models on these spits against accuracy on a random template split, where we hold out K = 0.3 of the program templates.
To create an adversarial split, we want to split the entire set of programs P into a set of training programs P o and test programs P u , such that if we look at the set S o of observed n-LSs in P o , and the set S u of unobserved n-LSs in P u , the similarity between any structure in S o and any structure in S u will be minimal.Given S u , a split of examples is created by setting the test set to contain all examples that have any of the structures in S u , and the training set to contain all other examples.
Concretely, to create an adversarial split, we go over the set of all n-LSs, S, in the set of programs P, and from each s ∈ S we attempt to create an adversarial split independently.Given an n-LS, s, and a similarity threshold t, a candidate set of unobserved structures is S u (t) = {s u ∈ S s.t.sim(s u , s) > t}, that is, the set of all structures that are similar to s (which includes s by definition).When t is low, splits are hard, since structures in S o will be less similar to S u .On the other hand, when t is too low, we might hold out a large fraction of programs, which might (a) create invalid splits, where symbols in the test set do not occur in the training set, or (b) the fraction of test set programs will exceed K, rendering comparison to a random template split unfair.Thus, we pick S u with the lowest t such that (1) S u (t) is valid (all test symbols occur in the training set) and (2) the fraction of program templates in the test set is at most K.We merge identical splits that are created from different structures s, s ′ ∈ S. In addition, because our process does not guarantee a difficult split, we discard any split if its easiness score is higher than a threshold τ , as described next.
We choose τ by using the n-LS classifiers we evaluate in §5.1, for which we have shown AUC scores in Table 3.For each decision rule and for each dataset, we find an optimal threshold that optimizes the F 1 score on our test instances, and use it as the threshold τ for the matching n-LS split.

Test example
either the color of cat that is playing with cat is equal to black or there is mouse or(eq(query_attr[color](with_relation(find(cat),playing_with,find(cat))), black),exists(find(mouse))) Predicted easiness for split (2-LS) 0.09 Train examples 1. either some of animal are brown or the number of cat is less than 2 or(some(find(animal),filter(brown,scene())),lt(count(find(cat)),2)) 2. both the number of dog that is chasing cat that is chasing dog is equal to 4 and there is brown mouse that is chasing mouse and(eq(...),exists(with_relation(filter(brown,find(mouse)),chasing, find(mouse)))) Test example either most of round white mouse are square or there is animal that is playing with cat or(most(...),exists(with_relation(find(animal),playing_with ,find(cat)))) Predicted easiness for split (2-LS) 0.94 discarded) and for Overnight, τ = 0.2 (44 splits are generated, 11 kept after filtering).For 2-LS-NOSIB-HALF, COVR, we do not use τ , but instead we take the top 15 hardest splits, according to the easiness score.We cannot use τ in these cases since the easiness score for all splits were higher than the thresholds (by the definition of our decision rule, easiness score is high in splits where all unobserved structures have similar observed structures).

C Training
All models are fine-tuned on each of the generated splits, for a total of 64 epochs with a batch-size of 28 (except for experiments with T5-large, for which we've used a batch-size of 8).We use a learning rate of 3e −5 with polynomial decay.Each experiment was run with an Nvidia Titan RTX GPU.We do early stopping using the test set, since we do not have a development set.As our goal is not to improve performance but only analyze instance Includes all 2-LSs and 3-LSs.difficulty, we argue this is an acceptable choice in our setting.

D 4-LS
In §4.1, we define 2-LS and 3-LS over the program graph G.The structure 4-LS is a natural extension of these structures.It includes all 2-LSs, and 3-LS, and also structures with (1) three parent-child relations, (2) three siblings relations and (3) a grandparent with its child and two sibling grandchildren and (4) a parent with three siblings (structures 6, 7, 8 and 9 in Figure 6, respectively).

E Measuring TMCD
We measure compound divergence of the distributions of compounds and atoms on the program graph, following Keysers et al. (2020) and Shaw et al. (2021).Similarly to TMCD, we define atoms and compounds over the program tree T , over the graph defined in §4.1.Each tree node is considered an atom, and the compounds are all sub-trees of up to depth 2. We use the same Chernoff cient as in the original paper, α = 0.5, to compute compound divergence.

Figure 1 :
Figure 1: Unobserved local structures are harder for models to generalize to whenever there are no similar structures that were observed during training.

Figure 2 :
Figure 2: An example program z and the structure of its program graph G (top left), with solid edges for parent-child relations and dashed edges for consecutive siblings.In the other figure parts we enumerate all 2-LS and 3-LS structures over this graph.

Figure 3 :
Figure 3: Example for the computation of similarity sim(m 1 , m 2 ) between two symbols m 1 , m 2 .

Figure 5 :
Figure 5: An illustration of our grammar split.Bottom part shows 7 produced different splits.See text for details.

Figure 6 :
Figure 6: An example program z and the structure of its program graph G (left), with solid edges for parent-child relations and dashed edges for consecutive siblings.The right side enumerates all 4-LS structures over this graph. .,

Table 2 :
Average test EM across splits for all datasets,

Table 3 :
AUC scores of different decision rules for each dataset, computed across the test instances in all splits.

Table 4 :
Example predictions of four models, showing a typical case where all models emit wrong tokens exactly when encountering an unobserved parent-child structure: or-exists (L in model name stands for large version).

Table 5 :
Pearson correlation between the easiness score of a split and the average model EM, shown for n-LS and TMCD, for COVR and Overnight (synthetic).

Table 6 :
AUC scores of our decision rule across each dataset, for experiments with an LSTM-based decoder (bottom) compared to a Transfomer (top).

Table 7 :
EM on adversarial splits versus template splits.Numbers are averaged across all created splits and across 4 models.We show standard deviation across the EM scores of all models on all generated splits.
can only use a small subset D train ⊂ D pool for fine-tuning, where the budget for training is |D train | ≤ B. Our goal is to improve accuracy on an unseen compositional test set D test , by choosing examples that are likely to reduce the number of unobserved structures in the test set.To simulate this, we use the template split method to generate D Oren et al. (2021) for COVR, ATIS and S2Q, holding out 20% of the program templates for COVR and ATIS, and 70% for S2Q (the number of templates in Overnight was too low to use).Improving compositional generalization under budget constraints was explored byOren et al. (2021), but as our setup is different, results are not directly comparable.
Yushi Wang, Jonathan Berant, and Percy Liang.2015.Building a semantic parser overnight.In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1332-1342, Beijing, China.Association for Computational Linguistics.

Table 9 :
Selected splits generated with the grammar split method, with selected examples for each split.Total instances Avg # training templates / ex.Avg # test templates / ex.

Table 10 :
Different splits statistics, including the number of generated splits per each split method and dataset, the total number of test instances across all splits and average train/test templates and examples across the splits.