Finding needles in a haystack: Sampling Structurally-diverse Training Sets from Synthetic Data for Compositional Generalization

Modern semantic parsers suffer from two principal limitations. First, training requires expensive collection of utterance-program pairs. Second, semantic parsers fail to generalize at test time to new compositions/structures that have not been observed during training. Recent research has shown that automatic generation of synthetic utterance-program pairs can alleviate the first problem, but its potential for the second has thus far been under-explored. In this work, we investigate automatic generation of synthetic utterance-program pairs for improving compositional generalization in semantic parsing. Given a small training set of annotated examples and an “infinite” pool of synthetic examples, we select a subset of synthetic examples that are structurally-diverse and use them to improve compositional generalization. We evaluate our approach on a new split of the schema2QA dataset, and show that it leads to dramatic improvements in compositional generalization as well as moderate improvements in the traditional i.i.d setup. Moreover, structurally-diverse sampling achieves these improvements with as few as 5K examples, compared to 1M examples when sampling uniformly at random – a 200x improvement in data efficiency.


Introduction
Semantic parsers map natural language utterances to executable programs (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005). A worrying weakness of semantic parsers that has been recently exposed, is their inability to generalize at test time to new compositions (Finegan-Dollak et al., 2018;Lake and Baroni, 2018;Keysers et al., 2020;Kim and Linzen, 2020;Gu et al., 2021). For example, a virtual assistant trained on the examples "Show me Thai restaurants that allow pets" and "How many hotels are there in Tokyo", might not generalize to "How many hotels in Tokyo allow pets?". This type of out-of-domain generalization to new compositions constructed from components seen during training is commonly termed compositional generalization.
Two high-level approaches have been considered for tackling compositional generalization: (a) designing models with a stronger compositional inductive bias (Liu et al., 2020;Russin et al., 2020;Zheng and Lapata, 2020;Herzig and Berant, 2021), and (b) adding training data that will encourage a compositional solution (Akyürek et al., 2021;Guo et al., 2020c;Wang et al., 2021;Guo et al., 2020a). In the latter approach, typically a model is trained from labeled data (Jia and Liang, 2016;Yu et al., 2020;Zhong et al., 2020), and is used to later generate new examples. An alternative approach is to use a manually-built synchronous grammar that automatically generates programs paired with synthetic utterances (Wang et al., 2015;Cheng et al., 2018;Weir et al., 2020). Using a grammar allows generating large amounts of synthetic data that cover a wide range of program structures. This has been shown to be useful in the i.i.d setup, and combined with paraphrase models, has led to high-accuracy parsers that are trained from synthetic data only (Xu et al., 2020b).
In this work, we investigate the potential of using synthetic data, generated from a synchronous grammar, to improve compositional generalization. Tsarkov et al. (2020) have shown that large quantities of such data improve compositional generalization. However, they evaluated on synthetic utterances only, and did not examine generalization to natural language. Moreover, error rates were high in some compositional splits even when the training set was as large as 1M examples. In this work, we ask whether we can strategically sample a small and structurally-diverse training set and improve compositional generalization without incurring a high cost to training and consequently, to the environment (Schwartz et al., 2020). We hypothesize that a training set that encompasses a diverse set of structures can steer the model towards a compositional solution.
We examine a realistic setup, where we have a small labeled training set (∼1,000 examples) and a large pool of synthetic utterances paired with programs ( Fig. 1), which are queries over a database (DB) in our setup. We propose two methods for strategically sampling a diverse synthetic training set. In the first, termed uniform abstract template (UAT) sampling, we abstract queries by replacing DB constants with abstract tokens (e.g., replacing petsAllowed with property), and then skew the distribution towards uniform sampling over the derived templates. This increases structure diversity in the training set, which intuitively should lead to better compositional generalization. In the second method, termed compound maximum entropy (CMaxEnt) sampling, we consider the tree structure of every DB query, and following Tsarkov et al. (2020) define compounds, which are sub-trees in the query. We then heuristically solve an optimization problem, where our goal is to select a training set that has maximal entropy over compounds. This results in a training set with a diverse set of substructures, which should enhance compositional generalization.
We evaluate our approach on a new split of the Schema2QA dataset (Xu et al., 2020a), in which it has been shown that synthetic data can lead to high accuracy parsers in the i.i.d setup (Xu et al., 2020b).
We train an encoder-decoder model on synthetic data and subsequently fine-tune it on the small annotated data. We show that random sampling of synthetic data improves performance, where gradually increasing the size of the synthetic training set also improves compositional generalization. With 1M randomly-sampled examples, accuracy in the compositional split improves from 20.1→37.7 and in the i.i.d split from 81.2→85.0. When sampling structurally-diverse data, compositional generalization improves from 20.1 to >40 with as few as 5K examples, outperforming training with 1M synthetic examples. In addition, the i.i.d generalization is comparable to random sampling. UAT and CMAXENT both lead to large improvements in compositional generalization, but UAT is more effective than CMAXENT.
Overall, our work demonstrates that sampling diverse structures from synthetic data can lead to dramatic gains in compositional generalization at negligible cost, while preserving or improving performance in the i.i.d setup. Our code and data can be downloaded from http://github.com/inbaroren/ scfg-sampling-for-comp-gen.

Problem Setup
We assume access to a small dataset of natural language utterances paired with queries, , and a large pool of synthetic utterances paired with queries D Syn In this work, D Syn train is generated with a synchronous context-free grammar, which provides wide coverage of query structures and tight control over the generated queries, but other methods of generating synthetic examples are possible (Andreas, 2020; Guo et al., 2020a;Wang et al., 2021). Table 1 provides examples of natural language utterances, synthetic utterances, and queries in the ThingTalk language, a language designed for virtual assistants used in this work (through the Schema2QA dataset (Xu et al., 2020a,b)).
Our goal is to train a model using D train and D Syn train and generalize to a test set D test sampled from the same distribution as D train . More importantly, our model should generalize to a compositional test set, D Comp test , which contains structures/compositions that are not observed in D train or D Syn train . We now describe this test set. x: show me a book with at least 2 awards . x : which books have more than 2 awards z: ( @Book ) filter count ( award:Array(String) ) >= 2 x: search for any books with a rating of 5 that also have 100 pages or more x : what book gets number of pages at least 100 and gets the 5 mark ? z: ( @Book ) filter ratingValue:Number == 5 and numberOfPages:Number >= 100 x: show me hotels with a fitness center x : is there any hotels having fitness center in its amenity features z: ( @Hotel ) filter amenityFeature:Array(LocationFeatureSpecification) contains "fitness center" x: can you find a hotel that accepts dogs ? x : what hotels having pets allowed ? z: ( @Hotel ) filter petsAllowed: Boolean == true zabs: ( @table ) filter property: type op entity

Sampling a Structurally-diverse Training Set
We first succinctly describe our model and training procedure ( §3.1) and then turn to methods for sampling structurally-diverse training sets.

Model and Training
In this work, we start from a pre-trained encoderdecoder model , as such models have been shown to provide a good initialization for fine-tuning semantic parsers . We then train our model in two steps (Yu et al., 2020;Wang et al., 2021). First, on synthetic utterance-query pairs (x , z) ∈D Syn train , Example iid comp. split split x : please search the hotels with pets allowed train train z: ( @Hotel ) filter patsAllowed:Boolean == true zabs: ( @table ) filter property:type op entity x : please search books with ebook format test train z: ( @Book ) filter format:Enum == ebook zabs: ( @table ) filter property:type op entity x : how many people are there train test z: aggregate count of ( @Person ) zabs: func ( @table ) x : how many hotels are there test test z: aggregate count of ( @Hotel ) zabs: func ( @table ) Table 3: A compositional split prohibits the same abstract template to appear in both the training and test set, and hence tests compositional generalization.
Above, examples 1-2 and 3-4 share the same template, so in an i.i.d split they can be assigned to different sets, while in a compositional split they must be either in the training or test set. and then on natural langauge utterance-query pairs (x, z) ∈ D train . Training in two steps mitigates the gap in language variation betweenD Syn train and D train . We train with the standard maximum-likelihood objective, maximizing the probability of the gold sequence, z.
Uniform sampling Our baseline sampling method is to constructD Syn train by sampling from D Syn train uniformly. This simulates sampling from the synchronous grammar directly, and can inform us whether synthetic data improves generalization to D Comp test , even ifD Syn train is very large.

Uniform Abstract Template Sampling
We conjecture that a model is more likely to converge to a "compositional solution" if it observes at training time a multitude of different structures, and learns that sub-structures can occur in multiple contexts. For example, if two properties always co-occur in the training set (e.g., ratingValue and numberOfPages in the second example of Table 1), then the model might erroneously learn to always decode them together. To achieve this goal, we define a sampling process, UAT, that results in a more uniform distribution over abstract templates. In this process, we first sample an abstract template, and then sample an example conditioned on that template. We skew the distribution over templates to be close to uniform, which leads to templates that have few examples to be over-represented in the training set D Syn train . Thus, minimizing the loss over the training set will take into account a large number of templates, which should improve compositional Figure 2: UAT sampling. When α = 1, UAT is a uniform sample over D Syn train . As α → 0, the probability over abstract templates (z abs ) becomes uniform. Consequently, the sample includes more abstract templates.
generalization. Typically, even if the number of examples in D Syn train is large (∼6M in our experiments), the number of abstract templates is much smaller (251 in our experiments, see §4). Fig. 2 illustrates this sampling process.
Formally, we constructD Syn train by sampling from D Syn train without replacement using the following procedure. Let T be the set of templates in D Syn train , T (z i ) be the abstract template for a query z i , and c(T (z i )) be the number of times T (z i ) occurs in D Syn train . We estimate a probability distribution over templates, p(T (z i )) = c(T (z i )) |D Syn train | , and a distribution over examples conditioned on a particular template . Now we sample a synthetic utterance-query pair from the following distribution: where α ∈ [0, 1]. When α = 1, this corresponds to the aforementioned uniform sampling over examples (factorized over templates), but when α = 0, this corresponds to uniform sampling over the templates for which there still remain examples to be sampled. Values between 0 and 1 allow a smooth transition between uniform sampling over examples and over templates. In §4, we will examine the effect of various values of α on compositional generalization for varying sizes of the sampled training set,D Syn train .

Compound Maximum Entropy
UAT sampling does not consider the similarity between different abstract templates, treating each template independently. However, different tem- Figure 3: A ThingTalk parse tree (with abstract entities). Each node is an atom, and any subgraph of height at most 2 that has at least one terminal is a compound. Two compounds are marked in green.
plates potentially share some sub-structure that can be used for obtaining more diversity at the substructure level. We now consider CMAXENT, a method for sampling a synthetic training set with diverse sub-structures.
Recently, Keysers et al. (2020) and Shaw et al. (2020) used the notion of sub-structures, to construct a compositional test set. Given a program tree, they define atoms to be nodes in the tree and compounds to be sub-structures in the tree. Then, given a pool of examples, they partition it into two sets (train and test), such that the distribution over atoms is similar, but the distribution over compounds is different. Here, we adopt their definition of atoms and compounds, but for a different objective. We aim to sample a setD Syn train , such that the entropy over atoms and compounds is maximized. This will expose the model to a diverse set of atoms and compounds in multiple contexts, which should lead to compositional generalization.
Atom and compound distributions Queries in formal languages can be parsed into trees. We adopt the definition of Keysers et al. (2020), and define atoms as any node in the tree, and compounds as any tree of height ≤ 2, that includes at least one tree terminal. We reduce the space of compounds by abstracting entity tokens (e.g., "tokyo"→entity). Fig. 3 shows an example tree with two compounds.
For a sampled setD Syn train , we use p(a) to denote the frequency distribution of atoms inD Syn train , and p(c) to denote the weighted frequency distribution of compounds inD Syn train . The compounds are weighted following Keysers et al. (2020), to avoid double-counting of compounds that mostly co-occur with their super-compounds. Finding the subset that maximizes the above objective is computationally hard, hence we use a greedy heuristic, termed CMAXENT, to approximate it. Despite its simplicity, we show in §4 that this approach improves entropy compared to a random sample. Specifically, we start with an empty training set, and in each iteration go over all examples in D Syn train (with abstract entities), choose the example that maximizes our objective, and add one of the corresponding non-abstract examples toD Syn train . We determine the number of iterations according to the desired target size ofD Syn train . Fig. 4 illustrates a single iteration in this procedure.
Hybrid approach Our last approach combines our two sampling procedures, UAT and CMAX-ENT. Specifically, in each iteration we sample an abstract template uniformly, and then use the greedy procedure for maximizing entropy over queries that correspond to the sampled abstract template. This results in a process that still improves the maximum entropy objective value but is skewed towards a uniform distribution over abstract templates.

Experiments
We empirically evaluate the contribution of synthetic data to compositional generalization and the extent to which structurally-diverse sampling can improve data efficiency.

Experimental Setting
Data Our work assumes access to a small annotated training set and a large pool of synthetic data, generated from a wide-coverage synchronous grammar. We build on work from Xu et al. (2020a), who created the Schema2QA dataset, which contains natural language utterance-query pairs in the ThingTalk language for 6 domains of the Schmea.org onthology: restaurants, people, hotels, books, movies, and music. Moreover, Xu et al. (2020b) presented AutoQA as part of the Genie toolkit 2 for generating synthetic examples for Schema.org from a synchronous grammar (Campagna et al., 2019). To obtain enough data we use the manually-labeled data from all 6 domains as our annotated data, and term it NL-SCHEMAORG (see §4.3). We generate 7M synthetic examples using AutoQA, and term it SYN-SCHEMAORG.
We construct an i.i.d split and a compositional split to NL-SCHEMAORG, resulting in one common training set D train , containing only 1,619 examples, a compositional development and test sets, and an i.i.d development and test sets (see Table 4 for statistics). The training set, compositional development set, and compositional test set are all disjoint w.r.t their abstract template (see §2). The i.i.d development and test sets are sampled from the same distribution as D train . We create a compositional split ofD Syn train , resulting in training/development/test sets that are disjoint in terms of their abstract template according to the templates in the compositional split of NL-SCHEMAORG. We describe the exact procedure for splitting the data in Appendix A.
Evaluation metric We evaluate models using exact match accuracy, that is, whether the predicted query is identical to the gold query. We denote accuracy on the compositional and i.i.d splits as EM comp and EM iid respectively. We report the average and standard deviation over 15 models, which are obtained by training on 3 different random sam-plesD Syn train , each with 5 different random seeds.  In all experiments, we use EM iid on the development set to determine early-stopping and for tuning batch size, learning rate and number of warmup steps (see hyper-parameter values in Appendix C).
Evaluated models Our baseline parser is finetuned on the training set of NL-SCHEMAORG (without pre-training), and is termed BASELINE. We denote the rest of our experiments by the sampling method used to obtainD Syn train , where UNI-FORM denotes uniform sampling, UAT denotes abstract template sampling ( §3.2), CMAXENT denotes compound maximum entropy sampling ( §3.3). We denote the hybrid approach, combining the latter methods, as CMAXENT+UAT.
Importantly, we evaluate the effectiveness of our methods across different sizes ofD Syn train . Overall, we are interested in the interactions between compositional generalization, i.i.d generalization, sampling method, and the size of the synthetic training set. We are especially interested in the effectiveness of our suggested sampling methods using smaller samples, hence limit the sample size to 120K. We denote the size ofD Syn train by concatenating it to the model name, e.g., UAT+5K corresponds to sampling with 5K synthetic examples.
As another baseline, we use GECA (Andreas, 2020) as an alternative source for synthetic data. We use the publicly available code, 3 which takes the training set of NL-SCHEMAORG and augments it with 1,342 new examples. We use these examples asD Syn train in our setup.  When further increasing the size of the synthetic data, improvement roughly plateaus, reaching 43.0 EM comp for 120K examples. A possible explanation for this effect is that as the size ofD Syn train grows, the distribution over templates becomes more skewed, as shown in Fig. 5. Changing the composition of D Syn train to contain more abstract templates by modifying the generation procedure in the AutoQA toolkit, and examining whether this leads to even greater gains in compositional generalization is an important question for future work.

Results
To test if a smooth transition from α = 1 to α = 0 indeed leads to a smooth transition in compositional generalization, we train models with multiple values of α. Fig. 6 confirms that tuning α from 1 to 0 yields a gradual improvement in EM comp .
Last, EM iid is comparable to UNIFORM, and even higher for smaller samples.
Compound maximum entropy CMAXENT improves compositional generalization with greater data efficiency compared to UNIFORM, as it improves EM comp at all sample sizes. With 120K examples, CMAXENT reaches 40.2 EM comp , and surpasses UNIFORM+1M, at 37.7.
Still, UAT outperforms CMAXENT in all cases. There are several possible explanations for this phenomenon. First, it might be the case that the distribution over abstract templates is the important Figure 6: Accuracy on the compositional development set by size and α value. α = 1 is equivalent to a uniform sampling over examples, and as α decreases the distribution over abstract templates becomes closer to uniform. We report the average over 5 random seeds, and bars denote 95% confidence intervals. x-axis is in log-scale. factor in determining compositional generalization. In Appendix D we show that indeed the distribution over abstract templates of CMAXENT is more skewed compared to UAT. Second, our heuristic for optimizing entropy might be sub-optimal. While we do observe an entropy increase from 6.1→7.1 in compound entropy, and from 3.9→4.4 in atom entropy, it is possible that a better optimization procedure or a better definition of the notion of compound will yield further gains in compositional generalization. We leave this investigation for future work.
Hybrid sampling Combining CMAXENT and UAT leads to improvements in EM comp over CMAXENT for all sample sizes (except 5K), but the overall performance does not surpass UAT.
GECA Results are slightly lower than BASELINE, 79.0 EM iid and 19.7 EM comp . Sampling 1,342 examples from GECA is better than UNIFORM+2K, but worse than UAT+2K. Thus, an advantage in the synchronous grammar is that we can easily sample a large number of synthetic examples that cover a wider range of structures.
NL-SCHEMAORG comprises data from 6 different domains. In Table 10 in Appendix F, we show development EM per domain. While the number of examples in each domain is small (a few dozen examples per domain), we still observe similar trends across all domains.
To summarize, our results suggest that abstract template diversity is an important factor in compositional generalization, and generating synthetic data with many abstract templates can dramatically improve compositional generalization.

Analysis
We perform a manual analysis of models' predictions on NL-SCHEMAORG compositional development set. We inspect 40 predictions of 8 models, and identify three major error types. The first type is structural errors, which include syntax errors, misplacing parenthesis, and incorrect typing of properties. The second group is linking errors, including, e.g., hallucination of entities, return of unsorted results, and missing an aggregation step. Third, we identify errors that are benign, where the predicted queries are valid and equivalent to the target query. An example is using a wrong operator that does not change the query. We also measure robustness to DB entity replacement in a query by grouping together any subset of development examples that only differ in a DB entity, and counting for how many groups all the examples are predicted correctly. Table 6 shows the results of the analysis. Our findings suggest two main benefits of largerD Syn train : (a) more frequently, the errors are benign, and (b) generalization is more robust to DB entity replacement. In addition, we find that using UAT reduces structural errors, but increases linking errors, e.g, missing necessary sort or filter steps. Last, linking errors are the most common error type across all models.
Inspecting the predictions of UNIFORM+1M and UAT+5K on the development set, we find that the abstract templates of correct predictions constitute roughly 40% of the templates in the set, and are almost identical between the two models. We notice that "hard" templates are not necessarily longer or more nested.

Related Work
Data augmentation Previous work studied different data augmentation techniques to improve i.i.d generalization in semantic parsing including synchronous grammars (Jia and Liang, 2016;Yu et al., 2020;Xu et al., 2020b), target side grammars with neural generation models (Tran and Tan, 2020;Wang et al., 2021), and pre-training with auxiliary tasks (Yin et al., 2020;Deng et al., 2021). In the context of compositional generalization, data aug-   (Andreas, 2020;Akyürek et al., 2021), or by back-translation (Guo et al., 2020c). Conversely, we generate data from an independent wide-coverage grammar and investigate data-efficient sampling through structured diversity.
Outside of semantic parsing, it has been shown in a grounded learning setup (Hill et al., 2020) that increasing lexical diversity can improve outof-distribution generalization.
Compositional Generalization In contrast to our work that focuses on sampling synthetic data, many other approaches have been suggested to improve compositional generalization in semantic parsing. These include new or modified model architectures Gordon et al., 2020;Guo et al., 2020b;Oren et al., 2020;Zheng and Lapata, 2020;Herzig and Berant, 2021;Shaw et al., 2020), pre-trained language models , intermediate representations , and meta learning (Lake, 2019;Conklin et al., 2021).
Data Selection Our work is related to algorithmic approaches for reducing biases in datasets such as adversarial filtering Sakaguchi et al., 2020) and representation debiasing (Li and Vasconcelos, 2019;. Our approach utilizes the structural nature of executable queries, and focuses on biases related to structural diversity.

Conclusion
In this work, we for the first time explore whether generating large amounts of synthetic data from a synchronous grammar improves compositional generalization, and propose sampling methods that allow for more efficient training by generating structurally-diverse training sets. We find that synthetic data dramatically improves generalization, and moderately improves i.i.d generalization, and that by uniformly sampling abstract templates, we can improve data efficiency by a factor of 200x.
In the past year, a myriad of approaches have been proposed for encouraging compositional generalization through modeling innovations, clever training procedures, and data augmentation techniques. Our works adds to the body of work that shows that data augmentation is an effective strategy even with small amounts of augmented data, when examples are carefully constructed. Moreover, data augmentation techniques can be easily combined with new models and training procedures, potentially leading to further gains in compositional generalization.
In addition, we believe our findings can be generalized to other NLP tasks to improve data efficiency w.   Table 7 shows the domain distribution in the synthetic and annotated datasets.

C Training
We implement and train our models using Al-lenNLP with PyTorch as backend, and initialize them using BART base. We conduct experiments on a machine with 8 NVIDIA GeForce GTX 2080 GPUs and 40 Intel(R) Xeon(R) Silver 4114 CPUs. The OS is Ubuntu 18.04.3 LTS.
Hyper-parameters We use the Adam optimizer (Loshchilov and Hutter, 2019)   size ≤ 5000, and {24, 48, 64} for larger samples. We use a learning rate schedueler with polynomial decay and select warm up steps from {1000, 1500, 2000} for sample sizes ≤ 1000, and {2500, 3000, 3500, 4000} for larger samples. We use patience of 5 epochs in pre-training, and 10 epochs in fine-tuning. We use EM on the i.i.d development set as a metric for early stopping and selecting the best hyper-parameters. Hyper-parameters are fine-tuned for each sampling method and sample size separately on a single sample of SYN-SCHEMAORG. The patience is selected from {5, 8, 10} when training on samples different than the one used for fine-tuning. Hyper-parameters for the models fine-tuned on NL-SCHEMAORG are fine-tuned once on a UNIFORM sample.

D Sample Diversity
We measure structural-diversity in terms of abstract template distribution, atoms entropy and compounds entropy. Table 8 compares the total number of abstract templates seen during both pre-training and fine-tuning between sampling methods and sample sizes. Figure 7 compares the normalized frequency of abstract templates between sampling methods for two sample sizes. Figure 8 compares the atoms and compounds entropy between sampling methods and sample sizes. The above statistics are on a single sample from each method and size.
E Generalization of Synthetic vs. Annotated Data Figure. 9 shows for each sampling method and sampling size, the EM comp of the pre-trained model on SYN-SCHEMAORG development set, and the EM comp of the fine-tuned model on NL-SCHEMAORG development set. The relation between the performances is correlated for sample  sizes smaller than 10K. We report the average over 5 random seeds and a single sample. Figure 9: EM comp of the pre-trained model on SYN-SCHEMAORG development set, and the EM comp of the fine-tuned model on NL-SCHEMAORG development set. We report the average over 5 random seeds on a single sample.  Table 9: Development EM for all sampling methods. We report average over 15 models, which are obtained by training on 3 different samples, each with 5 different random seeds.