Structurally Diverse Sampling for Sample-Efficient Training and Comprehensive Evaluation

A growing body of research has demonstrated the inability of NLP models to generalize compositionally and has tried to alleviate it through specialized architectures, training schemes, and data augmentation, among other approaches. In this work, we study a different approach: training on instances with diverse structures. We propose a model-agnostic algorithm for subsampling such sets of instances from a labeled instance pool with structured outputs. Evaluating on both compositional template splits and traditional IID splits of 5 semantic parsing datasets of varying complexity, we show that structurally diverse training using our algorithm leads to comparable or better generalization than prior algorithms in 9 out of 10 dataset-split type pairs. In general, we find structural diversity to consistently improve sample efficiency compared to random train sets. Moreover, we show that structurally diverse sampling yields comprehensive test sets that are a lot more challenging than IID test sets. Finally, we provide two explanations for improved generalization from diverse train sets: 1) improved coverage of output substructures, and 2) a reduction in spurious correlations between these substructures.


Introduction
Systematic compositionality-expressing novel complex concepts as systematic compositions of expressions for simpler concepts-is the property underlying human languages' expressive power (Lake et al., 2017;Fodor and Pylyshyn, 1988).However, NLP models struggle to generalize to novel composite expressions in the context of semantic parsing (Lake and Baroni, 2018;Loula et al., 2018;Kim and Linzen, 2020).
Data augmentation has been extensively explored for compositional generalization (Akyürek et al., 2021;Guo et al., 2021;Wang et al., 2021;Guo et al., 2020;Qiu et al., 2022).However, instances in semantic parsing possess structure, such as abstract syntax trees (ASTs) of target programs (Figure 1).Randomly selecting instances from a grammar, or even from human annotators, while ignoring similarity in their structure is likely to produce skewed distributions with spurious correlations between substructures.This task ambiguity and underspecification are known to cause overparameterized models to have high variance and poor generalization (D'Amour et al., 2020;Geirhos et al., 2020).In contrast, a structurally diverse set of instances containing a variety of structures should better cover the combinatorial space of structures and yield more sample-efficient train sets.Moreover, this improved coverage should yield more comprehensive test sets than the traditional IID test sets.
Prior work on diverse sampling has shown improved sample efficiency in compositional (non-IID) splits, by selecting instances to have diverse program templates (Oren et al., 2021) or AST bigrams (Bogin et al., 2022) (see Figure 1 for examples).However, for reasonably-sized programs, templates are very large structures, and bigrams very small; focusing on either does not necessarily improve diversity across a range of differently-sized structures.Moreover, there is little evidence of sample efficiency of diverse train sets in the IID setting and no exploration of diverse test sets.
In this study, we take a broader look at the advantages of structural diversity.We propose a general, substructure, and model-agnostic recipe for structurally diverse subsampling algorithms that interleaves selecting unseen substructures with selecting instances containing those substructures.Experimenting with different substructure and instance selection criteria, we show improved performance from 1) prioritizing the selection of more frequent unseen substructures in the training pool, 2) using subtrees 1 as substructures that are neither too granular (bigrams) nor too large (templates), and 3) simultaneously diversifying over both small (subtree) and larger (template) substructures.
We evaluate on template and IID splits of five semantic parsing datasets of varying complexities and find that training on diverse train sets consistently outperforms random train sets.In particular, our proposed subtree diversity algorithm is the most consistent, performing comparably with or better than the template and bigram diversity in 9 out of 10 splits, with both bigram and template algorithms often performing worse than random subsampling in IID splits.We study the efficacy of diversely sampled test sets and show that our diverse sampling algorithm yields test sets that are harder and more comprehensive than IID tests, especially with IID training sets.Finally, by comparing random and structurally diverse subsamples, we show that the latter 1) better cover the space of substructures, especially those in the long tail, and 2) have weaker spurious correlations between substructures, explaining their benefits in training and evaluation.
To conclude, our results demonstrate the effectiveness of structurally diverse train sets in inducing generalization and of structurally diverse test sets as comprehensive evaluations.We hope these insights will encourage the use of structural diversity as a criterion when sampling train or test sets from a pool of labeled instances.A setting where we expect our algorithm will be particularly useful is that of prototyping semantic parsers in zero-data settings, where a pool of programs is sampled from a grammar and mapped to canonical utterances (Wang et al., 2015; Herzig and Berant,   1 These could also be called subgraphs but we'll stick to "subtree" as the whole structures are trees.2019;Campagna et al., 2020;Yin et al., 2022).Another potential application could be in exemplar selection for in-context learning.Our code, data, and models are available at https://github.com/Shivanshu-Gupta/structural-diversity.

Setup
Semantic Parsing, the task we focus on, involves parsing an utterance x into a program or logical form y. Following Oren et al. (2021) and Bogin et al. (2022), we will assume access to a pool of instances D pool = {e i , . . ., e n }, where e i = (x i , y i ), from which we wish to sample instances.As the example in Figure 1 shows, logical forms possess hierarchical structure which can be represented as trees called abstract syntax trees (ASTs).These structures are composed of smaller substructures such as subtrees of the AST, that are shared across logical forms.We can thus view each instance as a bag of output substructures: where C is the mapping from instances to their substructures and C is the set of all unique substructures.
The goal of structurally-diverse sampling is to select instances that contain among them a variety of different substructures.Our hypotheses, that we verify in §6, are that a collection of such instances would both better cover the combinatorial space of structures and reduce spurious correlations between substructures compared to random instances.A structurally diverse train set should thus give the model more information about the system underlying the instances' structure and hence improve generalization.Similarly, its improved coverage should also make a structurally diverse test set more comprehensive.This is analogous to how one would choose to evaluate on a test set with balanced classes even when the training data might have a considerable imbalance.

Substructures
Prior work (Oren et al., 2021;Bogin et al., 2022) has used program templates and bigrams in program ASTs (Figure 1) for their diverse subsampling algorithms.Bigrams are pairs of parent-child and sibling nodes in program ASTs.Program templates represent the reasoning pattern in the program and are obtained by replacing certain program tokens such as strings and numbers with their abstract type.
Their exact implementation depends on the dataset and is described in §4.2.
While we follow Oren et al. (2021) and Bogin et al. (2022) in extracting substructures from programs alone, we choose to use different substructures, as templates and bigrams do not seem to hve the optimal granularity to diversify over.Templates are too coarse; they still share a lot of structure, and there may be a combinatorially large number of them.On the other hand, bigrams are too fine and may be unable to capture many salient structural patterns.We thus use subtrees of the program AST up to a size d.The ASTs are constructed as in Bogin et al. (2022): tokens in a program are categorized as either functions, values, or structural tokens (such as parentheses or commas) that define the hierarchical structure of the program.

Algorithm
As motivated in §2, we want to sample instances containing among them a variety of substructures.Additionally, we wish to experiment with: 1) prioritizing substructures more common in the pool, as we expect it to reduce the divergence of the subsample from the pool distribution; and 2) simultaneously diversifying over both fine (subtree) and coarse (template) granularity substructures.One approach would be constructing an optimization problem that directly selects an optimally diverse set of instances.However, this has the challenge of defining a measure of diversity that is also tractable to optimize for large pools and a large number of substructures (see Table 2).Additonally, it would be harder to experiment with the different variations of diverse sampling described above.
We thus take the route of an iterative algorithm (pseudo-code in Algorithm 1) that alternates between picking a substructure and picking an instance with that substructure till the requisite number, B, of instances has been sampled.Here, w c and w e specify the substructure and instance selection criteria by assigning weights based on the current state comprising sampled instances D sample , substructures C sample , and templates T sample .The algorithm resets C sample and T sample once all substructures and templates, respectively have been sampled to allow cycling over them repeatedly.In the following sections, we will show that with different substructure definitions and substructure and instance-weighting schemes, Algorithm 1 can subsume both the template diversity and bigram diver- we(e, D pool , D sample , T sample ) Oren et al. (2021) and Bogin et al. (2022).Additionally, it will allow us to experiment with the variations described above.

Subtree Diversity
Our proposed subsampling algorithm diversifies over subtrees as defined in §3.1, i.e., C(x, y) is the set of subtrees of size ≤ d in the AST of y.We will use d = 4. Since we want to prioritize selecting more frequent unsampled substructures in the pool, we will use where F pool (c) is the number of instances in the pool containing c.
For selecting instances given a sampled substructure, we use these instance weighting schemes: 1. RANDEX samples an instance uniformly at random: w e (e) is a constant.The last two schemes seek to improve the coverage of templates to allow the diversifying of both subtrees and templates.

Template and Bigram Diversity
Template Diversity from Oren et al. (2021), henceforth referred to as TEMPLATE, can be implemented in the framework of Algorithm 1 as: • C(x, y) is a singleton set containing the template for y. • w c (c) is a constant function i.e. a random template.
• w e (e) is a constant function (RANDEX).Additionally, we also experimented with selecting only among the unsampled templates i.e. w c (c) = and with prioritizing more frequent unsampled templates using the substructure weighting scheme of §3.3.We found both of these to improve performance but only include the results for the latter.We will refer to it as TEMPLATE [FREQ].
Bigram Diversity from Bogin et al. (2022) , hereafter referred to as BIGRAM, can also be implemented in the framework of Algorithm 1 as: • C(x, y) is the set of bigrams in y's AST as defined by Bogin et al. (2022).• w e (e) is a constant function (RANDEX).We also experimented with bigram diversity with the substructure weighting scheme of §3.3.We will refer to this algorithm as BIGRAM [FREQ].

Experiments
Given a dataset D of utterance-program pairs, we create three different types of splits with each split consisting of a training pool D pool and test set D test .We then compare the various subsampling algorithms described in §3 by using them to sample training sets D train of varying budget size B from D pool and evaluating on D test .

Splits
IID split For this we randomly sample instances from D to use as D test , keeping the rest for D pool .
Template split This is a type of compositional split proposed by (Finegan-Dollak et al., 2018).Here instances are grouped based on their program template as described in §3.1 The split is then created by randomly splitting the set of templates into a train set and a test set and using examples for train or test templates as D pool or D test respectively.We follow the procedure of Bogin et al. (2022) to obtain solvable template splits where every token in the test set also occurs in the train set.
Subtree split In §2 we argued that diversely subsampled sets of instances should also make for more comprehensive test sets.We thus experiment with a third type of split: we use SUB-TREE [FREQNEWT] diverse subsampling to sample test sets D test from D, keeping the rest as D pool .

Datasets
We use five semantic parsing datasets from diverse domains and complexity for our analysis, with both synthetic and natural language input utterances.Tables 1 and 2 show a few examples and statistics regarding number of instances and different types of substructures.COVR: A synthetic dataset that uses a variablefree functional query language and is generated using a synchronous context-free grammar (SCFG) adapted from the VQA dataset of Bogin et al. (2021a).We use the SCFG to generate 100K examples for our experiments.ATIS (Hemphill et al., 1990;Dahl et al., 1994): A dataset of natural language queries about aviation paired with λ-calculus programs.Overnight (Wang et al., 2015): A dataset containing both synthetic and natural language utterances from 11 domains (e.g.socialnetwork, restaurants, etc.) paired with Lambda-DCS logical forms.Schema2QA (Xu et al., 2020): Uses the ThingTalk language (Campagna et al., 2019).We use the synthetic instances from the people domain generated by Oren et al. (2021).SM-CalFlow (Andreas et al., 2020): Consists of dialogs paired with LISP programs.Each instance is a single dialogue turn from one of two domains about creating calendar events or querying an org chart.
Except for SM-CalFlow, which we took from Andreas et al. (2020), we used the preprocessed versions of the above datasets provided by Bogin et al. (2022) and used their code2 to anonymize programs and produce ASTs.For SM-CalFlow, we anonymized strings and numbers such as "staff what people are named aideliz li ( Person ) filter id = "aideliz li" ATIS (natural) a flight on continental airlines leaving boston and going to denver ( lambda $0 e ( and ( flight $0 ) ( airline $0 co : al ) ( from $0 boston : ci ) ( to $0 denver : ci ) ) )

SM-CALFLOW (natural)
When is my next staff meeting scheduled for?(Yield (Event.start(FindNumNextEvent (Event.subject?(? = "staff meeting")) 1L))) meeting" and "1L" in Table 1 and used nltk 3 to produce ASTs.Additionally, we excluded all instances with a particular program that accounted for more than 12% of the original dataset.

Model and Training
We use the pre-trained BART-base model (Lewis et al., 2020) for our experiments, fine-tuning it on subsamples for each dataset, split, and subsampling algorithm.For each dataset, we use 4 different random seeds to create 4 splits of each type.Then for each split and training budget, we use 3 different seeds to subsample 3 training sets for each subsampling algorithm.See App.A for more details.

Structurally Diverse Train Sets
Figure 2 compares the various structurally diverse sampling algorithms, and random sampling on Template and the IID splits of the different datasets.

Structural Diversity improves sample efficiency.
Random sampling is outperformed by structurally diverse sampling in 9 out of 10 dataset-split 3 https://www.nltk.org/type combinations with about 2x sample efficiency in template splits of all five datasets except COVR and IID splits of SCHEMA2QA and OVERNIGHT.The only exception is the IID split of SM-CALFLOW.We believe this is due to a greater imbalance in substructures in SM-CALFLOW where over a third of templates (as well as bigrams and subtrees) only appear in a single instance.
Subtree diversity is most consistent.Our proposed subtree diversity algorithm is the most consistent among the various diverse sampling algorithms.In all but 1 of the 10 splits, it outperforms or matches both template and bigram diversity which often lag behind even random sampling.The only exception is template diversity in the template split of SCHEMA2QA, which has very few templates (see Table 2).These results suggest that the optimal substructure granularity depends on the task.For datasets with little structural diversity, diversifying over templates is sufficient.However, diversifying over smaller substructures is more beneficial for complex datasets with many program templates overlapping in structure as it exploits the structural similarity of programs.
Combining subtree and template diversity improves performance.
Comparing SUB-TREE [RANDNEWT] and SUBTREE [RANDEX] we see that the former performs better in template splits with minor reductions in IID splits.This suggests that simultaneously diversifying over fine and coarse substructures, subtrees, and templates here, can be more effective than either alone.
Prioritizing more frequent compounds improves efficiency.Figure 3: Subtree splits are more comprehensive than IID splits and hence are much harder with randomly sampled train sets but not with diverse train sets.PLATE[FREQ] respectively.It is evident that on IID splits, bigram and template diversity greatly benefit from prioritizing more frequent substructures instead of random selection, with only minor degradation in template splits.

Structurally Diverse Test Sets
Figure 3 compares the performance of random and diverse (SUBTREE [RANDEX]) subsamples in IID and Subtree splits.It is evident that Subtree splits are more challenging than IID splits.More importantly, this split is much harder with random train sets than with diverse train sets, which, as the training budget increases, quickly close the gap with IID performance on all datasets except ATIS.This is expected since, unlike template splits, this split is challenging not because of a systematic distributional gap between train and test sets, but because it tests for a lot more structures that a random train set may not cover.We thus believe that structurally diverse test sets also enable more comprehensive evaluation.

Analysis
Having seen that structurally diverse datasets improve generalization, we now investigate why they do so.In §2 we hypothesize that structural diversity has two benefits over random sampling: (1) improved coverage of the space of substructures, and (2) reduced spurious correlations.We now evaluate whether this is indeed the case.
We will view each instance as the set of subtrees contained in it, i.e., e i ⊂ C where C is the set of all subtrees in the dataset.We define the frequency of a subtree, s, in a set of instances A as the number of instances containing it, i.e.F A (s) = |{s ∈ e : e ∈ A}|.The frequency of a pair of subtrees, s i , s j , is analogously defined as F A (s i , s j ) = |{s i , s j ∈ e : e ∈ A}|.The rank of a subtree in A is defined as its position in a list of all subtrees descending in frequency, with equally-frequent subtrees assigned the average of their ranks.

Improved Coverage of the Long tail
To verify whether structural diversity improves the coverage of substructures, we use random and SUB-TREE [RANDEX] subsampling to take equal-sized subsamples from pools consisting of entire datasets, D, and compare the distribution of subtrees.Figure 4 shows the number of unique subtrees in random and diverse subsamples bucketed by rank in D.
As can be seen in the figure, structurally diverse  sampling improves the coverage of substructures, especially those in the long tail.Crucially, because substructures are chosen by frequency, this coverage of the long tail does not come at the expense of frequent substructures.We refer the reader to Appendix C for additional analyses showing that although SUBTREE[RANDEX] sampling directly diversifies over only output subtrees, it also improves coverage of program templates as well as n-grams in input utterances.

Weakened Spurious Correlations
Structurally diverse datasets might improve generalization by weakening the spurious correlations between substructures that would exist in any dataset of bounded size.

Measure for Correlations
We measure correlations between substructures in a set of instances D as the sum of pairwise mutual information (MI) between them, normalized to account for the larger number of distinct substructures in diverse subsamples.Let S = s 1 , . . ., s m be the set of all subtrees in a sample.We define an indicator random variable I s for each subtree s that is 1 if s is present in a random instance, 0 otherwise.MI between two subtrees, s i and s j is then defined as, (1) where p(I s = 1) and p(I s i = 1, I s j = 1) are  empirical probabilities defined as, Other probabilities are defined analogously.Finally, our measure of correlations is the average mutual information (AMI) across all subtree pairs: Results For each dataset, we took 3 samples using random and subtree subsampling algorithms, treating the entire dataset as pool, and computed the AMI over their substructures.Figure 5 shows that diverse subsamples have lower AMI than random subsamples across different training budgets and the different split types, confirming our hypothesis that structurally diverse datasets have lower correlation.Moreover, examining Spearman correlations of AMI and accuracy (Table 3) for train sets subsampled using random, SUBTREE[RANDEX] and SUBTREE[RANDNEWT] subsampling algorithms from the experiments in §4, we find that they are significantly negatively correlated, giving further credence to our measure.These results substantiate our hypothesis that reducing spurious correlations is indeed one way diverse subsamples improve generalization.

Related Work
Generalization Work on improving compositional generalization has considered many approaches including specialized architectures (Herzig and Berant, 2021;Bogin et al., 2021b;Chen et al., 2020;Gordon et al., 2020;Yin et al., 2021), data augmentation (Andreas, 2020;Akyürek et al., 2021;Guo et al., 2021), modifications to training methodology (Oren et al., 2020;Csordás et al., 2021) (Lewis and Catlett, 1994;Settles and Craven, 2008), where instances are iteratively selected for annotation based on criteria such as model uncertainty (Gal et al., 2017) or diversity (Sener and Savarese, 2018); and 2. adversarial selection of instances that model fails on (Bras et al., 2020;Sakaguchi et al., 2020;Wallace et al., 2019;Nie et al., 2020;Kiela et al., 2021).However, given that these methods use a model in the loop for selection and generally work in the input domain, they are not comparable to our method.Additionally, the coupling of the dataset and model, and the lack of generalizability of AL heuristics has been shown to limit the effectiveness of active learning in practice (Lowell et al., 2019;Karamcheti et al., 2021).Tamkin et al. (2022) recently showed that pretrained models benefit more from AL by preferring ambiguous instances or ones with uncommon feature(s) thereby reducing task under-specification which can lead to higher variance and instability in models under-constrained by their training datasets (D'Amour et al., 2020;Geirhos et al., 2020).Our analyses from §6 show that structurally diverse sampling also attempts to reduce under-specification albeit by directly sampling from a pool model-agnostically.
Evaluation Out-of-distribution (OOD) tests are essectial as testing in-distribution may not penalize models for learning spurious patterns (Linzen, 2020).Our work on sampling diverse test sets is most related to methods for creating splits to test for compositional generalization from existing datasets (Keysers et al., 2020;Shaw et al., 2021;Bogin et al., 2022).However, unlike these, diverse test sets do not attempt to create a systematic gap between the train and test sets, but, as shown in §6.1, cover the space of structures better.

Conclusion
In this work, we studied the benefits of structural diversity for the task of semantic parsing.We proposed a novel model-agnostic algorithm for sampling structurally diverse instances by diversifying over subtrees in program ASTs.Evaluating on multiple datasets with varying complexities and on both IID and compositional splits, we demonstrated that diversity almost always, and often significantly, improves generalization.We further demonstrated that structural diversity also yields more comprehensive test sets than traditional IID test sets.Finally, we showed that these benefits of structural diversity are likely a manifestation of improved coverage of the long-tail of substructures as well as a reduction in spurious correlations between them.We hope that our results demonstrating the importance of structural diversity will encourage future research on better structurally diverse sampling algorithms as well as their use to guide train and test set construction.The algorithms in this work are applicable whenever a large pool of possible output programs is available.One such setting is of prototyping a semantic parser in a zero-data setting where a pool of outputs can be sampled from a grammar and mapped to canonical utterances that can subsequently be paraphrased for linguistic variation either manually or automatically (Wang et al., 2015;Campagna et al., 2020).Another scenario is that of selecting instances for in-context or fewshot learning (Brown et al., 2020).
Perhaps more importantly, our results indicate that the failure of NLP models on compositional generalization may be due in part to the lack of diversity in them, suggesting the need to create more diverse benchmarks.

A Training
Models for COVR, ATIS, Schema2QA and Overnight datasets were trained for different number of epochs depending on train set size as shown in Table 4. Models for Schema2QA were all trained for 240 epochs.Training was run with batch sizes ranging from 8 to 20, depending on the maximum number of example tokens in each dataset and a learning rate of 3e −5 with polynomial decay.Each experiment was run with a Nvidia Titan RTX GPU and took between a few minutes to a couple of hours as we varied the training set size and number of epochs.We used exact match accuracy as our metric and following (Bogin et al., 2022), we do early stopping using the test set.As our goal is to estimate the train set quality and not the model, we argue this is an acceptable choice in our setting.

B Additional Results
The original Bigram and Template diversity algorithms (BIGRAM and TEMPLATE), while good in template splits, are worse than even random subsampling in IID splits.However, replacing their substructure selection scheme with one that prioritizes frequent substructures (BIGRAM [FREQ] and TEMPLATE [FREQ]) improves both of their efficiencies in IID splits with minor degradation in template splits (Figure 6).The only exception is the template diversity in COVR.This is expected given that it was generated from a synchronous grammar with production rules sampled uniformly at random and hence is dominated by instances with shorter templates.Thus, TEMPLATE[FREQ] will only pick these.Additionally, the slightly poorer performance subtree diversity than template diversity on SM-CALFLOW can be attributed to (1) imperfect program to tree conversion and (2) the presence of free-form strings in programs which together lead to a large number of spurious subtrees.

C Coverage
Figure 7 replicates the analysis of § 6.1 using program templates and n-grams in the input utterances as substructures showing that improving coverage over subtrees also leads to improved coverage of program templates as well as n-grams input utterances.
Figure 6: Prioritizing more frequent substructures greatly improves bigram and template diversity in IID splits with relatively minor degradation in template splits.

Figure 1 :
Figure1: An instance from COVR and its abstract syntax tree (AST) and different types of substructures in it, including templates, bigrams, and subtrees.Note that subtrees will also include each node in the AST.
2. RANDNEWT samples an instance with unseen template: w e (e) = 1[T (e) / ∈ T sample ]. 3. FREQNEWT samples instance with the most frequent unsampled template: w e (e) = 1[T (e) / ∈ T sample ]F pool (T (e)) where T (e) is e's template and F pool (t) is the number of instances in the pool with template t.
• Bogin et al. (2022)'s bigram diversity algorithm randomly samples from unsampled bigrams until there is still an unsampled bigram and then any random bigram.This can be formulated as: w c (c) = 1 c / ∈ Csample where Csample = e∈D sample C(e) is the set of all bigrams in the current sample.
Figure 6 compares BIGRAM and TEMPLATE with BIGRAM[FREQ] and TEM-

Figure 2 :
Figure 2: Comparing different subsampling algorithms on IID and Template splits for varying budgets.

Figure 4 :
Figure4: The number of unique subtrees in diverse and random subsamples bucketed by the rank of their frequency in the pool.Titles include the size of the subsamples and the pool, and the legends include the number of unique subtrees in the subsamples v/s the pool.

Figure 5 :
Figure 5: Average Mutual Information (AMI) of random subsamples compared with that of diverse subsamples.

Figure 7 :
Figure 7: Number of unique program templates (top) and input utterance n-grams of size up to three (bottom) in diverse and random subsamples bucketed by pool ranks.Titles include size of the subsample v/s the pool while the legends include the number of unique substructures in each subsample v/s the pool.
Algorithm 1 Structurally Diverse Subsampling Require: Instance pool D pool ; set of all substructures C, and templates T in pool: instance-to-substructure mapping C; template mapping T ; substructure weight function wc; instance weight function we; training budget B D sample , C sample , T sample ← ϕ c∈C wc(c, D pool , D sample , C sample ) e = argmax e∈D pool s.t.c∈C(e)

Table 1 :
Examples of input utterance and target program pairs for the datasets used in this work.

Table 2 :
Number of instances, bigrams, subtrees (size ≤ 4), and templates in the datasets used for structural diversity experiments.

Table 3 :
Spearman correlations of AMI and Accuracy for train sets sampled randomly and using SUB- TREE[RANDEX]and SUBTREE[RANDNEWT] algorithms averaged across 4 split seeds (from § 4).

Table 4 :
Giovanni Campagna, Agata Foryciarz, MehradMoradshahi, and Monica Lam.2020.Zero-shot transfer learning with synthesized data for multi-domain dialogue state tracking.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 122-132, Online.Association for Computational Linguistics.Henry Conklin, Bailin Wang, Kenny Smith, and Ivan Titov.2021.Meta-learning to compositionally generalize.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3322-3335, Online.Association for Computational Linguistics.Number of training epochs for COVR, ATIS, Schema2QA and Overnight depending on train set size.