Meta-Learning to Compositionally Generalize

Natural language is compositional; the meaning of a sentence is a function of the meaning of its parts. This property allows humans to create and interpret novel sentences, generalizing robustly outside their prior experience. Neural networks have been shown to struggle with this kind of generalization, in particular performing poorly on tasks designed to assess compositional generalization (i.e. where training and testing distributions differ in ways that would be trivial for a compositional strategy to resolve). Their poor performance on these tasks may in part be due to the nature of supervised learning which assumes training and testing data to be drawn from the same distribution. We implement a meta-learning augmented version of supervised learning whose objective directly optimizes for out-of-distribution generalization. We construct pairs of tasks for meta-learning by sub-sampling existing training data. Each pair of tasks is constructed to contain relevant examples, as determined by a similarity metric, in an effort to inhibit models from memorizing their input. Experimental results on the COGS and SCAN datasets show that our similarity-driven meta-learning can improve generalization performance.


Introduction
Compositionality is the property of human language that allows for the meaning of a sentence to be constructed from the meaning of its parts and the way in which they are combined (Cann, 1993). By decomposing phrases into known parts we can generalize to novel sentences despite never having encountered them before. In practice this allows us to produce and interpret a functionally limitless number of sentences given finite means (Chomsky, 1965). * Equal contribution.
Whether or not neural networks can generalize in this way remains unanswered. Prior work asserts that there exist fundamental differences between cognitive and connectionist architectures that makes compositional generalization by the latter unlikely (Fodor and Pylyshyn, 1988). However, recent work has shown these models' capacity for learning some syntactic properties. Hupkes et al. (2018) show how some architectures can handle hierarchy in an algebraic context and generalize in a limited way to unseen depths and lengths. Work looking at the latent representations learned by deep machine translation systems show how these models seem to extract constituency and syntactic class information from data (Blevins et al., 2018;Belinkov et al., 2018). These results, and the more general fact that neural models perform a variety of NLP tasks with high fidelity (eg. Vaswani et al., 2017;Dong and Lapata, 2016), suggest these models have some sensitivity to syntactic structure and by extension may be able to learn to generalize compositionally.
Recently there have been a number of datasets designed to more formally assess connectionist models' aptitude for compositional generalization (Kim and Linzen, 2020;Lake and Baroni, 2018;Hupkes et al., 2019). These datasets frame the problem of compositional generalization as one of outof-distribution generalization: the model is trained on one distribution and tested on another which differs in ways that would be trivial for a compositional strategy to resolve. A variety of neural network architectures have shown mixed performance across these tasks, failing to show conclusively that connectionist models are reliably capable of generalizing compositionally (Keysers et al., 2020;Lake and Baroni, 2018). Natural language requires a mixture of memorization and generalization (Jiang et al., 2020), memorizing exceptions and atomic concepts with which to generalize. Previous work looking at compositional generalization has suggested that models may memorize large spans of sentences multiple words in length (Hupkes et al., 2019;Keysers et al., 2020). This practice may not harm in-domain performance, but if at test time the model encounters a sequence of words it has not encountered before it will be unable to interpret it having not learned the atoms (words) that comprise it. Griffiths (2020) looks at the role of limitations in the development of human cognitive mechanisms. Humans' finite computational ability and limited memory may be central to the emergence of robust generalization strategies like compositionality. A hard upper-bound on the amount we can memorize may be in part what forces us to generalize as we do. Without the same restriction models may prefer a strategy that memorizes large sections of the input potentially inhibiting their ability to compositionally generalize.
In a way the difficulty of these models to generalize out of distribution is unsurprising: supervised learning assumes that training and testing data are drawn from the same distribution, and therefore does not necessarily favour strategies that are robust out of distribution. Data necessarily underspecifies for the generalizations that produced it. Accordingly for a given dataset there may be a large number of generalization strategies that are compatible with the data, only some of which will perform well outside of training (D'Amour et al., 2020). It seems connectionist models do not reliably extract the strategies from their training data that generalize well outside of the training distribution. Here we focus on an approach that tries to to introduce a bias during training such that the model arrives at a more robust strategy.
To do this we implement a variant of the model agnostic meta-learning algorithm (MAML, Finn et al., 2017a). The approach used here follows Wang et al. (2020a) which implements an objective function that explicitly optimizes for out-ofdistribution generalization in line with Li et al. (2018). Wang et al. (2020a) creates pairs of tasks for each batch (which here we call meta-train and meta-test) by sub-sampling the existing training data. Each meta-train, meta-test task pair is designed to simulate the divergence between training and testing: meta-train is designed to resemble the training distribution, and meta-test to resemble the test distribution. The training objective then requires that update steps taken on meta-train are also beneficial for meta-test. This serves as a kind of regularizer, inhibiting the model from taking update steps that only benefit meta-train. By manipulating the composition of meta-test we can control the nature of the regularization applied. Unlike other meta-learning methods this is not used for few or zero-shot performance. Instead it acts as a kind of meta-augmented supervised learning, that helps the model to generalize robustly outside of its training distribution.
The approach taken by Wang et al. (2020a) relies on the knowledge of the test setting. While it does not assume access to the test distribution, it assumes access to the family of test distributions, from which the actual test distribution will be drawn. While substantially less restrictive than the standard iid setting, it still poses a problem if we do not know the test distribution, or if the model is evaluated in a way that does not lend itself to being represented by discrete pairs of tasks (i.e. if test and train differ in a variety of distinct ways). Here we propose a more general approach that aims to generate meta-train, meta-test pairs which are populated with similar (rather than divergent) examples in an effort to inhibit the model from memorizing its input. Similarity is determined by a string or tree kernel so that for each meta-train task a corresponding meta-test task is created from examples deemed similar.
By selecting for similar examples we design the meta-test task to include examples with many of the same words as meta-train, but in novel combinations. As our training objective encourages gradient steps that are beneficial for both tasks we expect the model to be less likely to memorize large chunks which are unlikely to occur in both tasks, and therefore generalize more compositionally. This generalizes the approach from Wang et al. (2020a), by using the meta-test task to apply a bias not-strictly related to the test distribution: the design of the meta-test task allows us to design the bias which it applies. It is worth noting that other recent approaches to this problem have leveraged data augmentation to make the training distribution more representative of the test distribution (Andreas, 2020). We believe this line of work is orthogonal to ours as it does not focus on getting a model to generalize compositionally, but rather making the task simple enough that compositional generalization is not needed. Our method is model agnostic, and does not require prior knowledge of the target distribution.
We summarise our contributions as follows: • We approach the problem of compositional generalization with a meta-learning objective that tries to explicitly reduce input memorization using similarity-driven virtual tasks. • We perform experiments on two text-tosemantic compositional datasets: COGS and SCAN. Our new training objectives lead to significant improvements in accuracy over a baseline parser trained with conventional supervised learning. 1

Methods
We introduce the meta-learning augmented approach to supervised learning from Li et al. (2018); Wang et al. (2020a) that explicitly optimizes for outof-distribution generalization. Central to this approach is the generation of tasks for meta-learning by sub-sampling training data. We introduce three kinds of similarity metrics used to guide the construction of these tasks.

Problem Definition
Compositional Generalization Lake and Baroni (eg. 2018); Kim and Linzen (eg. 2020) introduce datasets designed to assess compositional generalization. These datasets are created by generating synthetic data with different distributions for testing and training. The differences between the distributions are trivially resolved by a compositional strategy. At their core these tasks tend to assess three key components of compositional ability: systematicity, productivity, and primitive application. Systematicity allows for the use of known parts in novel combinations as in (a). Productivity enables generalization to longer sequences than those seen in training as in (b). Primitive application allows for a word only seen in isolation during training to be applied compositionally at test time as in (c).  Meta-train update: 10: Compute meta-test objective: Final Update: θ ← Update(θ, ∇ θ L τ (θ)) 12: end for A compositional grammar like the one that generated the data would be able to resolve these three kinds of generalization easily, and therefore performance on these tasks is taken as an indication of a model's compositional ability.
Conventional Supervised Learning The compositional generalization datasets we look at are semantic parsing tasks, mapping between natural language and a formal representation. A usual supervised learning objective for semantic parsing is to minimize the negative log-likelihood of the correct formal representation given a natural language input sentence, i.e. minimising where N is the size of batch B, y is a formal representation and x is a natural language sentence. This approach assumes that the training and testing data are independent and identically distributed.
Task Distributions Following from Wang et al.
(2020a), we utilize a learning algorithm that can enable a parser to benefit from a distribution of virtual tasks, denoted by p(τ ), where τ refers to an instance of a virtual compositional generalization task that has its own training and test examples.

MAML Training
Once we have constructed our pairs of virtual tasks we need a training algorithm that encourages compositional generalization in each. Like Wang et al. (2020a), we turn to optimization-based metalearning algorithms (Finn et al., 2017b;Li et al., 2018) and apply DG-MAML (Domain Generalization with Model-Agnostic Meta-Learning), a variant of MAML (Finn et al., 2017b). Intuitively, DG-MAML encourages optimization on meta-training examples to have a positive effect on the meta-test examples as well.
During each learning episode of MAML training we randomly sample a task τ which consists of a training batch B t and a generalization batch B g and conduct optimization in two steps, namely metatrain and meta-test.
Meta-Train The meta-train task is sampled at random from the training data. The model performs one stochastic gradient descent step on this batch where α is the meta-train learning rate.

Meta-Test
The fine-tuned parameters θ are evaluated on the accompanying generalization task, meta-test, by computing their loss on it denoted as L Bg (θ ). The final objective for a task τ is then to jointly optimize the following: The objective now becomes to reduce the joint loss of both the meta-train and meta-test tasks. Optimizing in this way ensures that updates on metatrain are also beneficial to meta-test. The loss on meta-test acts as a constraint on the loss from metatrain. This is unlike traditional supervised learning (L τ (θ) = L Bt (θ) + L Bg (θ)) where the loss on one batch does not constrain the loss on another. With a random B t and B g , the joint loss function can be seen as a kind of generic regularizer, ensuring that update steps are not overly beneficial to meta-train alone. By constructing B t and B g in ways which we expect to be relevant to compositionality, we aim to allow the MAML algorithm to apply specialized regularization during training. Here we design meta-test to be similar to the metatrain task because we believe this highlights the systematicity generalization that is key to compositional ability: selecting for examples comprised of the same atoms but in different arrangements. In constraining each update step with respect to meta-train by performance on similar examples

Neighbours using LevDistance
The girl rolled a drink beside the table .
-2.00 The girl liked a dealer beside the table .
-2.00 The girl cleaned a teacher beside the table .
-2.00 The girl froze a bear beside the table .
-2.00 The girl grew a pencil beside the table .
-2.00 in meta-test we expect the model to dis-prefer a strategy that does not also work for meta-test like memorization of whole phrases or large sections of the input.

Similarity Metrics
Ideally, the design of virtual tasks should reflect specific generalization cases for each dataset. However, in practice this requires some prior knowledge of the distribution to which the model will be expected to generalize, which is not always available. Instead we aim to naively structure the virtual tasks to resemble each other. To do this we use a number of similarity measures intended to help select examples which highlight the systematicity of natural language. Inspired by kernel density estimation (Parzen, 1962), we define a relevance distribution for each example: where k is the similarity function, [x, y] is a training example, η is a temperature that controls the sharpness of the distribution. Based on our extended interpretation of relevance, a highp implies that [x, y] is systematically relevant to [x , y ] -containing many of the same atoms but in a novel combination. We look at three similarity metrics to guide subsampling existing training data into meta-test tasks proportional to each example'sp. Levenshtein Distance First, we consider Levenshtein distance, a kind of edit distance widely used to measure the dissimilarity between strings. We compute the negative Levenshtein distance at the word-level between natural language sentences of two examples: where LevDistance returns the number of edit operations required to transform x into x . See Table 1 for examples. Another family of similarity metrics for discrete structures are convolution kernels (Haussler, 1999).

String-Kernel Similarity
We use the string subsequence kernel (Lodhi et al., 2002): where SSK computes the number of common subsequences between natural language sentences at the word-level. See Table 1 for examples. 2 Tree-Kernel Similarity In semantic parsing, the formal representation y usually has a known grammar which can be used to represent it as a tree structure. In light of this we use tree convolution kernels to compute similarity between examples: 3 where the TreeKernel function is a convolution kernel (Collins and Duffy, 2001) applied to trees. Here we consider a particular case where y is represented as a dependency structure, as shown in Figure 1. We use the partial tree kernel (Moschitti, 2006) which is designed for application to dependency trees. For a given dependency tree partial tree kernels generate a series of all possible partial trees: any set of one or more connected nodes. Given two trees the kernel returns the number of partial trees they have in common, interpreted as a similarity score. Compared with string-based similarity, this kernel prefers sentences that share common syntactic sub-structures, some of which are not assigned high scores in string-based similarity metrics, as shown in Table 1.
Though tree-structured formal representations are more informative in obtaining relevance, not all logical forms can be represented as tree structures. In SCAN (Lake and Baroni, 2018) y are action sequences without given grammars. As we will show in the experiments, string-based similarity metrics have a broader scope of applications but are less effective than tree kernels in cases where y can be tree-structured.
Sampling for Meta-Test Using our kernels we compute the relevance distribution in Eq 4 to construct virtual tasks for MAML training. We show the resulting procedure in Algorithm 1. In order to construct a virtual task τ , a meta-train batch is first sampled at random from the training data (line 2), then the accompanying meta-test batch is created by sampling examples similar to those in meta-train (line 5).
We use Lev-MAML, Str-MAML and Tree-MAML to denote the meta-training using Levenshtein distance, string-kernel and tree-kernel similarity, respectively.

Datasets and Splits
We evaluate our methods on the following semantic parsing benchmarks that target compositional generalization.
SCAN contains a set of natural language commands and their corresponding action sequences (Lake and Baroni, 2018). We use the Maximum Compound Divergence (MCD) splits (Keysers et al., 2020), which are created based on the principle of maximizing the divergence between the compound (e.g., patterns of 2 or more action sequences) distributions of the training and test tests. We apply Lev-MAML and Str-MAML to SCAN where similarity measures are applied to the natural language commands. Tree-MAML (which uses a tree kernel) is not applied as the action sequences do not have an underlying dependency tree-structure.
COGS contains a diverse set of natural language sentences paired with logical forms based on lambda calculus (Kim and Linzen, 2020). Compared with SCAN, it covers various systematic linguistic abstractions (e.g., passive to active) including examples of lexical and structural generalization, and thus better reflects the compositionality of natural language. In addition to the standard splits of Train/Dev/Test, COGS provides a generalization (Gen) set drawn from a different distribution that specifically assesses compositional generalization. We apply Lev-MAML, Str-MAML and Tree-MAML to COGS; Lev-MAML and Str-MAML make use of the natural language sentences while Tree-MAML uses the dependency structures reconstructed from the logical forms.

Baselines
In general, our method is model-agnostic and can be coupled with any semantic parser to improve its compositional generalization. Additionally Lev-MAML, and Str-MAML are dataset agnostic provided the dataset has a natural language input. In this work, we apply our methods on two widely used sequence-to-sequences models. 4 LSTM-based Seq2Seq has been the backbone of many neural semantic parsers (Dong and Lapata, 2016;Jia and Liang, 2016). It utilizes LSTM (Hochreiter and Schmidhuber, 1997) and attention (Bahdanau et al., 2014) under an encoderdecoder (Sutskever et al., 2014) framework.
Transformer-based Seq2Seq also follows the encoder-decoder framework, but it uses Transformers (Vaswani et al., 2017) to replace the LSTM for encoding and decoding. It has proved successful in many NLP tasks e.g., machine translation. Recently, it has been adapted for semantic parsing (Wang et al., 2020b) with superior performance.
We try to see whether our MAML training can improve the compositional generalization of contemporary semantic parsers, compared with standard supervised learning. Moreover, we include a meta-baseline, referred to as Uni-MAML, that constructs meta-train and meta-test splits by uniformly sampling training examples. By comparing with this meta-baseline, we show the effect of similarity-driven construction of meta-learning splits. Note that we do not focus on making comparisons with other methods that feature specialized architectures for SCAN datasets (see Section 5), as these methods do not generalize well to more complex datasets .
GECA We additionally apply the good enough compositional augmentation (GECA) method laid out in Andreas (2020) to the SCAN MCD splits. Data augmentation of this kind tries to make the training distribution more representative of the test distribution. This approach is distinct from ours which focuses on the training objective, but the two can be combined with better overall performance as we will show. Specifically, we show the results of GECA applied to the MCD splits as well as GECA combined with our Lev-MAML variant. Note that we elect not to apply GECA to COGS, as the time and space complexity 5 of GECA proves very costly for COGS in our preliminary experiments.

Construction of Virtual Tasks
The similarity-driven sampling distributionp in Eq 4 requires computing the similarity between every pair of training examples, which can be very expensive depending on the size of of the dataset. As the sampling distributions are fixed during training, we compute and cache them beforehand. However, they take an excess of disk space to store as essentially we need to store an N × N matrix where N  is the number of training examples. To allow efficient storage and sampling, we use the following approximation. First, we found that usually each example only has a small set of neighbours that are relevant to it. 6 Motivated by this observation, we only store the top 1000 relevant neighbours for each example sorted by similarity, and use it to construct the sampling distribution denoted asp top1000 .
To allow examples out of top 1000 being sampled, we use a linear interpolation betweenp top1000 and a uniform distribution. Specifically, we end up using the following sampling distribution: p(x , y |x, y) = λp top1000 (x , y |x, y) + (1 − λ) 1 N wherep top1000 assigns 0 probability to out-of top 1000 examples, N is the number of training examples, and λ is a hyperparameter for interpolation. In practice, we set λ to 0.5 in all experiments. To sample from this distribution, we first decide whether the sample is in the top 1000 by sampling from a Bernoulli distribution parameterized by λ. If it is, we usep top1000 to do the sampling; otherwise, we uniformly sample an example from the training set.

Development Set
Many tasks that assess out-of-  Table 3: Main results on the COGS dataset. We show the mean and variance (standard deviation) of 10 runs. Cells with a grey background are results obtained in this paper, whereas cells with a white background are from Kim and Linzen (2020).
Dev set that is representative of the generalization distribution. This is desirable as a parser in principle should never have knowledge of the Gen set during training. In practice though the lack of an O.O.D. Dev set makes model selection extremely difficult and not reproducible. 7 In this work, we propose the following strategy to alleviate this issue: 1) we sample a small subset from the Gen set, denoted as 'Gen Dev' for tuning meta-learning hyperparmeters, 2) we use two disjoint sets of random seeds for development and testing respectively, i.e., retraining the selected models from scratch before applying them to the final test set. In this way, we make sure that our tuning is not exploiting the models resulting from specific random seeds: we do not perform random seed tuning. At no point are any of our models trained on the Gen Dev set.

Main Results
On SCAN, as shown in Table 2, Lev-MAML substantially helps both base parsers achieve better performance across three different splits constructed according to the MCD principle. 8 Though our models do not utilize pre-training such as T5 (Raffel et al., 2019), our best model (Lev-MAML + LSTM) still outperforms T5 based models significantly in MCD1 and MCD2. We show that GECA is also effective for MCD splits (especially in MCD1). More importantly, augmenting GECA with Lev-MAML further boosts the performance substantially in MCD1 and MCD2, signifying that our MAML training is complementary to GECA to some degree. Table 3 shows our results on COGS. Tree-MAML boosts the performance of both LSTM and Transformer base parsers by a large margin: 6.5% and 8.1% respectively in average accuracy. Moreover, Tree-MAML is consistently better than other MAML variants, showing the effectiveness of exploiting tree structures of formal representation to construct virtual tasks. 9

SCAN Discussion
The application of our string-similarity driven metalearning approaches to the SCAN dataset improved the performance of the LSTM baseline parser. Our results are reported on three splits of the dataset generated according to the maximum compound divergence (MCD) principle. We report results on the only MCD tasks for SCAN as these tasks explicitly focus on the systematicity of language. As such they assess a model's ability to extract sufficiently atomic concepts from its input, such that it can still recognize those concepts in a new context (i.e. as part of a different compound). To succeed here a model must learn atoms from the training data and apply them compositionally at test time. The improvement in performance our approach achieves on this task suggests that it does disincentivise the model from memorizing large sections -or entire compounds -from its input.
GECA applied to the SCAN MCD splits does improve performance of the baseline, however not to the same extent as when applied to other SCAN tasks in Andreas (2020). GECA's improvement is comparable to our meta-learning method, despite the fact that our method does not leverage any data augmentation. This means that our method achieves high performance by generalizing robustly outside of its training distribution, rather than by making its training data more representative of the test distribution. The application of our Lev-MAML approach to GECA-augmented data results in further improvements in performance, suggest- 9 The improvement of all of our MAML variants applied to the Transformer are significant (p < 0.03) compared to the baseline, of our methods applied to LSTMs, Tree-MAML is significant (p < 0.01) compared to the baseline. ing that these approaches aid the model in distinct yet complementary ways.

COGS Discussion
All variants of our meta-learning approach improved both the LSTM and Transformer baseline parsers' performance on the COGS dataset. The Tree-MAML method outperforms the Lev-MAML, Str-MAML, and Uni-MAML versions. The only difference between these methods is the similarity metric used, and so differences in performance must be driven by what each metric selects for. For further analysis of the metrics refer to the appendix.
The strong performance of the Uni-MAML variant highlights the usefulness of our approach generally in improving models' generalization performance. Even without a specially designed metatest task this approach substantially improves on the baseline Transformer model. We see this as evidence that this kind of meta-augmented supervised learning acts as a robust regularizer particularly for tasks requiring out of distribution generalization.
Although the Uni-MAML, Lev-MAML, and Str-MAML versions perform similarly overall on the COGS dataset they may select for different generalization strategies. The COGS generalization set is comprised of 21 sub-tasks which can be used to better understand the ways in which a model is generalizing (refer to Table 4 for examples of subtask performance). Despite having very similar overall performance Uni-MAML and Str-MAML perform distinctly on individual COGS tasks -with their performance appearing to diverge on a number of of them. This would suggest that the design of the meta-test task may have a substantive impact on the kind of generalization strategy that emerges in the model. For further analysis of COGS sub-task performance see the appendix.
Our approaches' strong results on both of these datasets suggest that it aids compositional generalization generally. However it is worth nothing that both datasets shown here are synthetic, and although COGS endeavours to be similar to natural data, the application of our methods outside of synthetic datasets is important future work.

Related Work
Compositional Generalization A large body of work on compositional generalization provide models with strong compositional bias, such as specialized neural architectures Russin Case Training Generalization Accuracy Distribution Primitive noun → Subject (common noun) shark A shark examined the child.

Paula
The child helped Paula.   Gordon et al., 2019), or grammar-based models that accommodate alignments between natural language utterances and programs (Shaw et al., 2020;Herzig and Berant, 2020). Another line of work utilizes data augmentation via fixed rules (Andreas, 2020) or a learned network (Akyürek et al., 2020) in an effort to transform the out-of-distribution compositional generalization task into an in-distribution one. Our work follows an orthogonal direction, injecting compositional bias using a specialized training algorithm. A related area of research looks at the emergence of compositional languages, often showing that languages which seem to lack natural-language like compositional structure may still be able to generalize to novel concepts (Kottur et al., 2017;Chaabouni et al., 2020). This may help to explain the ways in which models can generalize robustly on in-distribution data unseen during training while still struggling on tasks specifically targeting compositionality.
Meta-Learning for NLP Meta-learning methods (Vinyals et al., 2016;Ravi and Larochelle, 2016;Finn et al., 2017b) that are widely used for few-shot learning, have been adapted for NLP applications like machine translation (Gu et al., 2018) and relation classification (Obamuyide and Vlachos, 2019). In this work, we extend the conventional MAML (Finn et al., 2017b) algorithm, which was initially proposed for few-shot learning, as a tool to inject inductive bias, inspired by Li et al. (2018); Wang et al. (2020a). For compositional generalization, Lake (2019) proposes a meta-learning procedure to train a memory-augmented neural model. However, its meta-learning algorithm is specialized for the SCAN dataset (Lake and Baroni, 2018) and not suitable to more realistic datasets.

Conclusion
Our work highlights the importance of training objectives that select for robust generalization strategies. The meta-learning augmented approach to supervised learning used here allows for the specification of different constraints on learning through the design of the meta-tasks. Our similarity-driven task design improved on baseline performance on two different compositional generalization datasets, by inhibiting the model's ability to memorize large sections of its input. Importantly though the overall approach used here is model agnostic, with portions of it (Str-MAML, Lev-MAML, and Uni-MAML) proving dataset agnostic as well requiring only that the input be a natural language sentence. Our methods are simple to implement compared with other approaches to improving compositional generalization, and we look forward to their use in combination with other techniques to further improve models' compositional ability. the ten random seeds used for Lev-MAML + Transformer on COGS, the best performing seed obtains 73% whereas the lowest performing seed obtains 54%. Thus, it is important to compare different models using the same set of random seeds, and not to tune the random seeds in any model. To alleviate these two concerns, we choose the protocol that is mentioned in the main paper. This protocol helps to make the results reported in our paper reproducible.

A.3 Details of Training and Evaluation
Following Kim and Linzen (2020), we train all models from scratch using randomly initialized embeddings. For SCAN, models are trained for 1,000 steps with batch size 128. We choose model checkpoints based on their performance on the Dev set. For COGS, models are trained for 6,000 steps with batch size of 128. We choose the meta-train learning rate α in Equation 2, temperature η in Equation 4 based on the performance on the Gen Dev set. Finally we use the chosen α, η to train models with new random seeds, and only the last checkpoints (at step 6,000) are used for evaluation on the Test and Gen set.

A.4 Other Splits of SCAN
The SCAN dataset contains many splits, such as Add-Jump, Around Right, and Length split, each assessing a particular case of compositional generalization. We think that MCD splits are more representative of compositional generalization due to the nature of the principle of maximum compound divergence. Moreover, it is more challenging than other splits (except the Length split) according to . That GECA, which obtains 82% in accuracy on JUMP and Around Right splits, only obtains < 52% in accuracy on MCD splits in our experiments confirms that MCD splits are more challenging.

A.5 Kernel Analysis
The primary difference between the tree-kernel and string-kernel methods is in the diversity of the examples they select for the meta-test task. The tree kernel selects a broader range of lengths, often including atomic examples, a single word in length, matching a word in the original example from metatrain (see table 5). By design the partial tree kernel will always assign a non-zero value to an example that is an atom contained in the original sentence. We believe the diversity of the sentences selected

Neighbours using LevDistance
The crocodile liked a girl .
-3.00 The boy hoped that a girl juggled .
-3.00 The cat hoped that a girl sketched .
-3.00 The cat hoped that a girl smiled .
-3.00 Emma liked that a girl saw .
-4.00 by the tree kernel accounts for the superior performance of Tree-MAML compared with the other MAML conditions. The selection of a variety of lengths for meta-test constrains model updates on the meta-train task such that they must also accommodate the diverse and often atomic examples selected for meta-test. This constraint would seem to better inhibit memorizing large spans of the input unlikely to be present in meta-test.

A.6 Meta-Test Examples
In Table 6, we show top scoring examples retrieved by the similarity metrics for two sentences. We found that in some cases (e.g., the right part of Table 6), the tree-kernel can retrieve examples that diverge in length but are still semantically relevant. In contrast, string-based similarity metrics, especially LevDistance, tends to choose examples with similar lengths.

A.7 COGS Subtask Analysis
We notice distinct performance for different conditions on the different subtasks from the COGS dataset. In Figure 2 we show the performance of the Uni-MAML and Str-MAML conditions compared with the mean of those conditions. Where the bars are equal to zero the models' performance on that task is roughly equal.