How Do In-Context Examples Affect Compositional Generalization?

Compositional generalization–understanding unseen combinations of seen primitives–is an essential reasoning capability in human intelligence.The AI community mainly studies this capability by fine-tuning neural networks on lots of training samples, while it is still unclear whether and how in-context learning–the prevailing few-shot paradigm based on large language models–exhibits compositional generalization.In this paper, we present CoFe, a test suite to investigate in-context compositional generalization.We find that the compositional generalization performance can be easily affected by the selection of in-context examples, thus raising the research question what the key factors are to make good in-context examples for compositional generalization.We study three potential factors: similarity, diversity and complexity. Our systematic experiments indicate that in-context examples should be structurally similar to the test case, diverse from each other, and individually simple.Furthermore, two strong limitations are observed: in-context compositional generalization on fictional words is much weaker than that on commonly used ones; it is still critical that the in-context examples should cover required linguistic structures, even though the backbone model has been pre-trained on large corpus.We hope our analysis would facilitate the understanding and utilization of in-context learning paradigm.


Introduction
Compositional generalization is an essential capability of human intelligence. It means to understanding and producing novel expressions by recombining known components in language (Chomsky, 1957;Montague, 1974;Fodor and Lepore, 2002). Taking examples in Figure 1, after learning the combination "baby in a room", human intelligence can easily generalize to "Jackson in a * Work done during an internship at Microsoft Research. room". On exploring this human-like capability in deep learning models, several benchmarks such as SCAN (Lake and Baroni, 2018), CFQ (Keysers et al., 2019) and COGS (Kim and Linzen, 2020) have been proposed based on semantic parsing 1 tasks. In these benchmarks, the training set cover all the primitives while lacking certain combinations, and the test set focuses on these missing combinations. By fine-tuning generic neural models on these benchmarks, much work reported that these models exhibit poor compositional generalization (Furrer et al., 2020;Shaw et al., 2021;Bogin et al., 2022).
Recently, in-context learning with large language models exhibits impressive performance on various tasks (Brown et al., 2020;Rae et al., 2021;Wei et al., 2022). By conditioning on few-shot incontext examples, the pre-trained language model, with extremely large model size and pre-trained corpus, can perform downstream tasks without any update on pre-trained parameters.
Behind the impressive performance of in-context learning, we are curious whether this prevailing paradigm can take a step towards compositional generalization. To investigate this, we first take an initial exploration: for each test case in COGS, we select in-context examples from its training set and ensure that all primitives in each test case are covered by the equipped in-context examples. Our initial exploration suggests that compositional generalization can be easily affected by in-context examples: with only covering primitives, davinci 175B lags behind fine-tuned GPT2-Large with 24.2% accuracy (similar to the observation in Qiu et al. (2022)); with also covering some local structures (inspired by Bogin et al. (2022)), davinci outperforms fine-tuned GPT2-Large with 3.9% accuracy. Based on these initial observations, we raise and investigate the question: How do in-context examples affect compositional generalization?
We construct the test suite COFE (based on COGS) to facilitate our systematic investigation. Taking the coverage of primitives as a basic principle in COFE, we further define and inject three factors in selecting in-context examples: similarity, diversity, and complexity. Similarity is considered as the matching of hidden structures behind concrete expressions. Diversity reflects whether the context presents repeated patterns or not. Complexity portrays the amount of information contained in each example. By controlling these factors in constructing COFE, we can systematically investigate how would in-context examples influence the performance on compositional generalization.
Our experiments demonstrate that all three factors matter for in-context compositional generalization. We leverage six large language models in GPT series: davinci, code-cushman-001, codecushman-002, text-davinci-002, text-chat-davinci-002, and code-davinci-002. The observations are consistent across models: to better perform compositional generalization, all backbone models prefer in-context examples with higher structural similarity to the test case, higher diversity among different examples, and lower complexity in each individual example. Furthermore, beyond the influence from these factors, in-context compositional generalization still faces two challenges. One is that in-context learning has difficulty recombining fic-tional words (e.g., random tokens) rather than commonly used ones. The other one is that in-context examples are still required to cover the linguistic structures in NL expressions, even though the backbone model has been pre-trained on large corpus.
Our contributions are three-fold: 1) to answer the research question posed, we investigate three factors in selecting in-context examples and draw consistent conclusions across models; 2) we construct COFE to conduct our systematic investigation, and will release it to facilitate further exploration of in-context compositional generalization; 3) we also point out two remaining challenges that in-context learning still struggles to handle. We hope our analysis would provide insights on how to select proper in-context examples, and to shed light on the future research of in-context compositional generalization. COFE is publicly available at https://github.com/microsoft/Contextua lSP/tree/master/cofe.

In-Context Compositional Generalization
In-context compositional generalization refers to understand and produce novel combinations through recombining the building blocks presented by in-context examples. We first introduce some basic settings for testing this desired capability, then show our initial observations.

Principles for Measuring In-Context Compositional Generalization
To measure in-context compositional generalization under a test suite, each test case and its equipped in-context examples should satisfy two principles.
• Combination held-out principle: to test generalization on certain combinations, incontext examples should exclude these combinations while test cases contain them.
• Primitive coverage principle: the primitives contained in each test case should be fully covered by in-context examples. Primitives are the minimum indivisible units in expressions. In this work, we mainly consider primitives as lexical items (e.g., the noun "baby" and the verb "observed" in Figure 1).
We say that a model exhibits in-context compositional generalization if it performs well on a test suite that satisfies these two principles.

COGS (Under In-Context Learning)
COGS is a compositional generalization benchmark designed for the fine-tuning paradigm: based on a semantic parsing task, the training set of COGS covers all primitives in this task, while several combinations of primitives in the test set are excluded from the training set. We term these excluded combinations as aiming combinations.
We measure in-context compositional generalization based on COGS, by converting it from the original fine-tuning paradigm to the in-context learning paradigm. For each COGS test case, we select in-context examples from the training set B, ensuring that the two principles are satisfied. Note that, for each test case, there are usually different collections of in-context examples satisfying the two principles. Our basic setting is to use a random one among them, and we show that this casual strategy could lead to an underestimation of in-context compositional generalization (Section 2.3).
To facilitate testing on more complex logical forms, we reconstruct some target-side clauses from the chain structure into the nested-function format (illustrated in Figure 2). This reconstruction follows An et al. (2023) and is similar to the conversion from Lambda calculus to FunQL in Geo domain (Zelle and Mooney, 1996;Kate et al., 2005;Zettlemoyer and Collins, 2012). Moreover, to improve human readability, we omitted two types of details: the special marker for definite descriptions and the Skolem constants. These details do not affect the testing of compositional generalization.
Apart from these omitted details, the logical forms in COFE unambiguously represent the main semantics in the domain of COGS, such as semantic roles, modifications, and orders among clauses and modifications. More details about COFE logical forms are contained in Appendix A.
Categories of aiming combinations. The aiming combinations in COGS can be divided into five categories, of which two are low-level combinations (i.e., focusing on specific primitives) and three are high-level combinations (i.e., focusing on high-level structures), illustrated in Figure 2.

In-Context Learning vs Fine-Tuning
Compositional generalization under the fine-tuning paradigm has been widely studied (Furrer et al., 2020;Shaw et al., 2021;Bogin et al., 2022), while there is little observation under in-context learning. To first get a general sense about in-context compositional generalization, we conduct an initial exploration to compare with a fine-tuning baseline.
Models and setups. We test in-context compositional generalization with six large models in GPT series: davinci, code-cushman-001 (cuchman001), code-cushman-002 (cuchman002), text-davinci-002 (text002), text-chat-davinci-002 (chat002), and code-davinci-002 (code002). Code-cushman-001 contains 12B parameters and other models contain 175B parameters. The sampling temperature is 0 (i.e., greedy decoding), and the max decoding length is 500. The reported metric is exact-match accuracy. To set a fine-tuning baseline, we take GPT2-Large with 0.7B parameters. We fine-tune it on the whole B and test without in-context examples. Appendix B includes more details.
Casual selection leads to low performance of incontext compositional generalization. For selecting in-context examples, we first take a casual selection: while satisfying the primitive coverage principle, we randomly select 10 examples without other preference. We conduct initial exploration on PrimSubs category. Figure 3 shows that under the casual selection, all six models lag behind the fine-tuned GPT2-Large on PrimSubs. In particular, although the size of davinci is more than 200 times that of GPT2-Large, there is a 24.2% accuracy gap between davinci and the fine-tuned GPT2-Large. These observations are close to Qiu et al. (2022). However, we suppose the potential of in-context learning is still not fully revealed. Specifically, the selection of in-context examples does not yet take full advantage of available examples in B. In next try, while still following the primitive coverage principle, we consider injecting some additional preference in the selection of in-context examples.
Preference in selection could bring huge improvement on PrimSubs. Inspired by Bogin et al. (2022) that suggests the influence of unobserved local structures, we consider to prioritize examples that have similar hidden structures to the test case. Figure 3 shows that with this preference in selection, results on PrimSubs hugely change: davinci now outperforms the fine-tuned GPT2-Large; code-davinci-002 even performs nearperfectly. These changes strongly suggest that the selection of in-context examples can significantly affect in-context compositional generalization.
Based on these initial results, to further reveal the potential of in-context learning, we perform in-depth investigations on how the selection of incontext examples affects compositional generalization.

Factors Under In-Context Examples
To facilitate our systematic investigation, we construct COFE (COmpositional generalization with FEw-shot examples), which is derived from COGS. For selecting in-context examples in constructing COFE, we identify, inject, and control three potential factors: similarity, diversity, and complexity.

Conceptual Definitions
We first give conceptual definitions of our considered factors and discuss our intuitions behind them.
Similarity has been widely considered as the main factor in selecting in-context examples (Liu et al., 2022;Shin et al., 2021;Rubin et al., 2021;Poesia et al., 2021). The primitive coverage principle can be regarded as a basic lexical similarity on the surface of expressions. Beyond this surface similarity, we consider that the structural similarity hidden behind expressions could be a beneficial factor. From the view of syntactic structure, the recombination of primitives is equivalent to the reconstruction of the parse tree. Similar structures would ease the difficulty of recombination because the model does not need to completely reconstruct the entire structure of in-context examples. Moreover, some work has suggested that the challenge of compositional generalization under fine-tuning lies in unobserved structures (Keysers et al., 2019;Shaw et al., 2021;Bogin et al., 2022). Complexity reflects the amount of information contained in each individual in-context example. The higher complexity means that the example could provide more information to the model, but these information could be redundant. In addition, the difficulty in directly learning from complex examples has been flagged at the intersection of cognitive science and machine learning (Elman, 1993;Bengio et al., 2009). Such difficulty may be more severe for in-context learning, since the parameters of the model cannot be updated to fit these complex examples. Thus, we suppose that too high complexity might hinder performance.

Incorporate Three Factors Into Test Suite
To inject these factors in selecting in-context examples, we design a matching score based on the parse trees behind concrete expressions. Formally, considering the primitive coverage, structural similarity, diversity and complexity, the matching score of two parse trees T and T is defined as follows, in which P(·) contains primitives, S(·) contains partial structures (defined later), C contains already selected examples, S(T ) − S(C) means to exclude already covered parts in S(C) from S(T ), and depth(·) reflects the complexity of the tree. The meaning of three factors in Equation 1 is that: the structural similarity means covering S(T), the high diversity means to avoid repeatedly covering the same element in S(T), and the low complexity is to prioritize low-depth structures.  Figure 4: T 1 S and T >1 S in the parse tree of the expression "Jackson observed a baby". T 1 S contains four one-depth sub-structures . We only illustrate one combination ∈ T >1 S composed from x , y and { ∈ T 1 S . T L are in bold and T N are with underlines.
Based on this matching score, the overall ranking score between the test case (X, Y) and a candidate (X c , Y c ) is calculated as follows, in which both the matching of source side (i.e., NL expressions) and target side (i.e., logical forms) are considered. Poesia et al. (2021) has demonstrated the importance of target-side similarity in semantic parsing and code generation tasks, and this work will further investigates the necessity of source-side matching. In the following, we will give a more detailed description of notations in Equation 1.
Detailed description: Figure 4 shows an illustration of notations. Considering an expression e with the parse tree T, T L represents leaf nodes (e.g., "Jackson") and T N contains internal nodes (e.g., "subject"). T 1 S contains one-depth sub-structures in T. Each T 1 s ∈ T 1 S (e.g., x in Figure 4) contains one parent node (e.g., "root") and a set of child nodes (e.g., "subject", "verb" and "object"). T >1 S contains deeper sub-structures that are composed from several one-depth sub-structures in T 1 S (e.g., x+y+{ in Figure 4). In Equation 1, the primitives P(T) = T L , and the partial structures S(T) = T 1 S ∪T >1 S . Note that aiming combinations ⊂ S(T). Appendix E includes more details.

Experimental Settings and Hyper-Parameters
We take a greedy-search algorithm to sequentially select 10 examples for each test case. Models and setups follow our initial explorations in Section 2.3. For the investigation of each factor, hyperparameters in Equation 1 are set as follows 2 . In all settings, we prioritize the matching of primitives (i.e., |P(T) ∩ P(T )| in Equation 1) since the primitive coverage principle should be firstly satisfied. Concretely, we set w p = 100 and ensure w p w s and w c in all settings. For investigating structural similarity 3 , we set w s = 1 and w c = 0, and exclude S(C) term.
For investigating the effect of higher diversity, we add the S(C) term and keep other settings.
For complexity, we set |w c |·max(depth(T )) < w s , such that the of preference of complexity will not influence the priority of structural similarity. Concretely, as max(depth(T )) = 12 in COFE, we set w c = 0.01 for the low-complexity experiments and w c = −0.01 for the high-complexity experiments, and exclude S(C) term.
Some basic statistics for COFE under full similarity setting are listed in Table 2, and Appendix C.5 contains statistics under other settings. These statics show that the primitive coverage principle is well satisfied, since the cover rates of T L are almost 100%. Note that the coverage on T 1 S ∪ T >1 S must be lower than 100% since the aiming combination must be excluded.

Similarity
Structural similarity brings significant gains. are essential for compositional generalization.
More precise structural similarity brings larger gains. As mentioned in Section 3.2, the structural similarity considers to match S(T) which contains two parts, T 1 S and T >1 S . Specifically, we regard that T 1 S describes the rough structure of T, and T >1 S determines a more precise structure. Based on the results in Table 1, we are curious about whether a rough structural similarity is enough. To verify this, we remove T >1 S from S(T), which means that now we do not restrict the selected incontext examples to match precise structures in test cases. Figure 5 shows that the performances on four categories significantly drop with only a rough structural similarity, indicating that matching the precise structure of test case is still required for in-context examples. The only exception lies in PhraReco. It suggests that similarity is not the only influential factor for in-context compositional generalization. In Section 4.3, we will show that the low diversity and high complexity potentially cause this exception.
With structural similarity, low-level combinations are almost solved while high-level combinations still have large room for improvement. Specifically, for code-davinci-002, which exhibits the best performance among all backbone models, it performs near-perfectly on low-level combinations (i.e., PrimSubs and PrimAlte) while still does not achieve >95% accuracy on high-level combinations (i.e., PhraReco, LongChain and DeepNest). Although in-context learning greatly exceeds the fine-tuning baseline on high-level combinations, we suppose there is still potential for improvement. Compared to low-level combinations, han- dling high-level ones requires more creation than imitation, thus just considering similarity for incontext examples is not enough. In the following, we will further investigate these high-level combinations from the view of diversity and complexity.

Diversity and Complexity
High diversity brings considerable gains on PhraReco. Figure 6 shows how diversity among in-context examples affects generalization on highlevel combinations. It shows that increasing the diversity could bring considerable gains in PhraReco, while not affecting the other two categories. For the performance on PhraReco, the improvements from higher diversity are in line with our speculations in Section 3.1, that low diversity leads to biased observations, thus blocking high-level structural generalization. For LongChain and DeepNest, beyond biased structures, their difficulty also lies in length generalization, thus just increasing structural diversity brings less effect to them.
Low complexity brings considerable gains on PhraReco. Figure 7 shows how the complexity in each individual example affects generalization on high-level combinations. For PhraReco, there are ∼10% gains in accuracy when the high complexity setting is changed to low complexity setting. We suppose the reason behind this gain is that simple examples could reduce the learning difficulty for the model. Moreover, simple examples also contain less redundant information thus would not confuse the model 4 . For LongChain and DeepNest, there is still less change on performance. Note that the max depth in these two categories is 13 while the max depth in the whole example bank is only 3. Therefore, changing the complexity of in-context examples would bring negligible influence for test cases in LongChain and DeepNest.  Table 3 show that the performance only slightly changes under different prompt orders. These results indicate that the main results revealed by COFE is consistent and reliable. It also indicates that in-context learning could be less sensitive to the prompt order when the in-context examples are chosen properly.

Discussion: Difficulty in DeepNest
Among all five categories, in-context learning performs worst on DeepNest. Compared to LongChain which also test recursive structures, the results on DeepNest still lag far behind. There is an interesting observation from the study of error cases (such as Figure 10): in-context learning frequently makes word-level mistakes, while the over- all nested structure in the prediction is close to the ground truth. It suggests that the performance bottleneck in DeepNest is to correctly fill the details in the complex structure, rather than generating the sketch of the structure. Appendix F.1 provides further analysis.

Remaining Challenges
Our investigation has revealed a huge potential of in-context learning on performing compositional generalization 5 . Despite this potential, for achieving the ideal in-context compositional generalization, there remains the following two challenges.
In-context examples are still required to match linguistic structures in NL expressions. Since all backbone models have been pre-trained on large natural language corpus, we expect that these models could already handle the high variety in NL expressions without further hints from in-context examples. Motivated by this, we conduct experiments on another variant of COFE: the source-side term Match(X, X c ) is removed from Equation 2, and the coverage of S(X) is limited (detailed in 5 Appendix G shows the results of assembling factors. Appendix C.6). Figure 8 shows that on all five categories, the performance consistently drops if incontext examples do not match the NL-side structure. It suggests that even having been pre-trained on large corpus, in-context learning still struggles to effectively recognize the semantic equivalence among different linguistic structures behind NL expressions (detailed in Appendix F.3).
In-context learning has difficulty leveraging fictional words 6 . The ideal compositional generalization requires that the recombination of primitives should be independent of the surface form in primitives. In COFE, we set the target-side primitives as the uppercase of source-side ones (e.g., "cat"→"CAT"). Such case conversion is commonly used in semantic parsing tasks. To test whether in-context learning could use fictional words, we replace each target-side word with random characters (e.g., replace "CAT" with "MXR", detailed in Appendix C.7). Figure 9 shows the huge drops after changing words. Moreover, we investigate the structural accuracy by only keeping the structural terminals (e.g., parentheses and commas) in predictions. Figure 9 shows that the structural accuracy is also affected by fictional words. It indicates that on performing in-context compositional generalization, the prediction of structural sketch is not decoupled with word-level patterns.

Conclusion and Future Work
This work investigates how in-context compositional generalization is affected by the selection of examples. The test suite COFE is constructed to study three factors. Experiments show the effects of structural similarity, higher diversity and lower complexity. Two challenges under in-context compositional generalization are further revealed. To apply our revealed factors outside the COFE test suite, one main challenge for future work is to determine the hidden structures behind expressions without knowing the exact generative grammar. Here, we consider two potential approaches. One is to use a pre-trained parser to generate a parse tree for the input query and then measure tree similarity. The other approach is to pre-train an embedding model with a structure-aware training objective and then compute embedding similarity.

Limitations
GPU resources. This work utilizes extremely large language models and thus has a high cost on GPU resources. Concretely, experiments are conducted on the 8 x NVIDIA A100 GPU station. The maximum inference time on each version of COFE (containing 4,785 test cases) is ∼ 8 hours. The maximum estimation of costed computing resources in this study is ∼ 500 x 8 GPU hours.
Synthetic data. As in most previous work on compositional generalization (Lake and Baroni, 2018;Keysers et al., 2019;Kim and Linzen, 2020), the COFE dataset is constructed using synthetic data rather than natural one. The source-side sentences in COFE are from COGS, which account for 70-80% of naturally-occurring English sentences (Kim and Linzen, 2020;Roland et al., 2007). Thus, this synthetic test suite could be close to the real-world application scenarios.
Single run. Due to the high cost on computing resources, we do not take multiple runs with different sets of examples, nor did we take multiple samples with temperature > 0. Observations under different prompt orders (in Appendix 4.4) imply that with desired factors in selecting in-context examples, there could be low variance in experiments.

Ethics Statement
Due to the utilization of pre-trained language models, this work could be exposed to some potential risks of ethical issues on general deep learning models (such as social bias and privacy breaches). As explored in this work that the model behavior can be hugely influenced by the provided context, we call for further investigation into how ethical issues can be avoided by controlling the provided context.

References
Ekin Akyürek, Afra Feyza Akyürek, and Jacob Andreas. 2020. Learning to recombine and resample data for compositional generalization. In International Conference on Learning Representations.

A Grammar
Part of the grammar used in constructing COFE is listed in Table 4. Note that the max recursive times of R-Production Rules is 2 in prompting examples and 12 in test cases. The target-side grammar follows the reconstruction in An et al. (2023). Overall, the original target grammar of COGS is reconstructed to be chain-structured. Concretely, first, the original output tokens in COGS are capitalized; then, the variables (e.g., "x_1") in the original grammar are aligned and replaced with their corresponding terminals; finally, the output clauses are grouped as the function format, in which the function name belongs to "PRED-FUNC" and the arguments are ordered as "AGENT", "THEME", and "RECIPIENT". Moreover, if "PRED-FUNC" does not contain one or some arguments, the positions of these arguments are filled with "NONE" terminal. For the two R-Production rules in Table 4, the first is in chain structure and the second is in nested structure. Moreover, the whole nested "PP-FUNC" will be filled into the "PRED-FUNC" as an argument, rather than concatenated to the tail of the "CLAUSE".

B Details of Fine-Tuning
The fine-tuned GPT2-Large contains 762M parameters. For fine-tuning, we take 50,000 training steps with 8 batch size and 1e-5 learning rate (without warm-up strategy). We set weight decay as 1e-2 and label smoothing factor as 1e-1. For inference with GPT2-Large, we set beam size as 5 and set max length as 1,024.

C.1 Algorithm
Algorithm 1 shows the greedy searching algorithm for constructing COFE.
• element / ∈ S(C): Already covered elements will not be awarded again, thus encouraging high diversity.
• depth(T): return the depth of the tree. Note that depth(X i ) = depth(Y i ) in COFE.

C.3 Prompt Order
We take the structure-closer order, i.e., the examples in C with a higher stru_score are placed closer to the test case. In Section 4.4, we show the robustness to the other two orders: random order, i.e., all selected in-context examples in C are randomly shuffled, and atom-closer order, i.e., the examples in C with a higher prim_score are placed closer to the test case.

S
Since the max repetition times for LongChain and DeepNest are 2 (as described in Section 2.2), we set the max depth in T >1 S as 2 in S(T). While changing diversity and complexity in variants of COFE in Section 4.3, the primitive coverage and structural similarity are still satisfied. Table 5 shows that onPhraReco, the statistics of coverage in different diversity and complexity settings are kept identical to the full similarity setting in COFE.

C.6 Excluding NL-Side Matching
For excluding source-side matching in Section 5, besides removing the first term in Equation 2, we also limit the matching of X 1 S . Concretely, we require that the sentence rule in test case should not be covered by in-context examples. The sentence rule is an N-Production rule that contains the nonterminal "sentence" as the left hand. To achieve this, we filter out test cases that can not meet this constraint. Finally, 1,037 out of 4,785 test cases are kept in this variant of COFE.

C.7 Fictional Words
For each target-side word that contain l characters, we sequentially and randomly sample l characters from alphabet as a fictional word to replace the original word. In addition, for the experiments on fictional words, we take the atom-closer prompt order, since the model with this order performs better the default structure-closer order.

D Excluding Target-Side Matching
In Section 5, we show that the performance drops with excluding the source-side matching. Here, we examine the effect of target-side matching. For constructing data, we directly remove the second term in Equation 2. As shown in Table 6, the performances with or without target-side matching are nearly identical. Such an observation is similar to the comparison between oracle and non-oracle settings in Qiu et al. (2022) that also utilized COGS benchmark, but different from Poesia et al. (2021) which suggested the importance of target-side similarity in code generation tasks. We suppose there are mainly two reasons that could cause this difference. On the one hand, different from general code generation tasks, the test suite for compositional generalization requires the exclusion of certain aiming combinations. Therefore, the performance bottleneck in compositional generalization benchmarks mainly lies in the lacked aiming combinations. On the other hand, in most compositional generalization benchmarks, the source-side matching could largely take over the target-side matching, since the terminals and rules in source grammar in these benchmarks are mapped many-to-one to the target grammar. Therefore, when seeking for the source-side matching, the target-side matching is also improved.  Figure 11 illustrates the notations defined in Section 3.2 based on a concrete expression "Jackson in a room observed a baby".

E Illustration of Defined Notations
Note that for all sub-structures in T 1 S ∪ T >1 S , we require them to be complete sub-structures.
Definition: Complete sub-structure (CSS). A CSS is a subgraph in a tree T, satisfying that if an internal node in T and one of its child nodes are covered in this CSS, all other child nodes must be also covered in this CSS.

F Case Study
We provide case study to further understanding the performance of compositional generalization observed in the main text. For ease of reading, we include the following contents in the caption of figures. Figure 12 shows two error cases in DeepNest with code-davinci-002 model and full similarity setting. The overall structure of predictions are close to the ground truth, but the model makes mistakes on some local parts. Concretely, some local semantics are incorrect (in red), and some words are redundant (in gray).

F.1 Two Types of Errors in DeepNest
Moreover, we also calculate the word-level coverage in predictions. Besides the instance-level accuracy, we further investigate a word-level error rate on DeepNest. We find that in DeepNest, 96.8% of the words in the ground truth are contained by the predictions from code-davinci-002 (while only 48.8% for GPT2-Large). It indicates that the low instance-level accuracy is mainly caused by the wrong positions of words and redundant words. Figure 13 shows the comparison of performance between fictional words (left) and commonly used words (right). For the provided contexts on the left and right, the only difference is that the target-side words on the left are randomly selected characters while on the right they are uppercase of the source-side words. It shows that by changing only the target-side words, the model not only makes word-level errors (i.e., missing two words "ES" and "NVCWI" in prediction), it also generates the wrong parentheses structure (i.e., generate a 2-depth structure while in ground truth it is 3-depth). Figure 14 shows the comparison of performances between excluding NL-side matching (left) and containing NL-side matching (right). For the test input "Matthew shipped the professor a chair .", it contains the sentence structure "subject verb ob-ject_1 object_2" behind the NL expression. Context on the left does not explicitly contain this sentence structure, but it contains a semantically equivalent structure (i.e., "subject verb object_2 to ob-ject_1"). However, the model generates the correct prediction on the right while fails on the left. Concretely, according to the wrong prediction on the left, the model perhaps considers that the semantics of "subject verb object_1 object_2" is equivalent with "subject verb object_1 to object_2". Figure 15 shows the comparison of performances on PhraReco under high diversity (left) and low diversity (right). For the test input "A girl in the house slept", "subject slept" is one element contained in T >1 S . This element is repeatedly covered in the context on the right (low diversity) while only covered once on the left (high diversity). However, under high repetitiveness, the model fails on the test case, but succeed when there is low repetitiveness. Figure 16 shows the comparison of performance on PhraReco under low complexity (left) and high diversity (right). With low complexity, the test case is covered by simple and short in-context examples, and the model succeeds on the test case. With high complexity, the test case is covered by more complex and longer examples, and the model fails on the test case.

G Full Results
Due to the page limitation for main text, here we list our full results in Section 4. The results in Assembling are the best performance under each category among all combinations of factors.  NONE ) ) ) ) ) ) ) ) ) ) ) , NONE ) Figure 12: Two error cases in DeepNest with code-davinci-002 model and full similarity setting. The overall structures of predictions in error cases are close to the ground truth, but the model makes mistakes on some local parts. Concretely, some local semantics are incorrect (in red), and some words are redundant (in gray).  ( TABLE , HOUSE Figure 13: Comparison of performance between fictional words (left) and commonly used words (right). For the provided contexts on the left and right, the only difference is that the target-side words on the left are randomly selected characters while on the right they are uppercase of the source-side words. It shows that by changing only the target-side words, the model not only makes word-level errors (i.e., missing two words "ES" and "NVCWI" in prediction), it also generates the wrong parentheses structure (i.e., generate a 2-depth structure while in ground truth it is 3-depth).  Figure 14: Comparison of performances between excluding NL-side matching (left) and containing NL-side matching (right). For the test input "Matthew shipped the professor a chair .", it contains the sentence structure "subject verb object_1 object_2" behind the NL expression. Context on the left does not explicitly contain this sentence structure, but it contains a semantically equivalent structure (i.e., "subject verb object_2 to object_1"). However, the model generates the correct prediction on the right while fails on the left. Concretely, according to the wrong prediction on the left, the model perhaps considers that the semantics of "subject verb object_1 object_2" is equivalent with "subject verb object_1 to object_2".  Figure 15: Comparison of performances on PhraReco under high diversity (left) and low diversity (right). For the test input "A girl in the house slept", "subject slept" is one element contained in T >1 S . This element is repeatedly covered in the context on the right (low diversity) while only covered once on the left (high diversity). However, under high repetitiveness, the model fails on the test case, but succeed when there is low repetitiveness.