Structural generalization is hard for sequence-to-sequence models

Sequence-to-sequence (seq2seq) models have been successful across many NLP tasks,including ones that require predicting linguistic structure. However, recent work on compositional generalization has shown that seq2seq models achieve very low accuracy in generalizing to linguistic structures that were not seen in training. We present new evidence that this is a general limitation of seq2seq models that is present not just in semantic parsing, but also in syntactic parsing and in text-to-text tasks, and that this limitation can often be overcome by neurosymbolic models that have linguistic knowledge built in. We further report on some experiments that give initial answers on the reasons for these limitations.


Introduction
Humans are able to understand and produce linguistic structures they have never observed before (Chomsky, 1957;Fodor and Pylyshyn, 1988;Fodor and Lepore, 2002).From limited, finite observations, they generalize at an early age to an infinite variety of novel structures using recursion.They can also assign meaning to these, using the Principle of Compositionality.This ability to generalize to unseen structures is important for NLP systems in low-resource settings, such as underresourced languages or projects with a limited annotation budget, where a user can easily use structures that had no annotations in training.
Over the past few years, large pretrained sequence-to-sequence (seq2seq) models, such as BART (Lewis et al., 2020) and T5 (Raffel et al., 2020), have brought tremendous progress to many NLP tasks.This includes linguistically complex tasks such as broad-coverage semantic parsing, where e.g. a lightly modified BART set a new state of the art on AMR parsing (Bevilacqua et al., 2021).However, there have been some concerns that seq2seq models may have difficulties with com-positional generalization, a class of tasks in semantic parsing where the training data is structurally impoverished in comparison to the test data (Lake and Baroni, 2018;Keysers et al., 2020).We focus on the COGS dataset of Kim and Linzen (2020) because some of its generalization types specifically target structural generalization, i.e. the ability to generalize to unseen structures.
In this paper, we make two contributions.First, we offer evidence that structural generalization is systematically hard for seq2seq models.On the semantic parsing task of COGS, seq2seq models don't fail on compositional generalization as a whole, but specifically on the three COGS generalization types that require generalizing to unseen linguistic structures, achieving accuracies below 10%.This is true both for BART and T5 and for seq2seq models that were specifically developed for COGS.What's more, BART and T5 fail similarly on syntax and even POS tagging variants of COGS (introduced in this paper), indicating that they do not only struggle with compositional generalization in semantics, but with structural generalization more generally.Structure-aware models, such as the compositional semantic parsers of Liu et al. (2021) and Weißenhorn et al. (2022) and the Neural Berkeley Parser (Kitaev and Klein, 2018), achieve perfect accuracy on these tasks.
Second, we conduct a series of experiments to investigate what makes structural generalization so hard for seq2seq models.It is not because the encoder loses structurally relevant information: One can train a probe to predict COGS syntax from BART encodings, in line with earlier work (Hewitt and Manning, 2019;Tenney et al., 2019a); but the decoder does not learn to use it for structural generalization.We find further that the decoder does not even learn to generalize semantically when the input is enriched with syntactic structure.Finally, it is not merely because the COGS tasks require the mapping of language into symbolic represen- tations.We introduce a new text-to-text variant of COGS called QA-COGS, where questions about COGS sentences must be answered in English.We find that T5 performs well on structural generalization with the original COGS sentences, but all models still struggle with a harder text-to-text task involving structural disambiguation.
The code 1 and datasets 2 are available online.

Related work
The recent interest in compositional generalization has raised concerns about limitations of seq2seq models.For instance, the SCAN dataset (Lake and Baroni, 2018) requires a model to translate natural-language instructions into symbolic action sequences; it has multiple splits in which the test data contains new combinations of commands or instructions that are systematically longer than in training.The PCFG dataset (Hupkes et al., 2020) builds upon SCAN and adds instructions with recursive structure.The CFQ dataset (Keysers et al., 2020) maps questions to SPARQL queries, and splits the data according to a measure of compositional complexity (MCD).In all of these papers, simple seq2seq models based on LSTMs and transformers were shown to perform poorly when the test data was more complex than the training data.Since then, followup research has shown that both generic transformer-based models (Ontanon et al., 2022;Csordás et al., 2021), general-purpose pretrained models (Furrer et al., 2020), and seq2seq models that are specialized for the task can achieve 1 https://github.com/coli-saar/Seq2seq-on-COGS 2 https://github.com/coli-saar/Syntax-COGS higher accuracies than the ones reported in the papers introducing the datasets.Nonetheless, there is a sense that despite the best efforts of the community, pure seq2seq models are hitting a ceiling on compositional generalization tasks.
In this paper, we shed some light on the issue by (a) clarifying that seq2seq models do not struggle with compositional generalization per se, but with structural generalization, and (b) demonstrating that this type of generalization remains hard for seq2seq models even after heavy pretraining.This is in contrast to most previous research, which has avoided pretraining and focused on length or MCD as the primary source of difficulty.Our data includes instances where the structure, but not the length differs between training and testing, and therefore allows us to differentiate between the two.The importance of structure to compositional generalization is also recognized by Bogin et al. (2022).
The difficulty of structural generalization for neural models has also been studied in more targeted ways.For instance, Yu et al. (2019) show empirically that LSTM-based seq2seq models cannot learn to close the brackets of Dyck languages, and Hahn (2020) proves that transformers cannot learn to distinguish well-bracketed Dyck expressions.McCoy et al. (2020) find empirically that seq2seq models struggle to learn the structural operations necessary to rewrite declarative English sentences into questions, whereas tree-based models work better.

Structural generalization in COGS
COGS (Kim and Linzen, 2020) is a synthetic semantic parsing dataset in which English sen- Most generalization types in COGS are lexical: they recombine known grammatical structures with words that were not observed in these particular structures in training.An example is the generalization type "subject to object" (Fig. 1a), in which a noun ("hedgehog") is only seen as a subject in training, whereas it is only used as on object at test time.The syntactic structure at test time was already observed in training; only the words change.
By contrast, structural generalization involves generalizing to linguistic structures that were not seen in training (cf.Fig. 1b,c).Examples are the generalization types "PP recursion", where training instances contain prepositional phrases of depth up to two and generalization instances have PPs of depth 3-12; and "object PP to subject PP", where PPs modify only objects in training and only subjects at test time.These structural changes are illustrated in Fig. 2.
Structural generalization requires learning about recursion and compositionality, and is thus a more thorough test of human-like language use, whereas lexical generalization amounts to smart template filling.In this paper, we investigate how well structural generalization can be solved by different classes of model architectures: seq2seq models and structure-aware models.We define a model as "structure-aware" if it is explicitly designed to encode linguistic knowledge beyond the fact that sentences are sequences of tokens.This captures a large class of models that can be as "deep" as a compositional semantic parser or as "shallow" as a POS tagger that requires that each input token gets exactly one POS tag.

Structural generalization is hard for seq2seq
We begin with some evidence that structural generalization in COGS is hard for seq2seq models, while structure-aware models learn it quite easily.
We first collect some results on the original semantic parsing task of COGS, extending it with numbers for BART and T5.We then transform COGS into a corpus for syntactic parsing and POS tagging and investigate the ability of BART and T5 to generalize structurally on these tasks.

Experimental setup: COGS
We follow standard COGS practice and evaluate all models on the generalization set.We report exact match accuracies, averaged across 5 training runs.Seq2seq models.We train BART (Lewis et al., 2020) and T5 (Raffel et al., 2020) as semantic parsers on COGS.Both models are strong representatives of seq2seq models and perform well across many NLP tasks.To apply these models on COGS, we directly fine-tune the pretrained bart-base and t5-base model on it with the corresponding tokenizer; see Appendix A for details.We also report results for a wide range of published seq2seq models for COGS (Kim and Linzen, 2020;Conklin et al., 2021;Csordás et al., 2021;Akyürek and Andreas, 2021;Zheng and Lapata, 2022;Qiu et al., 2021).
Structure-aware models.We report evaluation results for LeAR (Liu et al., 2021) and the AM parser (Weißenhorn et al., 2022).Both models learn to predict a tree structure which is decoded into COGS meaning representations using the Principle of Compositionality.Thus both models are structure-aware.

Results
We report the results by generalization type in the "semantic" rows in Table 1.We will explain "BART+syn" in Section 5.3 and the "syntactic" and "POS" sections in Section 4.3.
Structural generalization is hard.We can observe that all recent models achieve near-perfect accuracy on the 18 lexical generalization types.However, all pure seq2seq models achieve very low accuracy on the structural generalization types, whereas structure-aware models are still very accurate.One outlier is the seq2seq model of Qiu et al. (2021).It employs heavy data augmentation based on (structure-aware) synchronous grammars encoding the Principle of Compositionality, which provides training instances of higher recursive depth to the seq2seq model.The seq2seq model then still generalizes to the recursive depth which it has seen in training, but not beyond (Peter Shaw, p.c.).
Note that the mean accuracy is dominated by the lexical generalization types; to really measure the ability of a model to generalize to unseen structures, it is important to focus on the structural generalization types.Note further that BART and T5 perform very well among the class of seq2seq models, outperforming many models that are specialized to COGS.We will focus on these two models in the experiments below.
It is important that although the generalization instances on PP and CP recursion are longer than the training instances, the low accuracy of the seq2seq models cannot be explained exclusively in terms of their known weakness to length generalization ( Hupkes et al., 2020).For the "Object to Subject PP" generalization type, the generalization and training sentences have the same length, but different structures.Thus our results point towards a specific weakness to structural generalization.Depth generalization.The accuracy of the seq2seq models depends on the difference in complexity of the test instance and the training data.For instance, all training instances for the "PP recursion" type have recursion depth two or less; Fig. 3 shows how the accuracy depends on the recursion depth of the test instance.As we see, the accuracy of BART (even when informed by syntax, cf.Section 5.3) degrades quickly with recursion depth.By contrast, LeAR and the AM parser maintain high accuracy across all recursion depths.

Syntax-COGS and POS-COGS
While these results on semantic parsing are suggestive, they could be explained away in many ways.For instance, the weakness of seq2seq models with respect to structural generalization might be specific to semantics, or the semantic representations chosen in COGS might be idiosyncractic and unfair to seq2seq models.
We therefore investigate structural generalization on a syntax variant of COGS.We convert each training and generalization instance of the COGS corpus into a pair of the sentence with its syntax tree (Syntax-COGS) and a pair of the sentence with its POS tag sequence (POS-COGS).This is possible because COGS is generated from an unambiguous context-free grammar; we reconstruct the unique syntax trees that underly each instance in COGS.
We replace the very fine-grained non-terminals (e.g.NP_animate_dobj_noPP) of the original COGS grammar with more general ones (e.g.NP) and remove duplicate rules (e.g.NP→NP) resulting from this.We extract the POS tag sequences from the preterminal nodes of the syntax trees.
We train BART and T5 to predict linearized constituency trees and the POS tag sequences from the input sentences.As a structure-aware model, we use the Neural Berkeley Parser (Kitaev and Klein, 2018), which consists of a self-attention encoder and a chart decoder and therefore has the notion of a tree and its recursive structure built into the parsing model.On the POS tagging task, our "structure-aware" model is constrained to predict exactly one POS tag for each input token.Specifically, we determine the most frequent POS tag in the training data for each word type and assign it to all occurrences of the word during inference.
Results.The results are shown in the "syntactic" and "POS" rows of Table 1.We find the same pattern as in the semantic parsing case: the seq2seq models do well on LEX, but struggle with STRUCT.The structure-aware models handle all generalization types well.Thus, the difficulties that seq2seq models have on structural generalization on COGS are not limited to semantics: rather, they seem to be a general limitation in the ability of seq2seq models to learn linguistic structure from structurally simple examples and use it productively.We also present an example for obj_pp_to_subj_pp type across different tasks in Figure 4.For a sentence The baby on a tray in the house screamed, T5 consistently predicted wrong symbol sequences.For example, in semantic parsing, T5 tends to predict tray as the theme of scream with a PP structure.This might be due to a

Input
The baby on a tray in the house screamed.preference of T5 to reuse the pattern for object-PP sentences in the train set even if the intransitive verb does not license it.T5 also displays an unawareness of word order that is reminiscent of the difficulties that seq2seq models otherwise face in relating syntax to word order (McCoy et al., 2020).For recursion generalization types, we find that the main error is that the decoder cannot generate long or deep enough sequences.
We now turn to the second question: Why do seq2seq models struggle on structural generalization?We start by investigating at which point the model loses the structural information -does the encoder not represent it, or can the decoder not make use of it?This also addresses an apparent tension between our findings and previous work demonstrating that pretrained models contain rich linguistic information (Hewitt and Manning, 2019;Tenney et al., 2019b), which should be sufficient to at least solve Syntax-COGS.

Probing for structural information
We use the well-established probe task methodology (Peters et al., 2018;Tenney et al., 2019a) to analyze what information is present in the outputs of the BART encoder.We define both a syntactic and a semantic probing task: Constituent labeling.The goal of this task is to predict correct labels for all constituency spans in a sentence.We treat spans that are not constituents as if they were annotated with the None label.The gold annotations are derived from Syntax-COGS.
Semantic role labeling.To measure the presence of structural semantic information, we define a probe task that predicts role labels for all predicateargument relations in a sentence.For example, in the sentence Emma slept, the goal is to recognize that slept is a predicate with Emma being its agent.This task captures most of the information in the original COGS meaning representations as relations between tokens in the sentence.We extract data for this task (given two tokens, predict if the second is an argument of the first and with what role label) from the COGS meaning representation.We refer to Appendix C for details.
We train probe classifiers in a similar way as (Tenney et al., 2019a).For each task, we train a multi-layer perceptron to predict the target label from the outputs of the frozen pretrained encoder.For constituent labeling, the MLP reads a span representation obtained by subtracting the encodings of the tokens at the span boundary from each other (Stern et al., 2017).For semantic role labeling, the input of the MLP is the concatenation of the encodings for the predicate and argument token.
We evaluate the probes in two ways.First, we train the probes on the original training split of COGS ("orig").However, this conflates the presence of structural information in the encodings with the ability of the probing MLP itself to perform structural generalization.We therefore also evaluate on a second split ("probe") in which we add 60% of the generalization set (randomly selected) to the training set and 10% to the development set and keep the rest as the probe test set.This makes the probe test set in-distribution with respect to the probe training set.The encoder remains frozen and can therefore not adapt to the modified training set; we still obtain meaningful results about whether the pretrained encodings contain the information that is needed to learn to predict structure in COGS.

Results
We report the sentence-level accuracy in Table 2.For better comparison, all accuracies are measured on the test set from the "probe" split.We find that the probes learn to solve both tasks accurately on the "probe" split, indicating that the pretrained encodings of BART contain all the information that is needed to make structural predictions.By contrast, when we replace the BART encodings with random vectors of the same size ("Random" rows), the probe fails to learn.The probes also perform badly on the "orig" split, suggesting that the probe "decoder" does not generalize structurally either.These findings suggest that the BART encoder captures all the necessary information about the input sentence, but the BART decoder cannot use it to learn to generalize structurally.

Enriching seq2seq with structure
Can we make things easier for the decoder by making the structural information explicit in the input?To investigate this, we inject the gold syntax tree into the BART encoder to see if this improves structural generalization in semantic parsing.
We retrain BART on COGS, but instead of feeding it the raw sentence, we provide as input the linearized gold constituency tree ("(NP (Det a) (N rose))"), both for training and inference.This method is similar to Li et al. (2017) and Currey and Heafield ( 2019), but we allow attention over special tokens such as "(" during decoding.
We report the results as "BART+syn" in Table 1 and Fig. 3; the overall accuracy increases by 1.5% over BART.This is mostly because providing the syntax tree allows BART to generalize correctly on LEX.However, STRUCT remains out of reach for BART+syn, confirming the deep difficulty of structural generalization for seq2seq models.
We also explored other ways to inform BART with syntax, through multi-task learning (Sennrich et al., 2016;Currey and Heafield, 2019) and syntaxbased masking in the self-attention encoder (Kim et al., 2021).Neither method substantially improved the accuracy of BART on the COGS generalization set (+1.0% and -6.4% overall accuracy, respectively).We conclude that the weakness of the BART decoder towards structural generalization persists even when the input makes the structure explicit.6 Text-to-text structural generalization We will now turn our attention to a novel textto-text variant of COGS.The difficulty of structural generalization for seq2seq models has been primarily studied on tasks where sentences must be mapped into symbolic representations of some kind, such as the semantic and syntactic representations in Section 4. But although pretrained seq2seq models like BART and T5 achieve excellent accuracy on broad-coverage semantic parsing tasks, one might argue that they were originally designed for tasks where the output sequence is natural language as well, and thus should be evaluated on such tasks.
We therefore propose a new dataset, QA-COGS, which presents structural generalization examples based on COGS sentences in a question-answering format.Given a context sentence and a question sentence as input, the goal is to output the correct answer, which should be a consecutive span of tokens in the context sentence.The dataset consists of two sections: QA-COGS-base directly asks questions about COGS sentences (Section 6.1), whereas QA-COGS-disamb combines COGS sentences in novel coordinating structures (Section 6.2).Following the original COGS design, each section consists of four subsets: training set, development set, in-distribution test set, and out-of-distribution generalization set.

QA-COGS-base
The QA-COGS-base dataset uses the sentences of COGS as context sentences, and then asks one or more questions about each sentence that can be answered by a contiguous substring (see Fig. 6).For example, given Noah ate the cake on the plate as context, we ask What did Noah eat? and Who ate the cake on the plate?, and the answer should be the cake on the plate and Noah respectively.
To generate question-answer pairs, we identify the semantic roles and arguments for each predicate in all sentences of COGS, as in the SRL probing task (Section 5.1).We generate question-answer pairs out of these based on handwritten templates (i.e. at least one per COGS instance) and split them into train/test/generalization sets as in the original COGS.We refer to Appendix D for more details.
The original COGS training set contains "primitive" instances in which the sentence consists of a single word, and the meaning representation is the word itself (e.g.Paula ⇒ Paula).We include these instances in QA-COGS-base by using a special token <prim> as the question sentence and the primitive word as context and answer (i.e., Paula <prim> ⇒ Paula).

QA-COGS-disamb
We add QA-COGS-disamb as a second, harder textto-text task based on COGS.This task exploits the interplay of the syntactic structure of a sentence with constraints on tense and number agreement.For instance, in sentences of the form "N1 V1 that N2 V2 and V3" (where N1, N2 are noun phrases and V1, V2, V3 are verbs), V3 belongs to the same clause as V1 or V2 depending on which one it agrees with.Thus, the agreement between verbs disambiguates a structural ambiguity of the sentence.Some concrete examples are shown in Fig. 5.The idea that agreement interacts with syntax is reminiscent of Linzen et al. (2016), but here we predict the syntactic structure rather than the agreement feature.
QA-COGS-disamb consists of two parts.The subcorpus cc_cp consists of sentences as above, where tense agreement disambiguates the structural ambiguity between CP embedding and coordination.The subcorpus rc_pp contains sentences where number agreement disambiguates the attachment of a relative clause.In both cases, we construct context sentences using a context-free grammar adapted from the one that generates COGS. Figure 6: Examples for the QA-COGS-base dataset with regard to each structural generalization type.In each example, the first sentence is the context sentence, the second sentence is the question sentence and the bold token span is the corresponding answer.We generate questions of the form "What is the ccomp of said?" along with their answers from the context sentences using a small number of handwritten heuristics.Answering these questions correctly amounts to disambiguating the structure of the sentence.We create training (4k instances), development (1k), and in-domain test sets (1k) for QA-COGSdisamb out of three of the four combinations of the agreement features of the two verbs (white cells in Fig. 5).We create a generalization set (2k instances) from the fourth, unseen combination of agreement features (gray cells).

Models
We conduct a series of experiments in which a model receives the concatenation of context sentence and question as input and must predict the answer.We fine-tune BART and T5 on QA-COGS and compare against two structure-aware models.Details of the training setup are discussed in Appendix A.
First, we compare against an extractive model we call BART-QA.Given a context sentence and question, BART-QA predicts the start and end position of the answer within the context sentence.The start and end positions are each predicted by an MLP trained from scratch which takes the outputs of the pretrained BART encoder as input.
Second, we use a more informed model called BART-QA+struct specifically for QA-COGS-disamb.BART-QA+struct shares the same encoder as BART-QA, but its decoder is constrained to select a span which exists in the gold syntax tree of the sentence.This model accesses information that is usually not available at test time, and we offer it only as a point of comparison.

Results
The exact match accuracies on the generalization sets are shown in Table 3.Similar to the earlier experiments, all models perform well on LEX; we mainly discuss results on STRUCT below.
QA-COGS-base.All models solve "Object to Subject PP" perfectly, with T5 and BART-QA also achieving perfect accuracy on the PP and CP recursion.While these positive results on structural generalization seem to go against the grain of our earlier discussion, it is important to note that QA-COGS-base is an extractive task which only requires selecting a substring of the input; and further, that this substring is in a very specific position of the string, making the task amenable to learning simple heuristics (e.g.subject is everything to the left of the verb).Thus, these results indicate that structural generalization is hard only if the decoder's task is sufficiently complex.Note that unlike BART, T5 sees question answering tasks during training, which may help explain the difference in accuracy.
QA-COGS-disamb.However, BART and T5 all achieve low accuracy on QA-COGS-disamb, suggesting that even text-to-text tasks involving structural generalization can be difficult; stringlevel heuristics are not successful on this task.In this case, the task is still hard for the structureaware model BART-QA.It can be solved by BART-QA+struct, but note that this model has access to gold syntax information which makes the task much easier.Note that since the training and generalization sentences in QA-COGS-disamb are of similar length, the difficulty comes exclusively from structural rather than length generalization.

Conclusion
We have presented evidence that structural generalization is hard for seq2seq models, both on semantic and syntactic parsing (COGS and Syntax-COGS) and on some text-to-text tasks (QA-COGSdisamb).In many of these cases, structure-aware models generalize successfully where seq2seq models struggle.Unlike earlier work, we have shown that this effect persists when the seq2seq models can be pretrained.
We have then presented a number of experiments to help pinpoint the cause of this limitation.We found that the BART encoder still provides structural information, but the decoder does not use it to generalize -both in the parsing tasks and in the probing tasks on the original splits, and not even when the input is enriched with syntactic information.We further found that when the decoder's task is simple enough, as in QA-COGS-base, seq2seq models learn to generalize structurally as well as structure-aware models.In improving the ability of seq2seq models to generalize structurally, it seems promising to focus on the decoder, especially by including structure-aware elements.

Limitations
Our experiments are limited to a synthetic corpus (COGS) and its derivatives.While it seems plausible to us to justify negative results like ours with a synthetic corpus, it must be recognized that the distribution of language in COGS is not the same as in English as a whole, which might undermine the ability of both seq2seq and structure-aware models to learn to generalize.
Furthermore, claims about a whole class of models (seq2seq) can only be supported, never completely proved, through empirical experiments on a finite set of representatives.Nonetheless, we think that this paper has considered a sufficiently wide range of models and tasks to make careful statements about seq2seq models as a class.

A Training details
Evaluation metrics.We use sequence-level exact match accuracy as our evaluation metrics for all experiments.Thus a predicted sequence is correct only if each output token in it is correctly predicted.
Hyperparameters.We used the following hyperparameter values in our experiments.For all experiments we reported, we use bart-base3 for BART model and t5-base4 for T5 model.We always use the Adam optimizer (Kingma and Ba, 2015) and gradient accumulation steps 8. Exact match accuracy is used as the validation metric.
We use the same hyperparameters setting for semantic parsing, syntactic parsing and POS tagging experiments.For BART, we use batch size 64 and learning rate 2e-4.For T5, we use batch size 32 and learning rate 5e-4.
In probing experiments, we probe the encoder of BART.The hidden size of the MLP classifier is 1024 and the dropout is 0.1.We use batch size 64 and learning rate 1e-3 for the span prediction task and 1e-4 for the semantic role labeling task.
In the QA-COGS experiments, we adapt the question answering module5 of BART for BART-QA and BART-QA+struct.In the evaluation for such extractive models, we do not consider the capitalization of the determiners (e.g.The boy is equivalent to the boy).We use batch size 64 and learning rate 2e-4 for these two models.For seq2seq models, most hyperparameters are the same as the ones in parsing tasks.The only difference is that we use learning rate 1e-4 for the T5 model.
Model selection.Csordás et al. (2021) find that using an in-distribution development set can lead to inefficient model selection and they select their best model based on the accuracy on the generalization set.We follow Zheng and Lapata (2022) by sampling a subset of the generalization set as an out-of-distribution development set.
In-distribution set performance.The exact match accuracy is at least 99 for both the (in-distribution) development set and the (indistribution) test set in all experiments for the parsing and tagging tasks.
On the QA-COGS-base dataset, all models (i.e.BART, T5 and BART-QA) achieve at least 99 ac-curacy on the in-distribution development and test sets.On the QA-COGS-disamb dataset, we find T5 and BART-QA can achieve an accuracy of 100 on the in-distribution development and test sets across different randoms seeds.However, the performance of BART is not stable with regard to different random seeds.The mean accuracy averaged over 5 runs is 95 ± 8.1 for cc_cp and 73.8 ± 27.8 for rc_pp.
Other details.Training takes 4 hours for BART with about 50 epochs and 4 hours for T5 with about 30 epochs.Inference on the generalization set takes about 1 hour.All experiments are run on Tesla V100 GPU cards (32GB).The number of parameters is 140 million in BART and 220 million in T5.
Results from other papers.(Kim and Linzen, 2020) provides two train sets: train (24155 samples) and train100 (39500 samples).The train100 simply extends train with 100 samples for each exposure example.For example, for the generalization type in Fig. 1 (a), train set only contains 1 sentence with hedgehog being the subject as the exposure example, but train100 contain 100 different sentences with hedgehog being the subject.Since train100 does not introduce new structures, it is only used to help lexical generalization.
All semantic models in Table 1 are trained on the train set, except for (Kim and Linzen, 2020;Conklin et al., 2021;Weißenhorn et al., 2022).We noticed that (Kim and Linzen, 2020;Conklin et al., 2021) get higher performance on train100 and thus report their number on train100.Although the number for (Weißenhorn et al., 2022) is based on train100, their model actually performs well on structural generalization when trained on the train set and using train100 only improves the performance on lexical generalization types.Thus their model still supports the point that structural generalization can be solved by structure-aware models.

B Dataset details
We use COGS (Kim and Linzen, 2020) and variants of COGS (i.e.Syntax-COGS, POS-COGS and QA-COGS) as our datasets.We report dataset statistics for all our datasets in Table 4.
Syntactic annotations.To obtain syntactic annotations for Syntax-COGS, we use NLTK6 to parse each sentence in COGS with the context-free grammar that was used to generate COGS.In our experiments, we find this parsing process yields a unique tree for each sentence in COGS.The original grammar contains rules such as NP→NP_animate_dobj_noPP.We replace such fine-grained nonterminals (e.g.NP_animate_dobj_noPP) with general nonterminals (e.g.NP).This results in duplicate patterns (e.g.NP→NP) and we further remove such patterns from the output tree.

C Semantic role labeling
We give more details about semantic role labeling task described in Section 5.1 here.In contrast to the semantic parsing task, where the output is a sequence encoding the meaning representation, the goal of this task is to predict the semantic role graph of a sentence.
An example of the semantic role graph is shown in Fig. 7.The symboldenotes that the column word is not an argument of the row word; we capture this with the special class None in the data.
We align tokens in the sentence and predicate symbols in the meaning representation based on the variable names, which specify positions in the sentence (e.g.x 1 corresponds to the second token in the string).This allows us to project the predicateargument relations in the meaning representation to relations between the tokens.For a predicate verb, we connect an edge to each of its arguments (i.e.drew has an Agent edge to girl.) in the meaning representation.
The COGS grammar also contains prepositional phrases (e.g. a bowl on the table).To represent this modification relation, we connect an Nmod edge from the modified noun to the modifier noun (e.g.bowl has an Nmod edge to table ).
For common nouns, we connect a DefN edge to itself to denote it has a definite determiner (e.g.girl has a DefN edge to itself) and a IndefN to denote it has an indefinite determiner (e.g.bat has an IndefN edge to itself).

D QA-COGS D.1 QA-COGS-base
To create QA-COGS-base, we first obtain the frame for each predicate in a sentence from its gold meaning representation.We define the frame of a predicate as the combination of argument types it takes.Possible frames in our dataset and corresponding examples are shown in

Figure 1 :
Figure 1: Some examples from the COGS dataset.LEX represents lexical generalization and STRUCT denotes structural generalization.

Figure 3 :
Figure 3: Influence of PP recursion depth on overall PP depth generalization accuracy.

Figure 4 :
Figure 4: Example for obj_to_subj_pp type.We list the annotation of semantic parse, syntax tree and POS tags with corresponding T5 predictions.

Figure 5 :
Figure 5: Construction of QA-COGS-disamb: top is cc_cp, bottom is rc_pp.The answer to the example question is highlighted in bold.
Table 6: Detailed results for our models across COGS, Syntax-COGS, POS-COGS and QA-COGS.

Table 1 :
Exact match accuracies on the individual generalization types.Column LEX reports mean accuracy over the 18 lexical generalization types.

Table 2 :
Exact match accuracy for probing on the individual generalization types.
Obj to Subj PP Noah ate the cake on the plate.What did Noah eat?The cake on the plate burned.What was burned?PP recursion Ava saw a ball in a bowl on the table.What did Ava see? Ava saw a ball in a bowl on the table on the floor.What did Ava see? CP recursion Ava said that Emma liked that a dog ran.What did Ava say?Ava said that Emma liked that Noah noticed that a dog ran.What did Ava say?

Table 3 :
Exact match accuracy on the individual generalization types on the sections of QA-COGS.

Table 5 .
We generate questions for all predicate-argument pairs in