Compositional Generalization in Dependency Parsing

Compositionality— the ability to combine familiar units like words into novel phrases and sentences— has been the focus of intense interest in artificial intelligence in recent years. To test compositional generalization in semantic parsing, Keysers et al. (2020) introduced Compositional Freebase Queries (CFQ). This dataset maximizes the similarity between the test and train distributions over primitive units, like words, while maximizing the compound divergence: the dissimilarity between test and train distributions over larger structures, like phrases. Dependency parsing, however, lacks a compositional generalization benchmark. In this work, we introduce a gold-standard set of dependency parses for CFQ, and use this to analyze the behaviour of a state-of-the art dependency parser (Qi et al., 2020) on the CFQ dataset. We find that increasing compound divergence degrades dependency parsing performance, although not as dramatically as semantic parsing performance. Additionally, we find the performance of the dependency parser does not uniformly degrade relative to compound divergence, and the parser performs differently on different splits with the same compound divergence. We explore a number of hypotheses for what causes the non-uniform degradation in dependency parsing performance, and identify a number of syntactic structures that drive the dependency parser’s lower performance on the most challenging splits.


Introduction
People understand novel combinations of familiar words in part due to the principle of compositionality: We expect the meaning of a phrase to be a predictable composition of the meanings of its parts. Unlike humans, many neural models fail to generalize compositionally; a growing interest in 1 Figure 1: An example question from the CFQ dataset, with the associated SPARQL query and dependency parse. this area has led to novel architectures and datasets designed to test compositional generalization (see § 7). One recently-introduced semantic parsing dataset, Compositional Freebase Queries (CFQ), consists of English questions with corresponding database queries written in SPARQL. Figure 1 shows an example question and SPARQL query. To test compositional generalization, CFQ includes test and train sets with a highly similar distribution of primitive units (like words) and increasingly divergent distribution of larger compound units (like phrases). The most challenging of these splits, with the highest compound divergence, are dubbed maximum compound divergence (MCD) splits.
Although CFQ has proven to be a valuable resource, the difficulty of the splits appears to be influenced by factors other than compositional generalization. First, some evidence suggests that the complexity of the SPARQL output is in part responsible for CFQ performance Herzig et al., 2021). Furthermore, splits of the same compound divergence are not equally difficult. One possible explanation is a difference in the syntactic constructions of different splits; however, this has not yet been explored in CFQ. To address these issues, we created a dependency-parsing version of CFQ. Using our dataset, we evaluated a state-ofthe-art dependency parser for compositional generalization, and used the dependency annotations to identify syntactic structures predictive of parsing failure on each MCD split.
We found that the dependency parser is more robust to increased compound divergence than the semantic parser, but performance still decreased with higher compound divergence. We also found the dependency parser, like semantic parsers, varied widely in performance on different splits of the same compound divergence. Finally, we found that a small number (less than seven) of syntactic constructions seem to drive the difficulty of the MCD splits. Our dataset will be made publically available 1 .

Motivation for Dependency Parsing
In this section, we discuss three problems of CFQ, and our motivation for studying compositional generalization in dependency parsing.
First, CFQ is hard: seq2seq models trained from scratch score at most 12% on MCD2 and MCD3 sets. Because of its difficulty, CFQ may lack sensitivity to capture small but significant progress in neural modelling of compositionality. Second, recent work shows that CFQ's difficulty is in part due to the output representation being raw SPARQL: Models perform better when outputs are replaced with compressed versions of SPARQL, that are more aligned with the natural-language-like questions Herzig et al., 2021). In interpreting performance on CFQ, we might be conflating challenges of compositional generalization with challenges related to the output representation.
Third, different splits of the same compound divergence vary widely in difficulty: eight of the nine semantic parsers currently listed on the leaderboard perform at least twice as well on MCD1 as MCD 2, despite the splits having the same compound divergence. 2 Performance on CFQ is thus heavily influenced by some other factor about the splits.
Finally, CFQ lacks a description of the specific syntactic generalizations tested by each split. Related benchmarks, like COGS (Kim and Linzen, 2020) and CLOSURE (Bahdanau et al., 2020), test 1 https://github.com/emilygoodwin/CFQ-dependencies 2 see https://github.com/google-research/googleresearch/tree/master/cfq a clearly-defined set of generalizations (for example, training a noun in subject position and testing in object position). CFQ splits, by contrast, optimize a gross metric over the distribution of all syntactic compounds in the dataset. This complicates in-depth analyses of CFQ results: For a particular split, it is unclear what syntactic constructions are tested in out-of-distribution contexts. Meanwhile, for a particular test sentence, it is unclear which of its syntactic structures caused the model to fail.
To address the issues with the CFQ semantic parsing benchmark, we studied compositional generalization in syntactic parsing. While syntactic parsing is simpler than mapping to a complete meaning representation, a language-to-SPARQL semantic parser must understand the question's syntax. For example, to generate the triple ?x0 ns:film.editor.film M0 in the SPARQL query shown in Figure 1, a semantic parser must first identify that "actor" is the subject of "edit".
We chose dependency trees as the target syntactic formalism due to the maturity of the universal dependencies annotation standard, the popularity of dependency trees among the NLP practitioners, and the availability of popular high-performance software such as Stanza (Qi et al., 2020). Importantly, dependency parsing does not require auto regressive models; instead, graph dependency parsers independently predict edge labels. This different way of employing deep learning for parsing allows us to separate the challenge of compositional generalization from challenges related to auto regressive models' teacher forcing training. Finally, having gold dependency annotations for CFQ questions enables detailed analysis of the relation between the model errors and syntactic discrepancies that are featured by the MCD splits.

Compound Divergence in CFQ
CFQ is designed to test compositional generalization by combining familiar units in novel ways. To ensure the primitive units are familiar to the learner, CFQ test and train sets are sampled in a way that ensures a low divergence in the frequency distribution of atoms. Here, atoms refers to individual predicates or entities, (like "produced" or "Inception"), and the rules used to generate questions. To ensure the compounds in test are novel, train and test sets were sampled in a way that ensures higher divergence between the frequency distribution of compounds, weighted to prevent double-counting of

Corpus Construction: Dependency Parses for CFQ
To train a dependency parser and analyze syntactic structures in the CFQ dataset, we created a corpus of gold dependency parses. Because the questions in CFQ are synthetically generated, we were able to write a full-coverage context-free grammar for the CFQ language. Using this grammar, and the chart parser available in Python's natural language toolkit, we generated a constituency parse for each question. Finally, we designed an algorithm to map to the dependency parse.
To map from constituency to dependency parses, we wrote a dependency-mapping rule for each production rule in the CFG (Collins, 2003). Each dependency rule describes the dependency relation between the elements in the constituent; for example, if the production rule is VP −→ V NP, the dependency-mapping rule connects the head of the right-hand node (the head of NP) as a dependent of the left-hand node (the V), with the arc label OBJ. We follow version two of the Universal Dependencies Corpus annotation standards (Nivre et al., 2020), 3 but simplify the categorization of nominal subjects for active and passive verbs into one category (NSUBJ), and do not include part of speech tags in the dataset.
Our algorithm then recursively walks the constituency tree from bottom to top, mapping nonhead children of each node to their syntactic heads and passing the head of each constituent up the tree. A number of sentences in the CFQ dataset exhibit dependency structures which cannot be directly read off the constituency parse in this manner: Such right-node-raising constructions involve a word without a syntactic head in the immediate constituent. For example, in "Was Tonny written by and executive produced by Mark Marabella?" the first instance of "by" is a dependent of "Mark Marabella", but its immediate constituent is "directed by". To handle right-node raising cases, our dependency-mapping algorithm identifies prepositions with no head in the immediate constituent, and passes them up the tree until they can be attached to their appropriate syntactic head.
Finally, we performed a form of anonymization on the questions, replacing entities with singleword proper names. This reflects the anonymization strategy used in Keysers et al. (2020), and prevents the dependency parser from failing because of named entities with particularly complex internal syntax (for example, "Did a Swedish film producer edit Giliap and Who Saw Him Die?") The experiments in this paper are based on the original CFQ splits. However, these validation sets are constructed from the same distribution as the test sets; some information about the test distribution is therefore available during train. To ensure that the model only had access to the training distribution during the training phase, we followed the suggestion of Keysers et al. (2020) and discarded the MCD 4 validation sets, randomly sampling 20% of the training data to use instead (see § 5.1 of that paper for more details). The resulting splits have 11, 968 test sentences and 76, 595 train sentences.

Compound Divergence Effect on
Dependency Parsing

Training Stanza
To evaluate the effect of compound divergence on dependency parsing, we used Stanza (Qi et al., 2020), a state-of-the-art dependency parser, on the gold label dependency parses described in §3. We trained Stanza five times on each of 22 splits from the CFQ release: one random split (which has a compound divergence of 0), 18 splits with increasing compound divergence (ranging from .1 to .6) and three MCD splits (divergence of .7). To evaluate performance on each test set, we used the CoNLL18 shared task dependency parsing challenge evaluation script, 5 which gives a Labeled Attachment Score (LAS) and Content-word Labeled Attachment Score (CLAS), reflecting how many of the total dependency arcs in the test set were correctly labeled, and how many of the arcs connecting content words were correctly labeled, respectively.
In addition, we calculated the percentage of test questions for which every content word arc was correctly labeled, which we call Whole Sentence Content-word Labeled Attachment Score (WSCLAS). This all-or-nothing evaluation scheme for each sentence more closely resembles the exactmatch accuracy of semantic parser evaluation.

Dependency Parsing Results
We plot the Stanza's performance as a function of the split compound divergence in Figure 2. Increasing compound divergence had a negative effect on performance: Stanza's accuracy on the random split (zero compound divergence) was near perfect, with an average CLAS of 99.98% and WSCLAS of 99.89%. Meanwhile, accuracy on the three MCD splits (divergence of .7) dropped to an average CLAS of 92.85% and WSCLAS of 74.92%. A linear regression predicting CLAS found a slope of −6.91, and predicting WSCLAS found a slope of −28.89; in other words, for each .1 increase in compound divergence the linear model predicts a 2.889% lower WSCALS, and .691% lower CLAS. These linear models are also shown in Figure 2.
We note, however, two exceptions to the generally negative relationship between compound divergence and accuracy, which indicate that other characteristics of the test set have a large effect on accuracy. First, all splits with a target compound divergence of .4 performed stronger than those at divergence .3 and .2. Secondly, we observed considerable variation in performance on different splits that have the same compound divergence, particularly the MCD splits.
Stanza's performance on the three maximumcompound-divergence splits and one random split is shown in Means and standard deviations are calculated over five randomly-seeded runs. The semantic parsing scores are reproduced from (Keysers et al., 2020); the mean exactmatch of 5 experiments with 95% confidence intervals is reported for each MCD split in their github repository. 6 were harder than the random split, performance varied from 96.57% WSCLAS (MCD1) to 56.76% WSCLAS (MCD3). Thus, while compound divergence is a factor in performance, idiosyncrasies in the individual splits also have large effects on performance.
Finally, we note that while Stanza was more robust to compound divergence than semantic parsers, it also ranked the splits differently in difficulty. Table 1 reproduces mean accuracies from Keysers et al. (2020)'s strongest-performing semantic parser, a universal transformer (Dehghani et al., 2019). The universal transformer's exact-match is lower than Stanza's WSCLAS on every MCD split. Additionally, while Stanza performed worst on MCD3, the universal transformer and most other semantic parsers in the CFQ leaderboard performed worst on MCD2. In the next sections, we explore what causes the variation in performance on different MCD splits.

Construction Complexity and the MCD Splits
The compound divergence metric treats all compounds of any number of words identically; therefore, the differences between the MCD splits may be driven by differing distributions of compounds of different complexities. In this section, we show that this is not the case. We first describe how we characterize syntactic constructions using the dependency annotations.

Syntactic Constructions
We explored differences in the distributions of syntactic constructions by looking at a restricted set of  Figure 3: A dependency parse and two of its subtrees the subtrees of each dependency parse, which we will now describe. With respect to any target node in the corpus, we consider a syntactic construction to be any subtree that consists of that target node together with a constituent-contiguous subset of the target node's immediate children. Here constituent-contiguous means the subsets of child nodes which are heads of phrases that are adjacent to one another or to the target node in the string. We include only the immediate children in the subtree (excluding their descendants). We also replace words with their category label in CFQ: in addition to traditional parts of speech like verb and adjective, the category labels include nominal categories role (which occurs in possessive constructions like "mother" in "Alice's mother"), entity for proper nouns, and noun for common nouns.
For the analyses in this and the following section, we extract every syntactic construction for every dependency parse in our corpus, and compare their complexity. We define complexity to be the number of arcs in the subtree, discounting the dummy ROOT arc. Two of the subtrees for sentence "Did M1's female actor edit and produce M0?" are shown in Figure 3 (these subtrees have a complexity of two).

Analysis of MCD Splits
One possible source of the differences between MCD splits may be that they differ in their distributions of subtrees at differing complexities. In this section, we present two analyses showing that this is not the case. In our first analysis, we analyzed the distance between test and train distribution for each split. To do this we calculated the Jensen-Shannon (JS) distance between the test and train histograms of syntactic constructions at differing complexities. 7 : Divergence between test and train of the MCD and random splits. Roughly, divergence increases with subtree length, although much more rapidly for MCD splits than random. Additionally, there is little difference between the different MCD splits.
The JS distances for constructions of each complexity are plotted in Figure 4. As can be seen in the figure, the differences between test and train are similar for all MCD splits at all subtree complexities. This is true even for MCD1, which closely resembles MCD2 and MCD3, despite behaving more similarly to the random split in parser performance. Thus, differences between the test and train distributions at different complexities cannot explain the MCD splits' differential performance.
In our second analysis, we examined whether the MCD splits differ in the proportion of untrained subtrees at different complexities. The proportions are plotted in Figure 5. The MCD splits pattern together, with far more untrained constructions at 7 The JS distance for histograms p and q is defined as where m is the pointwise mean of p and q, and D is the Kullback-Leibler divergence. each complexity than the random split.
We thus conclude that it is unlikely that gross distributional properties of the MCD splits explain the differences in parser performance between them. In the next section, we show that parser mistakes for all splits seems to be driven by a very small number of hard-to-parse subtrees and, thus, performance differences are likely to be dependent on idiosyncratic interactions between the specific data splits and models.

Syntactic Analyses of Dependency
Parser Outputs

Identifying Difficult Subtrees
In order to identify syntactic constructions that are predictive of dependency parsing failure, we fit a logistic model predicting Stanza's performance on each test question from the question's syntactic constructions. Because we trained five randomlyinitialized versions of Stanza, the logistic model was fit with five instances of each question. To encourage sparse subtree feature weights, we used L1 regularization. We used 90% of the test set to train the logistic model, and the remaining 10% to test it and select a regularization coefficient of .01.
To analyze the subtrees most predictive of parsing failure, we extracted from the model all subtrees with a coefficient less than or equal to −1. Finally, to quantify the effect these trees have on the test set, we removed all the sentences containing these trees for each split, and calculate Stanza's accuracies on the remaining test sentences. Table 3 shows the number of subtrees found to be predictive of parsing error, together with the accuracy when those trees are removed from test. Removing five subtrees from MCD2's test set improves the accuracy to 92.46% (an increase of 21.05%), and removing seven trees from MCD3's test set improves the accuracy to 93.09% (an increase of 36.32%). We can thus conclude that the performance degradation of Stanza on higher compound divergence splits is driven by a relatively small number of syntactic constructions. Table 4 shows the subtrees most predictive of a dependency parsing error, together with the test and train frequency. To quantify the effect of each subtree on the test accuracy, we also report the Test set ∆: WSCLAS(T ) − WSCLAS(T ) where T is the original test set and T is all test sentences which do not include the construction. A positive ∆ means that removing the subtree from the test set improved performance, while a negative ∆ indicates that removing the subtree from the test set degraded performance. As can be seen in Table 2, the set of subtrees predictive of failure for MCD 2 are also predictive of failure for MCD 3. Four of the five of these trees are characterized by a copular or auxiliary verb "was" appearing to the left of the syntactic subject. This occurs in questions like "Was Alice the sister of Bob?". This suggests the MCD 2 and MCD 3 splits might be targeting a particular syntactic phenomenon (copular or auxiliary fronting) that is more challenging for the parsers.

Related work
A growing body of work uses CFQ to investigate better models for compositional generalization in semantic parsing (Herzig and Berant, 2020;Guo et al., 2020;. Tsarkov et al. (2020) also recently released an expanded version of CFQ called *-CFQ, which remains challenging for transformers even when they are trained on much more data. Our methodology can be easily be applied to *-CFQ at the cost of a straight-forward extension of the grammar.
Other datasets focused on compositional generalization include SCAN (Lake and Baroni, 2018), a dataset of English commands and navigation sequences; gSCAN (Ruis et al., 2020), a successor to SCAN with grounded navigation sequences; and COGS (Kim and Linzen, 2020)   based on lambda calculus and the UDepLambda framework (Reddy et al., 2017). In contrast to CFQ, these datasets challenge models by targeting specific, linguistically-motivated generalizations. For example, COGS includes tests of novel verb argument structures (like training on a verb in ac-tive voice and testing in passive voice), and novel grammatical roles for primitives (like training with a noun in object position and testing in subject position); similarly, SCAN includes splits which test novel combinations of specific predicates (training a predicate "jump" or "turn left" in isolation, and testing it composed with additional predicates from train). Finally, the CLOSURE benchmark for visual question answering tests systematic generalization of familiar words by constructing novel referring expressions; for example, "a cube that is the same size as the brown cube" (Bahdanau et al., 2020).

Conclusion
In this paper, we presented a dependency parsing version of the Compositional Freebase Queries (CFQ) dataset. We showed that a state-of-the-art dependency parser's performance degrades with increased compound divergence, but varies on different splits of the same compound divergence. Finally, we showed the majority of the parser failures on each split can be characterized by a small (seven or fewer) number of specific syntactic structures. To our knowledge, this is the first explicit test of compositional generalization in dependency parsing. We hope that the gold-standard dependency parses that we have developed will be a useful resource in future work on compositional generalization. Existing work on syntactic (and in particular dependency) parsing can provide researchers in compositional generalization with ideas and inspiration which can then be empirically validated using our corpus.
Finally, our work represents a step forward in understanding the syntactic structures which drive lower performance on MCD test sets. Predicting parser performance from the syntactic constructions contained in the question provides a new method for understanding the syntactic structures that can cause parser failure; in future work, similar methods can also be used to better understand failures of semantic parsers on the CFQ dataset.

A Correlation of Semantic Parsing and Dependency Parsing Errors
Because syntactic parsing is a necessary sub-task for semantic parsing, we also explored the possibility that dependency parsing errors might be predictive of semantic parsing errors. We extracted the predictions from Keysers   The results for MCD1 and MCD3 are shown in Figure 6: for example, the top-right hand corner of the MCD1 matrix means that 29.99% of the test set was correctly parsed in all semantic and dependency parsing experiments, while the top right-hand corner in the MCD3 matrix indicates that only 7.68% of the sentences were correctly parsed by both models in all experiments. The semantic parser fails for all five experiments on the majority of sentences. We do note some trends in error patterns between the models: for example, no sentences are correctly parsed by all semantic parsers without also being correctly parsed by the dependency parser at least a few times. However, overall it does not appear that dependency parsing performance is strongly related to semantic parsing performance.

B Proportion of Few-shot Constructions in MCD Splits
We examined whether the MCD splits differ in the proportion of test syntactic constructions which are few-shot, meaning they appear in train fewer than four times. This analyses is similar to the ones described in § 5.2. The proportions are plotted in Figure 5. The MCD splits pattern together, with far more fewshot constructions at each complexity than the random split.