Evaluating Universal Dependency Parser Recovery of Predicate Argument Structure via CompChain Analysis

Accurate recovery of predicate-argument structure from a Universal Dependency (UD) parse is central to downstream tasks such as extraction of semantic roles or event representations. This study introduces compchains, a categorization of the hierarchy of predicate dependency relations present within a UD parse. Accuracy of compchain classification serves as a proxy for measuring accurate recovery of predicate-argument structure from sentences with embedding. We analyzed the distribution of compchains in three UD English treebanks, EWT, GUM and LinES, revealing that these treebanks are sparse with respect to sentences with predicate-argument structure that includes predicate-argument embedding. We evaluated the CoNLL 2018 Shared Task UDPipe (v1.2) baseline (dependency parsing) models as compchain classifiers for the EWT, GUMS and LinES UD treebanks. Our results indicate that these three baseline models exhibit poorer performance on sentences with predicate-argument structure with more than one level of embedding; we used compchains to characterize the errors made by these parsers and present examples of erroneous parses produced by the parser that were identified using compchains. We also analyzed the distribution of compchains in 58 non-English UD treebanks and then used compchains to evaluate the CoNLL’18 Shared Task baseline model for each of these treebanks. Our analysis shows that performance with respect to compchain classification is only weakly correlated with the official evaluation metrics (LAS, MLAS and BLEX). We identify gaps in the distribution of compchains in several of the UD treebanks, thus providing a roadmap for how these treebanks may be supplemented. We conclude by discussing how compchains provide a new perspective on the sparsity of training data for UD parsers, as well as the accuracy of the resulting UD parses.


Introduction
The Universal Dependencies (UD) project (De Marneffe et al., 2014;Nivre et al., 2016) is a multilingual annotation scheme for dependency grammars that has gained wide usage (Zeman et al., 2017;Kong et al., 2017;Qi et al., 2020). To this extent, automatically identifying whether a dependency parse 1 is correct or incorrect, as well as the potential source of such errors, becomes an important part of NLP pipelines. For example, such identification can prevent errors from propagating to downstream applications such as the identification of predicate-argument structure, involved in semantic role labeling and sentiment analysis. 2 Furthermore, embedding of sentences within sentences, and in particular embedding of predicate-argument structures within one another, is one of the ways in which humans have the capability to generate an infinity of different <sentences, meaning> pairings, and so it is important to evaluate whether a UD parser can accurately recover the predicate-argument structure of sentences with embedding. Thus, characterizing the limits of how accurately and consistently UD parsers assign predicate-argument structure in the context of correct UD annotation also becomes important (Nivre and Fang, 2017;Oepen et al., 2017;Fares et al., 2018;White et al., 2016;Reddy et al., 2017;Mille et al., 2018). That is the goal of this study.
In this study we introduce compchains, a categorization of the hierarchy of predicate dependency relations present within a Universal Dependency (UD) parse; this categorization serves as a proxy for predicate-argument structure. We use compchains to evaluate the accuracy of three (English) CoNLL 2018 Shared Task baseline models for the UDPipe dependency parser (Zeman et al., 2018). We found that the baseline model for the EWT UD treebank was more accurate than the baseline models for the LinES and GUM UD treebanks. We then use compchains to characterize the errors (relevant to predicate-argument structure) made by these models. We found that the accuracy of all three models dropped significantly when restricting the test set to samples with predicate-argument structure with embedding. Finally, we extended the analysis above to languages other than English, computing the distribution of compchains in 58 UD treebanks and evaluating the performance of the corresponding CoNLL 2018 Shared Task baseline models (for the UDPipe parser) as compchain classifiers. We conclude by discussing deficiencies in the distribution of predicate-argument structure with embedding present in the UD treebanks, as identified by our analysis.

Related Work
This section reviews prior work on the evaluation of (Universal) dependency parsers and the characterization of the errors these parsers make. The CoNLL Shared Task is a well established benchmark for evaluating the performance of multilingual (Universal) dependency parsers (Buchholz and Marsi, 2006;Nivre et al., 2007;Zeman et al., 2017Zeman et al., , 2018. The task uses a number of metrics to evaluate the accuracy of the parser including: UAS (unlabeled attached score), LAS (labeled attachment score), CLAS (Content-word LAS) (Nivre and Fang, 2017), MLAS (Morphologically-aware LAS) and BLEX (BiLEXical Dependency Score). However, these metrics rely on the attachment accuracy (of dependency relations) 3 and do not take into account that errors cascade -i.e. if the parser incorrectly attaches a dependency relation, it may then be forced to make yet another incorrect attachment (Ng and Curran, 2015), thus making it difficult to identify the provenance of the error.
In light of this, efforts to further characterize the errors have proceeded in several directions. One direction involves studying whether and how the parsing errors are a result of the design of the dependency parser: (McDonald and Nivre, 2007) characterizes and compares the errors produced by graph-based dependency parsers (e.g. the MST-Parser by (McDonald and Pereira, 2006); see also (Kiperwasser and Goldberg, 2016;Cheng et al., 2016;Zhang et al., 2016)) and transition-based dependency parsers (e.g. the MaltParser by (Nivre et al., 2006)); (Zhang and Clark, 2008) shows how the two approaches to dependency parsing may be combined and documents the resulting improvement in performance.
An alternative direction involves characterizing the errors in the context of linguistic theory -e.g. (Kummerfeld et al., 2012) has introduced a method for classifying erroneous parse trees by repairing the tree with a series of tree-transformations, with each tree-transformation having a linguistic interpretation; (Mahler et al., 2017) has shown that it is possible to systematically break NLP systems for sentiment analysis by editing sentences with linguistically interpretable transformations. In this study we pursue this latter direction, opting to characterize erroneous parse trees by classifying their predicate-argument structure using compchains.

Compchains
Within a UD parse tree, predicate-argument structure 4 is encoded by core argument dependency relations, along with the special dependency relation root. 5 The core-argument dependency relations fall into two classes: predicate relations and nominal relations. In this study, we limit our attention to the two predicate dependency relations that encode embedding of clausal complements: (i) ccomp -a dependent, clausal complement, and (ii) xcomp -a clausal complement lacking a subject; the subject is determined by an argument that is external to the xcomp, usually the object (or otherwise subject) of the next higher clause. 6 We will focus on categorizing sequences of these two dependency relations (with POS marked as Verb) that originate from the root of a dependency tree (intuitively, the spine of the predicate-argument structure). This notion is formalized as follows: Definition. A compchain is a finite sequence of dependency relations that traces a path starting at 4 See (Hale, 1993;Hale and Keyser, 2002) for further reference on predicate-argument structure. 5 See universaldependencies.org/u/dep/ for more details. 6 xcomp is often used to model control/raising constructions in which an argument in the embedded clause establishes a syntactic relation with the predicate in the matrix clause. Figure 1: Examples of compchain classifications (left) for eight UD parses (right) produced by the UDv2.2 EWT baseline model using UDPipe 1.2. In each parse, the node with no incoming dependency relations is the root. Sentence 8 is classified as the / 0 compchain because the root is not marked as VERB.
the root node of a dependency parse tree and passing through only xcomp and ccomp dependency relations, subject to the constraints that: (i) every node in a compchain must have the POS tag of Verb; (ii) no node in a compchain should have a child dependency relation with POS Verb that is either an xcomp or ccomp and is not in the compchain as well. 7 We denote a compchain by listing the sequence of dependency relations, starting from the root of the tree, using the notation: R = root; X = xcomp; C = ccomp. E.g. we would denote the compchain [root → xcomp → ccomp] as RXC. See Figure-1 for examples of UD parses and their compchain classifications. One way to evaluate (indirectly) how well a UD parser can identify predicate-argument structure for sentences in a UD treebank is to evaluate whether the UD parse assigned by the parser to a sentence in the treebank has the same compchain as the compchain associated with the gold 7 This constraint serves to ensure that if a UD parse tree has a compchain, it is unique and may be derived deterministically. This constraint also implies that some valid UD parse trees do not have a compchain -e.g. a parse in which there are two xcomp dependency relations that are both children of the same node. We use the symbol / 0 to denote that a UD parse tree has no compchain. UD parse listed for that sentence (in the treebank); we refer to this task as compchain classification. Performance on the compchain classification task is a proxy for performance on the task of classifying predicate-argument structure that includes predicate-argument embedding. If a UD parser performs poorly on the compchain classification task, predicate-argument structure cannot be reliably recovered from an (output) UD parse tree via top-down traversal of the sequence of dependency relations that forms the associated compchain. See Figure-2 for examples of incorrect compchain classifications that reflect the parser recovering incorrect predicate-argument structure.

Evaluation of English UD Treebanks
We evaluated the performance of the CoNLL'18 shared task baseline (parsing) models for English as compchain classifiers using three UD (v2.2) English treebanks: the English Web Treebank (EWT), with a total of 16,622 sentences (Silveira et al., 2014;Schuster and Manning, 2016); the English side of the English-Swedish Parallel Treebank (LinES), with a total of 4,564 sentences (Ahren-  (1) and (2) are for the sentence "How come no one bothers to ask any questions in this section?" The parses in (3) and (4) are for the sentence "Even the least discriminating diner would know not to eat at Sprecher's." Both sentences were taken from the UDv2.2 English Web Treebank. (1) and (3) are gold parse from the treebank whereas (2) and (4) are produced by UDPipe using the CoNLL'18 baseline language model for UDv2.2 EWT. Both (2) and (4) are incorrectly classified, reflecting that these two parses encode misinterpretations (compared to the interpretations in their respective gold parses -i.e. (1) and (3)). berg, 2007); and the GUM treebank, with a total of 4,390 sentences (Zeldes, 2017). 8 We began by computing the distribution of compchains in each of the sections (train, dev, test) for each of the treebanks (see Table-1). We observed that although the training section of the EWT (UD) treebank includes a non-negligible number of UD parse trees that are classified (according to their corresponding Gold UD parse) as compchains with three or more dependency relations, the test section of the EWT (UD) treebank does not. This suggests that performing well on the task of parsing the test section of the EWT (UD) treebank need not indicate competency in parsing sentences with predicate-argument embedding of degree two or more. We also observed that the LinES and GUM treebanks have a negligible number of parse trees (across all sections) that are classified as compchains with three or more dependency relationsi.e. RCC, RCX, RXC and RXX.
Next, we evaluated the CoNLL'18 shared task baseline (parsing) models 9 for the three treebanks as compchain classifiers. We used UDPipe (v1.2), a transition-based non-projective dependency parser, to parse the test section of each of the three tree- 8 We used the pretrained word embeddings supplied with the CoNLL Shared Task for each of the three treebanks; these embeddings were produced with word2vec (Mikolov et al., 2013b,a). 9 These UDPipe models were trained on the training section of the UDv2.2 EWT/LinES/GUM respectively. We also used the tagging and tokenization pipeline provided by UDPipe. banks using the corresponding baseline model (Straka and Straková, 2017). We then classified the compchain of each UD parse and compared it to the compchain associated with the corresponding gold parse. We report the F-measures for this classification task in Table 2. We observed that the baseline model for EWT had the best performance as a compchain classifier. We also computed the per-compchain F-measures and observed that for all three baseline models, their per-compchain F1score for RX was notably better than for RC. Here we observed a steep falloff in per-compchain F1score as the number of dependency relations in a compchain increases. This suggests that either the parsers were not trained on enough examples of sentences with predicate-argument embedding, or that they did not adequately generalize from the limited number of examples that they were trained on.
Finally, we computed and analyzed the confusion matrix (i.e. error matrix) for each of the three baseline models, evaluating each model on the test section of its associated treebank. (see Figure 3) In each confusion matrix, off-diagonal entries count instances of parses with erroneous predicate-argument structure as indicated by the predicted compchain differing from the actual compchain (if two parse trees have different compchains, then their predicate-argument structure must differ as well). On-diagonal entries count instances of parses with correctly classified com- Train  Dev  Test Train Dev Test Train Dev Test   /  0  5230  985 1065  591 191 224  879 201 268  R  5500  815  806 1767 608 580 1661 413 419  RC  758  79  79  135  43  43  171  43  33  RX  808  100  104  202  65  50  158  43  41  RCC  47  4  6  1  0  2  8  1  2  RCX  94  7  9  17  1  6  10  2  1  RXC  48  6  3  10  2  6  6  0  2  RXX  39  2  2  12  2  3  13  3  3 Total 12543 Table 2: F-measures for the compchain classification of the parse trees in the EWT, LinES and GUM (UD) treebanks. The left most column refers to the true compchain from the appropriate UD treebank. Each row has the F1-score for the evaluation of the parser (as a compchain classifier) on sentences in the treebank that had the listed compchain, except for the bottom most row, which is the total (weighted) F1-score over all compchains -i.e. performance as a multi-way classifier. pchains, which indicates that the parse may be correct (though it may well have errors not related to predicate-argument structure). We observed, for all three models, that compchains of length two or less were very rarely misclassified as compchains of length three or more, and that compchains of length two were often misclassified as the R compchain (see Figure-2 for an example of such a misclassification). We also observed that in the case of the baseline model for LinES, the compchain for RC is frequently misclassified as RX, but the compchain RX is rarely misclassified as RC; this asymmetry may reflect the difference in number of training examples in the LinES treebank -135 in the case of RC and 202 in the case of RX (see Table-1).

Multilingual Evaluation of UD Treebanks
We also used the compchain classification task to evaluate the CoNLL'18 shared task baseline models (and the respective UD treebanks they were trained on) for languages other than English; this was motivated by the observation that since the UD treebanks are derived from a variety of textual sources, and thus have varying compchain distributions, we can use them collectively to evaluate and characterize the performance of the UDPipe dependency parser under various training conditions. Figure 4 presents the distribution of compchains across 61 UD treebanks (including the three English treebanks analyzed earlier in this study). 10 Our analysis reveals that: (i) the UD treebanks for Hindi and Urdu have no instances of the compchain RC in either the training or test sections; (ii) the UD treebanks for Japanese, Korean, Turkish and Uyghur have no instances of the compchain RC in either the training or test sections; (iii) the UD treebanks for Hindi, Japanese, Turkish and Uyghur do not include any instances of compchains of length three or more (i.e. RXX, RCC, RXC, or RCX) in either the training or test sections.
We computed the F1-scores for the performance of each baseline model on the compchain classification task. 11 The F1-score for length-1 compchains is very weakly correlated with the F1-score for length-2 compchains, with R 2 = 0.265 (see Figure 5), and F1-scores for the two length-2 compchains (RC and RX) are also very weakly correlated, with R 2 = 0.177 (see Figure 6). This suggests that performance in recovering predicate-argument structures with differing embedding structures is largely unrelated and should be measured explicitly, just as the compchain classification task does. Additionally, we observe (as we did with the models trained on English treebanks) a rapid decline in the per-class F1-score as the length of the compchain increases, in particular for compchains of length two or more. (See Figure 7) This is revealing because, although the lack of compchains of length 10 See Table 4 in the appendix for a complete listing of the distribution of compchains in the Test and Training treebank for each of the 61 languages. 11 See Table-5 for a complete listing of performance on the compchain classification task for each UD treebank using the associated baseline model, including a breakdown of performance per-compchain. two or more in the UD treebanks suggests that we should not necessarily expect a dependency parser trained on the treebank to generalize out of the training domain, there is empirical evidence that humans do have the capacity to acquire a grammar from sentences with at most degree-1 embedding (corresponding to compchains of length 2) and then later correctly parse sentences with a degree of embedding of two or more (Wexler and Culicover, 1980;Morgan, 1986;Lightfoot, 1989); thus, the poor performance on compchains of length three or more suggests that the CoNLL 2018 Shared Task baseline models are not able to generalize beyond the distribution of syntactic structures they were trained upon, in contrast to human learners.

Impact of Word Ordering
Word ordering data (i.e. head-directionality) for each of the 61 languages in the UD treebanks was obtained from the WALS Online database (Dryer, 2013); we retrieved this information because the word ordering dictates whether a predicate precedes or succeeds its complement with respect to the linear ordering of the words in a sentence, and we wanted to understand whether this had an impact on the parser's performance on the compchain classification task. (See Table-5 in the appendix for the word-order of each language) The 47 languages with verb-object (VO) ordering had a median and mean weighted average F1-score of 0.85 and 0.88 respectively; the 18 languages with object-verb (OV) ordering had a median and mean weighted average F1-score of 0.86 and 0.85 respectively. It thus appears that the word-ordering does not appear to impact the weighted average F1-score. The F1-scores associated with compchains of length 2 (i.e. RX and RC) tell a different story: in the case of the RC compchain, the median F1-scores for verb-object and object-verb were 0.68 and 0.55 respectively, and in the case of the RX compchain, the median F1-scores for verb-object and objectverb were 0.72 and 0.42 respectively; thus for both compchains of length 2, models trained on verbobject ordered languages performed significantly better than models trained on object-verb ordered languages. 12 Given that the orderings of verbobject (i.e. head-initial) and object-verb (i.e. headfinal) control whether a language will be associated with right-branching or left-branching structures respectively, our results suggest that the UDPipe   parser has difficulty dealing with left-branching structures.

Impact of Sentence Length
We carried out a regression analysis to investigate the relationship between the correctness of compchain classification and sentence length; this was motivated by the observation that sentences with higher degrees of embedding, and thus longer compchains, tend to be longer sentences. We fitted a logistic function for each sentence in the test treebank, with the log of the sentence length (i.e. the number of tokens including punctuation) serving as the independent variable, and the (binary) dependent variable being whether the compchain associated with that sentence was correctly classified. We interpreted a good-fitting logistic function to indicate that compchain accuracy is dependent on sentence length. To evaluate the fit of the logistic function, we computed the Area Under Curve (AUC) measure of the Receiver Operator Characteristic (ROC) curve for the fitted logistic function. Figure 8 presents the distribution of AUCs for the test corpus of each of: (a) the 43 UD treebanks for languages with verb-object (VO) word-ordering, and (b) the 18 UD treebanks for langauges with object-verb (OV) word-ordering. We observe that the AUC for the majority of the treebanks falls between 0.55 and 0.65, and virtually none of the AUCs surpass 0.7, which is generally considered a minimum threshold for a binary-classifier to be considered accurate. Additionally, we observe that the OV languages tend to have a slightly higher AUC than the VO languages. We conclude that accuracy of compchain classification is weakly correlated with the log of the length of the sentence, and that this correlation is slightly higher for OV languages than for the VO languages. (Similar results were obtained when the analysis was carried out directly on the length of the sentence.)

Comparison with Other Eval. Metrics
In order to understand whether the compchain metric is simply a proxy for one of the three official evaluation metrics (LAS, BLEX and MLAS), we computed the pairwise linear correlation between each of the metrics for each of the 61 UD treebanks. 13 Table 3 presents the coefficient of determination for each pairing of the metrics. We observe that although LAS, MLAS and BLEX are all highly correlated with one another, they are weakly correlated with the compchain-metrics (i.e. weighted avg. of F1-score over all compchains and per-compchain F1-scores); notably, performance on compchain classification for RX is very weakly correlated with LAS, MLAS and BLEX (R 2 < 0.1). 13 LAS, MLAS and BLEX scores for CoNLL Shared Task baseline models were obtained from https://universaldependencies.org/ conll18/baseline.html#baseline-results.  This suggests that the compchain metric is measuring an aspect of the parser's performance that is not brought to the fore by any of the three official evaluation metrics, and that a baseline model having a good LAS, MLAS or BLEX score does not necessarily indicate that the model will correctly predict the embedding structure of a sentence with even a single level of embedding.

Conclusion
In this study, we defined compchains and used them to evaluate how accurately a UD parser can parse sentences with predicate-argument structure that contains embedded clauses. We also used compchains to classify the errors, relevant to predicateargument structure with embedding, made by a UD parser. Overall model performance on the compchain classification task (as measured by weighted F-measure) was found to be dominated by parse trees in the training set with no embedding (compchain R); closer inspection of per-compchain performance revealed that parser accuracy dropped precipitously as the degree of embedding in the predicate argument structure (i.e. length of compchain) increased. Finally, our results indicate that UD treebanks have very few parse trees with degree of embedding (i.e. length of compchain) greater than two. This presents an opportunity: if the test sets of the UD treebanks were augmented with parses with predicate-argument structure with degree of embeddings greater than two, then UD parsers can be evaluated in terms of their capacity to generalize from constructions (in the training set) with (mostly) low degree of embedding, just as a child must in some models of first language acquisition (Wexler and Culicover, 1980;Berwick, 1985;Lightfoot, 1989

A Appendix
Table-4 presents the distribution of compchains across 61 UD treebanks (including the three English treebanks analyzed earlier in this study). Table-5 presents the F1-scores for the performance of each baseline models on the compchain classification task. The rows of Table 4 and Table 5 were seriated using the Google OR-Tools library so that rows with similar values appear close together: Table 4 is seriated so that languages with similar compchain distributions are clustered together; Table 5 is seriated so that languages with similar F1-scores are clustered together.     Table 5: F1-Scores for Compchains Classifications for each UD 2.2 Gold Treebanks. The test section of each gold treebank was parsed using the corresponding pre-trained UDPipe language model; the compchain classification was computed for each pair of gold and parsed treebanks, and we report: (i) the weighted average F1-score (over all compchains); (ii) the (per-class) F1-score for each compchain. Entries for which the F1-score could not be computed due to a lack of support are marked with a dash ("-").