Comparing Learnability of Two Dependency Schemes: ‘Semantic’ (UD) and ‘Syntactic’ (SUD)

This paper contributes to the thread of research on the learnability of different dependency annotation schemes: one (‘semantic’) favouring content words as heads of dependency relations and the other (‘syntactic’) favouring syntactic heads. Several studies have lent support to the idea that choosing syntactic criteria for assigning heads in dependency trees improves the performance of dependency parsers. This may be explained by postulating that syntactic approaches are generally more learnable. In this study, we test this hypothesis by comparing the performance of five parsing systems (both transitionand graph-based) on a selection of 21 treebanks, each in a ‘semantic’ variant, represented by standard UD (Universal Dependencies), and a ‘syntactic’ variant, represented by SUD (Surface-syntactic Universal Dependencies): unlike previously reported experiments, which considered learnability of ‘semantic’ and ‘syntactic’ annotations of particular constructions in vitro, the experiments reported here consider whole annotation schemes in vivo. Additionally, we compare these annotation schemes using a range of quantitative syntactic properties, which may also reflect their learnability. The results of the experiments show that SUD tends to be more learnable than UD, but the advantage of one or the other scheme depends on the parser and the corpus in question.


Introduction and Background
This paper compares the learnability of two approaches to dependency annotation. One, represented by Universal Dependencies (UD; http: //universaldependencies.org/; , favours content words over function words as dependency heads, as this increases crosslinguistic uniformity of the resulting scheme; here we will call this approach 'semantic'. 1 Another, represented by Surface-Syntactic Universal Dependencies (SUD; https://surfacesyntacticud. github.io; Gerdes et al., 2018Gerdes et al., , 2019, uses purely syntactic criteria for determining headedness; hence the moniker 'syntactic'. The SUD scheme was designed as minimally different from -'near-isomorphic to' -UD, and many UD treebanks have been converted to SUD, so differences in learnability between the two approaches should be relatively easy to assess and interpret. As is clear from Figure 1, which juxtaposes the UD basic tree (at the top) and the SUD tree (at the bottom), SUD generally adopts the principle that function words such as auxiliaries (e.g., do), subordinating conjunctions (until), copula ('re), and prepositions (with) are heads of relevant constructions. On the other hand, SUD representation of coordination is similar to that of UD, but where all non-initial conjuncts are attached to the head of the first one in UD, each conjunct is attached to the head of the previous conjunct in SUD; when there are just two conjuncts, annotations are the same.
Previous results suggest that the syntactic scheme should be more learnable. For example, Schwartz et al. (2012) (Zeldes, 2017) and SUD: 2 preposition-noun (e.g., of Rome)a class which also includes complementiser-clause constructions (e.g., after you go), to-infinitival (e.g., to eat), modal-verb (e.g., can come), and coordination. The experiments involved five different parsers (representing both transition-based and graph-based methodologies) and two different learnability measures (including one based on attachment scores). The results of these experiments favour SUD-like representations in all four cases.
In the case of constructions involving a preposition or a complementiser, having them as heads -as in SUD, but unlike in UD -results in extremely strong ('unanimous') learnability improvements. The effect is weaker in the case of verb groups containing a modal and still weaker in the case of infinitivals introduced by to, but in both cases having the main lexical verb as the dependent -as in SUD, but unlike in UD -gives generally better results. A similar range of constructions is inspected in Silveira and Manning (2015). For each kind of construction, 3 different variants of conversion from semantic to syntactic headedness are considered, depending on how many of the dependents of the semantic head are moved to the syntactic head. The best variant gives significant improvements in the learnability of the syntactic scheme in the case of preposition-noun (but not complementiser-clause), auxiliary-verb (rather than the more general modal-verb, considered in Schwartz et al., 2012) and -and this is were the improvement was most clear -in the case of copula-predicate constructions. Other papers that report better learn-ability of a more syntactic scheme converted automatically from a more semantic scheme include: Nilsson et al. (2006Nilsson et al. ( , 2007 (auxiliary-verb constructions in Arabic, Czech, Dutch and Slovene, small improvement observed in the case of the transition based MaltParser, but not with the graphbased MSTParser), Rosa (2015) (adposition-noun constructions in 30 languages), Kohita et al. (2017) (various constructions involving function and content words in 19 typologically varied languages), and Rehbein et al. (2017) (15 languages, although the extent of the improvements varied considerably, and in the exceptional case of Turkish regress was observed for all three parsers used in the experiments). 3 On the other hand, de Lhoneux and Nivre (2016) report on an experiment involving 24 languages, in which the original UD representation of verb groups (modal-verb constructions) turns out to be more learnable by MaltParser than the converted representation with main verbs acting as dependents of modal verbs. In a similar vein, Wisniewski and Lacroix (2017) report that languages and particular constructions vary drastically in the extent to which the syntactic or the semantic approach to headedness is more or less learnable by their own transition-based parser. However, out of the seven constructions they consider (similar to those considered in Silveira and Manning, 2015), four differentiate UD and SUD, and out of these four, two (copula-predicate and case-noun, but not mark-verb) are more learnable in the syntactic encoding in the majority of languages -copula-predicate con-structions by a wide margin (75% of languages). Unfortunately, the paper does not present the full results of the experiments, so it is not clear whether there is a correlation between, say, language family and learnability of particular representations of particular constructions.
The current paper is methodologically closest to Rehbein et al. (2017) and Kohita et al. (2017): it reports results of experiments performed on multiple corpora of typologically diverse languages, and it compares the learnability of different annotation schemes applied to the same underlying texts. However, the novelty of the current paper lies in comparing the learnability of two comprehensive linguistically-informed annotation schemes rather than a real scheme and an artificial scheme differing from it in the headedness of a single or a small number of constructions. That is, unlike previous experiments reported in the literature cited above, the experiments reported here were performed in vivo rather than in vitro. This matters, as any realistic annotation schema which employs a more 'syntactic' approach to headedness than UD will also differ from UD in the repertoire and distribution of dependency labels, and will also take into account the intrinsic linguistic interaction between various constructions. The co-existence of large and high-quality treebanks in their UD and SUD variants presents the unique opportunity to compare the learnability of 'semantic' and 'syntactic' annotation schemes in a realistic setup.

Data
Treebanks. Experiments were performed on a subset of UD 2.6 treebanks and the corresponding SUD 2.6 treebanks created by the SUD team. 21 treebanks (in each annotation scheme) representing 18 languages were selected on the basis of three criteria. First of all, emphasis was put on the quality of treebanks, so only those -mainly Indo-European -that have the quality score higher than 70 percent were used (as evaluated by the official UD script: https: //github.com/UniversalDependencies/ tools/blob/master/evaluate_treebank.pl by Dan Zeman). Second, in order to obtain robust results, only relatively large corpora, over 4 The code necessary to perform all of the actions described in this section can be found on our Github page: https: //github.com/ryszardtuora/ud_vs_sud 70k tokens, were selected. Third, due to the limited computational power, upper bounds on the treebank size had to be set -1000k tokens. Three languages -Italian, Polish and Swedishare represented by two treebanks each, which may give some insight into how stable certain trends are within one language.
Preprocessing. The original UD and SUD treebanks had been preprocessed before the experiments were carried out. In particular, the representation of multitoken words was normalised to the format where, say, the French form du is represented in the conllu scheme by two lines (one corresponding to de with information 'SpaceAfter=No' and the other to le) rather than three (one for de, another for le and another for their contraction du). This was done to remove some inconsistencies between training and testing subsets of some corpora. Additionally, all tokens with PUNCT as their UPOS tag were removed, unless they had dependents.
Where possible (i.e., in the case of UDPipe, UUParser, and COMBO), pretrained fasttext word embeddings were utilised (https://fasttext.cc/docs/en/ crawl-vectors.html; Grave et al., 2018) as opposed to learning embeddings during the training process. The fasttext architecture is based on embeddings of character n-grams, but only the resulting word-level vectors were used in the training procedure, as all of the selected systems which offer an option of including external embeddings can work with word embeddings only. Each embedding model was pruned to 300,000 most frequent forms, to ease the computational load.

Parsers
Two transition-based and three graph-based parsers were used in the experiments. Some of these tools offer robust pipelines for NLP, including tokenisation, lemmatisation and tagging, but in the current experiments only the parser component of the tool was trained; in particular, POS tags were extracted from the gold standard and used as features. Below, training procedures of each parser are described separately, including only information about parsers' hyperparameters that differ from the default setting. 5 UDPipe. Version 1.2.0 (http://ufal.mff. cuni.cz/udpipe; Straka and Straková, 2017) of this transition-based parser was used without the default values of various hyperparameters, as these were fitted on UD, and thus could skew the results against SUD. Instead, 21 models were trained on each treebank (for either annotation scheme). That is, for each transition system available -projective, swap, link2 -seven models were trained using random hyperparameter search -a feature provided by UDPipe that randomises some of the training hyperparameters.

Mate.
Version 3.62 (Bohnet, 2010) of the graph-based parser was utilised; it was adapted from the version 3.61 (available here: http://code.google.com/p/mate-tools/) to our study. 6 Seven models were trained for each treebank (and each annotation scheme), and in every training run a different non-projective approximation threshold was selected from the following list: 0.75, 0.5, 0.4, 0.3, 0.2, 0.15, 0.1.
UUParser. Version 2.4 (https://github.com/ UppsalaNLP/uuparser; de Lhoneux et al., 2017) of both graph-and transition-based methodologies were applied in the experiment. UUParser is an adaptation of the BIST parser (Kiperwasser and Goldberg, 2016). In UUParser, swap transition and Eisner algorithms were implemented, among others, in place of their projective counterparts -used by BIST parser -in transition-and graph-based versions respectively. Universal POS tags dimension was set to 20 and external word embedding dimension was adapted to the size of the embeddings used. Five models with different random seeds were trained, and the one which performed best as measured by LAS on the dev set, was then selected for testing.
COMBO. The graph-based dependency parsing component from Version 1.0.1 https://gitlab. clarin-pl.eu/syntactic-tools/combo of the COMBO (Rybak and Wróblewska, 2018) pipeline was utilized, with word embeddings, characters, and gold UPOS tags as features. For each treebank four models with different combinations of learning rate (0.001 or 0.002) and dropout probability (0.4 or 0.25) have been trained, for 100 epochs each. 6 The implemented change forces the parser to produce only one root in each sentence. We thank Bernd Bohnet for adjusting the parser to our needs and for allowing to share the new Mate 3.62 version on our Github page.

Evaluation
In each case, models produced by the parsers on the basis of training sets were used to parse the test parts of the respective treebanks. Hyperparameter selection (based on LAS) and early-stopping was performed on the development set.
The official conll18_ud_eval.py script (http://universaldependencies.org/ conll18/evaluation.html) was used to calulate UAS and LAS scores both during hyperparameter selection and during final testing. Due to the differences in the annotation of labels in UD and SUD, modifications had to be implemented in the script. In UD, syntactic relations are divided with a colon into two parts. The first part refers to the universal dependency taxonomy. The second part, after the colon, is a relation subtype which is specific to one language or a group of related languages. For example, advmod is a general UD relation that refers to adverbial dependents, while advmod:arg is specific for Polish and referes to obligatory adverbial arguments, advmod:df is specific for Chinese and Cantonese and refers to durative and frequentative noun phrases, etc. In SUD, on the other hand, some general relation names contain the colon; e.g., comp:pred is used for copulae and comp:aux for auxiliary verbs.
The conll18_ud_eval.py script ignores the part after the colon during evaluation (if a parser predicts advmod:df instead of just advmod, or vice versa, that counts as a match). In the case of SUD, leaving out the part after the colon would result in incomplete labels. Hence, the evaluation script was modified so that labels were processed differently in the case of SUD: if the part of the relation after the colon is either aux, pred, obj, or obl, it will not be split off, and a full match of the predicted relation will be necessary. 7 These label manipulations are applied only at the stage of evaluation; during the training phase, parsers are learning the full spectrum of dependency labels.

UAS and LAS scores
The results of the experiments are presented in Table 1 (on the next page). Out of 210 comparisons, 58 gave statistically significant results using the strict version of McNemar's test, with α = 0.001. 8 Out of these, 46 favour SUD, and 12 -UD; this confirms the generally -but not unanimouslyhigher learnability of the 'syntactic' scheme. (Taking into account all 210 comparisons, the result is 146:64 in favour of SUD.) There is a clear difference between the transition-and graph-based parsers in this respect. The former -UDPipe and transition-based UUParser -have no clear preferences: their SUD:UD scores in statistically significant differences are 4:4 and 6:4, respectively (and in all differences: 23:19 and 23:19). The latter -Mate, COMBO, and graph-based UUParser -strongly prefer SUD, with respective significant scores 14:1, 8:2, and 14:1 (and all scores: 34:8, 32:10, and 34:8). Particular parsers show similar preferences for SUD or UD in terms of UAS and LAS, apart from Mate, whose preference for SUD is 5:1 in terms of significant UAS differences and 9:0 in terms of LAS.
Moreover, the mean difference of UAS and LAS results is similar for all parsers. In the case of UAS, the mean differences between UD and SUD are −0.03, −0.61, −0.36, −0.04, and −0.48 for UD-Pipe, Mate, COMBO, transition-based and graphbased UUParser respectively (i.e., SUD is preferred on the average), and in the case of LAS, the differences are 0.01, −0.86, −0.42, −0.14, −0.60 (that is, apart from UDPipe, parsers tend to prefer SUD). The highest difference between these two metrics concerns Mate's results (δ = 0.20); however, the difference is minimal in the case of, for instance, UDPipe's results (δ = 0.04).
As to particular corpora, when one scheme is more learnable according to one parser, it tends to be more learnable also according to other parsers. Only in the case of the Polish PDB treebank do different parsers have significantly different preferences: the two transition-based parsers and COMBO significantly prefer UD (both with respect to the LAS score), while the graph-based UUParse prefers SUD (with respect to UAS).
Interestingly, this relative stability in preferences for SUD or UD concerns particular corpora (and, to some extent, languages; especially the two Swedish corpora behave similarly on most parsers) but not 8 The strict version of McNemar's test was employed here in order to minimise the false discovery rate; as Table 1 reports the results of 210 comparisons, the weaker test, with α = 0.05, would likely produce some false significance claims, while with the stricter version the probability that all statistically significant claims are correct is over 0.94.   Table 1: UAS and LAS results of the five parsing systems on the UD and SUD versions of the corpora. Differences are in green if UD annotation is more learnable and in red if SUD annotation is more learnable; statistically significant (p < 0.001) differences are additionally in bold. See Table 2 for full names of corpora language families. This is especially clear in the case of Germanic languages. While English and, to a smaller extent the two Swedish treebanks, show strong preference for the 'syntactic' scheme across all five parsing systems, the German treebank favours 'semantic' scheme in nearly all cases. The proportion of statistically significant differences (to all comparisons) is high in the group of Germanic languages (14 out of 40). Slavic languages, out of which nine treebanks were included in the study, appear to be following a similar pattern, altough to a smaller extent. Czech shows a preference in favour of the UD scheme (three statistically significant differences in favour of UD and zero in favour of SUD), learnability of Polish PDB appears to be dependent on the parser used (as discussed earlier), Polish LFG and Slovenian do not show any significant preferences, while Croatian, Russian (strongly), Slovak and Serbian present higher learnability in the SUD schemeall observed statistically significant differences are in favour of SUD. On the other hand, Romance languages do not show strong preferences for either of the schemes. In total, five statistically significant preferences were found in this language group, out of which three are in favour of UD and two in favour of SUD. In the case of Baltic (Latvian and Lithuanian) and Finno-Ugric (Estonian), all of the statistically significant differences prefer SUD scheme (14 in total). However, since only two Baltic treebanks and only one Finno-Ugric treebank were included in the study, we refrain from drawing any conclusions about these language groups; a more comprehensive study is needed.
To what extent do these results reflect headedness decisions of the two schemes, i.e., preference for content heads in UD and for functional heads in SUD? It is important to note that, unlike in the previous experiments, differences between the two annotation schemes do not only concern headedness, but also the repertoire and meanings of dependency labels. The number of basic dependency labels (as defined in §2.3) is consistently smaller in the case of SUD than in the case of UD, which may favourably bias parsers towards SUD. For instance, the English GUM treebank has 48 and 43 different labels in UD and SUD respectively, and applying the label processing described in §2.3 results in further reduction to 36 and 25 different labels, respectively, making the task of parsers easier in the case of SUD than in the case of UD. (These pre-dictions are confirmed by the results concerning Label Entropy, reported in §3.2 below.) Hence, the results reported in this section cannot at this stage be interpreted as showing -but are compatible with the claim -that 'functional headedness' tends to be more learnable than 'content headedness'; further experiments are needed to confirm or deny such a claim.

Quantitative syntactic properties
In an attempt to find which quantifiable syntactic properties of treebanks may impact the differences in parsing performance, five different metrics were calculated (see Table 2 on the next page). Some of these properties differ substantially between UD and SUD. Two notable examples are Average Dependency Length (ADL) and Average Token Depth (ATD) -properties which are inversely related to each other. ADL is calculated so that the length of a dependency between neighbouring tokens is equal to one, and each intervening token increases it by one. ATD is calculated by only taking non-root tokens into consideration; immediate children of root have depth equal to one, and each intervening token in the path from a node to root adds one to the depth of the token. SUD is characterised by deeper trees (with higher ATD), and UD -by flatter trees and longer dependency arcs (i.e., higher ADL). SUD treebanks have, without exception, higher ATD, and lower ADL than their UD counterparts.
These differences may be important, as there is growing evidence that natural languages tend to minimise dependency lengths (see, e.g., Temperley and Gildea, 2018 and references therein). Nevertheless, as shown in Table 3, differences between UD and SUD in ADL and ATD are not significantly correlated with differences between UD and SUD in terms of UAS or LAS.
In addition, two entropy-based measures were calculated: Arc Direction Entropy (ADE), and Label Entropy (LE). ADE is used to quantify the rigidity of word order in a given corpus, i.e., given two tokens connected by a dependency arc, the label of the relation, and the UPOS tags of the tokens, how much certainty can we have about the linear ordering of these tokens (head-initial vs. head-final). Arguably, the more consistent word order is, the easier the task of parsing becomes. As expected, English treebanks have the lowest ADE, whereas free-order languages such as Polish show higher entropy.  Table 2: Quantitative syntactic properties of the treebanks used in the experiment: average dependency length ADL, average token depth ATD, arc direction entropy ADE, label entropy LE, percentage of non-projective trees NPROJ Columns marked with ∆ represent differences between UD and SUD; in green if a given figure is higher for UD, and red otherwise On the other hand, Label Entropy is the entropy of the frequency distribution of dependency labels across the treebank. It is calculated by iterating over all tokens in the treebank and counting their dependency labels. This frequency distribution is treated as a probability distribution and used for calculating entropy. LE was introduced because the SUD scheme has substantially smaller sets of labels for dependency relations, and LE offers a more informed way of assessing the baseline difficulty of a label scheme than the mere cardinality of the labelset. SUD versions of the same treebanks are in all cases characterized by lower LE. Both measures were calculated using the dependency label transformations defined in §2.3, i.e., with the exception of some SUD labels, all labels were split after a colon.
We were not able to confirm the correlation between differences in learnability and differences in ADE reported in Gulordava and Merlo (2016) (on the basis of artificially created data) and in Rehbein et al. (2017). Most probably, this is because of the very small differences in ADE between SUD and UD, much lower than in the experiments cited in these two papers. In fact, the differences are so insignificant that we would prefer to be cautious in interpreting the one statistically significant, positive correlation that concerns COMBO parser.
Following the results from the papers cited above, one would expect to obtain a negative correlation between ADE and learnability (i.e. higher ADE leads to lower learnability). The opposite is the case here. This result is puzzling; it is possible that the correlation is in fact spurious.
Another aspect of tree structure which is significant in this context, is the proportion of nonprojective arcs (and consequently non-projective trees) in the treebanks. As shown in the last column of Table 2, marked with NPROJ, SUD is characterised by a consistently larger degree of nonprojective trees. On average, conversion into SUD increases the percentage ratio of non-projective trees in a treebank 3.52 times (up to 10.32 times in the case of Polish LFG). This is consistent with previous experiments in the domain; e.g., Kohita et al. (2017) report that, after applying syntacticlike transformations, the ratio of non-projective arcs in the training sets increased by 10 percentage points on average. Non-projective dependency structures are notoriously hard to parse for humans, and so one might expect a similar effect in computational settings. However, modern parsers are able to handle non-projective trees; they offer particular transition systems (e.g., UDPipe) or hyperparameters (e.g., Mate) that can be manipulated in order to better fit treebanks with a certain degree of non- Table 3: Correlations (cor) between 1) learnability difference between UD and SUD and 2) differences in values of various corpus measures: average dependency length ADL, average token depth ATD, arc direction entropy ADE, label entropy LE, percentage of non-projective trees NPROJ. Statistically significant (p < 0.05) values are in bold projectivity. Correlations between the difference in the percentage of non-projective trees between UD and SUD treebanks and learnability scores are presented in the last column of Table 3. Only one statistically significant, correlation: −0.50 can be observed in the case of UDPipe, with respect to UAS score.

Conclusions
While some initial work suggested clear relation between learnability of dependency parsing and the 'semantic' or 'syntactic' approach to headedness, with 'syntactic' annotations usually reported as more learnable, the experiments often had a very limited scope: they concerned one language, or just one or a very small number of constructions, or just one or two parsers. More extensive experiments, performed on a number of languages, taking into account a handful of constructions and a few parsers, such as those reported in Rehbein et al. (2017), showed that this relation between learnability and different approaches to headedness, even though imperfect, in general favours syntactic-like approaches, but also suggested a more stable correlation between learnability and other corpus characteristics (such as ADE). All these experiments were performed in vitro, on the basis of dependency corpora with one or just a few constructions reanalysed for the purpose of the experiments.
In contrast, the current paper presents the results of comparing two full-fledged annotation schemes in vivo: the 'semantic' UD and the 'syntactic' SUD. The experiments confirm that it cannot be claimed that more 'syntactic' approaches to annotation uniformly lead to better learnability: this depends on particular languages (rather than on language families) and on particular parsers. However, corpora annotated according to the SUD scheme tend to be more learnable, especially, by the graph-based parsers utilised in the experiments.
As to correlations between corpus characteristics and learnability, the experiments show a clear correlation between Label Entropy and parsers' performance (especially, in terms of UAS), which suggests that SUD may take advantage of its smaller set of labels or lower order variability of labels, or both. Also, correlation was found between the difference in the percentage of non-projective trees between both schemes and learnability in the case of UDPipe (again, in terms of UAS). This may suggest the inability of this parser to effectively deal with higher degrees of non-projectivity, even though some hyperparameter tuning was implemented to deal with this issue. On the other hand, the results do not confirm the recent hypothesis that learnability of the two kinds of annotations is negatively correlated with arc direction entropy. In the same vein, we have not found statistically significant correlations between parser performance, and average dependency length.
Future work should seek to dissociate the effect of more learnable dependency labels from that of different approaches to headedness; to this end experiments should be performed on corpora with trees typologically just like those in UD and SUD, but with label schemes modified so that Label Entropy is not correlated with learnability. Also, it would be interesting to relate learnability of particular schemes by particular parsers to their inherent dependency displacement bias (cf. Anderson and Gómez-Rodríguez 2020). Clearly, many more experiments of this sort are needed to establish exact factors influencing learnability of dependency parsers.