Annotations Matter: Leveraging Multi-task Learning to Parse UD and SUD

Using multiple treebanks to improve parsing performance has shown positive results. However, to what extent similar, yet competing annotation decisions play in parser behavior is unclear. We investigate this within a multi-task learning (MTL) dependency parser setup on two parallel treebanks, UD and SUD, which, while possessing similar annotation schemes, differ in speciﬁc linguistic annotation preferences. We perform a set of experiments with different MTL architectural choices, comparing performance across various input embed-dings. We ﬁnd languages tend to pattern in loose typological associations, but generally the performance within an MTL setting is lower than single model baseline parsers for each annotation scheme. The main contributing factor seems to be the competing syntactic annotation information shared between tree-banks in an MTL setting, which is shown in experiments against differently annotated tree-banks. This suggests that the impact of how the signal is encoded for annotations and its in-ﬂuence on possible negative transfer is more important than that of the input embeddings in an MTL setting.

MTL inherently is designed to share information between tasks, which has helped various NLP components (Collobert and Weston, 2008). One active research question however is what information in specific tasks should be shared, as well was what indicators can be used to predetermine the cost-benefit trade-offs of MTL for a given application. Findings have shown that label distributions (Martínez Alonso and Plank, 2017), data sizes (Bollmann et al., 2018) and single task loss curves (Bingel and Søgaard, 2017) have all been respective indicators for MTL performance. Different tasks, data sizes, and settings can all show different relative performance gains (Adouane and Bernardy, 2020). Thus, it is still an open question under which circumstance MTL can be used to achieve max performance boosts over a single task system.
In syntactic parsing, learning a closely related task (e.g. POS tagging) in a joint paradigm benefits overall performance (Bohnet and Nivre, 2012;Zhang and Weiss, 2016), and work has also exploited MTL by leveraging two or more treebanks against each other (see section 2). We often assume simply increasing data and the sharing of syntactic information will inherently benefit all parsers, but this assumes that all syntactic sharing, specifically all annotation sharing, is positive and complementary. However, annotation decisions have been shown to favor parsing preferences (Rosa, 2015;Rehbein et al., 2017;Kohita et al., 2017). This means that is is not necessarily clear if sharing annotations benefits all parsers equally. This is especially true if two annotation schemes choose drastically different approaches when annotating specific linguistic phenomena.
We look to examine this issue further by utilizing a set of treebanks that are annotated on parallel data, Universal Dependencies (UD; Nivre et al., 2016) and Surface-Syntactic Universal Dependencies (SUD;Gerdes et al., 2018), to examine how two competing syntactic annotation schemes behave when used in an MTL setup. Using parallel treebanks also removes the lexical variation and influences of domain differences that are present in most MTL treebank setups. Whether this is a positive or negative in an MTL setup is unclear, but reduction in domain differences tend to benefit single model parsers.
We utilize the graph-based Deep Biaffine Parser of Dozat and Manning (2017) in an MTL architecture, treating each UD and SUD treebank of a selected language as a task, and experiment with sharing different embeddings, layers, and loss functions. Additionally, we look at the how different embeddings interact with these annotations along with their role in encoding the signal utilized by the MTL parsers, and whether results follow any linguistic patterns. Finally, we perform additional experiments with treebanks from SPMRL shared task (Seddah et al., 2013(Seddah et al., , 2014 to support our analysis. We look to investigate the following questions:  Barry et al. (2019), though a single language source model yielded the best results. More directly related work is Johansson (2013), who shares features between two treebanks of the same language that differ in annotation schemes by identifying overlapping features. Using a graphbased parser, he achieved noticeable relative error reduction in UAS for four language pairs, with the largest performance gains on the smaller treebanks. This was followed by Johansson and Adesam (2020) using a neural transition-based parser and leveraging a mixture of treebanks, three dependency and two constituency, against a single constituency treebank in a multi-treebank setup. They find that in all settings, performance on the target constituency treebank improves, with the highest gain coming from using all five as an auxiliary treebank. Kankanampati et al. (2020) use the Multidimensional Easy First approach introduced by Constant et al. (2016) to parse the Arabic CATiB (Habash and Roth, 2009) and its converted UD representation in a multi-task setup. They note that both treebanks showed error reduction, but that improvements were due to partial dependencies, and not primarily driven through lexical sharing.
Little direct work exists on extensive empirical investigations between UD and SUD with parsers. Recent work by Kulmizev et al. (2020) performed probing experiments across a set of languages to extract dependency graphs from BERT (Devlin et al., 2019) and ELMO (Peters et al., 2018) language models, finding that both models prefer UD, with tree shape directly correlated to preference strength.
One of the advantages of MTL is the ability to share information as well as altering objective functions between tasks. Early work examined the impact different loss functions have on downstream applications (Hall et al., 2011) and how in a hierarchy of tasks, sharing of individual layers benefits other tasks differently, with lower level task sharing most beneficial (Søgaard and Goldberg, 2016).
Both hard and soft sharing of parameters have proven successful. Duong et al. (2015) exploited soft parameter sharing between different crosslingual treebanks possessing the same annotation schemes achieving results on the target language with only half the needed annotated data. Soft sharing of parameters allows nuances between languages of the same treebank when hard sharing all other parameters (Stymne et al., 2018).
Parameter sharing has proven effective in both monolingual (Guo et al., 2016) and multilingual parsing (Ammar et al., 2016;Kitaev et al., 2019). However, what are the optimal parameters to share, and where to do so in the architecture, particularly in cross-lingual setups, is not consistent as shown by  in extensive experiments in sharing word and character LSTM parameters.  UD have become a de facto standard as a source for treebanks for dependency parsing. A main annotation choice in UD is the prioritization of content words as the head. While some functional distinctions are kept, such as those between subjects and objects, many other are merged, such as complements and adjuncts. Importantly, function words are dependents of the content words.
SUD were developed as a counter-balance to UD with the belief that UD are not syntactically motivated enough, with a particular linguistically argued objection to the prioritization of content words as heads, stemming from the belief that the distributional context of words should drive headedness. While many individual labels are kept, several are collapsed into a single label (e.g. nsubj & csubj → subj). The primary result of function words becoming heads is the inherent reversal of syntactic relationships of many words. Fig. 1 is an example of how the SUD conversion alters an English sentence from its original UD representation. One of the more noticeable differences is that the projective UD tree is now non-projective in the SUD schema. The main cause, in this example, is because the auxiliary verb can is now the root in SUD, rather than the content word do in UD. Furthermore, the only word to retain the same head word between the two sentences is what, while all others have new heads. We wish to emphasize however, that not all trees show such stark contrasts, but simply want to highlight how a simple choice in annotation can produce distinctly different trees, and the resultant impact on non-projectivity.
By using UD and SUD, we eliminate one of the variables in many multi-treebank setups, the different distribution of the underlying vocabulary. This effectively eliminates domain differences between the treebanks (see section 3.2), as both parsers will get more similar outputs from the BiLSTM layer, and identical ones in a joint loss setting.
We Use UD and SUD version 2.7 and select 12 different language from 10 language families. This was done in order capture sufficient linguistic variation in terms of how UD and SUD may impact various linguistic phenomena found in typologically different languages, and subsequently annotation schemes. 1 Table 1 presents statistics on the treebanks in respect to their variation in training and dev sizes. Additionally, we also note the proportion of non-projective trees found in each annotation scheme. 2 All languages show higher number of non-projective trees in SUD when compared to their UD counterparts, but for some it is much more substantial. A noticeable example is Chinese (zh) which has 40% more absolute nonprojective trees in its SUD treebank compared to its UD counterpart. Noticeable increases can also be seen in Arabic and German, but most languages show only moderate differences. Hungarian (hu) is interesting as it is the only language that shows a high proportion of non-projective trees in UD and only a moderate increase for SUD.

MTL Parsing Architecture
We use the PyTorch (Paszke et al., 2019) implementation of the Biaffine parser of Dozat and Manning (2017) provided by , 3 and extend it to an MTL architecture. 4 We modify the base parser by treating parsing of each annotation scheme as a separate task. Each task shares the BiLSTM layer that is used to encode the concatenation of all input embeddings. These BiLSTM encodings are then passed through dimension reducing MLPs to strip away arc and relationship information information deemed not relevant. We implement two MLP schemes, one in which we share them across tasks (shared; Figure  2A) and the other in which each task has its own MLP layers (unshared; Figure 2B). Considering the overlap in the annotation schemes, a shared MLP setting allows us to examine the behavior of sharing information between the two annotation scheme when irrelevant information is minimized. Finally, in order for the model to learn task specific information, we apply task specific biaffine attention layers to the MLP outputs to produce scores for both arcs and labels.
The common practice in MTL is to have separate losses for different tasks and to optimize for each of them separately (alternating loss; Ruder, 2017). This is particularly the case when the different tasks do not share the same input. However, our dataset contains parallel sentences albeit with different annotations. It thus then becomes possible to experiment with using a joint loss for training both tasks as the parsers receive the same input, and a joint loss has shown improvements when joint learning POS tags and dependency parsing (Li et al., 2018). We do this by optimizing for the sum of losses of each of the tasks. Since the losses of both tasks are of nearly the same magnitude, we do not have to worry about imbalance and a simple sum suffices. 5 We experiment with both types of losses.
In the alternating loss setting, we randomly choose a task from the given tasks and then randomly choose a batch of sentences along with their annotations from that task before calculating the loss of that batch and backpropagating the errors. In a given epoch we chose sentences without replacement. For joint loss, we randomly choose a batch of the same sentences from both the tasks, along with their different annotations. Losses are calculated based on those annotations and summed together before backpropagating the errors. We posit that joint loss should allow for faster convergence as both the tasks affect the parameter updates of the shared layers simultaneously, thus helping the optimization process to move towards the goal more quickly.
The two choices of losses combined with the op- tional sharing of MLP layers gives rise to four different experimental settings: alternating-unshared, alternating-shared, joint-unshared, and joint-shared.
In addition, we experiment with internally randomly initialized word and POS 6 embeddings, external embeddings (FastText; Bojanowski et al.

Results
The overall performances of the four experimental settings, namely alternating vs joint loss and shared vs unshared MLP layers, are very close to each other. The convergence statistics for joint and alternating loss settings are reported in Table 2. 7 It can be noted that despite taking a greater number of epochs to converge when compared to alternating loss, joint loss converges faster in terms of time because it performs the forward propagation through the shared layers only once for both tasks, whereas alternating loss has to perform it separately for each task.
As we are more interested in the MTL parser 6 We use gold POS tags. 7 All experiments were performed on Nvidia V100 GPUs. Tables 2 and 3   To analyze the impact of different embedding types on the MTL parsing setup, we change the specificity of information by using different embedding types with the MTL parser as discussed in section 3.2, results of which are presented in Fig  3. We see that adding more information yields in higher LAS across languages (moving from left to right on the heatmap) with the concatenation of all embeddings (rightmost columns) performing the best.
However, given that we are more interested in examining whether the parallel UD-SUD treebanks can benefit from an MTL setup, we choose instead to focus on how the MTL parsers compare to the single UD and SUD baseline parsers across the different embedding choices. Fig. 4 shows a heatmap depicting the difference of the mean LAS of all four settings with respect to the corresponding single Figure 4: Heatmap illustrating the performance difference between MTL parsers compared to corresponding single task baseline parser. Each block represents the difference between the mean LAS score of four MTL settings and the respective single task baseline LAS score. baseline parser, for each embedding input. 8 The mean drop in LAS scores for MTL settings when compared to the baselines across all languages and all the different feature embeddings (432 runs) are reported in Table 3, with lower numbers indicating better performance. No particular setting shows a significant improvement over the other. Keeping this in consideration, we still see that joint loss performs slightly better than alternating loss. Sharing of MLP layers seems to help a little compared to the setting where we have task specific layers. As mentioned in section 3.2, the role of dimension reducing MLPs is to remove all the information that is not necessary for performing the task at hand. This would indicate that the two tasks remove similar unnecessary information, thereby sharing the signal necessary for making parsing decisions.
One of the most striking observations is that randomly initialized word embeddings (seen in the far left two columns) are noticeably lighter across all languages. This stands in stark contrast to the subsequent FastText (FT), word+char and FT+char  embeddings. Hungarian shows particularly noticeable improvements, though it may be due to its size. However, given that we also see some moderate improvements for English, Greek and Russian, the size of the treebanks is not the only contributing factor.
Once inputs include word embeddings initialized with FT embeddings or randomized char embeddings, we see some interesting trends. Finnish, Hungarian, Korean, Turkish, and Russian show consistent degradation in performance with any inclusion of character-based embeddings. Some languages show more stable results, regardless of the input embeddings, namely Greek, English, French, and Chinese. Vietnamese clearly performs worse when using FT embeeddings to initialize the word embeddings, but otherwise is rather stable in other settings. However, even with BERT (B) embeddings, we do not see any noticeable improvement over the baselines.
When we begin incorporating POS tag embeddings, we see that drops relative to the baseline for these languages become less pronounced, but few languages ultimately show improvements, with Hungarian being the noticeable exception. The pattern in the reduction in degradation of performance when including POS tag embeddings continues across all settings. The exception being Finnish, which still shows large drops in almost all settings, but show slightly reduced drops when including BERT embeddings.

UD vs SUD
We see, in general, a systematic decrease in performance when using parallel UD and SUD treebanks in an MTL setup across many languages. When looking for linguistic behaviors, we can clearly see that agglutinative languages (Hungarian, Finnish Turkish, and Korean) all suffer severe performance drops when using character-based embeddings, but the concatenation of POS embeddings helps mitigate the degradation. The absolute differences of Hungarian and Finnish are noticeably different compared to one another. This may be somewhat unexpected, given they are in the same language family. However, the modern forms are quite different and treebank sizes may play a role, as input embeddings pattern similarly overall between the two.
The morphological complexity of other languages in relation to their behavior is not necessarily a good predictor of behavior. However, if we view the other eight remaining languages on a continuum of fusional and analytical properties, we can see some general patterns.
Russian, a fusional language, patterns with the agglutinative languages in its behavior with character embeddings, but is also one of the more morphologically rich languages (MRL) of the non agglutinative languages. The other more fusional MRLs, German and Greek, also do not see as much volatility, although German tends to be worse respective to the baseline, while Greek shows some more positive results, but this could again be due to treebank sizes. English and French are more analytical than the other fusional languages and contain far less morphology. While both show rather consistent minimal degradation regardless of the input embedding, English occasionally shows some improvement, while French virtually none.
Arabic and Vietnamese, however, are somewhat odd cases. Arabic is both fusional and an MRL, whereas Vietnamese is much more analytical. Vietnamese, though, patterns more with the other MRLs, while Arabic patterns more like the analytical language, particularly with its less overall performance degradation compared to baselines across settings. Vietnamese shows one of the larger performance drops relative to the baseline compared to the other eight languages when using FT embeddings in the input, but this is diminished when combined with additional embeddings.
Chinese, an extremely analytical language, presents an additionally interesting case. The LAS for SUD is usually on average 10% absolute lower than its UD counterpart, which can be seen in Fig. 3. This probably is a direct result of the SUD treebank having 40% absolute more nonprojective trees. However, this massive disparity in non-projectivity has seemingly not resulted in additional performance degradation in the MTL setup (as seen in Fig. 4), suggesting that sharing between treebanks that show large differences in non-projectivity is not necessarily detrimental.
Given the general behavior across settings, performance degradation can most likely be attributed to negative transfer derived from the different annotation preferences UD and SUD encode, which is not seen in the single model baselines. When different embeddings are used, the negative transfer is either accentuated depending upon the language, as seen with character embeddings, or some embeddings seem to help mitigate the negative transfer, as with POS embeddings. Interestingly, although character-based embeddings show significant improvements in the single baseline models compared to word embeddings, the signals they encode seem to be detrimental in an MTL setting, as performances drops relative to their respective single model baselines. This would seemingly suggest that in an MTL setup, word and POS embeddings are encoding more beneficial signals that help both annotation schemes, reducing possible negative transfer from each treebank, whereas character embeddings are maximally beneficial when used to train a single model. This is in line with recent work showing that the linguistic information POS tags convey, when highly accurate or gold, still have value for specific use cases, and are beneficial in certain dependency parsing architectures or as auxiliary tasks (Anderson and Gómez-Rodríguez,  2020; Zhou et al., 2020). One specific often overlooked annotation issue that helps convey this point is punctuation. 9 In both annotation schemes, punctuation attachment is rather straight forward. However, as seen in Fig  1, it is one of the competing annotation decisions. In both annotation schemes, punctuation is simply attached directly to the root. However, in UD the root is a content word, while in SUD the root is a function word. Thus, while straight forward form an annotation perspective for both schemes, an MTL system is now learning both attachment possibilities simultaneously and preferences and errors regarding both are now being encoded in the global attachment decisions. When looking at specific attachment errors, across almost all examined experiments, there were substantial increases in punctuation attachment errors. This can be seen as a direct result of switching the content versus function oriented headedness and creates systematic, competing attachment decisions for an MTL parser exposed to both attachment possibilities.

UD and SUD vs SPMRL
To further explore whether the competing annotation decisions between UD and SUD are indeed contributing to the noticeable performance degradation, we choose to compare a subset of languages in an MTL setup but with a differently annotated treebank. Using a different treebank runs the risk 9 The CoNLL 2018 evaluation scores punctuation. We are not, however, making a claim as to whether punctuation is or is not a linguistic issue, rather simply highlight it is an annotation attachment issue that illustrates different possible attachment distributions between the annotation schemes. From a linguistic perspective, punctuation can be argued to be irrelevant, from a parsing perspective, unless removed, it still influences attachment decisions. of adding additional domain issues into our experiments; however, character-level embeddings have proven effective at handling OOV words (Ballesteros et al., 2015;Vania et al., 2018), thus domain differences should be reduced.
We perform experiments where we use the Arabic (Habash and Roth, 2009), German (Brants et al., 2004), and Hungarian (Vincze et al., 2010) treebanks from the SPMRL Shared task (Seddah et al., 2013(Seddah et al., , 2014, each of which were annotated with language specific linguistic phenomena in mind. 10 To mitigate size difference issues as the smaller treebank tends to benefit more in a multi-treebank setup (Johansson, 2013), we randomly select a train and dev set from the SPMRL data respectively to match the corresponding size of the UD-SUD treebanks. 11 Results for the SPMRL experiments using char+FT embeddings are presented in Table 6. 12 All results in UD-SPRML and SUD-SPMRL MTL experiments show improved performance over the baseline. Importantly, this includes settings in which the UD-SUD MTL experiments show noticeable decreases relative to the baseline, and specifically we see that character-based embeddings are able to yield benefit in an MTL setup relative to the baseline.
These results suggests that the annotation 10 We refer to the reader to the cited papers for more detailed information on the individual treebank annotations. 11 We note that the both German and Hungarian UD-SUD Treebanks are derived from a small section of the TiGer and Szeged Treebanks respectively, which are also the treebanks for the SPMRL data, thus there is a possibility of sentence overlap in the random selection. The UD-SUD Arabic treebank is derived from the Prague Arabic Dependency Treebank (Hajič et al., 2004) but is also annotated on newswire.
12 Results using word+POS embeddings are provided in the Appendix. schemes are indeed contributing to why UD and SUD make poor tasks for each other in an MTL setup, and not strictly the embeddings themselves. Rather, the information conveyed by each individual annotation scheme is important in terms of the possible gains that MTL parsers can make over the baseline parsers. It may simply be that how the annotations are embedded into the architecture and shared are more influential in what signals are encoded in the network than the embeddings themselves in terms of how they benefit treebanks in an MTL setup. If the annotations themselves encode information that results in negative transfer in the network due to their competing nature, an MTL setup cannot benefit as effectively.

Conclusion
We implemented an MTL architecture leveraging parsing UD and SUD as separate tasks to examine how their syntactic annotation overlaps and differences influence parser behavior. We find that models from an MTL setup perform generally worse than their single model baselines, regardless of input embeddings. Interestingly, POS embeddings seemingly help mitigate some of the performance loss caused from negative transfer as the POS information may help resolve possible linguistic ambiguities with which character embeddings struggle (Vania et al., 2018;Smith et al., 2018b). This stands in contrast to much multi-treebanking research which has yielded positive performance gains when using multiple treebanks, particularly if they are of the same language, though this is not always the case (Barry et al., 2019).
We then further investigated the possible influence annotations have in an MTL setup by training a subset of SPMRL treebanks against their UD-SUD counterparts, finding increases in performance across the chosen languages and input embeddings not seen when pitting UD and SUD together. We argue that this indicates that in an MTL setup, simply adding another treebank is not inherently going to yield better performance, rather the information that each additional treebank can learn from the other, specifically from their annotation schemes, and how this is then subsequently encoded in the network is a more pivotal factor in yielding performance gains.
We conclude that the syntactic annotation schemes are pertinent when determining performance gains in an MTL parsing setup, as extensive competing annotations provides too many mixed signals in an MTL architecture, hampering the ability of both parsers to benefit from shared information, yielding worse results.
Future research will include incorporating more treebanks with different annotation schemes to examine in which directions and annotations parsers will optimize towards in MTL. We also wish to further explore how constituency parsing and dependency parsing can be leveraged against each other in similar MTL setups.

B Heatmaps for Alternating vs Joint Loss settings
Alternating Loss (UAS) Joint Loss (UAS) Figure 7: Heatmaps depicting the when joint loss or alternating loss settings. Each block represents the mean UAS score across shared and unshared MLP settings for the corresponding embedding and loss setting.
Unshared LAS Shared LAS Figure 8: Heatmaps depicting the when joint loss or alternating loss settings. Each block represents the mean LAS score across shared and unshared MLP settings for the corresponding embedding and loss setting.   Table 6: Results for MTL experiments with SPMRL dataset. All MTL experiments are trained with the alternating batch loss setting to allow for comparison with experiments involving SPMRL. We cannot use joint loss when training SPMRL with either UD or SUD as they are not parallel treebanks. The UD-SUD experiment shows results for UD and SUD when they are trained together in an MTL setting, whereas the SPMRL experiment shows results for UD and SUD when each of them is separately trained along with the corresponding SPMRL dataset instead of each other (UD-SPMRL & SUD-SPMRL).