Climbing the Tower of Treebanks: Improving Low-Resource Dependency Parsing via Hierarchical Source Selection

Recent work on multilingual dependency parsing focused on developing highly multilingual parsers that can be applied to a wide range of low-resource languages. In this work, we substantially outperform such “one model to rule them all” approach with a heuristic selection of languages and treebanks on which to train the parser for a speciﬁc target language. Our approach, dubbed T OWER , ﬁrst hierarchically clusters all Universal Dependencies languages based on their mutual syntactic similarity computed from human-coded URIEL vectors. For each low-resource target language, we then climb this language hierarchy starting from the leaf node of that language and heuristically choose the hierarchy level at which to collect training treebanks. This treebank selection heuristic is based on: (i) the aggregate size of all treebanks subsumed by the hierarchy level and (ii) the similarity of the languages in the training sample with the target language. For languages without development treebanks, we additionally use (ii) for model selection (i.e., early stopping) in order to prevent overﬁtting to development treebanks of closest languages. Our T OWER approach shows substantial gains for low-resource languages over two state-of-the-art multilingual parsers, with more than 20 LAS point gains for some of those languages. Parsing models and code available at: https: //github.com/codogogo/towerparse .


Introduction
Syntactic parsing -grounded in a wide variety of formalisms (Taylor et al., 2003;De Marneffe et al., 2006;Hockenmaier and Steedman, 2007;Nivre et al., 2016, inter alia) -has been the backbone of natural language processing (NLP) for decades, and an indispensable preprocessing step for tackling higher-level language understanding tasks. A recent major paradigm shift in NLP towards largescale pretrained language models (PLMs) (Devlin et al., 2019;Liu et al., 2019;Brown et al., 2020) and their end-to-end fine-tuning for downstream tasks has reduced the downstream relevance of supervised syntactic parsing. What is more, there is more and more evidence that PLMs implicitly acquire rich syntactic knowledge through large-scale pretraining (Hewitt and Manning, 2019;Chi et al., 2020) and that exposing them to explicit syntax from human-coded treebanks does not offer significant language understanding benefits (Kuncoro et al., 2020;Glavaš and Vulić, 2021). In order to implicitly acquire syntactic competencies, however, PLMs need language-specific corpora at the scale at which it can only be obtained for a tiny portion of world's 7,000+ languages. For the remaining vast majority of languages -with limited-size monolingual corpora -explicit syntax still provides valuable linguistic bias for more sample-efficient learning in downstream NLP tasks.
Reliable syntactic parsing requires annotated treebanks of reasonable size: this prerequisite is, unfortunately, satisfied for even fewer languages. Despite the multi-year, well-coordinated annotation efforts such as the Universal Dependencies (Nivre et al., 2016(Nivre et al., , 2020 project, language-specific treebanks are unlikely to appear anytime soon for most world languages. This renders the transfer of syntactic knowledge from high-resource languages with annotated treebanks a necessity. A truly zeroshot transfer for low-resource languages assumes a set of training treebanks from resource-rich source languages and a target language without any syntactic annotations. Effectively, the task is then to identify the subset of source treebanks, the parser trained on which would yield the best parsing performance for the target language. An exhaustive search over all possible subsets of source treebanks is not only computationally intractable 1 but also uninformative in true zero-shot scenarios in which there is no development treebank (i.e., any syntactically annotated data) for the target language. Most existing transfer methods therefore either (1) choose one (or a few) best source languages for each target language (Rosa and Zabokrtsky, 2015;Agić, 2017;Lin et al., 2019;Litschko et al., 2020) or (2) train a single multilingual parser on all available treebanks; such parsers, based on pretrained multilingual encoders, currently produce best results in low-resource parsing (Kondratyuk and Straka, 2019;Üstün et al., 2020). Other transfer approaches, e.g., based on data augmentation (Ş ahin and Steedman, 2018;Vania et al., 2019), violate the zero-shot transfer by assuming a small target-language treebank -a requirement unfulfilled for most world languages. 2 In this work, we propose a simple and effective heuristic for selecting a good set of source treebanks for any given low-resource target language. In our approach, named TOWER, we first hierarchically cluster all Universal Dependencies (UD) languagues. To this end, we compute syntactic similarity of languages by comparing manually coded vectors of their syntactic properties from the URIEL database (Littell et al., 2017). We then iteratively 'climb' that language hierarchy level by level, starting from the leaf node of the target language. We stop 'climbing' (i.e., select the set of source treebanks subsumed by the current hierarchy level), when the relative decrease in linguistic similarity of the training sample w.r.t the target language outweighs the increase in size of the training sample. We additionally exploit the linguistic similarity between the target language and its closest sources with existing development treebanks to inform a model selection (that is, early-stopping) heuristic. TOWER substantially outperforms stateof-the-art multilingual parsers -UDPipe (Straka, 2018), UDify (Kondratyuk and Straka, 2019), and UDapter (Üstün et al., 2020) on low-resource languages, while offering comparable performance for high-resource languages.

Climbing the TOWER of Treebanks
Constructing the TOWER. We start by hierarchically clustering the set of 89 languages from Universal Dependencies 3 based on their syntactic collection of N source treebanks.
2 For the vast majority of world languages there does not exist a single manually annotated syntactic tree. 3 We worked with the UD version 2.5. similarity. To this end, we represent each language with its syntax knn vector from the URIEL database (Littell et al., 2017). Features of these 103-dimensional vectors correspond to individual syntactic properties from manually coded linguistic resources such as WALS (Dryer and Haspelmath, 2013) and SSWL (Collins and Kayne, 2009). URIEL's syntax knn strategy replaces feature values missing in those resources with kNN-based predictions (cf. (Littell et al., 2017) for more details). We then carry out hierarchical agglomerative clustering with Ward's linkage (Anderberg, 2014) with Euclidean distances between URIEL vectors guiding the clustering. Figure 1 shows a dendrogram of one part of the resulting hierarchy. We display the complete hierarchy in the Appendix. The syntax-based clustering largely reflects memberships in language (sub)families, with a few notable exceptions: e.g., Tagalog (tl), from the Austronesian family appears to be syntactically similar to (and is joined with) Scottish (gd), Irish (ga), and Welsh (cy) from the Celtic branch of the Indo-European family.
Treebank Selection (TBS). For a given test treebank, we start climbing the hierarchy from the leaf node of the treebank's language. Let s l denote the number of climbing steps we take from the target leaf node l. If the target test treebank also has the corresponding training portion, in-treebank training constitutes the first training configuration (we denote this configuration with s l = −1). For resource-rich languages with several training treebanks, we create the next training sample by concatenating all of those treebanks (we denote this level with s l = 0). 4 For low-resource target lan-guages without any training treebanks, the first training sample is collected at s l = 1, where the language is joined with other languages. The training set corresponding to a hierarachy level (i.e., each join in the tree) concatenates all training treebanks of all languages (i.e., leaf nodes) of the respective hierarchy subtree. 5 Let {S n } N n=0 (or −1) be the set of training configurations collected by climbing the hierarchy starting from the target language l and let S n = ∪{T k } K k=1 be the n-th training set consisting of K training treebanks. As we climb the hierarchy (i.e., as n increases), the training set S n is bound to grow; at the same time, the sample of training languages becomes increasingly dissimilar w.r.t. the target language l. In other words, as we climb higher up the induced syntactic hierarchy of languages, we train on more data but from a mixture of (syntactically) more distant languages. Let l k be the language of the training treebank T k . We then quantify the syntactic similarity sim(S n , l) between the training set S n and the target language l as follows: with cos(l k , l) as cosine similarity between URIEL vectors of l k and l, and relative sizes of individual treebanks |T k |/|S n | as weights. We then use the following simple heuristic to select the best training set S n : we stop climbing when the relative growth of the training set becomes smaller than the relative decrease of the similarity with the target language, i.e., we select the smallest n for which the following condition is satisifed: Model Selection (MS). Early stopping based on the model performance on a development set (dev) is an important mechanism for preventing model overfitting in supervised machine learning. In a truly zero-shot transfer setup, on the one hand, we do not have any development data in the target the training portions of Russian GSD, PUD, and SynTagRus treebanks. 5 Note that the number of climbs s l needed to reach some hierarchy level depends on the language l: e.g., the hierarchy level joining Tagalog (tl) with Scottish, Irish, and Welsh ({gd, ga, cy}) is reached in s l = 1 climbs from Tagalog, s l = 2 climbs from Scottish and s l = 3 climbs from Irish and Welsh.
language. Model selection based on the development set of the source language, on the other hand, overfits the model to the source language, which may hurt effectiveness of the cross-lingual transfer (Keung et al., 2020;Chen and Ritter, 2020). For test treebanks with a respective development portion, TOWER uses that development set for model selection. For low-resource languages l without development treebanks, we compile a proxy development set D l = ∪{D k } K k=1 by collecting all development treebanks D k from the hierarchy level closest to l that encompasses at least one treebank with a development set. 6 Intuitively, the more syntactically similar D l is to l, the more beneficial the model selection based on D l will be for performance on l, the optimal model checkpoint w.r.t. l should be closer to the model checkpoint exhibiting best performance on D l . Accordingly, with M as the model checkpoint with best performance on D l , we select the model chekpoint M = sim(D l , l) · M (see Eq.(1)) as the "optimal" checkpoint for the target language l. Shallow Biaffine Parser. TOWER employs the shallow biaffine parser of Glavaš and Vulić (2021), stacked on top of the pretrained XLM-R (Conneau et al., 2020). Compared to the standard biaffine parser (Dozat and Manning, 2017;Kondratyuk and Straka, 2019;Üstün et al., 2020), this shallow variant forwards word-level representations (aggregated from subword output) directly into biaffine products, bypassing deep feed-forward transformations that produce dependent-and head-specific vectors (Dozat and Manning, 2017). The shallow variant is reported to perform comparably (Glavaš and Vulić, 2021), while being faster to train.

Evaluation and Discussion
Treebanks and Baselines. We evaluate TOWER on 138 (test) treebanks from Universal Dependencies (Nivre et al., 2020). 7 We compare TOWER against two state-of-the-art multilingual parsers:  Training and Optimization Details. We limit input sequences to 128 subword tokens. We use XLM-R Base with L = 12 layers and hidden size H = 768 and apply a dropout (p = 0.1) on its outputs before forwarding them to the shallow parsing head. We train in batches of 32 sentences and optimize parameters with Adam (Kingma and Ba, 2015) (starting learning rate 10 −5 ). We train for 30 epochs, with early stopping based on dev loss. 8 Results and Discussion. We show detailed results for all 138 treebanks in the Appendix. In Table 1, we show averages over different treebank subsets: treebanks on which both TOWER and (1) UDify (∩UDify; 111 treebanks) and (2) UDapter (∩UDapt; 39 treebanks) have been evaluated, (3) 12 high-resource languages on which UDapter was trained (HIGH) and (4) 11 low-resource treebanks (LOW) for which all three models have been evaluated. We show LAS scores for languages from HIGH and LOW in Figure 2. Similar trends are observed with UAS scores. TOWER outperforms UDify and UDapter in all setups except HIGH, with especially pronounced gains for LOW. This renders TOWER particularly successful for the intended use case: lowresource languages without any training data. Admittedly, the fact that TOWER is built on XLM-R, whereas UDify and UDapter use mBERT, impedes the direct "apples-to-apples" comparison. Two sets of results, however, strongly suggest that it is TOWER's heuristics (TBS & MS) that drive its performance rather than the XLM-R (instead of mBERT) encoder. First, UDapter outperforms TOWER on high-resource languages with large training treebanks (i.e., the HIGH setup). For these languages, however, TOWER effectively does not employ its heuristics: (i) TBS selects the large language-specific treebank(s), as adding any other language prohibitively reduces the perfect similarity sim(S 0 , l) = 1 (see Eq. (1)); (ii) MS is not used because each high-resource treebank has its own dedicated dev set. Secondly, removing TOWER's heuristics (see -TBS-MS in Table 1) brings its performance slightly below that of UDapter, rendering TBS (primarily) and MS (rather than the XLM-R encoder) crucial for TOWER's gains. Comparing -TBS and -MS reveals that, somewhat expectedly, selecting the "optimal" training sample (TBS) contributes to the overall performance more than the heuristic early stopping (MS).
Looking at individual low-resource languages (Fig. 2), we observe largest gains for Amharic (am) and Sanskrit (sa). While Sanskrit benefits from TOWER selecting training languages from the same family (Marathi, Urdu, and Hindi), Amharic (Afro-Asiatic family), interestingly, benefits from treebanks of syntactically similar languages from another family (cf. the full TOWER hierarchy in the Appendix) -Tamil and Telugu (Dravidian family). Similarly, Tagalog (Austronesian language) parsing massively benefits from training on Scottish and Irish treebanks (Indo-European, Celtic).

Conclusion
We proposed TOWER, a simple yet effective approach to the crucial problem of source language selection for multilingual and cross-lingual dependency parsing. It leverages the language hierarchy, induced from syntax-based manually coded URIEL language vectors, and simple treebank selection heuristics to inform the source selection. A widescale UD evaluation and comparisons to current state-of-the-art multilingual dependency parsers validated the effectiveness of TOWER, especially in low-resource languages. Moreover, while the main experiments in this work were based on one particular state-of-the-art parsing architecture, TOWER is fully independent of the chosen underlying parsing model, and thus widely applicable.