PPT: Parsimonious Parser Transfer for Unsupervised Cross-Lingual Adaptation

Cross-lingual transfer is a leading technique for parsing low-resource languages in the absence of explicit supervision. Simple ‘direct transfer’ of a learned model based on a multilingual input encoding has provided a strong benchmark. This paper presents a method for unsupervised cross-lingual transfer that improves over direct transfer systems by using their output as implicit supervision as part of self-training on unlabelled text in the target language. The method assumes minimal resources and provides maximal flexibility by (a) accepting any pre-trained arc-factored dependency parser; (b) assuming no access to source language data; (c) supporting both projective and non-projective parsing; and (d) supporting multi-source transfer. With English as the source language, we show significant improvements over state-of-the-art transfer models on both distant and nearby languages, despite our conceptually simpler approach. We provide analyses of the choice of source languages for multi-source transfer, and the advantage of non-projective parsing. Our code is available online.


Introduction
Recent progress in natural language processing (NLP) has been largely driven by increasing amounts and size of labelled datasets. The majority of the world's languages, however, are lowresource, with little to no labelled data available (Joshi et al., 2020). Predicting linguistic labels, such as syntactic dependencies, underlies many downstream NLP applications, and the most effective systems rely on labelled data. Their lack hinders the access to NLP technology in many languages. One solution is cross-lingual model * Work done outside Amazon. 1 https://github.com/kmkurn/ ppt-eacl2021 transfer, which adapts models trained on highresource languages to low-resource ones. This paper presents a flexible framework for cross-lingual transfer of syntactic dependency parsers which can leverage any pre-trained arc-factored dependency parser, and assumes no access to labelled target language data. One straightforward method of cross-lingual parsing is direct transfer. It works by training a parser on the source language labelled data and subsequently using it to parse the target language directly. Direct transfer is attractive as it does not require labelled target language data, rendering the approach fully unsupervised. 2 Recent work has shown that it is possible to outperform direct transfer if unlabelled data, either in the target lan-guage or a different auxiliary language, is available (He et al., 2019;Meng et al., 2019;Ahmad et al., 2019b). Here, we focus on the former setting and present flexible methods that can adapt a pre-trained parser given unlabelled target data.
Despite their success in outperforming direct transfer by leveraging unlabelled data, current approaches have several drawbacks. First, they are limited to generative and projective parsers. However, discriminative parsers have proven more effective, and non-projectivity is a prevalent phenomenon across the world's languages (de Lhoneux, 2019). Second, prior methods are restricted to single-source transfer, however, transfer from multiple source languages has been shown to lead to superior results (McDonald et al., 2011;Duong et al., 2015a;Rahimi et al., 2019). Third, they assume access to the source language data, which may not be possible because of privacy or legal reasons. In such source-free transfer, only a pre-trained source parser may be provided.
We address the three shortcomings with an alternative method for unsupervised target language adaptation (Section 2). Our method uses high probability edge predictions of the source parser as a supervision signal in a self-training algorithm, thus enabling unsupervised training on the target language data. The method is feasible for discriminative and non-projective parsing, as well as multi-source and source-free transfer. Building on a framework introduced in Täckström et al. (2013), this paper for the first time demonstrates their effectiveness in the context of state-of-the-art neural dependency parsers, and their generalizability across parsing frameworks. Using English as the source language, we evaluate on eight distant and ten nearby languages (He et al., 2019). The singlesource transfer variant (Section 2.1) outperforms previous methods by up to 11 % UAS, averaged over nearby languages. Extending the approach to multi-source transfer (Section 2.2) gives further gains of 2 % UAS and closes the performance gap against the state of the art on distant languages. In short, our contributions are: 1. A conceptually simple and highly flexible framework for unsupervised target language adaptation, which supports multi-source and source-free transfer, and can be employed with any pre-trained state-of-the-art arcfactored parser(s); 2. Generalisation of the method of Täckström et al. (2013) to state-of-the-art, non-projective dependency parsing with neural networks; 3. Up to 13 % UAS improvement over state-ofthe-art models, considering nearby languages, and roughly equal performance over distant languages; and 4. Analysis of the impact of choice of source languages on multi-source transfer quality.

Supervision via Transfer
In our scenario of unsupervised cross-lingual parsing, we assume the availability of a pre-trained source parser, and unlabelled text in the target language. Thus, we aim to leverage this data such that our cross-lingual transfer parsing method out-performs direct transfer. One straightforward method is self-training where we use the predictions from the source parser as supervision to train the target parser. This method may yield decent performance as direct transfer is fairly good to begin with. However, we may be able to do better if we also consider a set of parse trees that have high probability under the source parser (cf. Fig. 1 for illustration). If we assume that the source parser can produce a set of possible trees instead, then it is natural to use all of these trees as supervision signal for training. Inspired by Täckström et al. (2013), we formalise the method as follows. Given an unlabelled dataset {x i } n i=1 , the training loss can be expressed as where θ is the target parser parameters andỸ (x i ) is the set of trees produced by the source parser. Note thatỸ (x i ) must be smaller than the set of all trees spanning x (denoted as Y(x i ) ) because L(θ) = 0 otherwise. This training procedure is a form of self-training, and we expect that the target parser can learn the correct tree as it is likely to be included inỸ (x i ). Even if this is not the case, as long as the correct arcs occur quite frequently iñ Y (x i ), we expect the parser to learn a useful signal.
We consider an arc-factored neural dependency parser where the score of a tree is defined as the sum of the scores of its arcs, and the arc scoring function is parameterised by a neural network. The probability of a tree is then proportional to its score.
Formally, this formulation can be expressed as where Z(x) = y∈Y(x) exp s θ (x, y) is the partition function, A(y) is the set of head-modifier arcs in y, and s θ (x, y) and s θ (x, h, m) are the tree and arc scoring function respectively.

Single-Source Transfer
Here, we consider the case where a single pretrained source parser is provided and describe how the set of trees is constructed. Concretely, for every sentence x = w 1 , w 2 , . . . , w t in the target language data, using the source parser, the set of high probability treesỸ (x) is defined as the set of dependency trees that can be assembled from the high probability arcs setÃ(x) = t m=1Ã (x, m), wherẽ A(x, m) is the set of high probability arcs whose dependent is w m . Thus,Ỹ (x) can be expressed formally as A(x, m) is constructed by adding arcs (h, m) in order of decreasing arc marginal probability until their cumulative probability exceeds a threshold σ (Täckström et al., 2013). The predicted tree from the source parser is also included inỸ (x) so the chart is never empty. This prediction is simply the highest scoring tree. This procedure is illustrated in Fig. 1.
Since Y(x) contains an exponential number of trees, efficient algorithms are required to compute the partition function Z(x), arc marginal probabilities, and the highest scoring tree. First, arc marginal probabilities can be computed efficiently with dynamic programming for projective trees (Paskin, 2001) and Matrix-Tree Theorem for the non-projective counterpart (Koo et al., 2007;McDonald and Satta, 2007;Smith and Smith, 2007). The same algorithms can also be employed to compute Z(x). Next, the highest scoring tree can be obtained efficiently with Eisner's algorithm (Eisner, 1996) or the maximum spanning tree algorithm (McDonald et al., 2005;Chu and Liu, 1965;Edmonds, 1967) for the projective and non-projective cases, respectively.
The transfer is performed by initialising the target parser with the source parser's parameters and then fine-tuning it with the training loss in Eq. (1) on the target language data. Following previous works (Duong et al., 2015b;He et al., 2019), we also regularise the parameters towards the initial parameters to prevent them from deviating too much since the source parser is already good to begin with. Thus, the final fine-tuning loss becomes where θ 0 is the initial parameters and λ is a hyperparameter regulating the strength of the L 2 regularisation. This single-source transfer strategy was introduced as ambiguity-aware self-training by Täckström et al. (2013). A difference here is that we regularise the target parser's parameters against the source parser's as the initialiser, and apply the technique to modern lexicalised state-ofthe-art parsers. We refer to this transfer strategy as PPT hereinafter.
Note that the whole procedure of PPT can be performed even when the source parser is trained with monolingual embeddings. Specifically, given a source parser trained only on monolingual embeddings, one can align pre-trained target language word embeddings to the source embedding space using an offline cross-lingual alignment method (e.g., of Smith et al. (2017)), and use the aligned target embeddings with the source model to com-puteỸ (x). Thus, our method can be used with any pre-trained monolingual neural parser.

Multi-Source Transfer
We now consider the case where multiple pretrained source parsers are available. To extend PPT to this multi-source case, we employ the ensemble training method from Täckström et al. (2013), which we now summarise. We definẽ is the set of high probability arcs obtained with the k-th source parser. The rest of the procedure is exactly the same as PPT. Note that we need to select one source parser as the main source to initialise the target parser's parameters with. Henceforth, we refer to this method as PPTX.
Multiple source parsers may help transfer better because each parser will encode different syntactic biases from the languages they are trained on. Thus, it is more likely for one of those biases to match that of the target language instead of using just a single source parser. However, multi-source transfer may also hurt performance if the languages have very different syntax, or the source parsers are of poor quality, which can arise from poor quality crosslingual word embeddings.

Setup
We run our experiments on Universal Dependency Treebanks v2.2 (Nivre et al., 2018). We reimplement the self-attention graph-based parser of Ahmad et al. (2019a) that has been used with success for cross-lingual dependency parsing. Averaged over 5 runs, our reimplementation achieves 88.8 % unlabelled attachment score (UAS) on English Web Treebank using the same hyperparameters, 3 slightly below their reported 90.3 % result. 4 We select the run with the highest labelled attachment score (LAS) as the source parser. We obtain cross-lingual word embeddings with the offline transformation of Smith et al. (2017) applied to fastText pre-trained word vectors (Bojanowski et al., 2017). We include the universal POS tags as inputs by concatenating the embeddings with the word embeddings in the input layer. We acknowledge that the inclusion of gold POS tags does not reflect a realistic low-resource setting where gold tags are not available, which we discuss more in Section 3.3. We evaluate on 18 target languages that are divided into two groups, distant and nearby languages, based on their distance from English as defined by He et al. (2019). 5 During the unsupervised fine-tuning, we compute the training loss over all trees regardless of projectivity (i.e. we use Matrix-Tree Theorem to compute Eq. (1)) and discard sentences longer than 30 tokens to avoid out-of-memory error. Following He et al. (2019), we fine-tune on the target language data for 5 epochs, tune the hyperparameters (learning rate and λ) on Arabic and Spanish using LAS, and use these values 6 for the distant and nearby languages, respectively. We set the threshold σ = 0.95 for both PPT and PPTX following Täckström et al. (2013). We keep the rest of the hyperparameters (e.g., batch size) equal to those of Ahmad et al. (2019a). For PPTX, unless other-wise stated, we consider a leave-one-out scenario where we use all languages except the target as the source language. We use the same hyperparameters as the English parser to train these non-English source parsers and set the English parser as the main source.

Comparisons
We compare PPT and PPTX against several recent unsupervised transfer systems. First, HE is a neural lexicalised DMV parser with normalising flow that uses a language modelling objective when fine-tuning on the unlabelled target language data (He et al., 2019). Second, AHMAD is an adversarial training method that attempts to learn language-agnostic representations (Ahmad et al., 2019b). Lastly, MENG is a constrained inference method that derives constraints from the target corpus statistics to aid inference (Meng et al., 2019). We also compare against direct transfer (DT) and self-training (ST) as our baseline systems. 7 Table 1 shows the main results. We observe that fine-tuning via self-training already helps DT, and by incorporating multiple high probability trees with PPT, we can push the performance slightly higher on most languages, especially the nearby ones. Although not shown in the table, we also find the PPT has up to 6x lower standard deviation than ST, which makes PPT preferrable to ST. Thus, we exclude ST as a baseline from our subsequent experiments. Our results seem to agree with that of Täckström et al. (2013) and suggest that PPT can also be employed for neural parsers. Therefore, it should be considered for target language adaptation if unlabelled target data is available. Comparing to HE (He et al., 2019), PPT performs worse on distant languages, but better on nearby languages. This finding means that if the target language has a closely related high-resource language, it may be better to transfer from that language as the source and use PPT for adaptation. Against AHMAD (Ahmad et al., 2019b), PPT performs better on 4 out of 6 distant languages. On nearby languages, the average UAS of PPT is higher, and the average LAS is on par. This result shows that leveraging unlabelled data for cross-lingual parsing without access to the source data is feasible. PPT also performs better than MENG (Meng et al., 2019) on 4 out of 7 distant languages, and slightly better on average on nearby languages. This finding shows that PPT is competitive to their constrained inference method. Also reported in Table 1 are the ensemble results for PPTX, which are particularly strong. PPTX outperforms PPT, especially on distant languages with the average UAS and LAS absolute improvements of 7 % and 6 % respectively. This finding suggests that PPTX is indeed an effective method for multisource transfer of neural dependency parsers. It also gives further evidence that multi-source transfer is better than the single-source counterpart. PPTX also closes the gap against the state-of-theart adaptation of He et al. (2019) in terms of average UAS on distant languages. This result suggests that PPTX can be an option for languages that do not have a closely related high-resource language to transfer from.

Results
Treebank Leakage The success of our crosslingual transfer can be attributed in part to treebank leakage, which measures the fraction of dependency trees in the test set that are isomorphic to a tree in the training set (with potentially different words); accordingly these trees are not entirely unseen. Such leakage has been found to be a particularly strong predictor for parsing performance in monolingual parsing (Søgaard, 2020). Fig. 2 shows the relationship between treebank leakage and parsing accuracy, where the leakage is computed between the English training set as source and the target language's test set. Excluding outliers which are Korean and Turkish because of their low parsing accuracy despite the relatively high leakage, we find that there is a fairly strong positive correlation (r = 0.57) between the amount of leakage and accuracy. The same trend occurs with DT, ST, and PPT. This finding suggests that crosslingual parsing is also affected by treebank leakage just like monolingual parsing is, which may present an opportunity to find good sources for transfer.
Use of Gold POS Tags As we explained in Section 3.1, we restrict our experiments to gold POS tags for comparison with prior work. However, the use of gold POS tags does not reflect a realistic low-resource setting where one may have to resort to automatically predicted POS tags. Tiedemann (2015) has shown that cross-lingual delexicalised parsing performance degrades when predicted POS tags are used. The degradation ranges from 2.9 to 8.4 LAS points depending on the target language. Thus, our reported numbers in Table 1 are likely to decrease as well if predicted tags are used, although we expect the decline is not as sharp because our parser is lexicalised.

Parsimonious Selection of Sources for PPTX
In our main experiment, we use all available languages as source for PPTX in a leave-one-out setting. Such a setting may be justified to cover as many syntactic biases as possible, however, training dozens of parses may be impractical. In this experiment, we consider the case where we can train only a handful of source parsers. We investigate two selections of source languages: (1) a representative selection (PPTX-REPR) which covers as many language families as possible and (2) a pragmatic selection (PPTX-PRAG) containing truly high-resource languages for which quality pretrained parsers are likely to exist. We restrict the selections to 5 languages each. For PPTX-REPR, we use English, Spanish, Arabic, Indonesian, and Korean as source languages. This selection covers Indo-European (Germanic and Romance), Afro-Asiatic, Austronesian, and Koreanic language families respectively. We use English, Spanish, Arabic, French, and German as source languages for PPTX-PRAG. The five languages are classified as exemplary high-resource languages by Joshi et al. (2020). We exclude a language from the source if it is also the target language, in which case there will be only 4 source languages. Other than that, the setup is the same as that of our main experiment. 8 We present the result in Fig. 3 where we also include the results for PPT, and PPTX with the 8 Hyperparameters are tuned; values are shown in Table 5. leave-one-out setting (PPTX-LOO). We report only LAS since UAS shows a similar trend. We observe that both PPTX-REPR and PPTX-PRAG outperform PPT overall. Furthermore, on nearby languages except Dutch and German, both PPTX-REPR and PPTX-PRAG outperform PPTX-LOO, and PPTX-PRAG does best overall. In contrast, no systematic difference between the three PPTX variants emerges on distant languages. This finding suggests that instead of training dozens of source parsers for PPTX, training just a handful of them is sufficient, and a "pragmatic" selection of a small number of high-resource source languages seems to be an efficient strategy. Since pre-trained parsers for these languages are most likely available, it comes with the additional advantage of alleviating the need to train parsers at all, which makes our method even more practical.
Analysis on Dependency Labels Next, we break down the performance of our methods based on the dependency labels to study their failure and success patterns. Fig. 4 shows the UAS of DT, PPT, and PPTX-PRAG on Indonesian and German for select dependency labels.
Looking at Indonesian, PPT is slightly worse than DT in terms of overall accuracy scores (Table 1), and this is reflected across dependency labels. However, we see in Fig. 4 that PPT outperforms DT on amod. In Indonesian, adjectives follow the noun they modify, while in English the opposite is true in general. Thus, unsupervised target language adaptation seems able to address these kinds of discrepancy between the source and target language. We find that PPTX-PRAG outperforms both DT and PPT across dependency labels, especially on flat and compound labels as shown in Fig. 4. Both labels are related to multi-word expressions (MWEs), so PPTX appears to improve parsing MWEs in Indonesian significantly.
For German we find that both PPT and PPTX-PRAG outperform DT on most dependency labels, with the most notable gain on nmod, which appear in diverse, and often non-local relations in both languages many of which do not structurally translate, and fine-tuning improves performance as expected. Also, we see PPTX-PRAG significantly underperforms on compound while PPT is better than DT. German compounds are often merged into a single token, and self-training appears to alleviate over-prediction of such relations. The multi-source case may contain too much diffuse signal on compound and thus the performance is worse than that of DT. We find that PPT and PPTX improves over DT on mark, likely because markers are often used in places where German deviates from English by becoming verb-final (e.g., subordinate clauses). Both PPT and PPTX-PRAG seem able to learn this characteristic as shown by their performance improvements. This analysis suggests that the benefits of self-training depend on the syntactic properties of the target language.

Effect of Projectivity
In this experiment, we study the effect of projectivity on the performance of our methods. We emulate a projective parser by restricting the trees inỸ (x) to be projective. In other words, the sum in Eq. (1) is performed only over projective trees. At test time, we search for the highest scoring projective tree. We compare DT, PPT, and PPTX-PRAG, and report LAS on Indonesian (id) and Croatian (hr) as distant languages, and on French (fr) and Dutch (nl) as nearby languages. The trend for UAS and on the other languages is similar. We use the dynamic programming implementation provided by torch-struct for the projective case (Rush, 2020). We find that it consumes more memory than our Matrix-Tree Theorem implementation, so we set the length cutoff to 20 tokens. 9 Table 2 shows result of our experiment, which suggests that there is no significant performance difference between the projective and non-projective  Table 3: Comparison of LAS on Arabic and Spanish on the development set, averaged over 5 runs. PPTX EN 5 is PPTX with 5 English parsers as source, each trained on 1/5 size of the English corpus. PPTX-PRAG S is PPTX with the pragmatic selection of source languages (PPTX-PRAG) but each source parser is trained on the same amount of data as PPTX EN 5 .
variant of our methods. This result suggests that our methods generalise well to both projective and non-projective parsing. That said, we recommend the non-projective variant as it allows better parsing of languages that are predominantly non-projective. Also, we find that it runs roughly 2x faster than the projective variant in practice.

Disentangling the Effect of Ensembling and Larger Data Size
The effectiveness of PPTX can be attributed to at least three factors: (1) the effect of ensembling source parsers (ensembling), (2) the effect of larger data size used for training the source parsers (data), and (3) the diversity of syntactic biases from multiple source languages (multilinguality). In this experiment, we investigate to what extent each of those factors contributes to the overall performance.
To this end, we design two additional comparisons: PPTX EN 5 and PPTX-PRAG S . PPTX EN 5 is PPTX with only English source parsers, where each parser is trained on 1/5 of the English training set. That is, we randomly split the English training set into five equal-sized parts, and train a separate parser on each. These parsers then serve as the source parsers for PPTX EN 5 . Thus, PPTX EN 5 has the benefit of ensembling but not data and multilinguality compared with PPT.
PPTX-PRAG S is PPTX whose source language selection is the same as PPTX-PRAG, but each source parser is trained on the training data whose size is roughly the same as that of the training data of PPTX EN 5 source parsers. In other words, the training data size is roughly equal to 1/5 of the English training set. To obtain this data, we ran-domly sub-sample the training data of each source language to the appropriate number of sentences. Therefore, PPTX-PRAG S has the benefit of ensembling and multilinguality but not data. Table 3 reports their LAS on the development set of Arabic and Spanish, averaged over five runs. We also include the results of PPTX-PRAG that enjoys all three benefits. We observe that PPT and PPTX EN 5 perform similarly on Arabic, and PPTX EN 5 has a slightly lower performance on Spanish. This result suggests a negligable effect of ensembling on performance. On the other hand, PPTX-PRAG S outperforms PPTX EN 5 remarkably, with approximately 6 % and 4 % LAS improvement on Arabic and Spanish respectively, showing that multilinguality has a much larger effect on performance than ensembling. Lastly, we see that PPTX-PRAG performs similarly to PPTX-PRAG S on Arabic, and about 1.6 % better on Spanish. This result demonstrates that data size has an effect, albeit a smaller one compared to multilinguality. To conclude, the effectiveness of PPTX can be attributed to the diversity contributed through multiple languages, and not to ensembling or larger source data sets.

Related Work
Cross-lingual dependency parsing has been extensively studied in NLP. The approaches can be grouped into two main categories. On the one hand, there are approaches that operate on the data level. Examples of this category include annotation projection, which aims to project dependency trees from a source language to a target language (Hwa et al., 2005;Li et al., 2014;Lacroix et al., 2016;; and source treebank reordering, which manipulates the source language treebank to obtain another treebank whose statistics approximately match those of the target language (Wang and Eisner, 2018;Rasooli and Collins, 2019). Both methods have no restriction on the type of parsers as they are only concerned with the data. Transferring from multiple source languages with annotation projection is also feasible (Agić et al., 2016).
Despite their effectiveness, these data-level methods may require access to the source language data, hence are unusable when it is inaccessible due to privacy or legal reasons. In such source-free transfer, only a model pre-trained on the source language data is available. By leveraging parallel data, annotation projection is indeed feasible without ac-cess to the source language data. That said, parallel data is limited for low-resource languages or may have a poor domain match. Additionally, these methods involve training the parser from scratch for every new target language, which may be prohibitive.
On the other hand, there are methods that operate on the model level. A typical approach is direct transfer (aka., zero-shot transfer) which trains a parser on source language data, and then directly uses it to parse a target language. This approach is enabled by the shared input representation between the source and target language such as POS tags (Zeman and Resnik, 2008) or cross-lingual embeddings (Guo et al., 2015;Ahmad et al., 2019a). Direct transfer supports source-free transfer and only requires training a parser once on the source language data. In other words, direct transfer is unsupervised as far as target language resources.
Previous work has shown that unsupervised target language adaptation outperforms direct transfer. Recent work by He et al. (2019) used a neural lexicalised dependency model with valence (DMV) (Klein and Manning, 2004) as the source parser and fine-tuned it in an unsupervised manner on the unlabelled target language data. This adaptation method allows for source-free transfer and performs especially well on distant target languages. A different approach is proposed by Meng et al. (2019), who gathered target language corpus statistics to derive constraints to guide inference using the source parser. Thus, this technique also allows for source-free transfer. A different method is proposed by Ahmad et al. (2019b) who explored the use of unlabelled data from an auxiliary language, which can be different from the target language. They employed adversarial training to learn language-agnostic representations. Unlike the others, this method can be extended to support multisource transfer. An older method is introduced by Täckström et al. (2013), who leveraged ambiguityaware training to achieve unsupervised target language adaptation. Their method is usable for both source-free and multi-source transfer. However, to the best of our knowledge, its use for neural dependency parsing has not been investigated. Our work extends theirs by employing it for the said purpose.
The methods of both He et al. (2019) and Ahmad et al. (2019b) have several limitations. The method of He et al. (2019) requires the parser to be generative and projective. Their generative parser is quite impoverished with an accuracy that is 21 points lower than a state-of-the-art discriminative arc-factored parser on English. Thus, their choice of generative parser may constrain its potential performance. Furthermore, their method performs substantially worse than direct transfer on nearby target languages. Because of the availability of resources such as Universal Dependency Treebanks (Nivre et al., 2018), it is likely that a target language has a closely related high-resource language which can serve as the source language. Therefore, performing well on nearby languages is more desirable pragmatically. On top of that, it is unclear how to employ this method for multisource transfer. The adversarial training method of Ahmad et al. (2019b) does not suffer from the aforementioned limitations but is unusable for sourcefree transfer. That is, it assumes access to the source language data, which may not always be feasible due to privacy or legal reasons.

Conclusions
This paper presents a set of effective, flexible, and conceptually simple methods for unsupervised cross-lingual dependency parsing, which can leverage the power of state-of-the-art pre-trained neural network parsers. Our methods improve over direct transfer and strong recent unsupervised transfer models, by using source parser uncertainty for implicit supervision, leveraging only unlabelled data in the target language. Our experiments show that the methods are effective for both single-source and multi-source transfer, free from the limitations of recent transfer models, and perform well for non-projective parsing. Our analysis shows that the effectiveness of the multi-source transfer method is attributable to its ability to leverage diverse syntactic signals from source parsers from different languages. Our findings motivate future research into advanced methods for generating informative sets of candidate trees given one or more source parsers.