Genre as Weak Supervision for Cross-lingual Dependency Parsing

Recent work has shown that monolingual masked language models learn to represent data-driven notions of language variation which can be used for domain-targeted training data selection. Dataset genre labels are already frequently available, yet remain largely unexplored in cross-lingual setups. We harness this genre metadata as a weak supervision signal for targeted data selection in zero-shot dependency parsing. Specifically, we project treebank-level genre information to the finer-grained sentence level, with the goal to amplify information implicitly stored in unsupervised contextualized representations. We demonstrate that genre is recoverable from multilingual contextual embeddings and that it provides an effective signal for training data selection in cross-lingual, zero-shot scenarios. For 12 low-resource language treebanks, six of which are test-only, our genre-specific methods significantly outperform competitive baselines as well as recent embedding-based methods for data selection. Moreover, genre-based data selection provides new state-of-the-art results for three of these target languages.


Introduction
Multilingual masked language models (MLMs) trained on immense quantities of heterogeneous texts (Devlin et al., 2019;Brown et al., 2020;Conneau et al., 2020) have recently made applications such as highly cross-lingual dependency parsing a reality (Kondratyuk and Straka, 2019). Adjacently, it has also been recognized that they capture characteristics relevant for training data selection (Aharoni and Goldberg, 2020) and can be efficiently fine-tuned for higher task-specific performance (Gururangan et al., 2020;Dai et al., 2020;Lauscher et al., 2020;Üstün et al., 2020). These considerations are especially important in computationally restricted environments and when data from the target distribution are unavailable.

Universal Dependencies
Target Parser

Genre Selection
Targeted Training Data . . . Figure 1: Genre-driven Training Data Selection for a zero-shot target treebank. In absence of annotated inlanguage data, we propose genre as a weak supervision signal for targeted instance selection from a large pool of out-of-language treebanks.
Universal Dependencies (Nivre et al., 2020;UD) provides an extensive testing ground for such scenarios: Its language diversity is constantly increasing (from 10 in v1.0 to 104 in v2.7) and lowresource languages are often limited to a single test-set-only treebank. As most of the 7,000+ languages in the world similarly lack any annotated training data, effective zero-shot transfer learning is crucial for achieving wider linguistic coverage.
Criteria for selecting training data within such settings vary, and a practitioner may determine relevance by proxy of language relatedness or treebank content. This leads us to the question: If our goal is to develop a parser for a known domain in an unseen language, can a signal such as genre guide our selection of cross-lingual training data from a significantly larger, diverse pool ( Figure 1)?
Within the heterogeneity of written and spoken (transcribed) data, genre broadly encompasses variation along the functional role of a text (Kessler et al., 1997). A clear definition is complex if not impossible and communities refer to genre, domain, style or register in different ways (Kessler et al., 1997;Lee, 2001;Webber, 2009;Plank, 2011). In this work, we take a pragmatic approach and use genre as defined by the 18 community-provided categories in UD (Zeman et al., 2020). These genres are assigned at the treebank level, and "are neither mutually exclusive nor based on homogeneous criteria, but [are] currently the best documentation that can be obtained" (Nivre et al., 2020).
Contributions In order to facilitate finer-grained, instance-level data selection for cross-lingual parsing in absence of in-language training data, we provide three contributions: First, we provide an analysis of the genre distribution in UD v2.7 (Zeman et al., 2020) across 104 languages and 177 treebanks (Section 3).
Next, we introduce three targeted data selection strategies which amplify existing genre information in multilingual contextualized embeddings in order to enable sentence-level selection based on UD's treebank-level genre annotations (Section 4).
Finally, we apply the extracted genre information to proxy training data selection for 12 typologically diverse low-resource treebanks. In absence of any in-language training data, our approach outperforms selection using treebank metadata alone as well as purely embedding-based instance selection and surpasses state-of-the-art results on three treebanks (Section 5). 1

Related Work
Despite advances in zero-shot performance (Devlin et al., 2019;Brown et al., 2020) and increasingly cross-lingual parsers (Kondratyuk and Straka, 2019), fine-tuning has remained a crucial step for achieving state-of-the-art performance. Meechan-Maddon and Nivre (2019) demonstrate that this holds true for low-resource languages in particular, with 200 training instances in the target or related languages producing better results on dependency parsing than a model trained on all available data. Lauscher et al. (2020) further show that as few as 10 samples in the target language can double parsing performance. Üstün et al. (2020) propose UDapters, which integrate language and task-specific adaptation modules into the parser to improve cross-lingual, zero-shot performance.
Considering factors complementary to language is equally important: MLMs can for instance be improved for specific domains such as Twitter or medical texts by fine-tuning on the same or related sources (Dai et al., 2020;Gururangan et al., 1 Code at https://personads.me/x/emnlp-2021-code. 2020). For dependency parsing, the use of data from matching genres has been explored by Plank and van Noord (2011), who find improvements for English and Dutch. This is further confirmed for German by Rehbein and Bildhauer (2017).
Automatically inferred topics (Ruder and Plank, 2017) as well as more abstract selection criteria such as overlapping part-of-speech sequences (Søgaard, 2011;Rosa, 2015) have also proven effective at selecting syntactically similar training instances. Vania et al. (2019) further demonstrate that when word embeddings of mutually unintelligible languages align with respect to POS, cross-lingual transfer remains especially effective. With respect to data-driven domain representations, Stymne (2020) shows that treebank embeddings can be used to successfully transfer knowledge from in-domain cross-lingual source treebanks when used in conjunction with in-language, out-of-domain data. In this work, we will rely solely on treebank genre labels as weak supervision and forgo the use of in-language training data as well as instance-level annotations thereof (e.g. POS tags).
Recently, contextualized embeddings have been shown to contain useful information for training data selection. Aharoni and Goldberg (2020) find that clusters formed by embeddings from untuned, monolingual language models correspond well to the genres of their five-domain corpus. Training an English-to-German machine translation model on only the closest embedded sentences to their target 2k-sentence development set outperformed a model trained on the entire dataset.
Although all aforementioned methods assume some degree of in-language training data, our methods will not have access to any annotated target data and will be trained exclusively on out-of-language instances. Building on information stored in pretrained contextual embeddings, we extend genrebased data selection into the massively multilingual, 104-language, 18-genre setting of Universal Dependencies (Zeman et al., 2020). While previous work further assumed sentence-level genre labels (Ruder and Plank, 2017;Aharoni and Goldberg, 2020), our methods will only have access to treebank-level metadata. An instance's genre will therefore have to be inferred using weakly supervised approaches. To the best of our knowledge, this constitutes the first application of UD's instance-level genre distribution to the selection of training data for zero-shot, cross-lingual dependency parsing.  0  10  20  30  40  50  60  70  80   Percentage   104  71   46   11 51  24 13 7 17 24 16 6 13 6 7 2 2 1 Figure 2: Genre Distribution in UD. Ranges indicate upper/lower bounds for sentences per genre inferred from UD metadata. Center marker reflects the distribution under the assumption that genres within treebanks are uniformly distributed. Labels above the bars indicate the number of treebanks which contain each genre.

Genre in Universal Dependencies
Universal Dependencies (Nivre et al., 2016) offer annotations for a broad spectrum of languages, with 104 in version 2.7 (Zeman et al., 2020). Of the 1.38 million sentences from the 177 treebanks which we consider, 64 are test-set only and many in this latter third constitute the sole treebank of the language they are in. Such data sparsity becomes even more critical when both the language and the domain are highly specialized and under-resourced. As more low-resource languages are added in this manner and as the vast majority of the world's languages remain without annotated data, it becomes important to consider new signals for selecting training data in zero-shot scenarios. If no data in the target language are available, we hypothesize that characteristics of most genres are stable enough across languages to offer a useful guiding criterion for data selection in cross-lingual dependency parsing.
For 26 of the 177 treebanks, their authors have provided sentence-level genre labels. However, these annotations cover only 6% of UD sentences and are typically incompatible across treebanks (with few exceptions such as PUD). At the treebank level, UD fortunately provides 18 approximated genre labels : academic, bible, blog, email, fiction, government, grammar-examples, learner-essays, legal, medical, news, nonfiction, poetry, reviews, social, spoken, web, wiki. Genres such as wiki likely have stronger internal consistency due to cross-lingual creation guidelines. Others such as fiction or web may have higher variance. While these UD-provided labels are far from perfectly defined (Nivre et al., 2020), they nonethe-less allow us to operationalize our hypothesis: If genre is globally consistent, it must have a positive effect on cross-lingual transfer performance.
From Figure 2 it is evident that these genres are heavily imbalanced. The minimum number of sentences in a genre is inferred from the sum over the number of instances in treebanks containing only that genre. The upper bound is the sum of all treebanks containing the genre among others. As indicated by these distributional bounds, news articles may constitute up to 70% of the whole UD dataset. Even assuming uniform genre distributions within each treebank (center marker), over half of all sentences in UD would fall into either the news or the non-fiction category.
Genres with highly specific lexical and/or structural features such as spoken, social or medical are much more underrepresented. Furthermore, they are often only a small part of larger genre mixtures (117 treebanks include multiple genre-labels). These mixtures, with up to 10 genres in one treebank, may contain related genres (e.g. news, nonfiction, web), but also unrelated ones (e.g. medical, poetry, social, web) depending on what data was available to authors during annotation.
Out-of-the-box, treebank-level genre labels appear to be highly noisy (see also Nivre et al., 2020). Additionally, individual treebanks are labeled with multiple genres while lacking such labels at the sentence level. We hypothesize that it is therefore necessary to predict instance-level genre distributions before targeted data selection can be effective.

Targeted Data Selection
In order to measure the effect of genre on the targeted selection of training data, we depart from previous treebank-level selection (Section 2) and introduce three new types of instance-level selection strategies in the following section. They are evaluated on the task of zero-shot dependency parsing in Sections 5 and 6. All of them build on contextualized embeddings learned by the mBERT (Devlin et al., 2019) masked language model (MLM). While MLMs still lack the full breadth of the languages covered in UD (mBERT covers 56 of the 104 languages), they have proven robust in zeroshot scenarios (Devlin et al., 2019;Brown et al., 2020) and have also been found to contain a certain amount of genre information -at least monolingually (Aharoni and Goldberg, 2020; Section 2). We evaluate whether UD's definition of genre is also recoverable from these data-driven representations and whether these categories hold crosslingually.

Closest Sentence Selection
SENT Akin to the strategy used by Aharoni and Goldberg (2020), this SENTENCE-based method attempts to find the most relevant training data by computing the mean embedding of n unannotated target data samples and retrieving the top-k closest non-target instances according to their cosine distance in embedding space. Notable differences from their original method are the use of a much smaller target data sample (n = 100 versus n = 2000) as well as the use of mBERT instead of English-only BERT embeddings (Devlin et al., 2019) due to our cross-lingual setting.
While the monolingual BERT embeddings were found to represent genre to some degree, such MLM embeddings likely contain many more dimensions of semantic and syntactic information. The SENT method alone is therefore not guaranteed to represent data selection by genre as stronger factors may override these signals. Additionally, Aharoni and Goldberg (2020)'s setup assumed five clearly-defined genres with instance-level annotations while UD has 18 genres with varying degrees of specificity which are only defined in the treebank-level metadata.

Genre Selection
META Separately to MLM embedding-based selection, we evaluate the effectiveness of using the manually assigned genre labels listed in each treebank's metadata. As seen in Section 3, these labels can be noisy and have variable interpretations across treebanks. Furthermore, each treebank is assigned up to 10 genres, making instance-level selection as in the previous method impossible.
BOOT To bridge this gap to sentence-level selection, we introduce a bootstrapping procedure which iteratively learns an instance-level classifier for UD genre. Each sentence is encoded through mBERT's CLS token before passing to a classification layer. The model is initialized using standard mBERT weights and begins by training on single-genre treebanks (i.e. standard supervised learning). It then predicts sentence labels for treebanks containing these initial genres. Above a prediction threshold of 0.99 ∈ [0, 1], these are added as new training data for the next round of training. When only one unclassified genre remains in a treebank, all remaining instances are inferred to be of that last genre. Using this procedure, a single genre label is assigned to each sentence in UD within three steps.
Compared to closest sentence selection (SENT), both of the former methods have the added benefit that no target-data is required in order to make the final training data selection. The training corpus simply consists of all instances labelled as belonging to a genre (BOOT) or to a treebank containing the genre in question (META).

Closest Cluster Selection
GMM As shown by Aharoni and Goldberg (2020), monolingual BERT embeddings can be clustered into distinct domains using common clustering algorithms such as Gaussian Mixture Models (GMMs). Using mBERT embeddings, we evaluate whether this holds cross-lingually by clustering each treebank into the number of genres which it is said to contain according to the UD-provided metadata. Deviating from previous work, which only uses these clusters for preliminary analyses, we then use them directly for data selection. By computing a mean embedding for each cluster and choosing the closest one to the mean target sample embedding (same as SENT), the most similar data is selected in bulk from each treebank. By only selecting clusters from treebanks for which the metadata states that the target genre is contained, this allows us to identify clusters which most likely correspond to the target genre while avoiding the manual labelling of clusters across 104 languages.
LDA We also evaluate a clustering method based purely on lexical features (i.e. n-grams) instead of pre-trained contextual embeddings. While the selection of the most relevant cluster from each treebank is performed using the same mean embedding distance methodology as for GMM, we use Latent Dirichlet Allocation (Blei et al., 2001; LDA) for the initial clustering step. This decouples the genresegmentation step from the multitude of non-genre dimensions in the embeddings themselves, while simultaneously not relying on LDA alone for the final data selection (as in Plank and van Noord, 2011;Mukherjee et al., 2017). Furthermore, this setup allows us to extract genres from languages and scripts unknown to mBERT as well as to compare whether the GMM clusters correspond to those found by using surface-level lexical information alone.  Table 1: Target Treebanks with language family (FAMILY), inclusion in mBERT pre-training (MB; included ( ), excluded (×), highly-related languages included (∼)), total number of sentences (SIZE) and UD-provided GENRE.

Target Treebanks
We evaluate the effect of genre on training data selection using a set of 12 target treebanks from the low-resource end of UD. For our purposes, lowresource is defined as treebanks with more than 200 and less than 2,000 sentences in total and with fewer than 5,000 in-language sentences in UD.
In order to distinguish the effects of genre specifically, we only use single-genre target treebanks and leave the investigation of genre-mixtures to future work. As seen in Table 1, the final set of targets is diverse with respect to genre, language family and their availability during mBERT pre-training.
Only three of the target languages are included in mBERT pre-training, with seven not being covered at all and two having strongly related languages in mBERT's repertoire: Hindi-English (QHE) → Hindi, English as well as Turkish-German (QTD) → Turkish, German.
The six included genres cover the high-resource news ( ) and fiction ( ) as well as the medium resource wiki ( ) and the lower resource spoken ( ), grammar-examples ( ) and social ( ).

Data Selection Setup
In order to train parsers for these largely test-only treebanks, we compare seven proxy training data selection strategies for each target (note that only the first strategy uses in-language training data).
TARGET Where available, we use the true target training split as a performance upper bound against which to compare our methods. These are available for the six targets: SWL-SSLC, TA-TTB, GL-TreeGal, TE-MTG, QHE-HIENCS and QTD-SAGT. For three targets without training splits, we make use of proxy in-language data: SA-Vedic RAND selects a random sample of n rand sentences from the non-target-language UD. We do not restrict this selection to treebanks containing the target genre such that data from a more diverse pool of languages may be selected. To ensure an equivalent comparison, we set n rand to the mean of the number of instances selected by BOOT, LDA and GMM (see Appendix C for values of n rand ).
SENT selection (see Section 4.1) is based on the mean embedding of 100 target sentences and retrieves the top-k closest out-of-language sentences from all of UD independently of genre. Since k needs to be chosen manually, we set it to the number of instances selected by GMM, which is equally dependent on mBERT embeddings.
META selects all non-target language treebanks which are denoted to contain the target genre (i.e. both single-genre treebanks as well as mixtures). These data pools make up the largest training corpora in our setup (up to 524k instances for news) and also subsume the other genre-based selection methods BOOT, LDA and GMM. In this way, it acts as an upper bound in terms of data quantity as well as a baseline for whether treebank-level metadata alone can aid data selection.
BOOT selects only the specific instances classified as being in the target genre for use as training data. The classifier is trained according to the bootstrapping method outlined in Section 4.2. In  Table 2: Zero-shot Parsing Results. LAS for test splits of target treebanks using training data from target/proxy in-language treebanks (TARGET; where available), random sentence selection (RAND), closest sentence selection (SENT), treebanks containing target genre (META), instances classified as target genre (BOOT) and closest cluster selection (GMM and LDA). Scores marked with † significantly outperform TARGET, RAND, SENT and META.
order to avoid the memorization of target data, we exclude all data in the target languages from the classifier training process.
GMM clusters each treebank into the number of genres denoted by its metadata using mean-pooled mBERT embeddings for each sentence. Training data is then selected according to the closest-cluster procedure outlined in Section 4.3.
LDA works analogously to GMM, but uses LDA to cluster sentences. It uses bags of character 3-6grams and no language-specific resources (e.g. stop word lists) in order to remain as cross-lingually comparable as possible. Hyperparameters were tuned as outlined in Section 5.3.
All methods relying on unannotated target data for the data selection process use 100 random sentences from the target treebank (changes across random initializations). In practical terms, this corresponds to having access to a small amount of target-like data -without gold dependency structures -and selecting the best possible training data for which we do have annotations.
Alternatively, BOOT (as well as META and RAND implicitly) work in a fully zero-shot manner as we only assume knowledge of the intended target genre, but do not assume access to the target sentences nor their annotations.

Training Setup
We use the biaffine attention parser (Dozat and Manning, 2017) implementation of MaChAmp v0.2 (van der Goot et al., 2021) with default hyperparameters. Each step involving non-deterministic components is rerun using three random seeds.
For efficiency reasons, the seven largest treebanks were subsampled to 20k instances per split.
Performance is measured using the labeled attachment scores (LAS) averaged across random initializations. Additionally, we report unlabeled attachment scores (UAS), the number of selected instances as well as the variance across runs in Appendix C. Significance is evaluated at α < 0.05 using a paired bootstrapped sign test with 10k resampling and Bonferroni correction (Bonferroni, 1936) for the multiple comparisons across random initializations. Appendix B lists all additional hyperparameter settings.
It is important to note that besides the upper bound in-language setup (TARGET), no parser is trained on in-language data. For the tuning of method-specific hyperparameters (LDA features, BOOT thresholds), development sets of the five treebanks containing such splits were used: SWL-SSLC, TA-TTB, TE-MTG, QHE-HIENCS and QTD-SAGT (details in Appendix B). During parser training, development data for early stopping is based solely on the out-of-language data selected by each method and not on the in-language target data itself (also excluding constituent languages for code switched targets). Results are reported on each target's test set without any further tuning.

Zero-shot Parsing Results
As expected, Table 2 shows that training the parser on target data (TARGET) results in the best overall performance even though the training corpora for these setups almost never exceed 1k instances. The target treebanks for which in-language data are available, consolidate into a final average of 50.28 LAS. This highlights the overall difficulty of parsing these low-resource treebanks. As the parser is initialized using mBERT, the scores on Tamil (TA), Galician (GL) and Telugu (TE), which are included in its pre-training, are highest overall compared to non-included languages or code-switched variants.
It is noteworthy that when a same-language proxy treebank was used for parser training, scores are lower compared to the other methods. In these three cases, namely Sanskrit (SA), Komi Zyrian (KPV) and Faroese (FO), none of the proxy treebanks include the target's genre which may be a strong contributing factor to this discrepancy.
Turning to our zero-shot setups, META data selection based on treebank-level annotations alone performs worst overall at 34.12 LAS despite constituting the largest training corpora in each setup (see Appendix C for details). Compared to the TARGET upper bounds, it shows how training on two orders of magnitudes more data can still be insufficient if they do not follow the target distribution.
Both RAND and SENT outperform the META baseline at 36.48 and 36.78 LAS respectively. These aggregated scores also highlight that sentence-based selection alone insufficiently captures cross-lingual characteristics as to outperform random chance in most cases.
In contrast, combining latent information in the MLM embeddings with higher-level genre information leads to performance increases not achievable by either method alone. Both GMM and LDA achieve the highest scores across the majority of target treebanks and the highest cross-lingual averages of 38.70 LAS and 38.71 LAS respectively. These scores reflect their similar performance across targets, however we do observe that LDA achieves slightly higher scores on languages which are not included in mBERT pre-training: e.g. Swedish Sign Language (SWL), Sanskrit (SA) and Komi Zyrian (KPV). We hypothesize that this is a result of GMM's dependence on latent information in the mBERT embeddings while LDA constructs clusters independently, based solely on surface-level lexical features (i.e. n-grams).
Finally, amplifying genre information in the mBERT embeddings using our BOOT method also leads to performance increases compared to using untuned embeddings or the coarser grained treebank-level metadata. While it does not entirely reach the performance of the cluster selection methods, its overall average of 37.69 LAS as well as generally similar performance patterns to LDA and GMM lead us to believe that all three methods are picking up on and are amplifying similar latent genre information. As an added benefit, BOOT is able to reach this competitive performance without the need for any target data samples (as opposed to GMM and LDA which use 100 raw samples for cluster selection).
Using our proposed genre-based selection methods we are therefore able to consistently outperform in-language/out-of-genre upper bounds for these low-resource target treebanks. Comparing our results to van der Goot et al. (2021) who train an identical parser architecture on each UD treebank's respective training split, proxy treebank (for testonly) or all of UD, our methods significantly outperform their best models on five of twelve target treebanks. 2 There are significant increases for both SA-UFAL (16.5 → 23.7 LAS) and KPV-Lattice (11.7 → 22.3 LAS). 3 For the targets YUE-HK (32.7 → 49.9 LAS), CKT-HSE (15.3 → 19.8 LAS) and FO-OFT (62.7 → 68.3 LAS), these scores furthermore constitute -to the best of our knowledgestate-of-the-art results without requiring annotated in-language data.

Analysis of Selected Data
Further analyzing the patterns of data selection allows us to identify some of the reasons behind the differences in performance (visualizations can be found in Appendix D).
RAND closely follows the overall data distribution in UD, selecting the most instances from the largest treebanks such as German-HDT (Borges Völker et al., 2019) and selecting none to almost none from low-resource treebanks. SENT follows a similar distribution albeit rarely selecting zero instances from any given language. This behaviour does not change substantially between targets, indicating less targeted data selection.
While the larger language diversity of the aforementioned RAND and SENT does not seem to be enough to outperform genre-selection in most cases, it can be helpful when in-genre data is not as linguistically diverse. For the targets SA-UFAL and MYV-JR (fiction) both methods outperform genre-based selection by around 2% LAS.
A clear example of insufficient in-genre data is the QHE-HIENCS target. It represents a highlyspecialized variation of the social genre, specifically Twitter data. Although the genre-based selec- , there is a lack of such in-genre data from other languages, 4 leading these parsers to overfit on Italian specifically. This once again highlights the difficulty of selecting proxy training data which covers all desired characteristics -even from a dataset as diverse as UD.
In general, the genre-driven methods make fairly similar selections given their shared baseline pool of treebanks containing the target genre in-mixture (see Appendix D). Since using all of these data however results in the worst overall performance (META) while BOOT, GMM and LDA perform best, the targeted selection of relevant subsets within the larger META pool appears to be key. Frequently, large treebanks such as Polish-LFG (Patejuk and Przepiórkowski, 2018) with 14k instances from fiction, news, nonfiction, social and spoken are subsampled to a much smaller fraction (around 3k instances in this example). The fact that these proportions as well as the selected instances themselves are relatively consistent across samegenre targets corroborates that all our methods are picking up on similar, data-driven notions of genre. Figure 3 further visualizes the presence of latent genre using t-SNE plots of up to 1k randomly sampled sentence embeddings from each of UD's single-genre treebanks. In their untuned state (Figure 3a), some local genre clusters do manifest.
However, these mainly correspond to specialized treebanks such as the aforementioned Italian Twitter treebanks (social). Most other genres occur in language-level mixtures or in a large overall "blob" on the left. By amplifying genre explicitly using the BOOT procedure, each individual genre is much more clearly segmented (Figure 3b).
In conclusion, the presence of similar performance patterns across all our proposed genredriven methods -while having separate approaches to treebank segmentation (weakly supervised tuning for BOOT, treebank-internal embedding distances for GMM, n-grams for LDA)confirms our hypothesis that instance-level genre can be identified cross-lingually from contextualized representations and aids zero-shot parsing.

Conclusions
In absence of in-language training data, we have explored UD-specified genre as an alternative signal for data selection. While prior work had indicated the presence of genre information in monolingual contextualized embeddings (Aharoni and Goldberg, 2020), an analogous strategy using mBERT embeddings proved insufficient in the cross-lingual parsing setting (SEN), performing close to the random baseline (RAND). Relying on manual, treebanklevel genre labels (META) proved even less performant, producing the lowest scores despite corresponding to a practitioner's typical first choice of selecting the largest number of training instances.
In order to enable finer-grained, instance-level data selection, we proposed three methods for combining latent genre information in the unsupervised contextualized representations with the treebank metadata: weakly supervised BOOT, sentence embedding-based GMM and n-gram-based LDA. Despite their different approaches to treebank segmentation, each method significantly outperformed the purely embedding-based SENT as well as the metadata (META) and random baselines (RAND). Their similar performance patterns and selected data distributions further indicate that each method is identifying a shared, data-driven notion of genre.
For future work, it will be important to extend our proposed approaches beyond single-genre targets towards genre-mixtures and more treebanks overall. As the data selected by these methods is further limited by the number of treebanks in each respective genre, combining a larger set of selection signals will be equally crucial.  For the early stopping of parser training, no such in-language validation data is used (to ensure a pure zero-shot setup). Instead, the data selected by each selection method is split in an 80%/20% fashion and is used as a proxy, out-of-language development split.
Similarly, the training of the bootstrapping classifier (BOOT) uses only the non-target-language portion of UD (i.e. excluding all treebanks of the 12 target languages plus constituent languages for code-switched). For efficiency reasons, this data is further subsampled to 40k total instances. Both the training and validation (used for early stopping) of BOOT are therefore similarly conducted without any target-language data.
Subsets Since data selection is at the core of this research, the exact instance IDs of each subset are available in the supplementary code.

B Model and Training Details
The following describes architecture and training details for all methods. When not further defined, default hyperparameters are used. Implementations are available in the supplementary code.
Infrastructure Neural models are trained on an NVIDIA A100 GPU with 40 GB of VRAM. Since most of our experiments do not require MLM sentence embeddings to be updated, we compute them once and store them on disk to save GPU cycles. Clustering Methods Both Gaussian Mixture Models (GMM) and Latent Dirichlet Allocation (Blei et al., 2001; LDA) use implementations from scikit-learn v0.23 (Pedregosa et al., 2011). LDA uses bags of character 3-6-grams which occur in at least two and in at most 30% of sentences. The ngram sizes were initially tuned on target treebanks with available development sets (see Appendix A). We found character 1-5-grams to perform approximately 2.5 LAS worse and word unigrams to perform approximately 2 LAS worse than the final method. GMMs use the mBERT sentence embeddings directly as input. Both methods are CPUbound and complete the clustering of all treebanks in UD in under 45 minutes.

Multilingual Language Model
Bootstrapping (BOOT) builds on the standard mBERT architecture as follows: mBERT → CLS → linear layer (d emb ×18) → softmax. The training has an epoch limit of 100 with early stopping after 3 iterations without improvements on the development set. No target-language data is used during this process. An alternate bootstrapping threshold of 0.9 was evaluated and found to perform approximately 1 LAS worse on the development subset (see Appendix A) than the final value of 0.99. Backpropagation is performed using AdamW (Loshchilov and Hutter, 2017) with a learning rate of 10 −7 on batches of size 16. The fine-tuning procedure requires GPU hardware which can host mBERT, corresponding to 10 GB of VRAM. Training on the subsampled 40k instance, non-targetlanguage data takes approximately seven hours.
Dependency Parsers Every parsing experiment in the main paper uses a biaffine attention parser (Dozat and Manning, 2017) implemented in the MaChAmp v0.2 framework (van der Goot et al., 2021) using default hyperparameters. The sentence encoder is initialized with standard mBERT weights. The training duration is foremost dependent on input data quantity. For the largest corpus (META for TA-TTB with 524k instances) this corresponds to 55 hours. Our proposed methods create smaller, targeted training corpora (around 80k instances on average) such that a better performing parser can be trained in approximately 90 minutes on the same hardware.
Random Initializations Each experiment is run thrice using the seeds 41, 42 and 43. This relates to the random subsampling of data as well as to model initialization (both parsers and selection).

C Additional Results
In addition to the labeled attachment scores (LAS) reported in the main paper, we list LAS standard deviation across random initializations in Table 5, unlabeled attachment scores (UAS) in Table 4 as well as the number of selected training instances per method in Table 3.