Revisiting Tri-training of Dependency Parsers

We compare two orthogonal semi-supervised learning techniques, namely tri-training and pretrained word embeddings, in the task of dependency parsing. We explore language-specific FastText and ELMo embeddings and multilingual BERT embeddings. We focus on a low resource scenario as semi-supervised learning can be expected to have the most impact here. Based on treebank size and available ELMo models, we select Hungarian, Uyghur (a zero-shot language for mBERT) and Vietnamese. Furthermore, we include English in a simulated low-resource setting. We find that pretrained word embeddings make more effective use of unlabelled data than tri-training but that the two approaches can be successfully combined.


Introduction
Pre-trained neural architectures and contextualised word embeddings are state-of-the-art approaches to combining labelled and unlabelled data in natural language processing tasks taking text as input. A large corpus of unlabelled text is processed once and the resulting model is either fine-tuned for a specific task or its hidden states are used as input for a separate model. In the task of dependency parsing, recent work is no exception to the above. However, earlier, pre-neural work explored many other ways to use unlabelled data to enrich a parsing model. Among these, self-, co-and tritraining had most impact (Charniak, 1997;Steedman et al., 2003;McClosky et al., 2006a,b;Søgaard and Rishøj, 2010;Sagae, 2010).
Self-training augments the labelled training data with automatically labelled parse trees predicted by a baseline model in an iterative process: 1. Select unlabelled sentences to be parsed in this iteration 2. Parse sentences with current model 3. Optionally discard some of the parse trees, e. g. based on parser confidence 4. Optionally oversample the original labelled data to give it more weight 5. Train a new model on the concatenation of manually labelled and automatically labelled data

Check a stopping criterion
Co-training proceeds similarly to self-training but uses two different learners, each teaching the other learner, i. e. output of learner A is added to the training data of learner B and vice versa. Tri-training uses three learners and only adds predictions to a learner that the other two learners, the teachers, agree on. As with co-training, the roles of teachers are rotated so that all three learners can receive newly labelled data. We compare tri-training and contextualised word embeddings in the task of dependency parsing, using the same unlabelled data for both approaches. In this comparison, we will try to answer: 1. How does semi-supervised learning with tritraining compare to semi-supervised learning with a combination of context-independent and contextualised word embeddings?
2. Are the above two approaches orthogonal, i. e. do we get an additional boost if we combine them?
3. How do these three approaches compare to the baseline of using only the manually labelled data?
We focus on low-resource languages as (a) maximising the benefits from semi-supervised learning even at high computational costs is most needed for low-resource languages, e. g. to reducing editing effort in the manual annotation of additional data, and (b) tri-training with high-resource languages comes at much higher computational costs as not only is the manually labelled data much larger but also the automatically labelled data can be expected to need to be at least equally larger to have a relevant effect. We select three low-resource languages, namely Hungarian, Uyghur and Vietnamese, (see Section 3.3 for selection criteria) and, English, simulating a low-resource scenario by sampling a subset of the available data.
The results of our experiments show that 1) both tri-training and pretrained word embeddings offer an obvious improvement over a fully supervised approach, 2) pretrained word embeddings clearly outperform tri-training (between 2 and 5 LAS points, depending on the language), and 3) there is some merit in combining the two approaches since the best performing model for each of the four languages is one in which tri-training is applied with models which use pretrained embeddings.

Tri-training
Tri-training has been used to tackle various natural language processing problems including dependency parsing (Søgaard and Rishøj, 2010), part-ofspeech tagging (Søgaard, 2010;Ruder and Plank, 2018), chunking (Chen et al., 2006), authorship attribution (Qian et al., 2014) and sentiment analysis (Ruder and Plank, 2018). Approaches differ not only in the type of task (sequence labelling, classification, structured prediction) but also in the flavour of tri-training applied. These differences take the form of the method used to introduce diversity into the three learners, the number of tri-training iterations and whether a stopping criterion is employed, the balance between manually and automatically labelled data, the selection criteria used to add an automatically labelled instance to the training pool, and whether automatic labels from previous iterations are retained. Zhou and Li (2005) introduce tri-training. They experiment with 12 binary classification tasks with data sets from the UCI machine learning repository, using bootstrap samples for model diversity. Each pair of learners, the teachers, sends their unanimous predictions to the remaining third learner if (a) the error rate as measured on the subset of the manually labelled data which the two learners agree on is below a threshold and (b) the total number of items that the teachers agree on and therefore can hand over to the learner reaches a minimum number that is adjusted in each round for each learner. There also is an upper limit for the size of the new data received that is enforced by down-sampling if exceeded. A learner's model is updated using the concatenation of the full set of manually labelled data (before sampling) and the predictions received from teachers. If no predictions are received a learner's model is not updated. Tri-training stops when no model is updated. Chen et al. (2006) apply tri-training to a sequence labelling task, namely chunking, and discuss sentence-level instance selection as a deviation from vanilla tri-training. They propose a "two agree one disagree method" in which the learner only accepts a prediction from its teachers when it disagrees with the teachers. Søgaard (2010) reinvents this method and coins the term tri-training with disagreement for it.
Li and Zhou (2007) extend tri-training to more than three learners and relax the requirement that all teachers must agree by using their ensemble prediction. They apply this to an ensemble of decision trees, i. e. a random forest, and call the method co-forest. As to the risk of deterioration of performance due to wrong labelling decisions, they point to previous work showing that the effect can be compensated with a sufficient amount of data if certain conditions are met, and they include these conditions in the co-forest algorithm. Guo and Li (2012) identify issues with the update criterion of tri-training and with the estimation of error rates on training data and propose two modified methods, one improving performance in 19 of 33 test cases (eleven tasks and three learning algorithms) and the other improving performance in 29 of 33 cases. Fazakis et al. (2016) compare self-, co-and tri-training combined with a selection of machine learning algorithms on 52 datasets and include a setting where self-training is carried out with logistic model trees, a type of decision tree classifier that has logistic regression models at its leaf nodes. Tritraining with C45 decision trees comes second in their performance ranking after self-training with logistic model trees. However, logistic model trees are not tested with co-or tri-training. Chen et al. (2018) adjust tri-training to neural networks by sharing parameters between learners for efficiency. Furthermore, they add random noise to the automatic labels to encourage model diversity and to regularize the models and in addition to teacher agreement they require teacher predictions made with dropout (as in training) to be stable. Ruder and Plank (2018) also propose to share all but the final layers of a neural model between the three learners in tri-training for sentiment analysis and POS tagging. They add an orthogonality constraint on the features used by two of the three learners to encourage diversity. Furthermore, they apply multi-task training in tri-training and they modify the tri-training algorithm to exclude the manually labelled data from the training data of the third learner.

Tri-training in Dependency Parsing
Tri-training was first applied in dependency parsing by Søgaard and Rishøj (2010), who combine tri-training with stacked learning in multilingual graph-based dependency parsing. 100k sentences per language are automatically labelled using three different stacks of token-level classifiers for arcs and labels, resulting in state-of-the-art performance on the CONLL-X Shared Task (Buchholz and Marsi, 2006).
In an uptraining scenario, Weiss et al. (2015) train a neural transition-based dependency parser on the unanimous predictions of two slower, more accurate parsers. This can be seen as tri-training with one iteration and with just one learner's model as the final model. Similarly, Vinyals et al. (2015) use single iteration, single direction tri-training in constituency parsing where the final model is a neural sequence-to-sequence model with attention, which learns linearised trees.

Comparing Cross-view Training and Pretraining in NLP
The only previous work we know of that compares pretrained contextualised word embeddings to another semi-supervised learning approach is the work of Bhattacharjee et al. (2020)

Experimental Setup
This section describes the technical details of the experimental setup.

Tri-Training Algorithm
We provide an overview of our tri-training algorithm. 1 Before the first tri-training iteration, three samples of the labelled data are taken and initial models are trained on them. Each tri-training iteration compiles three sets of automatically labelled data, one for each learner, feeding predictions that two learners agree on to the third learner. 2 In case all three learners agree, we randomly pick a receiving learner. 3 At the end of each tri-training iteration, the three models are updated with new models trained on the concatenation of the manually labelled and automatically labelled data selected for the learners.

Parameter Selection
We explore three tri-training parameters: • A: the amount of automatically labelled data combined with labelled data when updating a model at the end of a tri-training iteration • T : the number of tri-training iterations 1 Full pseudocode is provided in Appendix A. We share our source code, basic documentation and training log files (including development and test scores of each learner for all iterations) on https://github.com/jowagner/ mtb-tri-training. 2 We require all predictions (lemmata, universal and treebank-specific POS tags, morphological features, dependency heads and dependency labels including languagespecific subtypes) for all tokens of a sentence to agree. The main reason is simplicity: The parser UDPipe-Future expects training data with all predictions as it jointly trains on them (multi-task learning). If we allowed disagreement between teachers on some of the tag columns we would have to come up with a heuristic to resolve such disagreements, complicating the experiment. Furthermore, we hypothesise that full agreement increases the likelihood of the syntactic prediction to be correct. The agreement can be seen as a confidence measure or quality filter. 3 Restricting the knowledge transfer to a single learner is a compromise between vanilla tri-training, which lets all three learners learn from unanimous predictions, and tri-training with disagreement (Chen et al., 2006), which lets none of the learners learn from such predictions. Furthermore, this modification (together with rejecting duplicates while sampling the unlabelled data ) increases diversity of the sets and therefore may help keeping the learners' models diverse.
• d: how much weight is given to data from previous iterations. The current iteration's data is always used in full. No data from previous iterations is added with d = 0. For d = 1, all available data is concatenated. With d < 1, we apply exponential decay to the dataset weights, e. g. for d = 0.5 we take 50% of the data from the previous iteration, 25% from the iteration before the last one, etc.
For a fair comparison of tri-training with and without word embeddings, we take care that A, T and d are explored equally well in both settings and that each comparison is based on results for the same set of parameters. Based on the observations in Appendices B.2 to B.4 and balancing accuracy, number of runs and computational costs, we perform for each language and parser twelve runs: • one run with A = 40k, T = 12 and d = 1 • one run with A = 80k, T = 8 and d = 1 • two runs with A = 80k, T = 8 and d = 0.5 • two runs with A = 160k, T = 4 and d = 0.5 • the above six runs in a variant where the seed data is oversampled to match the size of unlabelled data for the model updates at the end of each tri-training iteration.

Choice of Languages
Since we focus on low-resource languages, we select the three treebanks with the smallest amount of training data from UD v2.3, meeting the following criteria: • The treebank has a development set.
• An ELMoForManyLangs model (Section 3.5) is available for the target language. Sign languages and transcribed spoken treebanks are not covered.
• Surface tokens are included in the public UD release.

Unlabelled Data
To match the training data of the word embeddings (Section 3.5), we use the Wikipedia and Common Crawl data of the CoNLL 2017 Shared Task in UD Parsing (Ginter et al., 2017;Zeman et al., 2017) as unlabelled data in tri-training. We downsample the Hungarian data to 12%, the Vietnamese data to 6% and the English data to 2% of sentences to reduce disk storage requirements. All data, including data for Uyghur, is further filtered by removing all sentences with less than five or more than 40 tokens 4 and the order of sentences is randomised. We then further sample the unlabelled data in each tri-training iteration to a subset of fixed size to limit the parsing and training costs. The last two columns of Table 1 show the size of the unlabelled data sets after filtering and sampling. 5

Parser and Word Embeddings
For the parsing models of the individual learners in tri-training, we use UDPipe-Future (Straka, 2018). This parser jointly predicts parse tree, lemmata, universal and treebank-specific POS tags and morphological features. Since its input at predict time is just tokenised text, it can be directly applied to unlabelled data while still exploiting lemmata and tags annotated in the labelled data to obtain strong models. We use UDPipe-Future in two configurations: • udpf: UDPipe-Future with internal word and character embeddings only. This parser is for semi-supervised learning via tri-training only, i. e. the unlabelled data only comes into play through tri-training. The parser's word em- 4 In preliminary experiments with the English LinEs treebank and without a length limit, learners rarely agree on predictions for longer sentences. This means that long sentences are unlikely to be selected by tri-training as new training data and the increased computational costs of parsing long sentences does not seem justified. We also exclude very short sentences as we do not expect them to feature new syntactic patterns and, if they do, to not provide enough context to infer the correct annotation. 5 These numbers do not reflect the removal of sentences that contain one or more tokens that have over 200 bytes in their UTF-8-encoded form and de-duplication performed before parsing unlabelled data. An unusual feature of UDPipe-Future is that it oversamples the training data to 9,600 sentences in each epoch if the training data is smaller than that. This automatic oversampling enables the parser to perform well on most UD treebanks without tuning the number of training epochs. In our experiments, this behaviour will be triggered in settings with a low or medium augmentation size A, except when data of previous iterations is combined (d > 0) and the number of iterations T is not small. 9 We make a small modification to the default learning rate schedule of UDPipe-Future, softening its single large step from 0.001 to 0.0001 to five smaller steps, keeping the initial and the final learning rate. 10 The seed for pseudo-random initialisation of the neural network of the parser is derived from the seed that randomises the sampling of data in each tri-training experiment, an identifier of the tri-training parameters, an indicator whether the run is a repeat run, the learner number i and the tri-training iteration t.

Ensemble Method for Candidate Models
Candidate models for the final model are created at each tri-training iteration by combining the current models of the three learners in an ensemble using linear tree combination (Attardi and Dell'Orletta, 2009) as implemented 11 by Barry et al. (2020). 12 Candidate ensembles are evaluated on development data using the CoNLL'18 evaluation script.
At tri-training iteration zero, i. e. before any unlabelled data is used, runs with the same language and parser only differ in the random initialisation of models. A large number of additional ensembles can therefore be built, picking a model for each learner from different runs. These ensembles behave like ensembles from new runs, allowing us to study the effect of random initialisation using a much more accurate estimate of the LAS distribution than possible with ensembles of each runs. We obtain 4096 LAS values for each language and parser as follows: 1. For each learner i ∈ {1, 2, 3}, we partition the available models into 16 buckets in order of development LAS.
2. We enumerate all 16 3 = 4096 combinations of buckets, and from each bucket combination, we sample one combinations of three models, one model per bucket.
3. The selected model combination is combined using the linear tree combiner and evaluated.

Development Set Results
We compare semi-supervised learning with tritraining to semi-supervised learning with a combination of context-independent and contextualised word embeddings, and we compare these two approaches to combining them and to training without unlabelled data, i. e. supervised learning, addressing the three questions posed in Section 1. As described in Section 3.2, we obtain twelve LAS scores from tri-training with each parser and language. Tri-training with T iterations can choose an ensemble from T + 1 ensembles (Appendix E). To give the baselines the same number of models to 11 https://github.com/jowagner/ ud-combination 12 As we observed that performance of the ensembles on the development data varies considerably with the random initialisation of the tie breaker in the combiner's greedy search, e. g. obtaining LAS scores from 75.52 to 75.62 for ten repetitions, we run the combiner 21 times with different initialisation and report average LAS.
choose from, we sample from the baseline ensembles described in Section 3.6. For each tri-training run with T iterations, we select the best score from T +1 baseline scores. As the choice of the subset of T + 1 scores is random, we repeat the random sampling 250,000 times to get an accurate estimate of the LAS distribution. In other words, we simulate what would happen if the additional data obtained through tri-training had no effect on the parser. Figure 1 compares the parsing performance with and without tri-training for the four development languages and for the three types of parsing models udpf, elmo and mBERT. Most distributions for each language are clearly separated and the order of methods is the same: both tri-training and external word embeddings yield clear improvements. External word embeddings have a much stronger effect than tri-training. In combination, the two semi-supervised learning methods yield a small additional improvement with an average score difference of over half an LAS point.

Error Analysis
In this section, we probe, using the development sets, how the error distribution changes as we add tri-training and/or word ELMo embeddings trained on unlabelled data to the basic parser. We compare the following 1. tokens that are out-of-vocabulary relative to the manually labelled training data versus those that are in the training data ("OOV"/"IV") 2. different sentence length distributions: up to 9, 10 to 19, 20-39, and 40 or more tokens 3. different dependency labels, e.g is there a marked difference in the effect of tri-training or word embeddings for particular label types, e.g. nsubj?
How does tri-training help (with no embeddings)? As expected, tri-training brings a clearly greater improvement for OOVs than IVs for all four languages. The role of sentence length is not consistent across languages. For English, tri-training helps most on longer sentences (> 20 words), for Hungarian, short sentences (< 10 words), and for Uyghur, very long sentences (> 40 words). Sentence length does not appear to be a factor for Vietnamese. Regarding dependency labels, there are no clear pattern across languages. 13 English "udpf" "elmo" "mbert" Hungarian "udpf" "elmo" "mbert" Uyghur "udpf" "mbert" "elmo" Vietnamese "udpf" "elmo" "mbert" How do word embeddings help (with no tritraining)? Analysis of improvement by sentence length and by OOV status show similar trends to the tri-training improvements described above. Across languages, the use of pretrained embeddings helps to correctly identify the flat relation (which is used in names and dates).

Test Set Results
In this section, we verify to what extent our main observations on development data carry over to test data and include results for a parser using only Fast-Text as external word embedding. For each of the LAS distributions using tri-training in Figure 1 and the distribution for fasttext not shown, we select the ensemble with highest development LAS for testing. Since we also use model selection based on development LAS to choose the final model of each tri-training run from its T + 1 iterations, the best model is selected from a set of 50 models given the values of T listed in Section 3.2, exceeding the number of baseline models available from iteration 0. For a fair comparison, we therefore leverage the 4096 baseline ensembles described in Section 3.6. As a choice of 50 out of 4096 ensembles would introduce noise, we repeatedly draw samples, for each sample find the best model according to development LAS, obtain test LAS and report the average LAS over all samples, i. e. the expectation value. As was the case for development results, we run the linear tree combiner 21 times on the three individual predictions of the tri-training learners and take the average LAS over all combiner runs as the score of the ensemble. Table 2 shows development and test set LAS for the models selected as described above. The test set results confirm the development result that a combination of tri-training and contextualised word embeddings consistently gives the best results and that the individual methods improve performance. In keeping with the development results, contextualised word embeddings yield higher gains than tritraining. The test results confirm the development observation that multilingual BERT does not work as well as language-specific ELMo for Uyghur, a zero-shot language for multilingual BERT.

Conclusion
We compared two semi-supervised learning methods in the task of dependency parsing for three low-resource languages and English in a simulated low-resource setting. Tri-training was effective but could not come close to the performance gains of contextualised word embeddings. Combined, the two learning methods achieved small additional improvements between 0.2 LAS points for Uyghur and 1.3 LAS points for Vietnamese. Whether these gains can justify the additional costs of tri-training will depend on the application.
We recommend that users of tri-training vary settings and repeat runs to find good models. Future work could therefore explore how to best combine the many models or the large amount of automatically labelled data that such experiments produce. To obtain a fast and strong final model, a combination of ensemble search and model distillation or up-training may be the next step. Integrating crossview training (Clark et al., 2018) into tri-training may also be fruitful similarly to the integration of multi-view learning in co-training (Lim et al., 2020). The requirement of tri-training that two teachers must agree changes the sentence length distribution of the data selected and may introduce other biases. Future work could try to counter this effect be re-sampling the predictions similarly to how Droganova et al. (2018) corrected for such effects in self-training.
While our literature review suggests that tritraining performs better than co-and self-training, it would be interesting how these methods compare under a fixed computation budget as the latter methods train fewer parsing models per iteration.

Ethics and Broader Impact
Tri-training uses much smaller amounts of unlabelled data than the state-of-the-art semisupervised method of self-supervised pre-training and we therefore do not expect tri-training to add new risks from undesired biases in the unlabelled data. The use of tri-training may, however, pose new challenges in detecting problematic effects of issues in unlabelled data as existing inspection methods may not be applicable.
An individual tri-training run with FastText and multilingual BERT word embeddings and A = 80k, T = 8 and d = 0.5 typically takes three days on a single NVIDIA GeForce RTX 2080 Ti GPU. Overall, we estimate that our experiments took 2500 GPU days. This large GPU usage stems from the exploration of tri-training parameters in Appendix B. Future work can build on our observations and thereby reduce computational costs.

A Tri-training Algorithm
Algorithm 1 shows the tri-training algorithm in the form we use it in this work. An extended version of the description in Seciton 3.1 follows.
Lines 1-3 Before the first tri-training iteration, three samples B i of the labelled data L are taken and initial models h i are trained on them (i ∈ {1, 2, 3}). We sample without replacement and with a target size 2.5 times the size of L, repopulating the sampling urn each time it becomes empty. 14 Lines 5-7 For each tri-training iteration, we further sample a de-duplicated subset U of the unlabelled data as processing all available unlabelled Algorithm 1: Tri-training in this work. Input: L: Labelled data U : Unlabelled data A: Maximum number of items to add per iteration and learner T : Number of tri-training iterations d: Decay parameter, 0 ≤ d ≤ 1 Output: Models {h 1 , h 2 , h 3 } for ensemble.
27 end for 28 end for data would not be practical for most languages in our experiments (Section 3.4). 15,16 Lines 5, 8-18 Each tri-training iteration t compiles three sets of automatically labelled data L t,i one for each learner i, feeding predictions that two learners agree on 17 to the third learner (lines 10-17). In case all three learners agree, we randomly pick a receiving learner. 18 While Zhou and Li (2005) do not state in their pseudo code that j and k must be different, this is clear from their description.

Lines 21 and 22
We limit the size of the data sets L t,i to A, downsampling them if needed.
Lines 19, 23-35 While Zhou and Li (2005) update the models h i by directly training on L ∪ L t,i , we experiment with concatenating data from previous tri-training iterations (lines 24-25) and we use B i instead of L (line 26). 19 The parameter d controls how much weight is given to data from previous iterations. The current iteration's data is always used in full. No data from previous iterations is added with d = 0. For d = 1, all available data is concatenated. With d < 1, we apply exponential decay to the dataset weights, e. g. for d = 0.5 we take 50% of the data from the previous iteration, 25% from the iteration before the last one etc. (line 25). 20 At the end of each tri-training iteration in (line 26), the three models h i are updated with new models trained on the concatenation of the manually labelled and automatically labelled data selected for the learners. 21 15 In Zhou and Li (2005)'s experiments, all datasets are small, that largest having 3772 items. 16 We set the size of U so that we do not expect it to be a limiting factor. In preliminary experiments with the English LinEs treebank, we observed that 4A is sufficient to obtain at least A new labelled items. We set the size of U to 16A to account for likely variation in the rate of agreement between models when switching to other treebanks. 17 See Footnote 2. 18 See Footnote 3. 19 Initially, the latter was an error on our side but, given that we ensure that each item in L is included in each Bi at least twice, keeping the Bi seems better fitted as L no longer provides additional labelled data, the diversity of learner models is improved and moderate oversampling of L can be expected to be helpful. 20 We do not restrict experiments to a single value of d as tri-training is considerably faster with d ∈ {0, 0.5} than for d = 1, see Section B.2. 21 For the reasons described in Section E, we do not use the model update conditions of Zhou and Li (2005) that are based on (a) the estimated label noise in R to be lower than in the previous iteration and on (b) R to be sufficiently big for the noise not to be harmful under certain assumptions.
Our training.conllu files produced by tri-training for each parsing model start with the manually labelled data followed by the automatically labelled data. In case of oversampling B i to match the size of R (option not mentioned in the description above), the oversampling also changes the order of the manually labelled data. Similarly, the L t ,i are only re-ordered if |L t ,i | > A × d t−t .
For clarity, when we use the set union operator in Algorithm 1 we mean concatenation of data sets. Duplicates are not removed. It is also clear that vanilla tri-training concatenates sets as set operations in the mathematical sense would damage the samples with replacement of manually labelled data.

B Parameter Search
This section describes our parameter search, analysing four distinct aspects of tri-training: sampling of seed data, reusing data from previous iterations, sample size and oversampling.

B.1.1 Seed Sampling Methods Considered
The seed data B i in tri-training is the labelled data that is used to train the initial models. This data is also included in the training data of the remaining tri-training iterations in our version of tri-training, see Algorithm 1. Each learner usually uses a different sample of the original training data.
In initial experiments with the English side of the LinES Parallel Treebank (en_lines) as seed data, we observed a degradation of performance of the learners' models when sampling the manually labelled data with replacement -as in vanilla tritraining (Zhou and Li, 2005) -compared to models trained directly on the labelled data. Neither combining three models in an ensemble nor additional training data obtained through tri-training in up to two tri-training iterations compensated for the loss of performance.
The reason why vanilla tri-training uses sampling is to ensure variation between the three learners. Neural models, however, naturally vary due to random initialisation of network weights, order of training data, stochastic kernels and numerical effects when intermediary results computed in parallel are combined in unpredictable order. We therefore tried using the original manually labelled training data in all learners and relying on random initialisation to instill variation. This re-moved the degradation of performance but as tritraining proceeded performance stayed within 0.6 LAS points of the average LAS of ensembles of three initial models. We suspected that more variation is needed. Therefore, we re-introduced sampling but modified it to ensure that all manually labelled data is available to each learner. We change the sampling to pick half of the data twice and the remaining half three times, resulting in a sample size of 250% of the original data. 22 With this sampling, tri-training performance clearly improved in the en_lines experiment and exceeded the range of results due to random initialisation and other sources of variation in neural models.
The results for our four development languages shown in Table 3 mostly confirm these findings. Using 2.5 copies consistently gives the highest LAS, though the improvements over vanilla tri-training, which uses sampling only for the initial models and then continues with the full labelled data, and a variant using the full data from the start are small.

B.2 Effect of Using Data from Previous Iterations
The tri-training parameter d controls how much data from previous tri-training iterations is used in the current iteration. We experiment with d ∈ {0, 0.5, 1} as we expect that training a model on data obtained with different models, initialised with different seeds, may have similar benefits as using ensemble predictions, which Yu et al. (2020) show to improve self-training. Furthermore, data combination may limit negative effects of an iteration with poorly performing models h i . The results are shown in Table 4. For all but the Uyghur parser udpf, i. e. without external word embeddings, we found the best development results when predictions of all tri-training iterations are combined. The difference in LAS to the combination method that exponentially reduces the size of data taken from previous iterations (d = 0.5) is small.

B.3 Effect of Sample Size A
The tri-training parameter A controls how much unlabelled data is combined with labelled data during  Table 3: Effect of seed data sampling on tri-training performance (average development LAS over eight tri-training runs, selecting, for each run, the best tri-training iteration according to development LAS): W.R. is sampling with replacement, Vanilla uses sampling with replacement for the initial models and a full copy of the labelled data for t > 0, 100% uses a full copy of the labelled data in all iterations, i. e. the only source of variation is the random seed used in parser training, 250% uses 2.5 copies of L for each learner, providing additional variation due to the random selection of the last half of the data.  training. Table 5 presents results for augmentation sizes A from 5k to 160k tokens. 23 We see good improvements for all development languages except Vietnamese as the size of the set of automatically labelled data added in each tri-training round increases. For Vietnamese with parser elmo, the range of scores is small and there is no consistent pattern. Table 6 compares average LAS with and without oversampling of the manually labelled data B i to match the size of the automatically labelled data R.

B.4 Effect of Oversampling
The results suggest that the effects is negligible and since oversampling slows down training we carry out the main experiment without oversampling. 24 C Error Analysis: Dependency Labels Table 7 shows the most frequent LAS improvements by dependency label. Table 8 compares the learning rate schedule we use with UDPipe-Future and its default schedule, 23 For comparison, the labelled data L has about 20k tokens in our experiments and the samples Bi have about 50k tokens for our best seed data sample size 250%. 24 Preliminary results for English with oversampling the manually labelled data three times in all iterations, including the seed models (Table 3), however, show a positive effect of oversampling. Maybe oversampling is more important in early iterations where the amount of automatically labelled data is relatively small. Future work should investigate the effect of oversampling further.

E Model Selection
We select the tri-training iteration with the best ensemble performance according to development LAS. We do not use Zhou and Li (2005)'s stopping criterion that is based on conditionally updating the learners' models in line 26 of Algorithm 1 for the following reasons: • The model update condition is designed for binary classification tasks. It is not clear how the condition would have to be updated for joint prediction of dependency trees, lemmata and multiple tags.
• The model update condition uses the training data to estimate label noise. We do not expect such estimates to be useful for neural models that tend to considerably overfit the training data.
• The model update condition rejects models trained on an amount of automatically labelled data that is too small to avoid harm from label noise under certain assumptions. In Zhou and Li (2005)'s experiments, the size of the unlabelled data is quite small. 25 In contrast, we can avail of orders of magnitude more unlabelled data. Hence, we do not expect the issue of insufficient data to arise.
• The inherent performance variation of neural models, e. g. due to random initialisation, can trigger Zhou and Li (2005)'s stopping criterion too early as it requires the error rate to drop in each iteration. When tri-training is run long enough for the improvements due to the additional training data to be smaller than the performance variation due to randomness in model training, we expect that patience is needed to bridge a temporary degradation. 26 • Furthermore, tri-training can reduce performance if wrong decisions are amplified.

F BERT Layer Selection
We experiment with using different BERT layers and pooling functions for combining BERT's subword vectors to token vectors. We explore 45 settings for each language, nine choices of layers (individual layers and average of layers, excluding bottom layers) and five choices of token pooling functions. For each language, we choose three different settings, one for each learner in tri-training,   starting with the top-performing setting, eliminating all settings with the same choice of layers or the same choice of token pooling and then repeating the process for the next learner. Table 9 shows the results of this experiment. Our observations confirm that middle layers typically perform best (Rogers et al., 2020). Uyghur, for which multilingual BERT does not perform well with UDPipe-Future, has a different pattern showing no large differences and a preference for the top layer that is less informative for the other languages. Different languages seem to prefer different pooling functions for combining vectors of subword units to vectors for UD tokens.  Table 9: Development set LAS for training UDPipe-Future with word embeddings taken from different BERT layers. Each inner cell shows the average over 25 runs. Pooling methods are average, first, last, maximum and weighted average with binonimal distribition with p = 0.5 (Z50). Layer A4 (A5) stands for using the average of the top 4 (5) layers. The E suffix means that the 768-dimensional BERT_BASE vectors are expanded to 1024 components so that all (A4E) or most (A5E) final components can be the average of fewer input components. The three settings selected for the three learners of each language are shown in bold.