Zero-Shot Dependency Parsing with Worst-Case Aware Automated Curriculum Learning

Large multilingual pretrained language models such as mBERT and XLM-RoBERTa have been found to be surprisingly effective for cross-lingual transfer of syntactic parsing models Wu and Dredze (2019), but only between related languages. However, source and training languages are rarely related, when parsing truly low-resource languages. To close this gap, we adopt a method from multi-task learning, which relies on automated curriculum learning, to dynamically optimize for parsing performance on outlier languages. We show that this approach is significantly better than uniform and size-proportional sampling in the zero-shot setting.


Introduction
The field of multilingual NLP is booming (Agirre, 2020). This is due in no small part to large multilingual pretrained language models (PLMs) such as mBERT (Devlin et al., 2019) and XLM-RoBERTa (Conneau et al., 2020), which have been found to have surprising cross-lingual transfer capabilities in spite of receiving no cross-lingual supervision. 1 Wu and Dredze (2019), for example, found mBERT to perform well in a zero-shot setting when finetuned for five different NLP tasks in different languages. There is, however, a sharp divide between languages that benefit from this transfer and languages that do not, and there is ample evidence that transfer works best between typologically similar languages (Pires et al., 2019;Lauscher et al., 2020, 1 In the early days, cross-lingual transfer for dependency parsing relied on projection across word alignments (Spreyer and Kuhn, 2009;Agić et al., 2016) or delexicalized transfer of abstract syntactic features (Zeman and Resnik, 2008;McDonald et al., 2011;Søgaard, 2011;Cohen et al., 2011). Delexicalized transfer was later 're-lexicalized' by word clusters (Täckström et al., 2012) and word embeddings (Duong et al., 2015), but with the introduction of multilingual contextualized language models, transfer models no longer rely on abstract syntactic features, removing an important bottleneck for transfer approaches to scale to truly low-resource languages. among others). This means that the majority of world languages that are truly low-resource are still left behind and inequalities in access to language technology are increasing.
Large multilingual PLMs are typically fine-tuned using training data from a sample of languages that is supposed to be representative of the languages that the models are later applied to. However, this is difficult to achieve in practice, as multilingual datasets are not well balanced for typological diversity and contain a skewed distribution of typological features (Ponti et al., 2021). This problem can be mitigated by using methods that sample from skewed distributions in a way that is robust to outliers. Zhang et al. (2020) recently developed such a method. It uses curriculum learning with a worst-case-aware loss for multi-task learning. They trained their model on a subset of the GLUE benchmark (Wang et al., 2018) and tested on outlier tasks. This led to improved zero-shot performance on these outlier tasks. This method can be applied to multilingual NLP where different languages are considered different tasks. This is what we do in this work, for the case of multilingual dependency parsing. Multilingual dependency parsing is an ideal test case for this method, as the Universal Dependency treebanks (Nivre et al., 2020) are currently the manually annotated dataset that covers the most typological diversity (Ponti et al., 2021).
Our research question can be formulated as such: Can worst-case aware automated curriculum learning improve zero-shot cross-lingual dependency parsing? 2 2 Worst-Case-Aware Curriculum Learning In multi-task learning, the total loss is generally the average of losses of different tasks: where l i is the loss of task i. The architecture we use in this paper is adapted from Zhang et al. (2020), which is an automated curriculum learning (Graves et al., 2017) framework to learn a worstcase-aware loss in a multi-task learning scenario. The architecture consists of a sampler, a buffer, a trainer and a multilingual dependency parsing model. The two main components are the sampler, which adopts a curriculum sampling strategy to dynamically sample data batches, and the trainer which uses worst-case-aware strategy to train the model. The framework repeats the following steps: (1) the sampler samples data batches of different languages to the buffer; (2) the trainer uses a worstcase strategy to train the model; (3) the automated curriculum learning strategy of the sampler is updated.
Sampling data batches We view multilingual dependency parsing as multi-task learning where parsing in each individual language is considered a task. This means that the target of the sampler at each step is to choose a data batch from one language. This is a typical multi-arm bandit problem (Even-Dar et al., 2002). The sampler should choose bandits that have higher rewards, and in our scenario, data batches that have a higher loss on the model are more likely to be selected by the sampler and therefore, in a later stage, used by the trainer. Automated curriculum learning is adopted to push a batch with its loss into the buffer at each time step. The buffer consists of n first-in-first-out queues, and each queue corresponds to a task (in our case, a language). The procedure repeats k times and, at each round, k data batches are pushed into the buffer.
Worst-case-aware risk minimization In multilingual and multi-task learning scenarios, in which we jointly minimize our risk across n languages or tasks, we are confronted with the question of how to summarize n losses. In other words, the question is how to compare two loss vectors α and β containing losses for all tasks l i , . . . l n : α = [ 1 1 , . . . , 1 n ] and β = [ 2 1 , . . . , 2 n ] The most obvious thing to do is to minimize the mean of the n losses, asking whether ∈α < ∈β . We could also, motivated by robustness (Søgaard, 2013) and fairness (Williamson and Menon, 2019), minimize the maximum (supremum) of the n losses, asking whether max ∈α < max ∈β . Mehta et al. (2012) observed that these two loss summarizations are extremes that can be generalized by a family of multi-task loss functions that summarize the loss of n tasks as the L p norm of the n-dimensional loss vector. Minimizing the average loss then corresponds to computing the L 1 norm, i.e., asking whether |α| 1 < |β| 1 , and minimizing the worst-case loss corresponds to computing the L ∞ (supremum) norm, i.e., asking whether |α| ∞ < |β| ∞ .
Zhang et al. (2020) present a stochastic generalization of the L ∞ loss summarization and a practical approach to minimizing this family of losses through automated curriculum learning (Graves et al., 2017): The core idea behind their generalization is to optimize the worst-case loss with a certain probability, otherwise optimize the average (loss-proportional) loss with the remaining probability. The hyperparameter φ is introduced by the worst-case-aware risk minimization to trade off the balance between the worst-case and the lossproportional losses. The loss family is formally defined as: where p ∈ [0, 1] is a random generated rational number, and P = i j≤n j is the normalized probability distribution of task losses. If p < φ the model chooses the maximum loss among all tasks, otherwise, it randomly chooses one loss according to the loss distribution. If the hyperparameter φ equals 1, the trainer updates the model with respect to the worst-case loss. On the contrary, if φ = 0, the trainer loss-proportionally samples one loss.
Sampling strategy updates The model updates its parameters with respect to the loss chosen by the trainer. After that, the sampler updates its policy according to the behavior of the trainer. At each round, the policy of the task that is selected by the trainer receives positive rewards and the policy of all other tasks that have been selected by the sampler receive negative rewards.
The multilingual dependency parsing model We use a standard biaffine graph-based dependency parser (Dozat and Manning, 2017). The model takes token representations of words from a contextualized language model (mBERT or XLM-R) as input and classifies head and dependency relations between words in the sentence. The Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967) is then used to decode the score matrix into a tree. All languages share the same encoder and decoder in order to learn features from different languages, and more importantly to perform zero-shot transfer to unseen languages.

Experiments
We base our experimental design on Üstün et al.
(2020), a recent paper doing zero-shot dependency parsing with good performance on a large number of languages. They fine-tune mBERT for dependency parsing using training data from a sample of 13 typologically diverse languages from Universal Dependencies (UD; Nivre et al., 2020), listed in Table 1. For testing, they use 30 test sets from treebanks whose language has not been seen at finetuning time. We use the same training and test sets and experiment both with mBERT and XLM-R as PLMs. It is important to note that not all of the test languages have been seen by the PLMs. 3 We test worst-case aware learning with different values of φ and compare this to three main baselines: size-proportional samples batches pro-portionally to the data sizes of the training treebanks, uniform samples from different treebanks with equal probability, thereby effectively reducing the size of the training data, and smooth-sampling uses the smooth sampling method developed in van der Goot et al. (2021) which samples from multiple languages using a multinomial distribution. These baselines are competitive with the state-ofthe-art when using mBERT, they are within 0.2 to 0.4 LAS points from the baseline of Üstün et al.
(2020) on the same test sets. When using XLM-R, they are largely above the state-of-the-art.
We implement all models using MaChAmp (van der Goot et al., 2021), a library for multi-task learning based on AllenNLP (Gardner et al., 2018). The library uses transformers from HuggingFace (Wolf et al., 2020). Our code is publicly available. 4 Our main results are in Table 2 where we report average scores across test sets, for space reasons. Results broken down by test treebank can be found in Table 4 in Appendix A. We can see that worstcase-aware training outperforms all of our baselines in the zero-shot setting, highlighting the effectiveness of this method. This answers positively our research question Can worst-case aware automated curriculum learning improve zero-shot dependency parsing?
Our results using mBERT are more than 1 LAS point above the corresponding baselines. Our best model is significantly better than the best baseline with p < .01 according to a bootstrap test across test treebanks. Our best model with mBERT comes close to Udapter (36.5 LAS on the same test sets) while being a lot simpler and not using external resources such as typological features, which are not always available for truly low-resource languages.
The results with XLM-R are much higher in general 5 but the trends are similar: all our models outperform all of our baselines albeit with smaller differences. There is only a 0.4 LAS difference between our best model and the best baseline, but it is still significant with p < .05 according to a bootstrap test across test treebanks. This highlights the robustness of the XLM-R model itself. Our results with XLM-R outperform Udapter by close to 7 LAS points.

Varying the homogeneity of training samples
We investigate the interaction between the effectiveness of worst-case learning and the representativeness of the sample of training languages. It is notoriously difficult to construct a sample of treebanks that is representative of the languages in UD (de Lhoneux et al., 2017;Schluter and Agić, 2017;de Lhoneux, 2019). We can, however, easily construct samples that are not representative, for example, by taking a sample of related languages. We expect worst-case aware learning to lead to larger improvements in cases where some language types are underrepresented in the sample. We can construct an extreme case of underrepresentation by selecting a sample of training languages that has one or more clear outliers. For example we can construct a sample of related languages, add a single unrelated language in the mix, and then evaluate on other unrelated languages. We also expect that with a typologically diverse set of training languages, worst-case aware learning should lead to larger relative improvements than with a homogeneous sample, but perhaps slightly smaller improvements than with a very skewed sample. We test these hypotheses by constructing seven samples of training languages in addition to the one used so far (13LANG). We construct three different homogeneous samples using treebanks from three different genera: GERMANIC, ROMANCE and SLAVIC. We construct four skewed samples using the sample of romance languages and a language from a different language family, an outlier language: Basque (eu), Arabic (ar), Turkish (tr) and Chinese (zh). Since we keep the sample of test sets constant, we do not include training data from languages that are in the test sets. The details of which treebanks are used for each of these samples  can be found in Table 5 in Appendix B.
Results are in Table 3 where we report the average LAS scores of our best model (out of the ones trained with the three different φ values) to the best of the three baselines. We can see first that, as expected, our typologically diverse sample performs best overall. This indicates that it is a good sample. We can also see that, as expected, the method works best with a skewed sample: the largest gains from using worst-case learning, both in terms of absolute LAS difference and relative error reduction, are seen for a skewed sample (ROM+EU). However, contrary to expectations, the lowest gains are obtained for another skewed sample (ROM+AR). The gains are also low for ROM+TR, ROM+ZH and for GERMANIC. Additionally, there are slightly more gains from using worst-case aware learning with the SLAVIC sample than for our typologically diverse sample. These results could be due to the different scripts of the languages involved both in training and testing.
Looking at results of the different models on individual test languages (see Figure 1 in Appendix C), we find no clear pattern of the settings in which this method works best. We do note that the method always hurts Belarusian, which is perhaps unsurprising given that it is the test treebank for which the baseline is highest. Worst-case aware learning hurts Belarusian the least when using the SLAVIC sample, indicating that, when using the other samples, the languages related to Belarusian are likely downsampled in favour of languages unrelated to it. Worst-case learning consistently helps Breton and Swiss German, indicating that the method might work best for languages that are underrepresented within their language family but not necessarily outside of it. For Swiss German, worst-case learn-ing helps least when using the GERMANIC sample where it is less of an outlier.

Conclusion
In this work, we have adopted a method from multitask learning which relies on automated curriculum learning to the case of multilingual dependency parsing. This method allows to dynamically optimize for parsing performance on outlier languages. We found this method to improve dependency parsing on a sample of 30 test languages in the zeroshot setting, compared to sampling data uniformly across treebanks from different languages, or proportionally to the size of the treebanks. We investigated the impact of varying the homogeneity of the sample of training treebanks on the usefulness of the method and found conflicting evidence with different samples. This leaves open questions about the relationship between the languages used for training and the ones used for testing.

Acknowledgements
We thank Daniel Hershcovich and Ruixiang Cui for comments on a draft of the paper, as well as the members of CoAStaL for discussions about the content of the paper. Miryam de Lhoneux was funded by the Swedish Research Council (grant 2020-00437). Anders Søgaard was funded by the Innovation Fund Denmark and a Google Focused Research Award. We acknowledge the computational resources provided by CSC in Finland through NeIC-NLPL and the EOSC-Nordic NLPL use case (www.nlpl.eu).

A Results by treebank
Results by language of the test treebanks are in Table 4.

B Training samples
The training samples are summarized in Table 5.

C Results by treebank with the different samples
Relative error reduction between our best worstcase aware result and the best baseline for each training sample used, with mBERT, in Figure 1.