Minimax and Neyman–Pearson Meta-Learning for Outlier Languages

Model-agnostic meta-learning (MAML) has been recently put forth as a strategy to learn resource-poor languages in a sample-efficient fashion. Nevertheless, the properties of these languages are often not well represented by those available during training. Hence, we argue that the i.i.d. assumption ingrained in MAML makes it ill-suited for cross-lingual NLP. In fact, under a decision-theoretic framework, MAML can be interpreted as minimising the expected risk across training languages (with a uniform prior), which is known as Bayes criterion. To increase its robustness to outlier languages, we create two variants of MAML based on alternative criteria: Minimax MAML reduces the maximum risk across languages, while Neyman-Pearson MAML constrains the risk in each language to a maximum threshold. Both criteria constitute fully differentiable two-player games. In light of this, we propose a new adaptive optimiser solving for a local approximation to their Nash equilibrium. We evaluate both model variants on two popular NLP tasks, part-of-speech tagging and question answering. We report gains for their average and minimum performance across low-resource languages in zero- and few-shot settings, compared to joint multi-source transfer and vanilla MAML.


Introduction
Knowledge transfer is ubiquitous in machine learning because of the general scarcity of annotated data (Pratt, 1993;Caruana, 1997;Ruder, 2019, inter alia). A prominent example thereof is transfer from resource-rich languages to resource-poor languages (Wu and Dredze, 2019;Ponti et al., 2019b;Ruder et al., 2019). Recently, Model-Agnostic * Equal contribution Meta-Learning (MAML;Finn et al., 2017) has come to the fore as a promising paradigm: it explicitly trains neural models that adapt to new languages quickly by extrapolating from just a few annotated data points (Gu et al., 2018;Nooralahzadeh et al., 2020;Wu et al., 2020;Li et al., 2020).
MAML usually rests on the simplifying assumption that the source 'tasks' and the target 'tasks' are independent and identically distributed (henceforth, i.i.d.). However, in practice most scenarios of cross-lingual transfer violate this assumption: training languages documented in mainstream datasets do not reflect the cross-lingual variation, as they belong to a clique of few families, geographical areas, and typological features (Bender, 2009;Joshi et al., 2020). Therefore, the majority of the world's languages lies outside of such a clique. As training and evaluation languages differ in their joint distribution, they are not exchangeable (Ponti, 2021;Orbanz, 2012, ch. 6). Therefore, there is no formal guarantee that MAML generalises to the very languages whose need for transfer is most critical.
In this work, we interpret meta-learning within a decision-theoretic framework (Bickel and Doksum, 2015). MAML, we show, minimises the expected risk across languages found in the training distribution. Hence, it follows a so-called Bayes criterion. What if, instead, we formulated alternative criteria geared towards outlier languages? The first criterion we propose, Minimax MAML, is designed to be robust to worst-case-scenario out-of-distribution transfer: it minimises the maximum risk by learning an adversarial language distribution. The second criterion, Neyman-Pearson MAML, upper-bounds the risk for an arbitrary subset of languages via Lagrange multipliers, such that it does not exceed a predetermined threshold.
Crucially, both of these alternative criteria constitute competitive games between two players: one minimising the loss with respect to the neural pa- )DPLO\ rameters, the other maximising it with respect to the language distribution (Minimax MAML) or Lagrange multipliers (Neyman-Pearson MAML).
Since an absolute Nash equilibrium may not exist for non-convex functions (Jin et al., 2020), such as neural networks, a common solution is to approximate local equilibria instead (Schäfer and Anandkumar, 2019). Therefore, we build on previously proposed optimisers (Balduzzi et al., 2018;Letcher et al., 2019;Gemp and Mahadevan, 2018) where players follow non-trivial strategies that take into account the opponent's predicted moves. In particular, we enhance them with first-order momentum and adaptive learning rate and apply them on our newly proposed criteria.
We run experiments on Universal Dependencies (Zeman et al., 2020) for part-of-speech (POS) tagging and TyDiQA (Clark et al., 2020) for question answering (QA). We perform knowledge transfer to 14 and 8 target languages, respectively, which belong to under-represented and often endangered families (such as Tupian from Southern America and Pama-Nyugan from Australia). We report modest but consistent gains for the average performance across languages in few-shot and zero-shot learning settings and mixed results for the minimum performance. In particular, Minimax and Neyman-Pearson MAML often surpass vanilla MAML and multi-source transfer baselines, which are currently considered state-of-the-art in these tasks (Wu and Dredze, 2019;Ponti et al., 2021;Clark et al., 2020).

Skewed Language Distributions
Cross-lingual learning aims at transferring knowledge from resource-rich languages to resource-poor languages, to compensate for their deficiency of annotated data (Tiedemann, 2015;Ruder et al., 2019;Ponti et al., 2019a). The set of target languages ideally encompasses most of the world's languages. However, the source languages available for training are often concentrated around few families, geographic areas, and typological features (Cotterell and Eisner, 2017;Gerz et al., 2018b,a;Ponti et al., 2020;Clark et al., 2020). As a consequence of this discrepancy, a language drawn at random might have no related languages available for training. Even when this is not the case, they might provide a scarce amount of examples for supervision. To illustrate this point, consider Universal Dependencies (UD; Zeman et al., 2020), hitherto the most comprehensive collection of manually curated multilingual data. First, out of 245 families attested in the world according to Glottolog (Hammarström et al., 2016), UD covers only 18. 1 In fact, some families are chronically over-represented (e.g. Indo-European and Uralic) and others are neglected (e.g. Pama-Nyugan and Uto-Aztecan). Second, as shown in Figure 1, the allocation of labelled examples across families is imbalanced (e.g. note the low counts for Niger-Congo or Dravidian languages). Third, one can measure how representative the linguistic traits of training languages are in comparison to those encountered around the globe. In Figure 2, we represent UD languages as dots in the space of possible typological features in WALS (Dryer and Haspelmath, 2013). These are plotted against the density of the distribution based on all languages in existence. Crucially, it emerges that UD languages mostly lie in a low-density region. Therefore, they hardly reflect the variety of possible combinations of typological features.
In general, this demonstrates that the distribution of training languages in existing NLP datasets is heavily skewed compared to the real-world distribution. Indeed, this very argument holds true a fortiori in smaller, less diverse datasets. While this fact is undisputed in the literature, its consequences for modelling, which we expound in the next section, are often under-estimated.

Robust MAML
Model-Agnostic Meta Learning (MAML;Finn et al., 2017) has recently emerged as an effective approach to cross-lingual transfer (Gu et al., 2018;Nooralahzadeh et al., 2020;Wu et al., 2020;Li et al., 2020). MAML seeks a good initialisation point for neural weights in order to adapt them to new languages with only a few examples. To this end, for each language T i a neural model f ϑ is updated according to the loss on a batch of examples L T i (f ϑ , D train ). This inner loop is iterated for k steps. Afterwards, the loss incurred by the model on a held-out batch D val is compounded with those of the other languages as part of an outer loop, as shown in Equation (1): where η ∈ R >0 is the learning rate. Language probabilities are often taken to follow a discrete uniform distribution p(T i ) = 1 |T | . In this case, Equation (1) becomes a simple average. MAML can also be interpreted as point estimate inference in a hierarchical Bayesian graphical model (see Figure 4 in the Appendix). In this case, the adapted parameters ϕ i are equivalent to an intermediate language-specific variable acting as a bridge between the language-agnostic parameters ϑ and the data (Grant et al., 2018;Finn et al., 2018;Yoon et al., 2018). This allows us to reason about the conditions under which a model is expected to generalise to new languages. Crucially, generalisation rests on the assumption of independence and identical distribution among the examples (including both train and evaluation), which is known as exchangeability (Zabell, 2005). However, as seen in Section 2, most of the world's languages are outliers with respect to the training language distribution. Therefore, there is no solid guarantee that meta-learning may fulfil its purpose, i.e. generalise to held-out languages.

Decision-Theoretic Perspective
To remedy the mismatch between assumptions and realistic conditions, in this work we propose objectives which can serve as alternatives to Equation (1) of vanilla MAML. These are rooted in an interpretation of MAML within a decision-theoretic perspective (Bickel and Doksum, 2015, ch. 1.3), which we outline in what follows. The quantity of interest we aim at learning is the neural parameters ϑ. Therefore, the action space for a classification task assigning labels y ∈ Y to inputs x ∈ X is A = {f ϑ : X → Y}. The risk function is in turn a function R : F × A → R + , which is the loss incurred by taking an action in A (making a prediction with a specific configuration of neural parameters) when the 'state of nature', the true function, is f ∈ F. In the case of MAML, this is represented by the language-specific inner loop loss L T i (·) in Equation (1).
The decision for the optimal action given the sample space, the function δ : X × Y → A, is usually determined via gradient descent optimisation for a neural network. The optimal action, however, may vary depending on the language, which results in multiple possible 'states of nature'. Usually, there is no procedure δ whose loss is inferior to all others, such that: Therefore, decision functions have to be compared based on a global criterion rather than in a pairwise fashion between languages. As previously anticipated, Equation (1) minimizes the expected risk across languages, for an arbitrary choice of prior p(T ). In decision theory, a decision δ with this property is called Bayes criterion.

Alternative Criteria
There exist alternative criteria to the Bayes criterion that are more justified in a setting that entails transfer between non-i.i.d. domains. Rather than minimising the Bayes risk, in this work, we propose to adjust MAML to either minimise the maximum risk (minimax criterion) or to enforce constraints on the risk for a subset of languages (Neyman-Pearson criterion). This is likely to yield more robust predictions for languages that are outliers to the training distribution. As demonstrated in Section 2, this definition encompasses most of the world's languages.

Minimax Criterion
Rather than the expected risk, the criterion could depend instead on the worst case scenario, i.e. the language for which the risk is maximum. This requires to select such a language with max. As an alternative to reinforcement learning (Zhang et al., 2020), to keep our model fully differentiable, we relax the operator by treating the choice of language as a categorical distribution T i ∼ Cat(· | τ ).
The parameters τ ∈ [0, 1] |T | , i τ i = 1 consist of language probabilities and are learned in an adversarial fashion: (3) Equation (3) can be interpreted as a two-player game between us (the scientists) and nature. We pick an action ϑ. Then nature picks a language T i ∈ p(T ) for which the risk is maximum given our chosen action. Therefore, our goal becomes to minimise such risk.

Neyman-Pearson Criterion
As an alternative, we might consider minimising the expected risk, but subject to a guarantee that the risk does not exceed a certain threshold for a subset of languages. In practice, we may want to enforce a set of inequality constrains, so that we minimise Equation (1) subject to {L T i ≤ r ∀T i ∈ C}, where r ∈ R + is a hyper-parameter. In general, C ⊆ T can be any subset of the training languages; in practice, here we take C = T . Constrained optimisation is usually implemented through Lagrange multipliers, where we add as many new terms to the objective as we have constraints (Bishop, 2006, ch. 7): where λ is a vector of non-negative Lagrange multipliers {λ i ≥ 0 ∀λ i ∈ λ} to be learned together with the parameters ϑ, but adversarially. Intuitively, if the risk for the estimated parameters ϑ lies in the permissible range, the constraints should become inactive {λ i = 0 ∀λ i ∈ λ}, i.e. each Lagrange multiplier should go towards 0. Otherwise, the solution should be affected by the constraints, which should keep ϑ from trespassing the boundary {L(ϑ) T i = r ∀T i ∈ T }. In gradientbased optimisation, this unfolds as follows: the gradient of each λ i depends uniquely on (L T i − r). Due to being maximised, the value of each λ i increases when the corresponding risk is above the threshold, and shrinks otherwise. Incidentally, note that the Lagrangian multipliers at the critical point ϑ are equal to the negative rate of change of r, as ∂R(ϑ ) ∂r = −λ. In other words, upon convergence λ i expresses how much we can decrease the risk in T i as we increase the threshold.

Constrained Parameters
The additional variables τ and λ, contrary to the neural parameters, are constrained in the values they can take. In neural networks, there are two widespread approaches to coerce variables within a certain range, viz. reparametrisation and gradient projection (Beck and Teboulle, 2003). 2 For simplicity's sake, we opt for the former, which just requires us to learn unconstrained variables and scale them with the appropriate functions. Thus, we redefine the abovementioned variables as τ softmax(τ u ) and λ softplus(λ u ).

Optimisation in 2-Player Games
Based on the formulation of Minimax MAML and Neyman-Pearson MAML in Section 3.2, both are evidently instances of two-player games. On one hand, the first agent minimises the risk with respect to ϑ; on the other, the second agent maximises the risk with respect to τ (for minimax) or λ (for Neyman-Pearson). In other words, both optimise the same (empirical risk) function in Equation (3) or Equation (4), respectively, but with opposite signs. However, the first term of Equation (4) does not depend on λ. Therefore, Minimax MAML is a zero-sum game, but not Neyman-Pearson MAML.
If the risk function were convex, the solution would be well-defined as the Nash equilibrium. But this is not the case for a non-linear function such as a deep neural network. Therefore, we resort to an approximate solution through optimisation. The simplest approach in this scenario is Gradient Descent Ascent (GDA), where the set of parameters of both players are optimised simultaneously through gradient descent for the first player and gradient ascent for the second player. With a slight abuse of notation, let us define R R(ϑ t , α t ), where α t stands for the adversarial parameters (τ t for Minimax and λ t for Neyman-Pearson) at time t. Then the update rule is: for a learning rate η ∈ R. Equations (5) and (6) are equivalent to allowing each player to ignore the other's move and act as if it will remain stationary. This naïve assumption often leads to divergence or sub-par solutions during optimisation (Schäfer and Anandkumar, 2019).

Symplectic Gradient Adjustment
To overcome the limitations of GDA, several independent works (Balduzzi et al., 2018;Letcher et al., 2019;Gemp and Mahadevan, 2018) proposed to correct Equations (5) and (6) with an additional term. This consists of a matrix-vector product between the mixed second-order derivatives (D 2 ϑα R and D 2 αϑ R, respectively) 3 and the gradient of the risk with respect to the adversarial parameters (∇ α R and ∇ ϑ R, respectively). The resulting optimisation algorithm, Symplectic Gradient Adjustment (SGA), updates parameters as follows: Intuitively, the mixed second-order derivative represents the interaction between the players, and the adversarial gradient represents the opponent's move if they follow the simple GDA strategy. Schäfer and Anandkumar (2019) cogently demonstrate how Equations (7) and (8) correspond to an approximation of the Nash equilibrium 4 of a local 3 Here D 2 wz R stands for the sub-matrix of the Hessian containing the derivative of the risk taken first with respect to w and then with respect to z. 4 A Nash equilibrium is a pair of strategies whose unilateral modification cannot result in loss reductions.
In practice, estimating the above-mentioned products is tedious because of their space and time complexity. Therefore, we resort to an approximation known as Hessian-vector product (Pearlmutter, 1994). For the third term of Equation (7): And similarly for the matrix product term in Equation (8), by swapping ϑ and α in Equation (9).

Adaptive Learning Rate and Momentum
While SGA may provide a more appropriate optimisation framework for competitive games, it still lacks several defining features of optimisers that accelerate convergence, such as first-order momentum and adaptive learning rate (second-order momentum). Therefore, we modify the update rule in Equations (7) and (8) to include both of these. Our starting point is Adam (Kingma and Ba, 2015). The changes we apply are the following (also illustrated in Algorithm 1): 2 The exponentially decayed, unbiased estimates of the expectations over mean and standard deviation are computed similarly to Adam. However, note that, in line 14, the update of the adversarial parameters corresponds to an ascent (rather than a descent).
This results in a novel optimiser, Adaptive Symplectic Gradient Adjustment (ASGA). We employ ASGA in our experiments to optimise the objectives of Minimax MAML and Neyman-Pearson MAML, as it enables a fair comparison with Adamoptimised Bayes MAML.

Experiments
We now outline the main experiments of our work on multilingual NLP. We evaluate our methods on part-of-speech (POS) tagging, a sequence labelling tasks, and question answering (QA), a natural language understanding task. We focus on POS given its ample coverage of languages and its frequent use as a benchmark for resource-poor NLP (Das and Petrov, 2011;Ponti et al., 2021). In fact, cross-lingual transfer in sequence labelling tasks was demonstrated to be the most challenging, as knowledge of linguistic structure is more language-dependent than semantics (Hu et al., 2020). However, we also include QA to illustrate the generality of our methods for crosslingual NLP. In this task, given the gold passage and a question, the system has to predict the beginning and end positions of a single contiguous span containing the answer.
Data. POS data are sourced from the Universal Dependencies (UD) treebanks 5 (Zeman et al., 2020) and QA data from the 'gold passage' variant of TyDiQA (Clark et al., 2020). 6 We retain the original training, development, and evaluation sets of UD. In TyDiQA, we use the original development set for evaluation. 7 For meta-learning, D train and D val examples are both obtained from disjoint parts of the training set.
We aim to create a partition of languages between training and evaluation that corresponds to the most realistic scenario in deploying NLP technology on resource-poor languages spoken around the world. Therefore, we reserve for evaluation all language isolates and languages with at most 2 family members in each dataset. We use all the remaining languages in the dataset for training. Therefore, for POS, the evaluation set spans 16 treebanks (14 languages, 11 families) and the training set 99 treebanks; QA comprises 9 languages (7 families). We hold out 4 of them in turn for evaluation (except English) and use the rest for training. We provide the full list of languages in Appendix A.
Training. In all tasks, we train a neural network consisting of two stacked modules: an encoder and a classifier. The encoder is a 12-layer, 768-hidden unit, 12-head Transformer initialised with multilingual BERT Base (mBERT), which was pre-trained on cased text from 104 languages. 8 The classifier is a single affine layer for TyDiQA and a 2-layer Perceptron (with 1024 hidden units) for POS tagging. The combined parameters of the encoder and classifier correspond to ϑ from Section 3.
These are meta-learned via Meta-SGD (Li et al., 2017), a first-order MAML variant where each parameter is assigned a separate inner-loop learning rate η. Moreover, each η is trained end-to-end based on the outer-loop loss (such as Equation (1) for the Bayes criterion). 9 Similar to Bansal et al. (2020), to avoid an explosion in the number of parameters, we assign a per-layer learning rate (rather than per-parameter). To avoid overfitting, we employ both dropout (with a probability of 0.2) and early stopping (with a patience of 10). For the Neyman-Pearson formulation, we set r = 0.1 as a threshold for all language-specific losses. 10 The parameters τ and λ were initialized uniformly as 1 |T | . Complete details of the hyper-parameters for all settings are given in Appendix B.
Methods. To assess the effectiveness of the proposed criteria and optimisers, we compare them with two competitive baselines, while maintaining the same underlying neural architecture: (i) J: a joint multi-source transfer method where a model is trained on the concatenation of the datasets for all languages; (ii) B: the original MAML (Finn et al., 2017) with Bayes criterion and uniform prior. Our choice of baselines is justified by the fact that these methods (or variations thereof) are currently 8 https://github.com/google-research/ bert/blob/master/multilingual.md 9 We implement Meta-SGD with the learn2learn package (Arnold et al., 2020). 10 We also experimented with a dynamic threshold which corresponded to the average language-specific loss of the last 10 episodes. However, this yielded sub-par results. Evaluation. For each evaluation language in a given task, we randomly sample k ∈ {0, 5, 10, 20} examples from the evaluation data as the support set (for adaptation) and the rest of the examples as the query set (for testing). When k > 0, we repeat the evaluation 100 times and report the following average metrics: (i) F 1 score for POS tagging, and (ii) exact-match (EM) and F 1 scores for QA. 11 Due to lack of space, we only report the average mean and standard deviation across languages for each model described above.

Results and Discussion
We report the results for POS tagging in Table 1 and for QA in Table 2. These include mean and standard deviation across languages. Note that, in this case, the standard deviation is by no means an interval for statistical significance, but rather reflects the heterogeneity among the evaluation languages. In what follows, we address a series of questions in the light of these figures.
Baselines. MAML and joint multi-source transfer are both strong contenders as state-of-the-art methods for cross-lingual transfer, but which one is better? By comparing J and B rows, no definite response emerges in our experiments. While MAML  Criteria. The minimax and Neyman-Pearson criteria both improve over the Bayes criterion baseline, although the latter more sporadically. Compared to the B rows, MM+ achieves gains for every k in POS tagging, with 0.94 points of margin at k = 0 and 0.79 at k = 20. The same holds for MM in QA, with margins that span from 1.73 at k = 0 to 1.47 at k = 20 in the Exact Match metric, and from 0.55 at k = 0 to 1.14 at k = 20 in F 1 score. Therefore, Minimax MAML is remarkably consistent in outperforming the baselines, although the gains are sometimes significant, sometimes only marginal. This is also reflected in language-specific performances, available in Table 5 and Table 6 in the Appendix. For POS tagging, the F 1 scores of only 2 languages (Indonesian and Naija) moderately decrease, whereas the rest of the 14 languages show improvements.
Incidentally, it may be worth noting that we did not perform any large-scale search over hyperparameters like τ and λ initialisations, the threshold r, or differential learning rates for maximised and minimised parameters. Therefore, these early ]K YL  XU  XN  XJ  WU  VZO  VY  VU  VPH  VO  VN  VD  UX  UR  SW  SO  RUY  ROR  QR  QO  PW  PU  O]K  OY  OW  OD  NR  NPU  NN  MD  LW  K\  KX  KVE  KU  KL  KH  JUF  JRW  JO  JG  JD  IUR  IU  IL  ID  HW  HV  HQ  HO  GH  GD  F\  FX  FV  FRS  FD  EJ  EH  DU  DI /DQJXDJHV 9DULDEOH Figure 3: Unconstrained values of τ u and λ u upon convergence in MM+ and NP+ models for POS tagging.   results are amenable to improve even further in the future. This lends credence to our proposition that minimax and Neyman-Pearson criteria are more suited for out-of-distribution transfer to outlier languages.
Optimiser. The results for the proposed optimiser ASGA (Algorithm 1) are favourable in comparison to Gradient Descent Ascent via Adam (Kingma and Ba, 2015) for POS tagging; on the other hand, the opposite trend is observed for QA. Therefore, future investigations are required to shed further light on modifications such as the Symplectic Gradient Adjustment. A tentative explanation of such discrepancy could be the disproportionate number of training languages available in either task.
To get insights into the game dynamics of the adversarial criteria, we plot the unconstrained values for τ u and λ u upon convergence in Figure 3. Interestingly, both variables appear to follow the same profile of peaks and troughs; therefore, as expected, languages chosen adversarially in MM have also higher Laplace multipliers in NP. To this group belong for instance languages with rare scripts (e.g. Coptic) or with no relatives in the training languages (e.g. Vietnamese). As a final note, we remark that the proposed criteria and optimiser are in principle more general than NLP and could facilitate transfer in other fields. While this thread of research transcends the scope of our work, we illustrate an example for regression in Appendix C.
Minimum Scores across Languages. In addition to the average cross-lingual performance, we also report the minimum cross-lingual performance for POS tagging in Table 3 and for QA in Table 4. This corresponds to the lowest score achieved across all evaluation languages. For POS tagging, we observe that NP and NP+ outperform J and B by 7-12 and 2-5 F1 points, respectively. This reveals that worst-case and constrained risk minimisation drastically uplifts the scores for the most disadvantaged language. Nevertheless, the opposite trend is observed for QA: MM(+) and NP(+) do not alter the minimum score with respect to the F 1 metric, and even degrade it with respect to the exact-match metric. Again, we conjecture that these mixed findings may depend on the different amount and distribution of the training languages in the corresponding datasets: UD offers greater language coverage than TyDiQA, which gives better guidance.

Related Work
MAML is a cutting-edge method for cross-lingual transfer in several NLP tasks (Gu et al., 2018;Nooralahzadeh et al., 2020;Wu et al., 2020;Li et al., 2020, inter alia). However, in all these experiments, the model is adopted in its standard formulation, minimising the expected risk. Therefore, its performance is prone to suffer in outlier languages. Moreover, the assumptions underlying our proposed variants are different from other instances of robust optimisation in NLP (Globerson and Roweis, 2006;Oren et al., 2019). In particular, the target language distributions are not explicitly treated as subspaces or covariate shifts of source languages. In separate fields such as vision, previous attempts at worst-case-aware meta-learning include , who use a Euclidean version of the robust stochastic mirror-prox algorithm, and Wang et al. (2020), who rely on reinforcement learning. Our formulation is both fully differentiable and broader, as the decision-theoretic interpretation admits alternative criteria for MAML. What is more, to our knowledge we are the first to successfully augment MAML with minimax criteria in cross-lingual NLP and with Neyman-Pearson criteria in general.

Conclusions
To perform cross-lingual transfer to low-resource languages, under a decision-theoretic interpretation Model-Agnostic Meta-Learning (MAML) minimises the expected risk across training languages. Generalisation then relies on the evaluation languages being identically distributed. However, this assumption is incongruous for cross-lingual transfer in realistic scenarios. Therefore, we propose more appropriate training objectives that are robust to out-of-distribution transfer: Minimax MAML, where worst-case risk is minimised by learning an adversarial distribution over languages; and Neyman-Pearson MAML, where constraints are imposed on language-specific losses, so that they remain below a certain threshold. From a gametheoretic perspective, both of these variants consist of 2-player competitive games. Therefore, we also explore adaptive optimisers that take into account the underlying game dynamics. The experimental results on zero-shot and few-shot learning for part-of-speech tagging and question answering, whose datasets span tens of typologically diverse languages, confirm that in several settings the pro-posed criteria are superior to both vanilla MAML and transfer from multiple source languages.

A Language Partitions
The languages from the following families in UD are held out for evaluation (16 treebanks, 14 languages in total): Northwest Caucasian (Abaza), Mande (Bambara), Mongolic (Buryat), Basque, Tupian (Mbya Guarani), Creole (Naija), Tai-Kadai (Thai), Pama-Nyungan (Warlpiri), Austronesian (Indonesian, Tagalog), Dravidian (Tamil, Telugu), Niger-Congo (Wolof, Yoruba). As all 8 languages in TiDiQA belong to families with at most 2 members in the dataset, we randomly create two partitions: in the former, Finnish, Korean, Bengali, and Arabic are used for evaluation, and the others for training; in the latter, Russian, Indonesian, Telugu, and Swahili are used for evaluation, and the others for training.

B Hyperparameter Setting
POS Tagging. For POS tagging: (i) the batch size was 32, (ii) the maximum sequence length was 128, (iii) the number of epochs was 20, with a patience limit of 10, (iv) both outer and inner learning rates were 5 × 10 −5 , (v) the number of episodes per iteration was 32, (vi) the number of inner loops per outer update was 4, (vii) the number of shots (k) during training was 30, and (viii) the hidden layer dropout probability for the classifier was 0.2.
QA. (i) the batch size and k were reduced to 12 due to memory constraints, (ii) the maximum context length was 336, and the document stride was 128, (iii) the maximum question length was 64, (iv) the inner and outer learning rates were 3 × 10 −5 .
For all J baselines, we used a uniform language sampler, since proportional sampling performed worse. As an optimiser, we chose Adam with a learning rate of 5 × 10 −5 , a weight decay of 0.1; i n d o 1 3 1 9 u r a l 1 2 7 2 a f r o 1 2 5 5 s i n o 1 2 4 5 t u r k 1 3 1 1 a u s t 1 3 0 7 d r a v 1 2 5 1 a t l a 1 2 7 8 k o r e 1 2 8 4 m a n d 1 4 6 9 t u p i 1 2 7 5 m o n g 1 3 4 9 t a i k 1 2 5 6 p a m a 1 2 5 0 a u s t 1 3 0 5 j a p o 1 2 3 7 a r t i 1 2 3 6 n a k h 1 2 4 5 a b k h 1 2 4 2 e s k i 1 2 6 4 k a r t 1 2 4 8 a l g i 1 2 4 8 q u e c 1 3 8 7 a t h a 1 2 4 5 i r o q 1 2 4 7 a y m a 1 2 5 3 u t o a 1 2 4 4 we clipped the gradient to a maximum norm of 5.0. For all MAML models, we performed 4 updates in the inner loop, both during training and fast adaptation (few-shot learning). We ran our experiments on a 48GB NVIDIA Quadro RTX 8000 GPU with Turing micro-architecture. Each run took approximately 2 hours for training and 3 hours for few-shot learning and evaluation.

C Additional Experiments & Results
Additional Results. Table 5 contains POS tagging F 1 scores of all languages, for all models, in both zero and few-shot settings. Table 6 shows the exact match and F 1 scores for QA.
Sinusoidal Regression. After delving into realworld, large-scale NLP applications, we additionally illustrate the effect of the alternative criteria on other ML domains. We run a proof-of-concept experiment on a toy task where we can fully control the distribution of the training and evaluation data, viz. regression of a sinusoidal function. For this task, we follow the same experimental setting and hyper-parameters of Finn et al. (2017): combinations of amplitudes a ∈ [0.1, 5] and phases p ∈ [0, π] determine a set of tasks characterised by the function y = sin(x − p) · a. The inputs are sampled at random from the interval x ∈ [−5, 5].
While both train and evaluation tasks in the original version were sampled uniformly from identical ranges, we also construct an alternative setting with skewed distributions sampled from disjoint ranges: during training, a ∈ [2.5, 5] and p ∈ [ π 2 , π]; during evaluation, a ∈ [0.1, 2.5] and p ∈ [0, π 2 ]. For Minimax MAML, we aim at learning the distribution over tasks adversarially. In particular, we consider two separate discrete categorical distributions for amplitudes softmax(τ (a) u ) and phases softmax(τ (p) u ) over their respective ranges discretised into 1,000 atoms. Hence, the probability of a task with the i-th amplitude value and the j-th phase value is simply τ The results for sinusoidal regression are shown in Figure 6. Vanilla MAML (Bayes criterion) consistently outperforms the minimax criterion when the task distribution is identical; on the other hand, the reverse occurs when the task distribution is skewed. MM performs much better in this case, with the gap in performance increasing as the shots k decrease. This verifies our hypothesis that the minimax criterion should benefit out-ofdistribution regression tasks.