Distributionally Robust Multilingual Machine Translation

Multilingual neural machine translation (MNMT) learns to translate multiple language pairs with a single model, potentially improving both the accuracy and the memory-efficiency of deployed models. However, the heavy data imbalance between languages hinders the model from performing uniformly across language pairs. In this paper, we propose a new learning objective for MNMT based on distributionally robust optimization, which minimizes the worst-case expected loss over the set of language pairs. We further show how to practically optimize this objective for large translation corpora using an iterated best response scheme, which is both effective and incurs negligible additional computational cost compared to standard empirical risk minimization. We perform extensive experiments on three sets of languages from two datasets and show that our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.

However, in multilingual training, the amount and type of training data available varies greatly * Equal contribution. 1 Our code is available at https://github.com/ violet-zct/fairseq-dro-mnmt across languages. Because most models are trained using empirical risk minimization (ERM), which minimizes the average training loss on the training set, high-resource languages (HRLs) with large amounts of data contribute to the majority of the training objective. When model capacity is limited, this results in trade-offs or decreased performance on some languages, particularly LRLs (Arivazhagan et al., 2019;Wang et al., 2020bWang et al., , 2021. To better control this trade-off, a common practice is to balance the training distribution by heuristic oversampling of LRLs (Johnson et al., 2017;Neubig and Hu, 2018;Arivazhagan et al., 2019).
Although simple data balancing can improve the performance on LRLs significantly, it is far from optimal. First, the sampling hyperparameters need to be adjusted for different datasets. Second, the use of simple heuristics does not consider the inherent level of difficulty in learning each language, the similarity between languages in the multilingual dataset, and other factors that affect cross-lingual transfer. Because of this, previous work has indicated the importance of learning strategies that are explicitly tailored for each multilingual learning scenario (Wang et al., 2020a).
In this paper, we propose a new learning proce-dure for multilingual translation that automatically adjusts the training distribution of different languages using distributionally robust optimization (DRO) (Ben-Tal et al., 2013a;. In constrast to ERM, DRO casts learning as a game between the learner and an adversary, where the learner picks a model while the adversary picks the hardest data distribution for that model within an uncertainty set Q of potential distributions we wish to perform well on (which typically contains the training distribution P 0 ). We first demonstrate how to apply DRO to multilingual training by letting the adversary choose the relative weights of individual language pairs in the training objective. However, we empirically find that naively applying existing methods to multilingual learning yields inferior results to ERM, mostly because (1) standard DRO objectives tend to be overly conservative and only take into account language pairs with very large losses and (2) existing optimization algorithms for DRO essentially reweigh the gradients of examples in a mini-batch, which implicitly changes the scale of the learning rates. This hurts modern NLP models like Transformers (Vaswani et al., 2017) that are highly sensitive to learning rate schedules.
To remedy this, we propose both a novel training objective and a corresponding optimization algorithm amenable to the multilingual setting. Our objective is a variation on Group DRO of Sagawa et al. with a more flexible uncertainty set, parameterized by the χ 2 -divergence. To efficiently solve the min-max game, we propose an iterated best response scheme that, at each epoch, re-samples the training data according to the worst weighting for the current model parameters, and then runs ERM training on the re-sampled dataset. Our methodwhich we refer to as χ-IBR -incurs negligible additional computational cost compared to ERM.
While this method applies to essentially any multilingual task, we specifically demonstrate its benefit on three sets of language pairs from two multilingual machine translation datasets. We experimentally test these choices by comparing several objectives and optimization algorithms, and results show that our method consistently outperforms existing DRO procedures and various strong baselines.

Preliminaries
Notation. Throughout this paper, n denotes the training set size and d the number of parameters of the model. For m ∈ N, ∆ m denotes the mdimensional simplex, i.e. ∆ m := {q ∈ R m , q i ≥ 0 and i q i = 1}. The data lies in X × Y where (x, y) ∈ X × Y consists of a source and target sentence pair with x = (x 1 , . . . , x Lx ) and y = (y 1 , . . . , y Ly ). The function : (X × Y) × R d → R refers to the loss. We consider maximumlikelihood estimation, i.e. for a target sentence y with L y tokens, we define log p(y i |x, y <i ; θ)

Multilingual Machine Translation
In contrast to bilingual machine translation, which translates from a single source language S to a target language T , multilingual neural MT (MNMT) learns a single model to translate between N language pairs {(S 1 , T 1 ), . . . (S N , T N )}. The training data D train is the concatenation of the N parallel datasets, i.e. D train = [D 1 ; D 2 , · · · ; D N ]. We can then define the probability over each language pair p train ∈ ∆ N as p train i = |D i | j |D j | . We now describe two common training objectives for MNMT.
Empirical Risk Minimization (ERM). The simplest and most common approach for MNMT is to minimize the empirical loss over data points, which we will refer to as ERM. More precisely, we define the average loss on a parallel dataset D i as ERM for multilingual models then corresponds to simply minimizing the loss over D, i.e. over all the aggregated parallel sentence pairs. This yields Classical results in learning theory guarantee that, under mild assumptions, as n goes to infinity, θ ERM n will show good performance on test sets with the same distribution as D. However, this does not guarantee that our model will perform adequately on individual parallel datasets. To remedy this issue, several works propose varying the sampling distribution-or equivalently the weighting of the parallel datasets in ERM-in order to encourage more uniform performance across language pairs.

Weighted Risk Minimization and Sampling
Strategies. The amount of training data can vary significantly across language pairs. As a result, in ERM training-i.e. optimizing for the average loss across sentence pairs-HRLs contribute most of the objective, resulting in poor performance on LRLs. Balancing the objective-or equivalently, the usage of training data-between HRLs and LRLs is important to maintain good performance across all languages (Devlin et al., 2019;Arivazhagan et al., 2019). A commonly adopted approach in multilingual training is temperature-based sampling (Arivazhagan et al., 2019;Conneau et al., 2020) where the probability of sampling data from D i is proportional to its data size exponentiated by a temperature term τ , i.e. p τ,i = |D i | 1/τ j |D j | 1/τ (referred to as ERM with τ in §4). This is equivalent to optimizing the re-weighted objective As a result, τ = 1 corresponds to ERM where most of the contribution comes from the HRLs and τ = ∞ corresponds to sampling language pairs uniformly at random, i.e. with data from LRLs being over-sampled. This approach comes with three major drawbacks (1) τ is an extra hyper-parameter that requires tuning for each MNMT instance to balance the performance across both HRLs and LRLs, (2) this heuristic sampling method does not consider the training dynamics of each language and how the optimal sampling distribution might evolve during the training process and (3) the parameterization of p τ is not only very constrained (essentially one degree of freedom), it is also only a function of the quantity of training data, which is too rigid to achieve the desired performance.
To resolve some of the above issues, the recently proposed MultiDDS (Wang et al., 2020a) uses a gradient-based meta-learning approach to learn the sampling distribution over language pairs to maximize gradient similarity with a multilingual development set. However, due to the necessity to calculate and store extra gradients, their approach comes at an increased computational and memory cost. In contrast, χ-IBR enjoys the same computational complexity as ERM, and as we show in experiments it also largely outperforms MultiDDS.

Distributionally Robust Optimization
In contrast to ERM and related sampling strategies which optimize for a fixed training distribution, DRO aims to find a model θ that performs well on an entire collection of potential test distributions Q (the uncertainty set). Formally, we wish to (1) Originating from operations research (Delage and Ye, 2010;Ben-Tal et al., 2013a,b;Bertsimas et al., 2018), DRO is a promising way to tackle robustness in a variety of machine learning and NLP problems (Hashimoto et al., 2018;Oren et al., 2019;Levy et al., 2020).
We present here a recent variant, Group DRO, developed by Sagawa et al. (2020) which incorporates additional information about the data distribution to define more meaningful uncertainty sets. Abstractly, this method assumes a collection of distributions over subpopulations {P g } g∈G such that the training distribution is a mixture of these subpopulations. Importantly, it assumes that this group structure is observed. The Group DRO objective then minimizes the worst-case loss over these groups, which is equivalent to (1) with Q = { g∈G q g P g : q ∈ ∆ |G| }, or equivalently, all possible mixtures of the distributions over subpopulations. In MNMT, the N language pairs at our disposal naturally correspond to groups; thus the Group DRO objective can be defined as (2) In other words, Group DRO wishes to find a model θ that performs well for the worst language pair. Oren et al. (2019) propose a related but less conservative objective, CVaR-Group DRO at level α ∈ [0, 1] which, considers instead the average of the αN largest group losses.

Methods for Distributionally Robust Multilingual Learning
As we previewed in §2, Group DRO is a natural objective for the multilingual setting. However, in experiments we found that naively applying existing DRO objectives fails to achieve performance on par with strong baselines, often improving results on language pairs with high losses but sacrificing too much performance overall. Our main contribution is showing how to successfully apply DRO to the MNMT setting, and to the best of our knowledge, our work is the first to do so. To that end, our methodological contributions are two-fold: (i) we first describe the shortcomings of the Group DRO objective (2), then propose a related training criterion that addresses these issues, (ii) we describe an optimization algorithm to solve the min-max optimization problem that is amenable to the MNMT setting.

χ 2 -Group DRO
Shortcomings of Group DRO. A weakness of the objective (2) is that apart from the language pair with largest loss, the objective does not take into account the value of the loss on the other language pairs. To illustrate this, consider this example with N = 3 language pairs and suppose that there exists two parameters θ 1 and θ 2 with the following loss: We have that L GDRO (θ 1 ; D) = 1.1 but L GDRO (θ 2 ; D) = 1.0. Consequently, the Group DRO objective will prefer θ 2 to θ 1 while clearly one would pick θ 1 over θ 2 in most practical cases. The aforementioned CVaR-Group DRO also exhibits this behavior and ignores the values of the language pairs not in the largest α-fraction.
To address this issue, let us rewrite the objective where the equality holds because the optimal weighting just puts all the mass on the language with largest loss. A natural way to make the objective less conservative is to instead take the maximum over a subset of the simplex U ⊂ ∆ N . This leads to the following objective Different choices of U will yield different objectives with different robustness properties. Note that this is a general formulation as U = {p train } reduces to the ERM objective, while U α = {q : q/p train ∞ ≤ 1/α} corresponds to the CVaR-Group DRO of Oren et al. (2019). We would like to choose U such that optimizing this objective results in models with better performance on language pairs with large losses (typically LRLs) without significant degradation of average performance.
To this end, we turn to a common and flexible choice for U: f -divergence balls (Csiszár, 1967) of radius ρ > 0 around p train , namely In particular, we propose using the χ 2 -divergence which corresponds to f (t) = 1 2 (t − 1) 2 and define χ 2 (q, p) = 1 2 i p i (q i /p i − 1) 2 with its corresponding uncertainty set U χ 2 ρ . The χ 2 -divergence has a long history (Ben-Tal et al., 2013b) and previous work shows that minimizing the robust loss with the χ 2 -uncertainty set enjoys favorable statistical properties such as optimally trading-off bias and variance (Duchi and Namkoong, 2019) or guaranteeing robustness and fairness (Hashimoto et al., 2018;Duchi and Namkoong, 2020). We refer to the objective L U χ 2 ρ as the χ 2 -Group DRO. Going back to the toy example, setting ρ = 0.1, yields that With the χ 2 -uncertainty set, the objective rightly prefers θ 1 to θ 2 and takes into account all the losses and not only the largest. We further confirm these intuitions and show in §4 and §5 that this is a suitable choice of uncertainty set for MNMT.

Optimization algorithm
Desiderata of the optimization algorithm. Minimizing the objective (3) effectively corresponds to a min-max optimization problem. Even in the relatively simple convex-concave setting, these are generally harder to solve than convex minimization problems. Recall that we want to minimize θ sup q:χ 2 (q,p train )≤ρ i≤N q i L(θ; D i ).
Due to the architectures we consider in MNMT, we wish for an algorithm that effectively changes the sampling distribution over mini-batches of data instead of importance-weighting the gradients. Indeed, standard MT architectures such as the Transformer (Vaswani et al., 2017) are extremely sensitive to learning rate schedules and we empirically observe that importance-weighting the gradients result in poor performance.
The canonical way to solve min-max problem is via primal-dual methods (PD) (Nemirovski et al., 2009) (see background in Appendix B), where at each step t, one keeps two vectors (θ t , q t ) and alternates between a gradient descent step on θ t and a gradient ascent step on q t . One can perform these updates efficiently as they only require unbiased stochastic gradient estimate of the loss w.r.t. θ t and q t . To obtain unbiased gradient estimate of the loss w.r.t. θ t , one either has to, at each step, sample a mini-batch of examples from Multinomial(q t ) and return the gradient of the loss or sample a mini-batch from Multinomial(p train ) and importance-weight the gradient.
As previously mentioned, the latter is not suitable for Transformer-type architectures. The former option is not ideal as it is more convenient for an algorithm to decide the sequence of minibatches every epoch rather than every optimizer step as this integrates much more smoothly with data loaders in deep learning frameworks, especially when doing distributed training. As a result, we posit that primal-dual algorithms are not an adequate choice for optimizing DRO-type objectives in our setting. We further discuss this point in §5.
To circumvent this issue, we consider a different optimization algorithm which we refer to as iterated best response (IBR), where, instead of doing a single gradient descent and ascent step, we iterate between (approximately) solving the maximization (resp. minimization) on q (resp. θ), while keeping θ t (resp. q t ) fixed. This is similar in spirit to algorithms in the game theory literature where individuals play the optimal strategy (best response) assuming everyone else's strategies remain constant. Under some mild assumptions, this procedure converges to the equilibrium of the game (Roughgarden, 2016). Formally, we alternate between q t+1 ← argmax q:χ 2 (q,p train )≤ρ i q i L(θ t+1 ; D i ). (7) Practical implementation. As we show in Appendix A, given the values of the loss, the q-update (7) is computed to accuracy in O(N log(1/ )) steps. Indeed, by taking the dual (Boyd and Vandenberghe, 2004) of (7), we transform the N -dimensional problem into a onedimensional root-finding procedure over the dual variable which we efficiently solve with a bisection. We provide the details in Appendix A. Note that computation cost is negligible compared to computing the gradient of the loss. We implement the θ-update of (6) by running a training epoch on a re-sampling of the training set D according to q t+1 .
token-level lossL k on each language pair k with an exponential moving average (EMA) (see line 14 in Algorithm 1). We precisely describe our implementation in Algorithm 1. We see that it respects our desiderata and comes at no computational cost. In §5, we compare primal-dual and iterated best response for various uncertainty sets.
Subtracting the baseline. Oren et al. propose subtracting a per-group scalar-which we refer to as a baseline-to each group loss before taking the maximum over q. They learn this baseline using a generative bi-gram model on each group. Recall that θ ERM is the parameter we obtain when we minimize the average loss and define θ τ when optimizing L τ . In this work, we propose using Intuitively, the baseline corresponds to the minimum performance we wish for our model on the given group and as such the loss obtained with ERM and its temperature variants are natural candidates. We show in §5 that these yield significant improvement and conveniently make our method more robust to the choice of ρ. We leave different (potentially learned) choices of baseline to future work.

Datasets
We evaluate the proposed method on two datasets: the 58-languages TED talk corpus (Qi et al., 2018) Table 1: BLEU scores of the best ERM model (among τ =1/5/100, τ = 5/100 are significantly worse than τ = 1, thus we omit these results), MultiDDS (Wang et al., 2020a) and our approach on the test sets of the TED dataset. Bold (resp. underlined) values indicate the best (resp. second best) performance for each language pair. Values under the language codes are the proportion of the language in the training data.  (1) related includes 4 LRLs (aze, bel, glg, slk) and their corresponding related HRL (tur, rus, pos, ces).
(2) diverse includes 8 languages with varying amount of data without considering linguistic similarities (bos, mar, hin, mkd, ell, bul, fra, kor) 2 . Both of the related and diverse sets have around 760K sentences of training data. For WMT, we consider 2 HRLs (German:deu and French:fra) and 2 LRLs (Tamil:tam and Turkish:tur). We subsample around 5M training sentences from the parallel corpus provided by the WMT shared task. Specifically, the training data of deu-eng, fra-eng is from WMT14, tam-eng is from WMT20 and tur-eng is from WMT18. We use the corresponding test and dev sets from each shared task for evaluation and validation.
We evaluate both en-to-any (translate English to a target language) and any-to-en (translate a source 2 See Wang et al. (2020a) for the interpretation of the language codes. language into English) directions for all language sets. We provide dataset statistics in Appendix C.

Experimental setup
For the translation models, we adopt the encoderdecoder Transformer (Vaswani et al., 2017) architecture with the implementation provided in fairseq (Ott et al., 2019). For both datasets, we use a Transformer-base architectures that also has 6 encoder and decoder layers with hidden dimension size being 512 and 8 attention heads. 3 The model is trained for 200K and 300K steps for TED and WMT respectively with the batch size of 65,536 tokens. For both datasets, we learn the sentencepiece (Kudo and Richardson, 2018) vocabulary for the English and the combined corpus of other languages respectively. We use beam search with beam size 5 for decoding and report the Sacre-BLEU score (Post, 2018;Papineni et al., 2002) on test sets for evaluation. For the TED and WMT datasets respectively, the constraint size ρ for the  Table 3: BLEU scores of different DRO objectives and algorithms-primal-dual (PD) and iterated best response (IBR)-on the WMT test sets. chi-square ball is set to be 0.05 and 0.3, and for the baseline losses we use the average token-level loss on each D i computed from the ERM model with τ = 1 and τ = 100-see §5 for more analyses of these choices. We provide additional preprocessing and training details in Appendix D.

Main Results
We present the BLEU scores of en→any and any→en translation directions on TED and WMT data in Tab. 1 and 2 respectively. First, for both TED and WMT datasets, χ-IBR outperforms all the other baseline methods in terms of average BLEU score over all language pairs. By taking a closer look at the BLEU scores for each individual language pairs, χ-IBR improves over almost all the language pairs for both translation directions compared to ERM. Secondly, as expected from temperature-based sampling methods, different values of τ achieve different trade-offs between the performances on HRLs and LRLs. As we explained in §2.1, large values of τ favor LRLs as this results in data being sampled with equal probability from each language pair while small values of τ approach ERM and will benefit HRLs. As a result, τ needs to be carefully tuned to achieve adequate performance on both HRLs and LRLs.
Importantly, χ-IBR achieves a significantly better trade-off than the sampling method for any value of τ . We show in Fig. 2 the quantitative improvements of χ-IBR over the best τ for various datasets and in both translation directions. Surprisingly, while the improvements are larger on LRLs in most cases, we consistently observe improvements on HRLs. This indicates that finding the right sampling distribution over language pairs facilitates cross-lingual transfer. We further observe that χ-IBR achieves more significant improvements in the en→any direction than in the any→en direction. This further supports our hypothesis. Indeed, it is attested in previous work on MNMT (Arivazhagan et al., 2019) that it is harder to decode to multiple languages than encode from multiple languages and as such, the en→any direction is a significantly harder multi-task learning problem. As such, this is where optimal crosslingual transfer would yield the larger gains, which is what we observe in practice. We also note that our method incurs negligible computational overhead compared to ERM.

Analysis
The importance of the sampling distribution. An advantage of our method over temperaturebased sampling methods is that it dynamically adjusts the training distribution as the model evolves    and does not compute it solely as a function of amount of training data. Our hypothesis is that this is important to achieve good performance across language pairs and that different sampling distributions will be adequate at different stages of training. We empirically check our hypothesis by studying how the training distribution q (the so-called best response) changes across training epochs. In Fig. 3 and 4, we plot the best response of χ-IBR across epochs on the TED-diverse dataset for both translation directions. In addition, we also plot the historical losses in Fig. 5 (Appendix). Our first observation is that the optimal q noticeably evolves across epochs which further showcases the need for dynamically adjusting the sampling distribution. We make the following observations (i) χ-IBR demonstrates the desired behavior and, at the early stages of training, always down-weights HRLs and up-weights LRLs; (ii) somewhat counterintuitively, there is no direct correlation between |D i |, the amount of data in language i and the final value L(θ (t) ; D i ). The latter further evidences the limitations of sampling distributions only based on the amounts of training data |D i |. Indeed, while kor is a HRL, it is typologically much farther from English so there is more inherent uncertainty in the task. Consequently, it has larger losses and is consistently up-weighted throughout training. On the other hand, while hin is a LRL, it achieves low loss after being up-weighted during the early stages of training and is consequently down-weighted after that.
Comparison to DRO variants. We demonstrate the benefits of χ-IBR over other DRO objectives by extensively evaluating a range of robust objectives and associated optimization algorithms. In terms of objective, we compare against (1) Group DRO (Sagawa et al., 2020), (2) CVaR-Group DRO (Oren et al., 2019) and (3) FastDRO (Levy et al., 2020). In terms of algorithms, we experiment with primal-dual methods and our proposed iterated best response procedure which we both described in §3.2. Note that in the case of Group DRO (i.e. U = ∆ N ), iterated best response is not a sensible choice as it would result in each training epoch being spent on a single language pair. In the case of CVaR Group DRO, we follow the implementation of (Oren et al., 2019), which is a hybrid of the two optimization algorithms with a primal update on θ and a best response update on q. We compare the performance of these methods on the WMT dataset. For fairness, we baseline losses in the same way for all the DRO objectives. We first observe that, outside of χ-IBR, none of the DRO objectives are competitive with temperature-weighted ERM. We also observe that for both uncertainty sets, iterated  best response convincingly outperforms the same objective trained with primal-dual. We finally note that, for a fixed optimization algorithm, χ 2 -Group DRO outperforms the CVaR objective on all but one language pairs. This validates both our choice of uncertainty set and of optimization procedure. The effects of baselined losses. We study the effect of the choice of baseline on the performance across languages. In Tab. 4, we empirically evaluate different baseline choices and uncertainty sizes ρ. We observe that in the TED dataset, baselineing with L( θ ERM ; D i ) performs significantly better than baseline-ing with L( θ τ =100 ; D i ) ((e) versus (g) while it is reversed for WMT. We explain this by observing that the LRLs in TED consist of very small amounts of data (on the order of a few thousands) and using τ = 100 results in a severe oversampling of LRLs, which the model then fits perfectly. As a result, recall the intuition that the baseline sets a lower bound on the performance we wish to achieve but because of the small training data and overfitting, the model disproportionately up-weighs the LRLs, which harms overall performance. This does not occur in WMT and uniform sampling across language pairs sets a good target performance for DRO methods. Finally, we see that with the right baseline loss, our method is more robust to different choices of ρ (e.g., comparing (c) and (d) versus (e) and (f)). We consistently observe this for other translation directions and datasets.

Conclusion
We showed how to successfully apply DRO to the MNMT setting and automatically adjust the sampling distribution over language pairs resulting in sizeable improvements in performance. We posit that this approach would also be successful in other multilingual scenarios. Our work raises a few questions: (i) what are the right baseline losses? (ii) surprisingly, χ-IBR also improves performance on HRLs; under what circumstances does cross-lingual transfer happen and which languages does it benefit most? We hope our work could inspire better distributionally robust learning methods for multilingual training in the future.