Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation

Sparsely gated Mixture of Experts (MoE) models have been shown to be a compute-efficient method to scale model capacity for multilingual machine translation. However, for low-resource tasks, MoE models severely over-fit. We show effective regularization strategies, namely dropout techniques for MoE layers in EOM and FOM, Conditional MoE Routing and Curriculum Learning methods that prevent over-fitting and improve the performance of MoE models on low-resource tasks without adversely affecting high-resource tasks. On a massively multilingual machine translation benchmark, our strategies result in about +1 chrF++ improvement in very low resource language pairs. We perform an extensive analysis of the learned MoE routing to better understand the impact of our regularization methods and how we can improve them.


Introduction
Training massively multitask models such as multilingual machine translation models benefit from transfer learning across different tasks.But they also suffer from reduced model capacity per task and potential interference between conflicting tasks.Scaling up models has been shown to be a very effective strategy in many natural language processing tasks such as language modeling, massively multilingual translation and natural language understanding (Brown et al., 2020;Kaplan et al., 2020).Most of these advancements have focused on training increasingly larger dense models.However, dense model scaling is computationally expensive, as a result, various sparse model architectures have been proposed to increase model capacity without incurring additional compute costs; the most commonly used one is the Sparsely-Gated Mixture- * Equal contribution High-resource (eng-fra) Figure 1: Validation perplexity of dense and MoE (64 experts) models.We show a high-resource direction that does not suffer from over-fitting, when a low resource direction sees extreme over-fitting.
of-Experts (MoE) layer (Shazeer et al., 2017;Lepikhin et al., 2020;Du et al., 2021;Hwang et al., 2022;Zoph et al., 2022).MoE models are a type of conditional compute models (Bengio et al., 2013;Almahairi et al., 2016) that activate a subset of model parameters per input, as opposed to dense models that activate all model parameters.MoE models unlock significant representational capacity while maintaining the same inference and training efficiencies in terms of FLOPs as compared to the core dense architecture.As a result, past work has demonstrated improved performance on multitask models such as multilingual machine translation when using MoE models (Lepikhin et al., 2020;Kim et al., 2021;Fedus et al., 2022;Zoph et al., 2022).
But we notice that, on imbalanced datasets, MoE models suffer from over-fitting on low resource tasks i.e., tasks with relatively less training data.Figure 1 illustrates this phenomenon on a multilingual translation benchmark.We see that eng-fra, a high-resource translation direction, does not overfit with either dense or MoE models.On the other hand, eng-kon, a low-resource translation direction, extremely over-fits with the MoE model compared to the dense model.
In this work, we introduce four effective strate-gies to reduce the over-fitting of MoE models on low-resource tasks in a massively multilingual MT benchmark: 1. Dropout techniques for MoE layers: we introduce Expert Output Masking (EOM) and Final Output Masking (FOM), two dropout methods specific to MoE layers that we apply on top of overall dropout.
2. Conditional MoE Routing (CMR): We train an additional gate to decide when to route a token to an MoE layer vs. a shared dense layer.
3. Curriculum Learning (CL): We introduce lowresource pairs that are prone to over-fitting in the later stages of model training.
On a massively multilingual MT benchmark,1 we experimentally demonstrate the effectiveness of each of these strategies.Particularly, we observe close to +1 chrF ++ improvements with EOM, FOM, CMR and CL strategies on very low resource language directions out of English.

Background
We first describe the multilingual machine translation (MMT) task setup, the dense backbone architecture, and how we augment it with MoE layers.
Multilingual Machine Translation.We model multilingual neural machine translation as a sequence-to-sequence task, where we condition on an input sequence in the source language with an encoder and generate the output sequence in the expected target language with a decoder (Sutskever et al., 2014).We train to maximize the probability of the translation sequence in the target language given the source sequence, in addition to the source language ℓ s and the target language ℓ t .
Model Architecture.Our sequence-to-sequence multilingual machine translation model is based on the Transformer encoder-decoder architecture (Vaswani et al., 2017).
To prime the model for multilingual translation, we prefix the source sequence with the source language ℓ s and the target sequence with the target language ℓ t .
Sparsely Gated Mixture of Experts.In both Transformer encoder and decoder, we replace every other dense FFN sublayer with an MoE sublayer.The MoE sublayer consists of E feed-forward networks (FFN), denoted with (FFN 1 , FFN 2 , . . ., FFN E ).A gating network, consisting of a softmax-normalized linear layer with weights W g , is attached to each MoE sublayer to decide how to route tokens to experts.Given an input token x t the output of the MoE sublayer is evaluated as: with G t ∈ R E the routing vector computed by the gating network, i.e., for each expert, G t,e is the contribution of the e th expert (FFN e ) in the MoE output.We follow the Top-k-Gating algorithm of Lepikhin et al. (2020) and dispatch each token to at most k=2 experts.
The sparse MoE model learns to route input tokens to the corresponding top-2 experts by optimizing a linearly weighted combination of labelsmoothed cross entropy, L MT , (ϵ=0.1, Szegedy et al. (2015)) and an auxiliary load balancing loss, L MoE (Shazeer et al., 2017), (3) This additional loss term (L MoE ) pushes the tokens to be uniformly distributed across experts.We set λ MoE to 0.01 in all our experiments.We refer the reader to Lepikhin et al. (2020) for more on the optimization of MoE models.

Fixing over-fitting on low-resource tasks
The motivation behind MoE models is to allow different parameters to model different aspects of the input space.The added expert capacity should help higher resource language pairs that might otherwise be constrained to share the same capacity with many other language pairs.Besides, increasing model capacity should reduce interference, thus benefiting tasks of all resource levels.Although overall dropout is sufficient to regularize dense models, it is not enough for MoE models (see Figure 4).To address the issue of over-fitting of MoE models on low-resource tasks, we propose a series of architectural changes that improve the performance on low-resource language pairs with MoE models in Sections 3.1 to 3.3.In Section 3.4, we devise and study a simple but effective curriculum learning strategy as another approach to reduce the over-fitting on low-resource directions.

MoE Expert Output Masking (EOM).
In this proposed regularization strategy, we mask the expert output for a random fraction (p eom ) of the input tokens.For input tokens with dropped expert outputs, the first and/or second expert is effectively skipped, as illustrated in Figure 2c.Note that although this masking will zero out some combination weights G t,e in Equation ( 2), it will not affect the weights used in the load balancing loss.

Final Output Masking (FOM).
A simpler alternative to EOM would be to mask the combined expert output for a random fraction of tokens, i.e., the last stage in Figure 2d.We denote with p fom the fraction of tokens masked with this regularization method.Note that this type of masking is more generic as it can be applied to dense models as well.

Conditional MoE Routing (CMR).
Instead For an input token x t , the output of CMR is evalu-ated as follows: where W CMR are the weights of the CMR's binary gate.W CMR is trained by optimizing translation accuracy under a budget constraint b.For a minibatch with T tokens, this amounts to adding the following auxiliary loss term (L CMR ) to the loss function in equation ( 3): We use the budget parameter b to limit the effective capacity of MoE layers, thus providing a regularizing effect; at b=0, the model is dense, practically pushing all tokens through FFN shared , and at b=1, the model is free to always route tokens through the high-capacity MoE layer.
To reduce over-fitting, we experiment with zeroing out a fraction of the CMR gates g(x t ) in the mini-batch; we denote this fraction with p cmr .This means that we force p cmr % tokens in the mini-batch to only take the route of FFN shared .

Curriculum Learning
We next explore alternative methods of regularization by means of Curriculum Learning (CL).We propose to start training with high-resource pairs first, then introduce low-resource pairs, prone to over-fit, later in phases.To derive the phases of the curriculum, we first train a vanilla MoE model (without CL), then we partition the tasks (translation directions) into n bins {b 1 , . .Algorithm 1 Partitioning for step-based CL 1: Input: number of bins n, a set of tasks T , the maximum number of updates U , the step corresponding to the best validation perplexity ▷ For s best , we take the max if multiple 2: Output:

2.
Step-based: partition based on the step where we observed a task to start over-fitting.See Algorithm 1.
4 Experimental Setup

MMT dataset
We construct a multilingual machine translation benchmark consisting of 53 languages and a total of 110 translation directions.Our MMT dataset consists of 45 directions out of English (aggregated as eng-xx), 45 directions into English (aggregated as xx-eng) and 20 non-English directions (aggregated as xx-yy).In terms of resource level, there are 40 high-resource and 70 low-resource directions, out of which 22 are very low-resource. 2The training data is composed of publicly available bitext in all 110 language directions (primary data in NLLB Team et al. ( 2022)) and large-scale mined data (Heffernan et al., 2022;NLLB Team et al., 2022) in English-centric directions.There are a total of 2×847M examples in this benchmark.For a detailed listing of the directions, see Appendix A.
Segmentation with SentencePiece.To tokenize our text sequences, we train a single SentencePiece (SPM) (Kudo and Richardson, 2018)  using See Appendix B for additional details.We use the chrF ++ metric (Popović, 2017) to compare the model performance5 .We report averages in each set of directions: eng-xx, xx-eng and xx-yy as all.For eng-xx and xx-eng, and when relevant, we breakdown the pairs by resource level: highresource (high), low-resource (low) and very low resource (v.low).

Vanilla (un-regularized) MoE
When looking at un-regularized models (without overall dropout), we see in Adding overall dropout6 significantly improves the performance of MoE models in both the 615M and 1.3B variants.Importantly, when increasing the dropout to 0.1 for the small MoE (615M), we see that the relative decline of -0.1 chrF ++ , turns into an improvement of +0.9 chrF ++ for eng-xx v.low pairs.Once we scale the computational cost per update (1.3B), tuned overall dropout does not fix the over-fitting of very low-resource pairs.
In Figure 4, we observe in the case of eng-kon, a very low-resource pair, that the model continues to face significant over-fitting when trained for 100k updates.This is unsurprising, as iterating over a small training set with large capacity causes overfitting.Training for more steps is important for high-resource pairs, but we want to avoid negatively affecting low-resource pairs in the process.

Regularizing MoEs
For the rest of this paper, we use the 1.3B variant as our backbone, to which we add MoE layers with E=64 experts.Results.In terms of alleviating the over-fitting issue, the last column of Figure 4 shows that EOM leads to better regularization and less over-fitting on low-resource tasks compared to overall dropout.In necessary ingredient in CMR top-2; in the last two rows of Table 3, adding p cmr improves the performance across the board, particularly in en-xx and xx-en very low directions (+2.6 and +2.0 chrF ++ , respectively).With top-1, p cmr is less critical as it barely affects the overall performance, but does help on eng-xx and xx-eng very low pairs.In the middle section of Table 3, we note that CMR top-1 is not sensitive to the exact value of b, but, at low budget b (less capacity), model performance significantly drops on eng-xx across all pairs.Pairs in xx-eng, on the other hand, favor a mid-range budget value.

Experimental
In Table 2 for CMR top-1, we see +0.4 chrF ++ across all pairs into English, and +0.4 chrF ++ across non-English pairs.Improvements are larger for out of English low and very low-resource languages, with +0.5 and +1.0 chrF ++ respectively.For CMR top-2, we see +1.9 chrF ++ across all pairs out of English and +0.9 chrF ++ across non-English pairs.The improvements are largest for low and very lowresource languages, with +2.3 and +3.2 chrF ++ out of English, and +0.9 and +1.5 into English.CMR top-2 is computationally more expensive by 23% because of the additional shared FFN layer at the level of each MoE layer in the model.
We find that Gating Dropout performs better than the baseline MoE, but is outperformed by all of our proposed methods.Overall, these results demonstrate that EOM, FOM, and CMR strategies help improve on top of vanilla MoE.

CL
Experimental Setup.To derive the phases of the curriculum, we train a vanilla MoE model with p drop =0.3 (our baseline), then, based on observed over-fitting patterns, we partition the tasks in our MMT dataset.For both count and step-based curricula, we introduce pairs in n=3 phases over U =100k.For count-based curriculum, we partition language pairs into bins w.r.t. the training ex-amples available for the task (D , 40k, 20k).8For step-based curriculum, we follow Algorithm 1 with n = 5 and merge the first 3 buckets resulting in 3 bins introduced at (k 1 , k 2 , k 3 ) = (100k, 40k, 20k).See Appendix C for the exact partitioning.
To combine a stronger dropout regularization with Curriculum Learning methods, we next apply our best CL strategy (step-based) to an MoE model with EOM (p eom =0.1).
Results.We show the results of our CL experiments in Table 4.For the baseline MoE-64, by using step-based CL, we improve the accuracy on very low-resource directions by 0.8 chrF ++ in eng-xx and 0.2 chrF ++ in xx-eng .Across all resource levels, we improve the accuracy in eng-xx and xx-eng by 0.4 and 0.2 chrF ++ .On non-English directions step-based CL improves the quality by 0.3 chrF ++ .The count-based CL hurts the model performance in all tasks except from very low-resource eng-xx directions.
For MoE EOM, training with step-based CL actually hurts performance across all tasks except for xx-eng very low-resource.We hypothesize that over-fitting on our MMT dataset is already reduced by EOM, thus, adding a curriculum on top of that is not needed and has a negligible impact on translation quality.

Related work
Improved routing in MoE models.Recent works have proposed alternatives to the commonly used top-2 gating of Lepikhin et al. (2020): Hash layers (Roller et al., 2021) use random fixed routing and Lewis et al. ( 2021) view routing as a linear assignment problem and drop the load balancing loss.Zuo et al. (2022) suggest to randomly select experts.Fedus et al. (2022) opt for top-1 routing, and Yang et al. (2021) split experts into different groups and applies k top-1 routing in each.In this work, we only use Top-2 gating9 but our techniques are orthogonal to the routing method.
Regularizing MoE models.Zoph et al. (2022) tried increasing the dropout within the expert (dubbed expert dropout) but saw marginal improvement in quality.They also proposed an additional regularization loss for MoE layers to resolve training instabilities.Kim et al. (2021) randomize the priority of tokens within a mini-batch as a regularization method.Liu et al. (2022) propose gating dropout to reduce cross-machine communication in MoE layers.Xie et al. (2022) propose routing tokens to expert clusters and a cluster-level expert dropout.
Conditional compute.Another line of research in the space of MoE models focuses on designing alternative strategies to learn balanced routing e.g., Lewis et al. (2021) formulated token-to-expert allocation as a linear assignment problem and Roller et al. ( 2021) assign tokens to experts using hash functions.
language-specific parameters.A common solution to relax parameter sharing in MMT models is to use light-weight language-specific adapters (Rebuffi et al., 2017;Bapna and Firat, 2019).Their size, however, scales linearly in the number of languages.Baziotis et al. (2022) introduce hyper-adapters to generate the adapters themselves.To make these language-specific parameters optional, Zhang et al. (2021) propose CLSR to dynamically select language-specific or shared paths.These paths are simple linear projections and do not incorporate routing.Similar to our own CMR's budget loss, CLSR optimizes the MMT crossentropy while constraining the use of the languagespecific capacity.Another approach similar to CMR is Residual-MoE (Rajbhandari et al., 2022).It is a hybrid dense and MoE model but it does not learn weights for each component.(Rajbhandari et al., 2022) also introduces Pr-MoE for pyramidal MoE where they increase the number of experts in the later layers to make the MoE models more parameter-efficient.
Curriculum Learning Curriculum learning (Bengio et al., 2009;Lu et al., 2020) is motivated by the learning behavior of humans in which training samples are introduced by increasing levels of difficulty.The most common curriculum in MT models consist of pre-training on the more abundant monolingual data before finetuning on MT aligned bitexts (Liu et al., 2020;Tang et al., 2020;Xue et al., 2021).In bilingual MT, recent works explored fixed curricula that shard training samples based on some difficulty criteria like sentencelength (Kocmi and Bojar, 2017) or the confidence of a baseline model (Zhang et al., 2018).Platan-ios et al. (2019) proposed a heuristic that decides which samples are shown to the model based on the estimated sample's difficulty and current model competence.Kumar et al. (2019) use reinforcement learning to learn the curriculum automatically and Zhou et al. (2020) propose uncertainty-aware curriculum learning.In data sampling, which can be viewed as a sort of curriculum learning, Wang et al. (2018) propose dynamic sentence sampling to assign lower weights to well-learned sentences.

Conclusion
In massively multilingual settings with imbalanced datasets, MoE models over-fit significantly more than dense models on low-resource directions.This work introduce multiple effective strategies for regularizing MoE models and achieving better performance across all language pairs, especially lowresource pairs.With EOM and FOM, we propose dropout methods to further regularize MoE models.We introduce in CMR a novel architecture to balance the capacity between MoEs and shared dense paths.Finally, we design curricula for introducing low-resource languages later during training.These strategies lead to less over-fitting on low-resource tasks, leading to improvements in translation quality.

Limitations
The first limitation of this work is that it lacks a study on how the proposed regularization methods work at other scales; although we looked in Section 5.1 at two variants based on the compute budget (615M and 1.3B), we only tested our methods on the 1.3B variant with a fixed number of experts E=64.These methods can potentially show larger improvements on larger models (larger backbone or more experts) and marginal impacts on smaller models that do not suffer from severe over-fitting.The second limitation of this work is that our methods are only validated on a single multilingual MT benchmark.Some of these techniques proved to be generalizable to a much larger benchmarks (NLLB Team et al., 2022), and we leave testing these techniques on other tasks like language modeling to future work.Another limitation of this work, and most other works on multilingual machine translation, is the evaluation metrics and how to aggregate them.We report in this paper chrF ++ scores and we average across three subsets of directions and three resource levels.This makes it difficult to highlight the impact in some challenging directions on which our methods can lead to ±3chrF ++ differential in quality.We did not report other metrics for the sake of brevity, and since we are not comparing to previously published results, chrF ++ is a reliable metric for comparing and contrasting our methods.

A Training data
We list in Table 5 the amount of data (bitexts) used to train our models.Figure 5 shows the data distribution over language pairs sorted by the example count per pair.The highest resource language pair has 180M examples (English-French), and the lowest resource language pair has 40K examples (Hindi-Tamil).

B Training details
We use Fairseq (Ott et al., 2019) to train Transformer encoder-decoder models with dimension 1024, FFN dimension 8192, 16 attention heads, 24 encoder layers and 24 decoder layers.Dense 615M models have 614,918,144 parameters, and MoE models corresponding with the dense 615M backbone have 6,961,431,552 parameters.Dense 1.3B models have 1,372,055,552 parameters, and MoE models corresponding to the dense 1.3B backbone have have 26,753,140,736 parameters.The total compute across all the experiments reported, including sweeps, is 461,631 GPU hours.We train with seed=2 for all experiments.We apply Layernormalization (Ba et al., 2016) at the beginning of each Transformer sub-layer (Pre-LN), as opposed to after the residual connection (Post-LN).This is because Pre-LN is more stable in practice compared to Post-LN (Xiong et al., 2020).All models are trained for 100k updates with an effective batch size of 1M tokens per update.We optimize with Adam (Kingma and Ba, 2015) using (β 1 , β 2 , ϵ) = (0.9, 0.98, 10 −6 ).We linearly increase the learning rate up to 0.004 through 8000 warmup updates, then follow the inverse square root learning rate schedule.For Top-2-Gating, we set the expert capacity to 2 × T /E, i.e., we enforce that each expert processes, at most, 2×T /E tokens, where T is the number of tokens in the mini-batch and E is the number of experts.During generation, we set the capacity to T so that all tokens can be routed to whichever expert they choose.Table 5: List of languages and Data counts between primary (pre-existing publicly available parallel data) and mined (Heffernan et al., 2022) for the 110 directions of our MMT dataset.45 languages are paired with English for a total of 90 English-centric directions.The remaining 20 directions are non-English centric.We also list on the rightmost table the sources of the training data in our MMT dataset following NLLB Team et al. (2022).

C Curriculum Learning
Count-based CL.We empirically partition based on training example counts.We first train our baseline model (MoE-64 (p drop =0.3) without CL, then we look at possible correlations between the number of steps before over-fitting and the count of training examples.In Figure 6 we plot these data points with the counts on the y-axis and the start-ofover-fitting step on the x-axis.The horizontal red lines indicate where the count-based curriculum thresholds were set in order to partition language pairs into bins.We list in Table 6 the tasks in each bin for the baseline MoE model.
Step-based CL.We partition based on the step where we observed a task to start over-fitting.Following Algorithm 1, we partition the tasks into n bins.In our experiments, we started with n=5 resulting in a ∆ of 20k steps.However, we merged the first three bins with characteristic steps k 1 = 100k, k 2 = 80k and k 3 = 60k to remain comparable with count-based CL.

Figure 3 :
Figure 3: Illustration of Conditional MoE Routing (CMR) showing a residual block in a Transformer layer with regular MoE (left) vs. CMR (right).
to the closest bin wrt.its characteristic step.11: end for each bin b i after U − k i updates.We compare two partitioning strategies for when and what directions to add at every phase.1. Count-based: we empirically partition based on training example counts.

Figure 5 :
Figure 5: Training data across all language pairs in our MMT dataset.

Figure 6 :
Figure 6: For the baseline MoE model, we plot the steps corresponding to the best validation perplexity (s best on the x-axis) against the number of training examples (|D t | on the y-axis).

Table 1 :
model for all languages.3Thevocabularysize of our trained SPM model is 256,000.For more on this SPM model, seeNLLB Team et al. (2022).Validation perplexities with Various dropout strategies for a low-resource direction (eng-kon in the top row) and a high-resource direction (eng-fra in the bottom row).Validation set chrF ++ of vanilla MoE with and without overall dropout.† indicates best of sweep.

Table 1
the MoE model, while computationally similar, shows +1.3, +1.5 and +0.4 chrF ++ improvements on eng-xx, xx-eng and xx-yy respectively.When focusing on the very low resource pairs (v.low), the performance actually drops on eng-xx (-0.1 chrF ++ ) signaling an over-fitting issue.When scaling the backbone to 1.3B, we see even more over-fitting on v.low directions (-1.9chrF ++ in eng-xx and -2.8 chrF ++ in xx-eng).

Table 2 :
(Liu et al., 2022)the MoE model with an overall dropout rate of 0.3 Comparison of Various Regularization Strategies applied to an MoE-64 baseline.In each column, we bold the best results out of the first six rows (computationally comparable), and we bold results from the last row (CMR top-2) if they outperform the other models.†signalsthatthismodel is best of sweep.(pdrop=0.3),bestperformingafterasweep of p drop ∈ {0.1, 0.2, 0.3} to be our baseline.7Ineachofthesweepsbelow, we choose the best variant based on the average chrF ++ score on the validation set.For EOM and FOM, we sweep over the values of (p drop , p eom/fom ) ∈ {0.1, 0.2, 0.3} 2 .For CMR, and in order to keep the compute equivalent to the baseline MoE, we use top-1 instead of the top-2 gating used in previous experiments.We fix p drop =0.3 and sweep over the CMR parameters (p cmr , b).We also train a CMR top-2 model, although not compute-equivalent to the baseline MoE, it provides insight into performance under a large compute budget.For CMR top-2, we fix p drop =0.3 and sweep over the values of p cmr ∈ {0.1, 0.2, 0.3}.We set λ CMR to 0.1 in all our CMR experiments.We additionally compare our methods to Gating Dropout(Liu et al., 2022), a method in which we route tokens with probability p gd to the local experts, thus skipping the All-to-All communication between GPUs.We sweep over the values of (p drop , p gd ) ∈ {0.1, 0.2, 0.3} 2 .To generate translation hypotheses, we use beam search with a width of 4 and a length penalty of 1.0.For each model, we report chrF ++ averages on the validation set (FLORES-200 dev -NLLB Team et al. (2022)) in 3 groups of directions: eng-xx, xx-eng and xx-yy, broken down w.r.t. to resource levels: high, low and very low (v.low) for eng-xx and xx-eng.

7
Initial experiments separating the dropout rates of shared and MoE blocks showed that the best values align.wesee in Table 2 gains over the baseline MoE of +0.1 chrF ++ across eng-xx pairs, +0.6 chrF ++ across xx-eng pairs and +0.6 chrF ++ across xx-yy pairs.For into English, the largest gains are observed on low and very low-resource languages; +0.7 and 1.1 chrF ++ .Compared to the best EOM model, FOM under-performs slightly on eng-xx (-0.3 chrF ++ ) but outperforms on xx-eng (+0.2 chrF ++ );when averaging over all pairs, the two models achieve the same chrF ++ score of 48.4.We look in Table3at the impact of the budget b and the dropout p cmr .We observe that p cmr is a

Table 4 :
Results of Curriculum Learning applied to a vanilla MoE model and an MoE model with EOM.

Table 6 :
Count-based CL bins for the baseline MoE model (p drop =0.3).