Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model

The recently released NLLB-200 is a set of multilingual Neural Machine Translation models that cover 202 languages. The largest model is based on a Mixture of Experts architecture and achieves SoTA results across many language pairs. It contains 54.5B parameters and requires at least four 32GB GPUs just for inference.In this work, we propose a pruning method that enables the removal of up to 80% of experts without further finetuning and with a negligible loss in translation quality, which makes it feasible to run the model on a single 32GB GPU. Further analysis suggests that our pruning metrics can identify language-specific experts.


Introduction
The Transformer (Vaswani et al., 2017) has become the dominant modeling paradigm in Natural Language Processing tasks. Many subsequent advances in the field came from increasing the computational budget, training data, and model size. Neural Machine Translation was not an exception, where massively multilingual NMT (Aharoni et al., 2019;Fan et al., 2021;Tang et al., 2020;Zhang et al., 2020) demonstrated promising results, while attempting to overcome the curse of multilinguality (Conneau et al., 2019) by scaling up model size.
However, increasing the parameter size exacerbates the cost of training (Yang et al., 2019;Strubell et al., 2019;Patterson et al., 2021) and hurts the memory footprint and inference latency (Dai et al., 2019;Fan et al., 2021;Wang et al., 2022). Sparselygated Mixture-of-Experts (MoE) models are an efficient alternative to dense models (Lepikhin et al., 2020;Fedus et al., 2021;Riquelme et al., 2021). For example, Du et al. (2022) demonstrates that an MoE language model results in a 7x larger model compared to GPT-3, but requires only 30% of its * Work done during an internship at NAVER LABS Europe energy for training and half of its FLOPs at inference.
Mixture-of-Experts models are neural networks whose set of parameters is partitioned into experts. Contrary to dense models, where all network parameters are used for every input, an MoE model activates different parts of the network, the experts, depending on the input, which is typically done by a gating mechanism at the token level. MoE models are computationally efficient due to expert parallelism (Fedus et al., 2021) across a large number of GPUs, by having each GPU hold a subset of all experts and communicate with the other GPUs when it needs expert outputs for its local batch.
In NLLB-200 1 (Costa-jussà et al., 2022), a load balancing regularizer in the objective function (Shazeer et al., 2017) promotes equal distribution of the tokens across experts. This encourages the model to use all the experts and ensures that all GPUs are used equally for the sake of computational efficiency. However, considering a large number of experts, it does not guarantee that all experts will be equally activated for a particular pair of languages at inference. It raises a research question: are there language-specific experts in multilingual MoE models? If this is the case, we may be able to prune such models without loss of translation quality for the language pairs of our interest. Reducing memory usage would be useful for a model like NLLB-200, which normally requires at least four 32GB GPUs at inference.
In this work, we define metrics to assess the importance of each expert and prune the least important experts at inference. We aim to avoid finetuning because of its computational cost. In an ideal scenario, we would like to be able to identify the important experts in an MoE model so that practitioners can deploy large models, such as NLLB-200, on a single GPU. We summarize our main contributions as follows: • We propose a pruning strategy that can remove 80% of experts in the NLLB-200 model without further finetuning and with a negligible loss in translation quality; • We find that the decoder experts can be pruned more aggressively than the encoder experts; • We show the emergence of language-specific experts in the NLLB-200 model; • We demonstrate that the important languagespecific experts in the decoder are shared between linguistically related languages; • We release the ids of the pruned experts, along with other experts' gathered statistics so that anyone with a single 32GB GPU can use NLLB-200 at inference. 2

Related work
The concept of Mixture-of-Experts models in machine learning dates back to the works of Jacobs et al. (1991); Jordan and Jacobs (1994). Most recent versions were inspired by Shazeer et al. (2017), who achieved state-of-the-art language modeling and translation results with the largest model at that time. Combined with the Transformer model, MoE models grew in popularity (Lepikhin et al., 2020;Fedus et al., 2021). Beyond natural language processing, MoE models showed a large success in computer vision (Puigcerver et al., 2020), speech recognition (You et al., 2021), multi-modal learning (Mustafa et al., 2022), and diffusion models (Feng et al., 2022;Balaji et al., 2022) to name a few. For a more detailed survey of MoE models, we refer readers to Yuksel et al. (2012) and Fedus et al. (2022). Despite the recent successes, large MoE models require a lot of memory and the contribution (or roles) of experts is under-explored. Chen et al. (2022) showed that the contributions of experts of a pre-trained MoE model in different tasks such as MNLI, CoLA, and SQuAD are quite different. Moreover, they converted a large sparse MoE model pre-trained on a general task to a singleexpert dense model by fine-tuning the most 'professional' expert and dropping the other experts. It demonstrates that experts do not contribute equally to the performance and some are more important than others. Zoph et al. (2022) also studied dif-ferent expert specializations such as sentinel tokens, punctuation, conjunctions and articles, and even languages. They concluded that experts in the encoder exhibit specialization, in contrast to the decoder, but not by language. According to the authors, their mechanism of token routing and load balancing prevents language specialization. Kudugunta et al. (2021) train study routing mechanisms at different levels of granularity and show that task-level experts (i.e., per language) can achieve similar performance as token-level experts. However, this work assumes that the model is trained this way, while our own work attempts to prune an existing token-level MoE model at inference without re-training it.
There have been a number of attempts to compress existing massively multilingual NMT models (Costa-jussà et al., 2022;Mohammadshahi et al., 2022b,a). However, to the best of our knowledge, none of them explicitly studied expert pruning and the emergence of language-specific experts in a large MoE model like we do. There has been a related line of works on pruning attention heads in transformer models (Michel et al., 2019;Voita et al., 2019), demonstrating linguistically-interpretable roles of attention heads (Voita et al., 2019;Jo and Myaeng, 2020) and the emergence of languagespecific attention heads (Kim et al., 2021b;Held and Yang, 2022). Understanding the role of attention heads helps carefully remove the least important ones without damage to translation quality.
Closest to our work, Kim et al. (2021a) tried to prune a machine translation MoE model by keeping the most activated experts, 3 but did not manage to preserve performance without further fine-tuning.
Even though it has been shown that multilingual NMT models benefit from a larger number of experts (Costa-jussà et al., 2022), to the best of our knowledge, our work is the first to study whether any language-specific experts emerge in a massively multilingual Mixture-of-Expert model for NMT, and how can redundant (or non-relevant) experts be pruned.

Mixture-of-Experts models
Sparsely-gated Mixture-of-Experts (MoE) models activate a subset of their parameters per input token, contrary to dense models, where the entire network is used for each input token. Therefore, the total amount of parameters can be significantly increased because the computation cost per token becomes only proportional to the size of the activated sub-network, not the total model size. An increased number of parameters unlocks significant representational capacity. Allocating different devices for different experts and running them in parallel (i.e., expert parallelism, Fedus et al., 2021), in combination with data parallelism makes MoE computationally efficient and highly scalable (Fedus et al., 2021;Lepikhin et al., 2020).
In the MoE Transformer models proposed by Lepikhin et al. (2020), the FFN sublayers in the dense model are replaced with MoE layers. An MoE layer takes an input token representation x t and then routes it to the top-k experts selected from a set {E i } N i=1 of N experts thanks to a gating network: Where W g ∈ R N ×d is a learned parameter. The output of the MoE layer is a weighted sum of the outputs of the k selected experts E:

NLLB-200
No Language Left Behind (NLLB-200) is a set of massively multilingual NMT models that can translate to and from 202 languages (Costa-jussà et al., 2022), including many very low resources languages. Models of varying sizes have been released. The largest one is a Mixture-of-Experts model and has 54.5B parameters. A dense model of 3.3B models is also available, which has the same architecture as the 54.5B MoE model without the experts. In this work, we will attempt to prune the experts from the 54.5B model while using the 3.3B variant as a lower-bound baseline. 4 In the 54.5B MoE model, every 4 th FFN sublayer -in both the encoder and decoder -is replaced by an MoE layer, starting at the 4 th layer (this makes 12 layers with experts). Each MoE layer consists of 128 experts (1536 experts in total) with the same architecture as an FFN sublayer, and has its own gating network, following the top-k gating algorithm of Lepikhin et al. (2020) and selecting the top-2 experts per token without any randomization. The model was trained with a linear combination of label-smoothed cross-entropy (Szegedy et al., 2016) with an auxiliary load balancing loss (Shazeer et al., 2017), which encourages tokens to be uniformly distributed across experts.
Memory usage. The 3.3B and 54.5B models are Transformers with an embedding dimension of 2048, an FFN dimension of 8192, 16 attention heads, 24 encoder layers, and 24 decoder layers. When storing their parameters in half precision, the 3.3B dense model and 54.5B MoE model take respectively 6.2GiB and 101.5GiB of memory. Each expert has 33.6M parameters, representing 51.6B parameters in total or 96GiB of memory. While the 3.3B model can easily run on a single GPU, the 54.5B model requires at the very least 4 32GB GPUs to run. To maximize efficiency, decoding with the MoE model has to be done with expert parallelism (Fedus et al., 2021), with each GPU holding a full copy of the "dense" parameters (2.9B or 5.5GiB) and 1/N th of the experts per layer, where N is the number of GPUs. 5 Because of the memory usage of beam search decoding and memory fragmentation, batched decoding actually requires more GPUs in practice (e.g., 6 or 8), or to offload the encoder and decoder to the CPU when they are not used. 6

Our Approach
We experiment with different experts' pruning metrics and strategies that allow us to select the most relevant experts per language or language pair, and thus significantly reduce the memory usage at inference time of NLLB-200.

Expert pruning metrics
The pruning metric should quantify the contribution of a given expert to the translation. Intuitively, experts that were more involved in translation should be considered more important.
Activity. We define the Top 1 activity, top 1 (e), of an expert e as the fraction of tokens routed to this expert as the first choice (i.e., the frequency at which this expert was ranked first by the gating mechanism). We also consider the Top 2 activity variant, top 2 (e), with the fraction of tokens routed to this expert as their first or second choice.
Using only activity as an importance metric can be sub-optimal as it does not take into account the gating value assigned to this expert by the model.
Load Balancing. We experiment with the load balancing pruning metric, similar to the load balancing loss used by Costa-jussà et al. (2022) to train the MoE model. It is defined as the product of the activity and the average gate value: Importance. Following the definition of attention head confidence by Voita et al. (2019), we define the confidence of an expert, conf (e), as its average gate value when it is ranked first. Then, we can define the "vanilla" importance of an expert as the product of its' activity and confidence. 7 We define importance as an improved version of vanilla importance with an exponential to smooth the confidence values:

Expert statistics granularity
To compute the pruning metrics defined above, for each expert e ∈ {1, . . . , 1536} 8 we collect the gate statistics, top 1 (e), top 2 (e), mean(e) and conf (e), by decoding the validation sets for all language directions. 9 However, these statistics can be aggregated at different granularity levels. Depending on how these statistics are aggregated, we hope to see language-specific experts emerge. In our experiments, we consider three different granularities: • global: we aggregate the statistics across all language pairs to keep the overall best experts; • language-pair: we collect gate statistics for each language pair and thus keep a (potentially) different set of experts for each language pair; • language-specific: we aggregate encoder-side statistics per source language and decoderside statistics per target language, which will let us keep a single set of encoder/decoder experts per source/target language. 7 Using confidence alone as a pruning metric has demonstrated very poor performance in our preliminary experiments, and therefore was not retained for the follow up study. 8 12 layers with 128 experts each = 1536 experts 9 We always use beam search with a beam size of 4.

Expert pruning algorithm
Using the pruning metrics defined in Section 4.1, there are different expert pruning strategies that we can adopt. The pruning metric values are normalized to sum to one in each layer, and experts are sorted from most important to least important.
Fixed per layer. First, the simplest way is to retain a fixed amount of top experts in each layer. For example, 75% pruning retains 384 out of 1536 experts, which corresponds to 32 experts per layer.
In the balanced setting, the number of experts per layer is the same in the encoder and decoder (e.g., 32 per layer). In the unbalanced setting, we keep a different number of experts in the encoder and decoder (e.g., 40 per encoder layer and 24 per decoder layer).
Global threshold. The pruning metrics we defined let us easily prune experts per layer, but not globally. To select globally best experts (with no a priori on the number of experts per layer) we search for a global threshold θ such that: Where φ is the pruning metric; k the layer id (out of 12 layers with experts); e k i the i th expert in the sorted list of experts for that layer; and count the desired total number of experts to retain (e.g., 384 for 75% pruning). Experts {e k i } n k i=1 are then retained and the rest are pruned. 10 In our experiments, we make sure to keep at least 4 experts per layer. 11 Our intuition behind this pruning method is to define a constant probability mass (or "importance" mass) each layer should have. Keeping only a couple of experts in a layer is fine if they are collectively used a majority of the time. Conversely, some layers may need more experts if expert usage is more uniformly distributed. Figure 1 illustrates how experts are distributed among layers with this approach at 75% pruning and with the top 1 metric. We see that the decoder requires much fewer experts per layer than the encoder to reach the same activity threshold. We also experiment with a variant of this method, which we call Enc/Dec thresholds, with a fixed amount in the encoder and decoder (e.g., 192 and 192) and thresholds that are defined independently in the encoder and decoder.

Evaluation settings
In our experiments, we use the FLORES-200 benchmark (Costa-jussà et al., 2022), which consists of translations of 3001 English sentences (from 842 distinct Wikipedia articles) to all other 201 languages. The multi-parallel nature of this dataset makes it possible to evaluate performance in all 40 602 language directions. As our final test benchmark, we take a representative subsample of 53 languages out of 202, which were also used as an ablation dataset by Costa-jussà et al. (2022). In our intermediate experiments, we work with a smaller subset of 30 out of 53 languages, with 10 languages per resource type (high, low, very low) and covering the same fourteen language families as the full subset of 53 languages. More details on the languages considered in our experiments as well as the amount of resources available per category are provided in Tables 8 and 14 in Appendix.
To evaluate translation quality we use two metrics: chrF++ 12 (Popović, 2015) and spBLEU 13 (Costa-jussà et al., 2022) tokenization-dependant and its implementations do not include tokenizers for most of the NLLB-200 languages. spBLEU overcomes this issue by tokenizing the references and model outputs with a multilingual SentencePiece tokenizer (SPM-200, Costa-jussà et al., 2022). We report chrF++ results in the main paper and spBLEU results in Appendix. We use FLORES-200 dev (which we call valid) for collecting MoE gate statistics and comparing different pruning algorithms and rates, and FLORES-200 devtest (which we call test) for reporting final results and comparing with the 3.3B and 54.5B baselines.

Results
In the first set of experiments, we work with a subset of 30 languages. Table 1 compares different expert pruning metrics and strategies under a 75% pruning rate. The experts are selected per language pair, and the scores are averaged per resource type (high, low, very low). The first part of the table reports two baselines: an upper bound corresponding to the full (unpruned) 54.5B MoE model, and a lower bound being the 3.3B dense model (same architecture without experts).
Pruning metric The second part of Table 1 compares the chrF++ performance of different pruning metrics (spBLEU score are reported in Appendix Table 9). From these results, we can see that the top-1 activity and importance metrics are the most effective at identifying important experts. Further experiments with global threshold pruning (third part of Table 1) confirm the slightly better performance of the importance metric which we keep as the default for the next experiments. Table 1 also compares the pruning algorithms described in Section 4.3 (fixed per layer and global threshold). Note that with fixed per layer, we can either allocate the same expert budget in the encoder and decoder (balanced setting) or have more experts in the encoder (unbalanced setting).

Pruning algorithm
First, we see that the global threshold strategy gives the best results overall, with the same average chrF++ as the full unpruned model. However, global threshold is not very practical for several reasons. First, it identifies a different amount of experts per layer for each language pair, which leads to variable memory usage across language pairs. It also requires recreating and reloading the model when decoding multiple directions, which is very  slow. Finally, we found that it was more sensitive to over-generation and hallucinations (which we elaborate on in Section A in Appendix) at higher pruning rates. The enc/dec thresholds approach does not suffer from all the limitations of global threshold, but it is not better than fixed per layer either. Therefore, for simplicity, we pick the fixed per layer approach for our next experiments.
Balanced versus unbalanced pruning When retaining 25% of experts (384 out of 12×128), global threshold keeps on average 335 encoder experts and 49 decoder experts. The number of selected experts in the encoder and decoder for different language resource types is shown in Table 16 in Appendix. Following this observation that encoder experts seem more important than decoder ones, we experiment with different encoder/decoder ratios. 1:1 is the balanced setting. 2:1 and 3:1 are unbal-anced with respectively twice and three times as many encoder experts as decoder experts. Figure 2 shows that 3:1 performs the best across almost all pruning rates and resource types.
Pruning with global statistics. Figure 2 and Figure 4 in Appendix also show that the same experts can be pruned across all language pairs (with statistics aggregated over all directions) with no loss in performance at 50% pruning. Statistics at the language-direction granularity let us safely prune up to 80% of the experts (in the unbalanced setting), which makes the model small enough to fit on a single GPU.
Test results and language-specific pruning. Finally, we validate our results over the test set on 53 languages (2 756 directions). We use the fixed per layer approach with a 3:1 ratio, which showed    the best results on the validation set at 80% (minimum rate for 1-GPU decoding). Tables 2 and 11 report these test scores with three different levels of granularity: global, language-pair-specific or language-specific (as described in Section 4.2). Table 10 in the Appendix reports valid scores with the same settings. Pruning important experts chosen per language pair gives 0.8 chrF++ more on average than the 3.3B dense model, and 0.2 chrF++ less than the full MoE model. Global pruning on the other hand performs worse than both the MoE and dense models, which confirms the importance of having a language-specific pruning strategy.
While choosing important experts for each language pair is effective, it is not very practical: with L languages, this generates L × (L − 1) different configurations. A more practical approach is to prune encoder experts per source language and decoder experts per target language (i.e., languagespecific pruning). This pruning strategy performs exactly as well as pruning per language direction and is more convenient. Following this observation, we extract per-language gate statistics on all 202 languages. 14 Then, we apply 80% per-layer prun-14 By decoding 25 random line pairs per language direction, ing with the importance metric (at the language granularity) and decode the test set in all 40 602 directions. Tables 3 and 12 report the chrF++ and sp-BLEU scores. Table 13 reports average score deltas with the unpruned model (and standard deviation per resource type). To facilitate future research and give the opportunity for anyone with a 32GB GPU to run the NLLB-200 model, we release the detailed gate statistics and the ids of the selected experts. We also share the scores for each direction and the decoding outputs of our best pruning approaches.
6 Discussion 6.1 Inference speed and compute budget  MoE model. Table 15 in Appendix gives a breakdown of the number of GPU hours used for this work.

Similarity of selected experts
Section 5.2 shows that only a fraction of all experts is necessary to translate between two given languages. We analyze the experts selected by our pruning method, to verify whether we can claim that there are indeed language-specific experts. In order to do so, we select experts with our proposed importance metric and prune them per language pair at a 75% rate with the Enc/dec thresholds method, so that both the encoder and decoder have the same number of experts. We then compute the Jaccard similarity of selected encoder/decoder experts between different language pairs sharing the same source or target language. The lower and upper triangles of Table 4 show this similarity in the encoder and decoder respectively. We see that the encoder experts are independent of the target language (even though pruning is based on statistics collected at the lang-pair granularity level). This is an expected result, and it is due to the model design, where the target language code is introduced on the decoder side only: the encoder representation is not impacted by the target language. We note that the similarity between different source languages is also quite high (30-50%). The similarity between important decoder experts for the same target language is in the 68-87% range; and in the 13-39% range for different target languages. These observations combined with the results in Section 5.2 suggest the emergence of language-specific experts in the NLLB-200 model.

Similarity of languages based on the importance metric
Finally, we compare expert statistics across different languages, to better understand whether knowl- edge transfer happens at the expert level between similar languages. We gather importance metrics for each expert in the decoder for each language and concatenate the values of all MoE layers to have one feature vector of dimension 768. Then we do hierarchical clustering and show it as a dendrogram in Figure 3, where we highlight different language subgroupings with different colors. We can see that some clusters contain linguistically related languages, such as Yue Chinese, Korean and Japanese; Russian and Belarussian; or Portuguese, Asturian, and French. We run a similar analysis on the encoder experts and also observe meaningful language clustering, but less clear (Appendix Figure 7).

Discrepancy between chrF++ and spBLEU scores
We observed that our pruning method results in slightly higher performance drop according to sp-BLEU, than with chrF++. We hypothesize that it is due to a rare but visible phenomenon of overgeneration (and sometimes hallucinations). In the majority of cases, the translation is accurate initially but subsequently includes repetitions, paraphrasing, or slight hallucinations. The spBLEU metric penalizes this behavior more than chrF++, which could account for the variation in scores observed. More details on this are in Section A in Appendix.

Conclusion
In this paper, we study expert pruning in the NLLB-200 Mixture-of-Experts MT model. We propose expert pruning metrics based on gate statistics collected while decoding. We study several pruning strategies and demonstrate that it is possible to prune up to 80% of experts with a negligible loss in performance, which makes it possible to decode on a single 32GB GPU. We compare pruning at three levels of granularity: per language direction, per language, or global. Language-specific and language-pair pruning perform the same but the former is the most convenient. Global pruning (i.e., pruning always the same experts regardless of the source and target languages) performs surprisingly well but worse than language-specific pruning, which suggests that there are indeed some language-specific experts. This latter hypothesis is confirmed by our analysis of the selected experts.

Risks and Limitations
In our work, we rely on a single Mixture-of-Experts NMT model which is NLLB-200. There is a risk that our conclusions may only hold for this particular model and are specific to the way this model was trained. We believe that our findings still can be of interest to any person willing to use the NLLB-200 model because: (1) It was the only publiclyavailable MoE NMT model at the time of submission; (2) It is the only model covering 202 languages and reaching SoTA results for most of those languages. Moreover, we did not try to finetune the pruned model, which could potentially improve the results (but requires a large number of GPUs) and therefore change some of our conclusions.
This work has similar risks as the original NLLB-200 models regarding the misuse of potentially wrong translations. Note that, as observed by Mohammadshahi et al. (2022b), pruning could amplify the biases already present in the full model.

A Discrepancy between chrF++ and spBLEU scores
The spBLEU scores (Figure 2 top right, or Figure 4 and Tables 9 and 11) do not show exactly the same trend as chrF++. The gap between the full models and their pruned versions is slightly higher. This is likely caused by a rare but visible phenomenon of over-generation (and sometimes hallucinations). Table 7 shows some examples of such over-generation (with 3:1 fixed per layer lang-pair pruning at 80%). Most of the time, the translation is correct, but then continues with repetitions of itself, paraphrasing, or slight hallucinations. This behavior is more penalized by spBLEU than chrF++, which may explain the difference in scores. For instance, when duplicating the FLORES valid English-French translation output of the 54.5B model (i.e., concatenating each output sentence with itself), we see a spBLEU drop of 47% and a chrF++ drop of only 13%. The global threshold method is more sensitive to this phenomenon. For instance, 80% pruning leads to a 1.75 spBLEU drop (vs 0.53 for the fixed per layer method). We report in Table 6 the difference in length ratio (reported by SacreBLEU, Post, 2018) between the pruned models and the full model. We observe that global threshold at 80% has an average length ratio delta with the full model of 0.16 (meaning it generates longer outputs), while fixed per layer has 0.04. We hypothesize that this over-generation issue may be mitigated by identifying experts that are specialized in generating the end-of-sequence symbol, but this is the subject of future work.  Table 6: Valid length ratio statistics on 30 languages. We compute the length ratio deltas w.r.t. the full unpruned model and report the mean and standard deviation in each language category. Values smaller than zero mean that the model generates longer sequences on average. We prune with the importance metric, using statistics at the language pair granularity. Length ratios are obtained with SacreBLEU and the 'flores200' tokenization.
Reference However, very few Royal Navy ships were based near the likely invasion routes as the admirals were afraid they would be sunk by German air attack. Translation However, the Royal Navy had few ships in the vicinity of the invasion routes, as the admirals feared that they would be sunk by German air attack. The Royal Navy's ships were too small to be able to operate in the vicinity of the invasion routes. Reference However, the discovery of his tomb in 1922 made him a celebrity. While many tombs of the past were robbed, this tomb was left virtually undisturbed. Translation However, his tomb became famous when it was discovered in 1922. Although many tombs in the past have been excavated, this one has remained largely undisturbed. The tomb was built in the middle of the 19th century. The tomb was built in the middle of the 20th century. The tomb was built in the middle of the 20th century Reference The translation engines have improved dramatically, and now often give more or less correct translations (and more seldom gibberish), but some care is due, as they still may have gotten it all wrong. Translation Translation engines have improved dramatically, and now often produce more or less accurate translations (as well as more or less uncommon ones), but some errors can still be corrected, so some caution is warranted. For example, in the case of the English language, the use of the word "translate" in the context of a translation of a document, such as a translation of a document into a foreign language, is not recommended. The use of the word "translate" in the context of a translation of a document is discouraged.         Table 13: Test chrF++ deltas (first part) and spBLEU deltas (second part) with the unpruned MoE model on all 202 languages. The pruned version uses the importance metric with 80% pruning at the language granularity. Each column reports the average score for a given language category, as well as the standard deviation. A positive value means that this model is worse than the full 54.5B model. The last column reports the average score and standard deviation over all 202×201 directions.