Efficient Large Scale Language Modeling with Mixtures of Experts

Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full-shot fine-tuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using $\sim$4 times less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters). Overall, this performance gap varies greatly across tasks and domains, suggesting that MoE and dense models generalize differently in ways that are worthy of future study. We make our code and models publicly available for research use.


Introduction
Large Language Models (LMs) achieve remarkable accuracy and generalization ability when fine tuned for NLP tasks (Peters et al., 2018;Devlin et al., 2019;Liu et al., 2019;Lan et al., 2020;Raffel et al., 2020). They are also capable zero-and fewshot learners (Brown et al., 2020), with the ability to generalize to tasks not seen during training. A reliable way to improve LM accuracy in all of these settings is by scaling up: increasing the number of parameters and the amount of computation used during training and inference (Raffel et al., 2020;Brown et al., 2020;Fedus et al., 2021). In fact, some generalization properties only emerge in very large models, including much improved zero-and few-shot learning (Brown et al., 2020). In-domain LM Out-of-domain LM Zero-shot priming Figure 1: Estimate of how much more efficient MoEs are relative to dense models. A speedup factor of y indicates that an MoE model can match the performance of the corresponding dense model-trained with x ZFLOPs-using y times less compute (i.e., x/y ZFLOPs). We estimate this factor according to validation perplexity for in-domain language modeling, the Pile perplexity for out-of-domain language modeling, and average accuracy across 6 tasks for zero-shot priming. See §3.3.3 for more details.
Unfortunately, the corresponding growth in computational resources required to train state-of-theart language models is a barrier for many in the research community (Schwartz et al., 2019). There is also a concern about the environmental costs associated with training and deploying such models (Strubell et al., 2019;Bender et al., 2021;Patterson et al., 2021) motivating research into more efficient model designs (Lepikhin et al., 2021;Fedus et al., 2021;Lewis et al., 2021).
Sparse models allow for increased number of learnable parameters without the associated computational costs. For example, sparsely gated mixture of experts (MoE) (Lepikhin et al., 2021) have been successfully used for language modeling and machine translation (Lepikhin et al., 2021;Lewis et al., 2021;Roller et al., 2021), but are yet to be shown effective for fine-tuning (Fedus et al., 2021) as well as zero-and few-shot learning. We hypothe-size that sparse models are comparatively accurate to dense models but at a much lower computational footprint. To measure this claim, we train traditional dense and MoE language models ranging in size from several hundred million parameters to more than one trillion parameters and present a careful empirical comparison of these models on downstream tasks in zero-shot, few-shot and fully supervised settings.
As shown in Figure 1, we find that MoE models can indeed achieve similar downstream task performance as dense models at a fraction of the compute. For models with relatively modest compute budgets, a MoE model can perform on par with a dense model that requires almost four times as much compute. Downstream task performance improves with scale for both MoE models and dense models. While we observe that the performance gap narrows as we increase model size, even at larger compute budgets (∼ 5000 GPU days), our largest MoE model (1.1T parameters) outperforms a dense model with similar computational cost (6.7B parameters). We further compare and contrast the performance of dense and sparse models with similar computational signatures and observe some performance variations across tasks and domains, suggesting this an interesting area for future research. In summary, our contributions are: • We present a comprehensive study of sparse models for zero and few-shot learning at scale; • We demonstrate that even at scale sparse MoE models can yield competitive zero and few-shot performance at a fraction of the computation for model training and inference; • We observe some differences in how dense and sparse models generalize at scale suggesting complementary behaviour that could be an interesting future research direction.
2 Background and Related Work 2.1 Large Language Models / GPT-3 Progress in the field of NLP has been driven by increasingly large Language Models (LMs) pretrained on large text datasets. While numerous variations have been proposed, such LMs are predominantly based on the transformer architecture (Vaswani et al., 2017). Models are pretrained by hiding parts of the input: predicting the next word sequentially left-to-right, masking words in the text (Devlin et al., 2019;Liu et al., 2019), or perturbing and/or masking spans (Lewis et al., 2020;Raffel et al., 2020). The resulting models can be quickly adapted to perform new tasks at high accuracy by fine-tuning on supervised data (Devlin et al., 2019;Liu et al., 2019). Recently, GPT-3 (Brown et al., 2020) demonstrated that large LMs can perform zero-and few-shot learning without fine-tuning through incontext learning. Notably, many of these in-context zero-and few-shot learning behaviors emerge or amplify at scale.

Sparse models
One drawback of dense model scaling is that it grows increasingly computationally expensive. To more efficiently increase model capacity, conditional compute strategies have been developed (Bengio et al., 2013;Davis and Arel, 2013;Cho and Bengio, 2014;Bengio et al., 2015), where each input activates a subset of the model. Recent work (Lewis et al., 2021;Lepikhin et al., 2021;Fedus et al., 2021;Fan et al., 2021) has studied different conditional compute strategies that work well with Transformer models for natural language tasks. In this work, we focus on Sparsely Gated Mixture of Expert (MoE) models (Lepikhin et al., 2021). Sparse MoE models replace the dense feed forward network block in every alternate Transformer layer with an MoE layer. The MoE layer has a routing gate that learns which tokens are to be mapped to which set of experts (we use top-2 experts). To ensure scalability and training efficiency, it is also common to include a weighted gate loss term as in Lepikhin et al. (2021) to the cross entropy loss to encourage the tokens to be uniformly distributed to the experts.

Zero-shot and Few-shot Learning
Few-shot learning with LMs (Brown et al., 2020), by first conditioning on K example input-output demonstrations and then performing a cloze-style prompt completion task (Schick and Schütze, 2021), has emerged as a promising alternative to fine-tuning. Such learning works with only a handful of examples per task and requires no model parameter updates. Along similar lines, zero-shot evaluation queries the LM using only a task-specific prompt without any demonstrations, presenting a more challenging evaluation scenario for LMs across tasks.  Ours (

Large-scale training
Many of the models we consider in this work are too big to be trained using standard data parallel techniques, since parameter storage would exceed the usable memory of a single GPU. We adopt several techniques to make these models feasible to train, including pure FP16 training, activation checkpointing and fully sharded data parallel training. These techniques are described in more depth in Appendix D.
3 Experimental Setup

Models
We train autoregressive (decoder-only) transformer models that roughly match the sizes and architecture explored in Brown et al. (2020). Model sizes are summarized in Table 1. We use prenormalization transformer blocks (Baevski and Auli, 2019;Child et al., 2019) and GELU activations (Hendrycks and Gimpel, 2016). We differ from Brown et al. (2020) in two ways: (1) we use only dense attention, while they alternate between dense and locally banded sparse attention; and (2) we train our models with sinusoidal positional embeddings, following Shortformer (Press et al., 2020). 2 We also train MoE models that mirror our dense model configurations (see the third set of columns in Table 1), so that comparisons are approximately matched in terms of the number of floating point operations (FLOPs). Our MoE models follow the design proposed in Lepikhin et al. (2021) with alternating dense and expert layers and top-2 expert selection. We use 512 experts in each expert layer (E = 512). Each expert has a capacity of C·B E tokens, where C is a capacity factor that we set to 2 and B is the total batch size in tokens. Capacity refers to the maximum number of tokens that are routed to each expert. Once an expert is at capacity for a given batch, additional tokens are considered to be "overflowed" with their representations passed-through via the residual connection. Fedus et al. (2021) report instability training large MoE models and suggest rescaling the initial model weights, which we do not find necessary. We instead observe that expert parameters have an E-times smaller batch size relative to dense (data parallel) parameters and accordingly rescale expert gradients by a factor 1 √ E . This rescaling aligns with theory suggesting that an E-times increase in batch size should be accompanied by a √ E increase in learning rate (Krizhevsky, 2014).
Following Brown et al. (2020), we train our models for 300B tokens 3 with a context size (sequence length) of 2048 tokens. The batch size and learning rate are set according to the model size following Brown et al. (2020). We linearly warm-up the learning rate from 0 over the first 375M tokens and linearly decay back to 0 over the remaining tokens. We use the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98, = 10 −8 , weight decay of 0.01 and dropout of 0.1. 4 We train our models in PyTorch (Paszke et al., 2017) using FAIRSEQ (Ott et al., 2019).
We encode our data using the same Byte-Pair Encoding (BPE) as GPT-2 (Radford et al., 2019) and RoBERTa (Liu et al., 2019) with a vocabulary of 50K subword units.

Evaluation
We evaluate models in terms of their in-domain and out-of-domain perplexity, as well as downstream task performance.

Perplexity Evaluation
We first evaluate our models on their ability to predict the next token in a sequence as measured by perplexity. Similar to training, we concatenate all documents in a given dataset using empty lines as separators, split the resulting sequence into nonoverlapping blocks of 2048 tokens, and score each block independently. 5 5 One limitation of this approach is that the first tokens in each block have limited context, as they do not condition on tokens from preceding blocks. Although more expensive, better results could be obtained using a sliding window approach. Nevertheless, this form of chunking the input is standard in language model evaluation.
We evaluate and report perplexity in both indomain and out-of-domain settings. In-domain, we sample a held-out subset of the combined pretraining data ( §3.2). For out-of-domain we use data from The Pile (Gao et al., 2021), a public dataset that combines data from 22 diverse sources (e.g., ArXiv, Github, OpenSubtitles, etc.). We report perplexities on the official test set of each individual subset, as well as the average across all subsets.

Downstream Evaluation
We target models that can perform downstream tasks well. Recent work shows that good perplexity performance does not always align with good performance on downstream tasks (Tay et al., 2021). Hence, we evaluate our models accordingly.
Benchmarks. We evaluate our models on a subset of the tasks considered in Brown et al. (2020). As GPT-3 performance varies greatly across tasks and model sizes, we focus on tasks for which GPT-3 either demonstrated consistent gains from scaling, or consistent gains going from zero-shot to few-shot settings.
Few-shot: we use WinoGrande (Sakaguchi et al., 2020), StoryCloze (Mostafazadeh et al., 2016) and OpenBookQA (Mihaylov et al., 2018), the only non-generation tasks for which Brown et al. (2020) reported meaningful gains over zeroshot at our scale. 6 We exclude SuperGLUE, since we were not able to reproduce results reported in Brown et al. (2020) using the public GPT-3 API. 7 Zero-shot: in addition to the 3 few-shot tasks, we evaluate on ReCoRD (Zhang et al., 2018), Hel-laSwag (Zellers et al., 2019) and PIQA (Bisk et al., 2020). Brown et al. (2020) reported strong results and monotonic improvements from scaling on these tasks. Baselines. We compare to the published GPT-3 numbers (Brown et al., 2020) as our primary baseline. To validate our experimental framework, we also evaluate GPT-3 leveraging the OpenAI API using our own evaluation code and settings. Unfortunately, the correspondence between model sizes and model names in the OpenAI API is not published. We follow other published work (Gao et al., 2021) and guess the correspondence based on our results from the public API as compared to results in Brown et al. (2020) Methods. We compare both priming and finetuning-based approaches.
• Priming: We use a language model to separately score each label choice using the same templates as Brown et al. (2020), and pick the one with the highest score. For few-shot learning, we use a single newline to separate examples. Our scoring function follows the description in Brown et al. (2020): -For WinoGrande, we take the loglikelihood of the common suffix of the different candidates. -For OpenBookQA, we normalize by the unconditional probability of each candidate by taking p(completion|context) p(completion|answer_context) , where we use the string "Answer: " as an-swer_context.
-For ReCoRD, we take the sum of per-token log-probabilities. 9 -For all the other tasks, we take the average of per-token log-probabilities, ignoring the common prefix of the different candidates.
• Fine-tuning: Although supervised fine-tuning of pre-trained LMs on task specific training data, D, requires updating and storage of all model parameters per task, the process typically produces significant task specific performance improvements. We contrast the fine-tuning performance of sparse models and their dense counterparts following (Radford et al., 2018), which applies an additional task-specific linear layer W y on the representation from the final transformer block for each input candidate separately, followed by a softmax layer. We fine-tune all model parameters using the entire training set (fully supervised learning). In addition to our zero-shot tasks, we also evaluate on 3 widelyused classification tasks: BoolQ (Clark et al., 2019), MNLI (Williams et al., 2018) and SST-2 (Socher et al., 2013). More details are in Appendix B.

MoE speedup factor
We hypothesize that sparse models can achieve comparable performance at a smaller compute budget. As such, it is informative to measure how much more efficient MoEs are at achieving a specific performance level relative to dense models. We estimate how many FLOPs c(t) the model needs to achieve performance t in a particular task (as measured by perplexity for language modeling and accuracy for downstream tasks) using either an MoE or a dense model. Given that we only have discrete observations, we estimate exact missing values by interpolating on a logarithmic scale as follows: where r = t−t lo t hi −t lo , t lo and t hi are the closest performance to t from the available models while being lower and higher than t, respectively, and c lo (t) and c hi are their corresponding training cost in ZFLOPs.
The interpolation gives us matching performance levels for dense and MoE models. We use them to compute the MoE speedup factor c dense (t)/ c moe (t). For example, if a dense model requiring 20 ZFLOPs achieves a performance of 90% on a given task and a MoE model requiring 5 ZFLOPs achieves the same performance, then the formula produces saving factor of 4. We visualize the savings curve using c dense (t) in the x axis, which allows us to contrast speedup in different tasks in a comparable scale. 4 Results and Analysis

Language modeling perplexity
We report our perplexity results in Figure 2, and visualize the speedup curves in representative subsets of the Pile (Gao et al., 2021) in Figure 3a. Refer to Appendix A for full results for all the 22 subsets of the Pile. We observe that all MoE models outperform their dense counterparts in all datasets, but their advantage greatly varies across domains and models. MoEs are most efficient when evaluated in-domain, where they are able to match the performance of dense models trained with 8-16x more compute (see Figure 1). The improvement is more modest in out-of-domain settings, bringing a speedup of 2-4 on the Pile. This is reflected in Figure 2, where the gap between the MoE and dense curves is substantially smaller in out-of-domain settings. Moreover, the advantage of MoEs over dense mod-els decreases at scale: MoEs need ∼4 times less compute to match the performance of dense models trained with 2-6 ZFLOPs, but the speedup is ∼2 for dense models trained with ∼30 ZFLOPs.
We also observe large difference across the subsets of the Pile, which correspond to different domains. As shown in Figure 3a, MoEs obtain the largest speedups in subsets that are closest to the training corpus (e.g., CommonCrawl). The efficiency gains are more moderate but still remarkable for other domains like ArXiv and OpenSubtitles. Our largest MoE model barely outperforms its dense counterpart on DM Mathematics (7.63 vs. 7.66 perplexity), which is arguably very different from the training domain.
Our dense models perform at par with their GPT-3 counterparts. This is consistent across different tasks, with our models doing marginally better on average. We are thus able to match Brown et al. (2020) despite some notable differences in our setup (e.g., different training corpus), establishing a solid baseline to evaluate MoE models on downstream tasks. Similarly, when using our own code to evaluate the strongest GPT-3 API backend (davinci), we obtain numbers that replicate those reported in the original paper for their largest model, which reinforces that our evaluation settings are comparable to Brown et al. (2020). 10 As with language modeling, MoEs outperform their dense counterparts for all datasets and model sizes. But, once again, we find the advantage narrows at scale as illustrated in Figure 4. Similar to the domain differences in language modeling, we observe differences across downstream tasks. As shown in Figure 3b, MoEs obtain significant speedups in certain tasks like HellaSwag and PIQA, but this improvement is more modest in other tasks such as ReCoRD and Winogrande.

Few-shot learning
We report our few-shot results in Table 3 and plot the corresponding improvement over zero-shot in Figure 5.
Our dense baselines perform at par or slightly better than GPT-3. We observe that the improvement over zero-shot is bigger for larger models, 10 We assume that ada corresponds to the 355M model, babbage corresponds to the 1.3B model, and curie corresponds to the 6.7B model based on the API evaluation results.  further supporting that certain capabilities in language models emerge at scale (Brown et al., 2020). Finally, we find that our larger MoE models also benefit from few-shot learning, outperforming their dense counterparts in all conditions. However, the improvements going from zero-shot to few-shot are smaller for MoE models compared to their dense counterparts. For example, the average for the 6.7B dense model improves by 3.6 points to 69.3 going from zero-shot to few-shot, whereas the corresponding 1.1T model improves by 2.3 points yielding 70.1. Table 4 contrasts full fine-tuning performance of MoE models with their dense counterparts on 8 datasets, using zero-shot accuracy as a baseline for reference. We did not fine-tune the 6.7B and 13B dense models and the 1.1T MoE models, owing to their high resource needs. As expected, supervised fine-tuning yields substantial performance benefits for all dense models across all datasets, over zero-shot performance. In contrast, although fine-tuning of MoE models produces substantial    Figure 5: Absolute accuracy improvement going from zero-shot to few-shot, averaged across 3 tasks. Each point corresponds to a different, fully-trained model (see Table 1). GPT-3 (paper) results taken from Brown et al. (2020).

Supervised Fine-Tuning
benefits for Storycloze, BoolQ, SST-2, MNLI and some improvements on OpenBookQA, it results in worse performance for HellaSwag, PIQA, and Winogrande. For the cases where we see improvements, the accuracy of fine-tuned MoE models approach that of their corresponding dense models. For this comparison, we fine-tune MoE models exactly as we do the dense models. While MoE models may benefit from alternative fine-tuning approaches, for example, selective fine-tuning of the expert or non-expert parameters, we leave such exploration to future work.

Understanding Potential Harms
Previous work (Sheng et al., 2019;Bordia and Bowman, 2019;Nadeem et al., 2021;de Vassimon Manela et al., 2021) has observed language models absorb bias and toxicity represented in the training data. We set out to explore if sparse models would behave differently than dense models in this arena. To that end, we evaluate our dense and MoE models on two popular benchmarks: Stere-oSet (Nadeem et al., 2021) and CrowS-Pairs (Nangia et al., 2020). StereoSet measures bias across four domains: profession, gender, religion, and race. CrowS-Pairs dataset covers nine bias types: race, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status. 11 Stereotypical bias as a function of scale. Table 5 presents the results on StereoSet benchmark using three metrics: (1) Language Modeling Score (LMS): defined as the percentage of instances in which a language model prefers meaningful over meaningless associations (higher LMS is better); (2) Stereotype Score (SS): defined as the percentage of instances where a model prefers a stereotypical association over an anti-stereotypical association (SS score close to 50 is better, while a more biased 11 The two benchmarks have limitations such as lack of clear articulations of how certain biases are being measured (Blodgett et al., 2021). Results should be interpreted accordingly.    model would have a higher score towards 100%); (3) Idealized CAT Score (ICAT): defined as a combination of LMS and SS to capture both in a single metric: LM S * min(SS,100−SS)

50
(higher ICAT is better). Table 6 presents the performance of our models on CrowS-Pairs using the Stereotype Score (SS) metric. Similar to StereoSet, we observe that both dense and MoE models get worse with scale, again with statistically significant (p < 0.05) difference between best and worst scores based on bootstrap test (Noreen, 1989;Efron and Tibshirani, 1994).

Stereotypical bias in dense vs MoE models.
We observe that as the model size increases, both dense and MoE models get worse ICAT scores in general -they become more biased with a statistically significant difference between best and worst scores. On the StereoSet benchmark, corresponding dense and sparse models (comparable FLOPs) yield comparable performance. On the CrowS-Pairs MoE models perform slightly better (less biased) than dense models on average but the difference is not statistically significant (see Table 5 and Table 6).

Broader Impact
It has been already noted that LMs at scale exhibit emergent zero and few-shot training capabilities. One of the main objectives of our work is to devise a performant scaled solution where the capabilities emerge at a lower compute cost. Accordingly in this study, we believe that we have paved the way for such a paradigm, quantitatively illustrating the power and potential of sparse models. Further research and optimization of sparse models could serve as a viable alternative to dense models at a significant drop in expense (eg. compute, carbon footprint, overall financial cost).

Compute-efficient training
We present compute-efficient training that leverages MoE models and achieves comparable performance as dense models in both zero and few shot settings using 1/4th, and in some cases 1/12th, of the compute in terms of floating-point operations (FLOPs). Reduced computational cost represents an opportunity to train language models that exhibit emergent zero and few-shot behavior to more languages and domains. It also lowers the required resources for experimenting with training and fine-tuning such models allowing increase in the number of experiments that researchers on a fixed budget can run.

CO2 Emission Related to Experiments
The carbon emissions of the experiments reported in this work are dominated by the largest models, in particular the 13B parameter dense and 1.1T parameter MoE models. We trained our largest models on Azure NDv4 instances 12 with A100 GPUs (TDP of 400W) in the West US 2 region, which has a carbon efficiency of 0.3 kgCO 2 e/kWh and assumed Power Usage Effectiveness (PUE) of 1.125. 13   kgCO 2 e of emissions, of which 100 percent is directly offset by the cloud provider. In Table 7 we report training time and energy usage for our models based on the above estimates, as well as estimates from Patterson et al. (2021) for other large-scale LMs, in particular GShard (Lepikhin et al., 2021), Switch Transformer (Fedus et al., 2021) and GPT-3 (Brown et al., 2020). Training times (GPU days) are computed assuming a throughput of 160 TFLOP/s and 115 TFLOP/s per A100 GPU for our dense and MoE models, respectively, based on observed training speeds for our largest models. 14 We note that these estimates do not account for the costs associated with manufacturing the infrastructure to train such models, which can be significant . We also note that these estimates do not account for pilot experiments common in the early exploratory stages of a research project. We estimate that pilot experimentation adds a factor of 2 to the total training cost, since most exploration and tuning is performed at small scale where compute costs are small, and the largest models are typically trained at most once or twice. For instance, we trained and discarded a pilot 6.7B dense and 1.1T MoE model in the early stages of this project, but trained the 13B dense model once.

Ethical considerations
As noted in §4.3, comparing and contrasting sparse and dense models, we note that the percentage of bias and stereotype propagated is comparable, especially at scale. Moreover, in general, we note worse performance (more bias/stereotyping) at larger scales. This observation points to more research needed in order to mitigate such behavior. Intuitively however, we believe that sparse models may be inherently more controllable -e.g. designing specific experts -than dense models. We leave this line of investigation for future research.
Devising more compute efficient performant models (MoE) allows for more accessibility of such models for researchers in the field despite the higher storage capacity needed for such models -compared to the their dense counterpartsas it is well known that the majority of model expense typically lies in compute power. Hence, we are hopeful that such sparse models especially in zero/few shot scenarios would open the door for less compute demand as well as less need for supervised data, thus combining the best of both worlds. Moreover, by running empirical investigations of various parameter settings, we believe we have alleviated some exploration burden for the community and the environment by releasing a set of optimized parameters for such scenarios, allowing for more efficient offsets for other researchers. In the spirit of transparency and allowing for maximal replicability and accountability, we include data and model cards together with our code.

Conclusion
We present results for scaling sparse Language Models up to 1.1T parameters. We observe that up to this scale sparse models offer better performance vs. computation trade-off when compared to their dense counterparts for language modeling, zeroand few-shot learning. While the gap begins to close at scale our biggest sparse model outperforms its dense counterpart where the latter requires twice as much computation. These results confirm that sparse MoE models can provide an alternative to widely used dense architectures that saves computation and reduces model energy consumption.

B Fine-tuning Settings
We run fine-tuning for a fixed number of epochs (100 for BoolQ, OpenBookQA, StoryCloze, and PIQA, 25 for HellaSwag, Winogrande, and SST-2, 6 for MNLI) and perform model selection based on validation set accuracy. For datasets either without a validation set or where we evaluate on the validation set, we randomly split the training set and use 80% for fine-tuning and 20% for per-epoch validation.

C Knowledge Distillation
In Section 3.3.3 we show that sparse (MoE) models are significantly more efficient to train than dense models. However, inference for large sparse models can be challenging, since the large number of parameters (most of which are inactive) introduce significant storage costs compared to dense models.
In this section we explore whether it is possible to blend the benefits of dense and sparse models via knowledge distillation (Hinton et al., 2015). Building on recent work in this area (Shleifer and Rush, 2020;Sanh et al., 2020;Fedus et al., 2021), we train small dense "student" models to mimic the behavior of larger "teacher" models, which may be either large dense or large sparse (MoE) models.
Methods We train dense student models with 12 layers and hidden dimension 768, matching the 125M dense model architecture in Table 1. We use a weighted training objective that combines the standard cross entropy loss (25% weight) with a soft distillation loss (75% weight) that encourages the student model to reproduce the logits of the teacher. Additionally, we use a reduced sequence length of 1024 tokens to speed up experimentation.

Results
We report results in Table 9. We find that student models trained with knowledge distillation improve over a well tuned dense baseline for both dense and sparse teacher models. Furthermore, some of the efficiency advantages of sparse training can be transmitted to a dense student through distillation. For example, student models distilled from a 52B parameter MoE teacher outperform student models distilled from a 1.3B parameter dense teacher, despite that the dense teacher model is twice as costly to train.

D Techniques for Large-scale Training
We adopt several techniques to train models in this work, including a more memory-efficient recipe for FP16 training, activation checkpointing and Fully Sharded Data Parallel.
FP16 Training: Typical mixed-precision training recipes require storing model weights in both 16-bit (FP16) and 32-bit (FP32), as well as storing optimizer state in 32-bit to preserve accurate weight updates (Micikevicius et al., 2018;Ott et al., 2018). Thus training a model in mixed precision with Adam requires 16 bytes of memory per parameter to maintain 16-bit and 32-bit weights (6 bytes), Adam optimizer state (8 bytes) and gradients (2 bytes), not including any memory required for activations.
In practice we find we can reduce memory requirements by 50% by maintaining only 16-bit model weights, optimizer state and gradients with no loss in model accuracy. First, we simply discard the 32-bit model weights, saving 4 bytes per parameter, since pilot experiments showed this to have no impact on model quality when training with large batch sizes. Second, the Adam optimizer state can be stored in 16-bit by dynamically rescaling the values to avoid underflow. Specifically, we compute the standard Adam weight update in 32-bit and then apply the following transformation at the end of each optimization step to maintain the optimizer state in 16-bit (Dhariwal et al., 2020): where = 10 −8 and FLOAT16_MAX = 65504.0 is the largest finite value expressible in FP16. We apply this transformation separately for the first and second moment estimates in Adam.
Activation Checkpointing: Activation size grows proportionally to the model and batch size, making it infeasible to store activations for transformer models with more than a couple of billion parameters. We adopt a popular technique called activation checkpointing, which saves memory during training by discarding a subset of activations in the forward pass and recomputing them in the backward pass (Chen et al., 2016). This technique results in a 33% increase in computation, but can often reduce activation memory requirements by a factor of  Table 9: Distillation Results. Cost refers to the cost to train the teacher model (see Table 1). PPL is the in-domain validation perplexity.
10 (Rajbhandari et al., 2020). In our experiments we only store activations between transformer layers and recompute intermediate activations within each layer during the backward pass.
Fully Sharded Data Parallel: In data parallel training, gradients are averaged across multiple workers (GPUs) that process distinct partitions of the data. Standard implementations maintain redundant copies of the model weights and optimizer state on each GPU, however this wastes GPU memory and makes it challenging to scale the model size beyond what can fit on a single GPU. Recent work has explored sharding model parameters, optimizer state and gradients across workers (Xu et al., 2020;Rajbhandari et al., 2020), enabling training of models with more than one trillion parameters using only data parallelism without the added complexity introduced by model parallel training approaches like pipeline or tensor parallelism (Huang et al., 2019;Shoeybi et al., 2019;Narayanan et al., 2021). We implement these ideas in Fully Sharded Data Parallel (FSDP), 15 which shards model parameters in-place and gathers the parameters on all workers just-in-time for the forward and backward pass. Training with FSDP is typically faster than standard data parallel implementations for three reasons: (1) sharding reduces the cost of the optimizer step and weight update by distributing it across workers, rather than redundantly updating model replicas on each worker; (2) while FSDP introduces 50% more communication, this extra communication is overlapped with the computation in the forward and backward pass; and (3) FSDP yields significant memory savings, which can be used to increase the batch size and achieve higher GPU utilization.
One important decision when using FSDP is choosing which submodules in the model to "wrap" with FSDP. If the wrapping is too fine-grained, then the parameter shards will be very small which reduces communication efficiency. If the wrapping is too coarse, then this increases the peak resident memory and may pose challenges when scaling to larger model sizes. In this work we wrap every transformer layer with FSDP, which ensures a reasonably large message size for communication while still limiting the peak resident memory to the size of a single layer.

E Counting FLOPs
We count the number of floating-point operations (FLOPs) analytically following Narayanan et al. (2021). We assume that all models are trained with activation checkpointing and thus have an additional forward pass before the backward pass. Thus the total training FLOPs for our dense models is given by: where T is the total number of training tokens, l is the number of layers, h is the hidden dimension, s is the sequence length and V is the vocabulary size. In this work, T = 300e 9 , s = 2048 and V = 51200 for all models. For mixture of expert models, we account for an additional feed-forward network at every other layer for the top-2 routing in GShard (Lepikhin et al., 2021), and ignore the FLOPs of the routing projection which is negligible. The resulting training FLOPs for our MoE models is given by: Notably, this quantity is independent of the number of experts.