Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of linear, mixing transformations to design the Sparse Mixer encoder model. Sparse Mixer slightly outperforms (<1%) BERT on GLUE and SuperGLUE, but more importantly trains 65% faster and runs inference 61% faster. We also present a faster variant, prosaically named Fast Sparse Mixer, that marginally underperforms BERT on SuperGLUE, but trains and runs nearly twice as fast. We justify the design of these two models by carefully ablating through various mixing mechanisms, MoE configurations and hyperparameters. Sparse Mixer overcomes many of the latency and stability concerns of MoE models and offers the prospect of serving sparse student models, without resorting to distilling them to dense variants.


Introduction
Sparsely gated Mixture-of-Experts (MoE) models have seen a surge of interest in recent years (Shazeer et al., 2017;Lepikhin et al., 2021;Fedus et al., 2022;Riquelme et al., 2021;Du et al., 2021;Artetxe et al., 2021;Clark et al., 2022;Mustafa et al., 2022).MoE models offer the promise of sublinear compute costs with respect to the number of model parameters.By training "experts" that can independently process different slices of input data, MoE layers increase model capacity with limited increases in FLOPS.
Perhaps because of the favorable capacity-tocompute trade-off, most recent MoE studies, including the aforementioned, have focused on using MoE to scale up large models.Using MoE layers to scale to larger models offers quality and total train efficiency gains over dense models, but not train or inference step latency benefits.Indeed, the task of serving these models in practice is either ignored or relegated to distilling the sparse teacher model to a dense student model (Hinton et al., 2015), often with a significant quality loss relative to the sparse teacher model.For example, Fedus et al. (2022) are only able to distill roughly 30% of the Switch Transformer's quality gains to a dense model.
Orthogonal to MoE, efficient mixing models (Tolstikhin et al., 2021;Liu et al., 2021;Lee-Thorp et al., 2021) replace attention in Transformer-like models with simpler linear transformations or MLP blocks that "mix" input representations.Linear transformations are particularly attractive because they are faster than the combined projection and dot product operations in an attention layer.
In this work, we pull on both MoE and mixing threads to build low latency, sparse encoder models that we hope can used in production settings.We focus on encoder models, and BERT-like models in particular, because they are widely used in practice -for example, in dual encoders for retrieval (Bromley et al., 1993;Karpukhin et al., 2020).
Relative to the vanilla Transformer model (Vaswani et al., 2017), we speed up our model in two ways.(1) We use the increased capacity from MoE sublayers to offset parameter reductions in other parts of the model.(2) We use mixing transformations to replace a large fraction of self-attention sublayers with faster, linear transformations.The resulting model, which we name Sparse Mixer, slightly (< 1%) outperforms BERT on GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019), but most importantly trains 65% faster and runs inference 61% faster.We also introduce a simple variant of Sparse Mixer, prosaically named Fast Sparse Mixer, that marginally (< 0.2%) under-performs BERT on SuperGLUE, but runs nearly twice as fast: training 89% faster and running inference 98% faster.
An interesting finding of our work is a training stability synergy between the sparse and mixing model components.As a point of comparison, we find that simply replacing dense feed-forward sublayers in BERT with MoE variants yields highly unstable models; see Section 5.However, these instabilities dissipate as we replace self-attention sublayers with mixing sublayers.We hypothesize that the (token-dependent) relevance weighted selfattention basis is the source of the instability, and hence that replacing the majority of self-attention sublayers with mixing sublayers renders sparse mixer models highly stable.
In summary, we introduce two models: • Sparse Mixer, which matches BERT on GLUE and SuperGLUE but runs 61-65% faster.
We justify the design of these models by ablating through model mixing, MoE, and hyperparameter configurations.With Sparse Mixers, we demonstrate that the speed and stability regressions of MoE models may be overcome using mixing mechanisms.This offers the promise of directly serving sparse models, rather than resorting to distilling them to dense variants.

Related work
Sparsely gated Mixture-of-Experts.Mixture-of-Experts (MoE) models were introduced by Jacobs et al. (1991); Jordan and Jacobs (1994) and more recently popularized by Shazeer et al. (2017).Recent work, such as (Zoph et al., 2022), has played out the promise of MoE models by achieving state of the art results on a number of NLP benchmarks.As with contemporary MoE studies (Du et al., 2021;Lepikhin et al., 2021) Memory mechanisms are another popular sparse technique for adding capacity to models with limited increases in compute; see, for example, (Weston et al., 2015;Sukhbaatar et al., 2015;Lample et al., 2019).While intuitively appealing and empirically promising, suboptimal implementations (look-ups in particular) for accelerator hardware often yield memory models that have favorable theoretical compute properties, but are slow in practice.
Mixing.Several recent works have explored mixing mechanisms, such as matrix multiplications (Tay et al., 2020;Lee-Thorp et al., 2021), MLP blocks (Tolstikhin et al., 2021;Liu et al., 2021), and spectral transforms (Lee-Thorp et al., 2021), as an efficient replacement of attention in Transformer-like models.You et al. (2020); Raganato et al. (2020); Lee-Thorp et al. (2021) find that hybrid attention-mixing models, wherein partial or a limited number of attention sublayers are retained, were faster than Transformers with only very limited accuracy degradation.Building off these works, we use sparse MLP sublayers to compensate for the remaining accuracy gap.
Model parameter configuration.Scaling up models has proven to be a successful program for increasing model quality (Kaplan et al., 2020;Raffel et al., 2020;Brown et al., 2020).The relationship between the number of model parameters and model quality can be roughly modelled through a power law (Kaplan et al., 2020;Clark et al., 2022;Hoffmann et al., 2022).However, the configuration of these parameters within the model also plays an important role in model quality and efficiency.Consistent with Tay et al. (2021b), we find that making the model thinner (smaller model dimensions) but deeper (more layers) is generally an efficient way to distribute parameters throughout the model.
Distillation.Knowledge distillation (Hinton et al., 2015) is a powerful technique that has been successfully deployed to train efficient "student" BERT models from larger "teacher" models (Sanh et al., 2019;Jiao et al., 2020;Sun et al., 2019;Xu et al., 2020).Although we suspect that Sparse Mixer will offer a promising distillation architecture, we view the distillation techniques themselves as orthogonal to our architecture optimization goal.Indeed, we show, in Figure 2, that speed-ups and quality gains from Sparse Mixer carry over to both larger (teacher) and smaller (student) sizes.

Architecture
Our design space for the Sparse Mixer builds off of the stacked encoder blocks of BERT (Devlin et al., 2019), which we use as our canonical Transformer encoder (Vaswani et al., 2017).Each encoder block contains a mixing or self-attention sublayer and a (dense or MoE) MLP sublayer, connected with residual connections and layer norms.We keep the standard BERT input embedding and output projection layers (Devlin et al., 2019); see also Appendix A.1.We arrive at the Sparse Mixer encoder block stack, shown in Figure 1, by carefully ablating through mixing mechanisms, MoE configurations, and model hyperparameters in Section 4.

MoE
In an MoE layer, we initialize multiple, different instances ("experts") of the layer and perform parallel computations with each instance over separate data shards.The sparsely activated MoE layers therefore have greater capacity than dense layers.
As we increase the number of experts, we typically decrease the expert capacity -the number of tokens processed by an individual expert.More specifically, with E denoting the number of experts and n the number of tokens, we set expert capacity = cf × n/E, where cf is the scalar capacity factor.For cf ≈ 1, this allows us to increase model parameter count with minimal increases in FLOPS. 2outing.We use a router or gating function to carefully direct data shards between experts.This follows the intuition that expert A may become specialized at processing inputs in one part of the embedding space, while experts B, C, . . .are specialized to other parts of the embedding space.It is the router that ensures sparsity by assigning only a subset of tokens to each expert, thereby ensuring that only a subset of parameters are activated for each token.
Router design is an active research area (Lewis et al., 2021;Roller et al., 2021;Zhou et al., 2022;Clark et al., 2022).We limit ourselves to two router types: traditional "Tokens Choose" and "Experts Choose".We follow the standard practice of routing at the token level -the router assigns each token to a subset of experts.Both assignment algorithms first generate router logits by projecting token representations from the embedding dimension, d m , to the expert dimension, E. We apply a softmax to normalize the logits to a probability distribution.Finally, tokens are assigned to experts using one of the assignment algorithms.
Tokens Choose routing.For Tokens Choose routing (Shazeer et al., 2017), each token is assigned to its top-k experts.We focus on top-1 ("Switch") routing (Fedus et al., 2022).Because expert capacities are limited, there is no guarantee that a given token can be routed to its top expert, although any token that fails to reach an expert will still propagate into the next encoder block through the residual connection.There is also no guarantee that a given expert receives at least one token.So, to ensure that compute is efficiently distributed among experts, we include a load balancing loss as in (Shazeer et al., 2017;Fedus et al., 2022).
We can increase expert capacity by increasing the capacity factor, cf .This will increase the probability that a given token is routed to its desired experts.Decreasing cf will further sparsify the model and speed up the MoE sublayer. 3We use Batch Prioritized Routing (Riquelme et al., 2021) to prioritize routing tokens with the highest router probability, rather than simply routing tokens in the left-to-right ordering in the batch.
Experts Choose routing.For the Experts Choose assignment algorithm (Zhou et al., 2022), experts choose their top tokens, rather than tokens choosing experts.This effectively amounts to a transpose of the router probabilities prior to the top-k operation.Each expert performs its top-k operation with k = expert capacity.An individual token may be processed by multiple experts or none at all.Because experts have their choice of tokens and always fill their buffer, increasing the capacity factor, cf , will increase both the number of tokens that an expert processes and also the number of experts to which a given token is routed.Because each expert always fills its capacity, no auxiliary loading balancing loss is required.
Token group size.Tokens are subdivided into groups and expert assignment is performed on a per-group basis.A larger group size will result in slower but more accurate top-k and sorting computations, whereas a smaller group size will result in faster but more approximate routing choices.In practice, we find that imperfect routing choices are tolerable and default to a group size of 4096 tokens.
Parallelization strategies.In this work, we focus on faster, servable architectures using expert and data parallelism.We use data parallelism to shard data across devices, and expert parallelism to partition experts across devices; for example, placing experts 1 and 2 on device 1, experts 3 and 4 on device 2, and so on.Model parallelism is a third axis to shard model weights (matrices) across devices; for example, expert 1 is split across devices 1 and 2, expert 2 is split across devices 3 and 4, and so on.Model parallelism is typically most beneficial for scaling to larger model sizes.

Mixing
We use simple linear, mixing transformations as drop-in replacements for a subset of the selfattention sublayers.Mixing transformations offer speed for reduced capacity and flexibility.Indeed, the attention mechanism contains four parameterized projections and two dot product operations ("QK" and "V "), allowing self-attention sublayers to construct representations in a highly expressive, token-dependent basis.On the other hand, the mixing transformations that we investigate are implemented through two, token-independent projections, one along each of the sequence and model dimensions.Fixing the mixing basis, relative to different data inputs, turns out to stabilize the model.
Spectral transformations.We experiment with the Fourier and Hartley transforms (Lee-Thorp et al., 2021).We integrate these transforms through a Fourier sublayer.The Fourier sublayer applies a 1D Discrete Fourier Transform (DFT) along the sequence dimension, F seq , and a 1D DFT along the hidden dimension, F h : where denotes the real part.
The Hartley sublayer uses Equation ( 1) with the DFT replaced with the Discrete Hartley Transform, H. 4 We compute the Fourier and Hartley transforms using the Fast Fourier Transform (FFT) (Cooley and Tukey, 1965;Frigo and Johnson, 2005).
In Equation ( 1), we transform along both the sequence and hidden dimensions.Although the primary purpose of a mixing sublayer is to combine inputs along the sequence dimension, Lee-Thorp et al. (2021) found that also mixing along the hidden dimension improved model quality.
Structured matrix projections.We explore structured matrices under the hypothesis that adding structure to the mixing basis may improve the distribution of output representations.We consider two parameterized, structured matrices: Toeplitz and circulant.A Toeplitz matrix is a matrix in which each diagonal is constant.A circulant matrix is a particular kind of Toeplitz matrix, in which all rows are composed of the same elements but rotated one element to the right relative to the preceding row.For both matrices, the weights are learned.The corresponding mixing sublayer mixes along the sequence and hidden dimension.For example, for the Toeplitz case, we perform: where T seq and T h denote Toeplitz matrices.5 Vanilla matrix projections.We also consider "unstructured", fully dense parameterized matrix projections.Following (Lee-Thorp et al., 2021), we call the mixing sublayer arising from this case, the "Linear" sublayer.The Linear sublayer performs the same FLOPS as the structured matrix sublayers (provided the FFT is not used), but is more flexible due to the increased number of matrix weights.

Implementation
We train and optimize our model on 8 V100 GPUs.We believe that our results are reasonably robust to differing accelerators (e.g.TPU) as almost all of our modifications boil down to accelerator friendly matrix multiplications.In Section 5, we scale our model sizes up and down on TPUs and find that the same favorable efficiency trade-offs persist.We use JAX (Bradbury et al., 2018) in the Flax framework (Heek et al., 2020). 6

Coordinate Descent
We train in a typical transfer learning setting (Devlin et al., 2019): Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) pretraining, followed by fine-tuning on GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019).When comparing models, we always use the exact same setup for all models and baselines.In particular, we follow the pre-training setup in (Devlin et al., 2019) with a few updates: (1) we pretrain on the much larger C4 dataset (Raffel et al., 2020); ( 2 In this section, we follow a "coordinate descent" through our model configurations until we arrive at the final Sparse Mixer design.Given the large number of model hyperparameters to explore, we be computed using the FFT (Davis, 1970).This requires three operations: one FFT to diagonalize the computation, one to apply the diagonalized matrix multiplcation and one iFFT to transform back to real space.A Toeplitz matrix may be embedded in a circulant matrix to take advantage of the same FFT computation.In practice, we find that for standard sequence lengths (512), using the FFT is slower than direct matrix multiplications on both GPU and TPU.
6 Sparse Mixers code is available at https://github.com/google-research/google-research/ tree/master/sparse_mixers. For our coordinate descent study, we only pretrain for 500k steps, which we found to be reasonably indicative of model performance.Models are fine-tuned with the same batch size (64) on the Validation split of each respective GLUE task for 5 epochs and the best result for each task is selected from across three default base learning rates, adapted from Devlin et al. ( 2019): {10 −5 , 5 • 10 −5 , 10 −4 }.Our final model is pretrained for longer and is evaluated on both GLUE and SuperGLUE for a broader set of training configurations in Section 5.
We prioritize efficiency -speed and accuracy.We use pre-training step speed as a proxy for model latency.We rely on downstream average GLUE scores as our primary accuracy metric, but fallback to upstream MLM and NSP accuracies when GLUE scores between model variants are similar.Additional coordinate descent experiments are summarized in Appendix A.2, and full GLUE results for all coordinate descent experiments are provided in Appendix A.3.

Mixing
Mixing mechanisms.We compare the mixing mechanisms discussed in Section 2. For each mixing model, we first replace all self-attention sublayers with the corresponding mixing sublayer.The results are shown in Table 1.The spectral models (Fourier and Hartley) perform the best on GLUE.The Linear model slightly under-performs the spec- Hybrid mixing-attention.We choose two strong representative candidates from Table 1, namely the Hartley and Linear models, and replace a subset of the topmost mixing sublayers with selfattention.The results are summarized in Table 2.
Once we include self-attention, we see that the hybrid Linear model offers larger quality gains than the hybrid Hartley model.Even though the hybrid Hartley models are faster, an iso-speed comparison still suggests that the hybrid Linear models are more efficient.For example, Hartley-6 and Linear-4 having roughly the same speed, but the Linear-4 model is more accurate.Hence, we opt to use the Linear-4 model.In Appendix A.2 (Table 11), we show that we get best accuracy when the selfattention sublayers at placed in the topmost layers.

Model shape
All model shape experiments are run in parallel and start from the Linear-4 configuration.
Model dimensions.In seeking a more efficient model, we attempt to slim our model down both by decreasing the model dimension (   below which model quality drops drastically.We select these cutoffs as our optimal model shape values.It is in decreasing these two hyperparameters that we obtain the biggest speed-up in our model.However, there is a material degradation in quality that must be compensated by the increased capacity from the MoE sublayers in Section 4.3.Number of layers.We vary the number of layers in Table 13 in Appendix A.2.We opt for 14 layers, beyond which we do not see quality gains.

MoE
Our starting configuration for our MoE ablations is the Linear-4 configuration with every other dense MLP sublayer replaced by an MoE sublayer (6 MoE sublayers) and 16 experts in each MoE sublayer.We performed the MoE experiments in parallel to the model shape optimizations, so all MoE ablations are performed on a default Base sized model with 12 layers, d f f = 3072 and d m = 768.
As in (Zoph et al., 2022), we find that we must adjust the fine-tuning learning protocol to better transfer any MoE MLM pre-training gains downstream.In particular, our MoE encoder models benefit from larger base learning rates ({10 −4 , 5 • 10 −4 , 10 −3 }) and larger dropout rates (0.2) for experts; see Appendix A.4 for a comparison of learning rates and expert dropout rates.For our final model comparison with BERT in Section 5, we consider a wide range of base learning rates for all models.8Routers.Routing mechanisms are compared in Table 4.We select Experts Choose routing as it obtains slightly higher accuracy results and does not require configuring a load balancing loss.
MoE layers.In Table 5, we vary the number of MoE sublayers and the layout of those layers within the model.As we increase the number of MoE layers, MLM accuracy improves, but these pretraining gains do not always lead to better GLUE performance.This was a general trend that we observed; see also Appendix A.2 and (Zoph et al., 2022).We opt for 4 MoE layers, which performs well on GLUE and better than the 2 MoE sublayer model on the MLM task.
The results of MoE layout experiments are clearer -we opt to use the MIDDLE layout, placing all MoE sublayers in the middle layers of the model.Nevertheless, it is interesting to note that the TOP layout gives a big boost to MLM accuracy, but does not improve downstream GLUE accuracy.
Number of experts.We can increase the number of experts to increase the capacity of the model.For a large number of experts, the computational cost of the routing assignment is more significant, while the training signal to an individual expert becomes too weak to facilitate effective training as each expert processes too small a slice of data.Seeking a compromise between quality and speed, we ultimately opt to use 16 experts.Results are summarized in Table 14 in Appendix A.2. base learning rates are not beneficial for the dense models.
Expert size.We can control the number of parameters in each expert by varying its d f f .In Table 15 (Appendix A.2), we find that: (1) using smaller experts yields a small accuracy drop, but limited speed benefits; and (2) increasing expert size increases MLM accuracy, but not GLUE.So, for simplicity, we opt to keep the expert d f f the same size as the dense d f f .

Sparse Mixer
Putting the preceding results together, we arrive at the Sparse Mixer model in Figure 1

Evaluating Sparse Mixer
Full training comparison with BERT.When comparing Sparse Mixer and BERT, both models are pre-trained on C4 for the full 1M steps, with batch size 64, and then evaluated on both GLUE and SuperGLUE for a larger range of fine-tuning batch sizes (16, 32, and 64) and base learning rates ({10 −5 , 5 • 10 −5 , 10 −4 , 5 • 10 −4 , 10 −3 }).The best results across all learning rates (for each task) and batch sizes (for all tasks) are shown in Tables 6 and  7; see Appendix A.3 for results for all batch sizes. 9ERT and Sparse Mixer's GLUE scores are very similar, although they diverge a little more on Su-perGLUE, where Sparse Mixer performs particularly well on the CB task, but underperforms BERT on the multi-label MultiRC and ReCoRD tasks.
Scaling the Sparse Mixer.Tables 6-8 indicate that the Sparse Mixer is more efficient than BERT in the Base configuration.In Figure 2, we compare BERT and Sparse Mixer across a selection of model sizes.We use MLM accuracy as a proxy for model accuracy and pre-training step speed as a proxy for overall model speed.Pre-training step speed is a good proxy for inference speed (see Table 8).MLM accuracy is only indicative of downstream accuracy.We construct an analogous speed-accuracy figure for NSP accuracy in Figure 4 in Appendix A.5.These caveats aside, Figure 2 suggests that Sparse Mixer's favorable speed and accuracy extends to other model sizes, as it defines the efficiency frontier across all model sizes considered.Trading accuracy for more speed.We design an even sparser model by decreasing the expert capacity factor.This decreases the number of tokens that each expert processes and yields significant speed-ups for a limited quality degradation: for a minor (0.2%) accuracy drop on SuperGLUE relative to BERT, a Sparse Mixer with capacity factor of 0.5 trains 89% faster and runs inference 98% faster; see Table 8.We name this variant of the model Fast Sparse Mixer. 10 We also experiment with decreasing the token routing group size, but this leads to larger quality drops.
Stability.Table 10 compares the stability of Sparse Mixer, BERT and "sparse BERTs" -MoE variants of BERT.Sparse Mixer is very stable, even relative to (dense) BERT.The sparse BERTs are highly unstable, with only one stable run that ultimately yields a slow model that significantly underperforms BERT. 11We hypothesize that the Sparse Mixer's improved stability is due to replacing most of the self-attention sublayers with mixing, which constrains the model to a less variable mixing basis.

Conclusions
Mixing transformations and MoE play well together.Utilizing MoE for capacity and mixing for speed and stability, we introduced the Sparse Mixer -a model that slightly outperforms BERT on GLUE and SuperGLUE, but more importantly trains 65% faster and runs inference 61% faster.We also presented a faster variant, Fast Sparse Mixer, that marginally under-performs BERT on Super-GLUE, but trains and runs nearly twice as fast: 89% faster training and 98% faster inference.Sparse Mixer overcomes many of the speed and stability concerns of MoE models and offers the prospect of serving sparse student models.

Limitations
Encoder only model.We have focused our work on BERT-like models as they are extremely wide used.12However, this limits our focus to encoders, which are not suitable for generative tasks.Sparse mixer encoder-decoder and decoder-only models are, in principle, straightforward extensions: Linear decoders can be designed by "causally" masking the Linear matrix and encoder-decoder mixing can also be designed with careful masking.However, we suspect that parts of the coordinate descent program will need to be repeated.For example, evidence suggests that cross-attention may be crucial to performance of encoder-decoder models (You et al., 2020).Nevertheless, we hope that the current Sparse Mixer recipe acts as a starting point and a roadmap for generalizing to other architectures.
More diverse tasks and learning frameworks.We only evaluated Sparse Mixer on GLUE and Su-perGLUE.It would be good to look at broader set of tasks, including Q&A.We also stuck to the original BERT training setup (Devlin et al., 2019), but there are potential training regime improvements that could be introduced, such as training for much longer as for RoBERTa (Liu et al., 2019b) or using the ELECTRA generator-discriminator training setup (Clark et al., 2020).
"Manual ML".In designing the Sparse Mixer architecture, we have optimized the model configuration one hyperparameter coordinate at a time.So while our manual gradient descent offers interpretability and pedagogical insight, it is potentially sub-optimal.It would be exciting to see future work expand both the coordinate space and jointly optimize multiple coordinates using Automated Machine Learning (AutoML) (Thornton et al., 2013;Liu et al., 2019a;Peng et al., 2020).
Long input sequences.Because of the presence of attention layers, Sparse Mixer will not scale as well as efficient Transformers to long sequence inputs.This could be compensated by dropping in efficient approximations of the attention mechanism (Tay et al., 2021a).

A Appendices
A.1 Base architecture Our design space for the Sparse Mixer builds off of the stacked encoder blocks of BERT (Devlin et al., 2019) in Figure 3.Each encoder block contains a mixing sublayer and an MLP sublayer, connected with residual connections and layer norms.We keep the standard BERT input embedding and output projection layers (Devlin et al., 2019).

A.2 Exploring more coordinates
Attention sublayer layout.Where should the 4 attention sublayers be placed within the model?We check whether it best to place the 4 self-attention sublayers at the TOP (final 4 layers), BOTTOM (first 4 layers), MIDDLE (middle 4 layers) or MIXED (every third layer).Table 11 shows that the TOP layout is best.Mixing dead ends.We tested two other mixing modifications that yielded no quality (or latency) gains: (1) adding a bias term to the mixing transformations, and (2) adding dropout during fine-tuning to the mixing sublayers.
Intermediate activation dimension.Below , the model quality drops significantly.
We select this cutoff as our optimal model MLP dimension.Number of layers.We do not see quality gains beyond 14 layers.Because we plan to thin out our model (decrease d f f and d m ), we opt for slightly increasing the number of layers to 14.
Number of experts.In Table 14, we increase the number of experts to increase the capacity of the model.As discussed in Section 3.2, we simultaneously decrease each expert's capacity (the number of tokens it processes) to prevent FLOPS from growing.We suspect that, for a large number of experts, the training signal to an individual expert becomes too weak to facilitate effective training   as each expert processes too small a slice of data.This is particularly apparent on downstream tasks where there are fewer training examples.Furthermore, for a large number of experts (≥ 64), the computational cost of the routing assignment starts to become significant.Seeking a compromise between quality and speed, we ultimately opt to use 16 experts. 13xpert size.The number of parameters in each expert can be controlled by varying its intermediate activation dimension, d f f .If we can maintain accuracy while decreasing the size of each expert, that will speed up our model.On the other hand, if we can achieve large accuracy gains by increasing the size of experts, we can potentially use that to offset shrinking the rest of the model; for example, by constructing a "thin", fast model with only a few "heavy", high capacity MoE layers.In Table 15, we see that neither of these scenarios plays out cleanly: (1) Using smaller experts (d f f = 1536) yields only (1) Changing the expert nonlinearity from GELU to RELU had little effect on downstream performance.Changing the nonlinearity to GEGLU (GELU Gated Linear Units) (Shazeer, 2020) only slowed down the model. 142) Although the router z-loss had little effect on model performance, we included it for potential stability benefits.(3) Fedus et al. (2022) recommend using a smaller scaled weight initialization to provide stability to MoE encoder-decoder models, especially in larger configurations.However, we obtained the best results, for our encoder-only model, with BERT's default kernel initialization.1, 2, 11, 12, 3 and 13), we report the best scores across {10 −5 , 5 • 10 −5 , 10 −4 } base learning rates, while for MoE models (Tables 4, 5, 14 and 15), we use {10 −4 , 5 • 10 −4 , 10 −3 }.We report F1/accuracy scores for QQP and MRPC, Spearman correlations for STS-B and accuracy scores for all other tasks.The MNLI accuracy metrics are reported by the match/mismatch splits.The top two average scores for each experiment set are boldfaced/underlined.We report macro-F1 scores for CB, micro-F1/exact match scores for MultiRC, F1/exact match scores for ReCoRD, and accuracy scores for all other tasks.For each task, we select the best result across the base learning rates {10 −5 , 5 • 10 −5 , 10 −4 , 5 • 10 −4 , 10 −3 }.The highest average score for each model (across the three batch sizes) is highlighted in boldface.

A.3 Full GLUE and SuperGLUE results
Table 16 contains the full GLUE results for all of the coordinate descent experiments summarized in Section 4. Tables 17 and 18 contain the full results for GLUE and SuperGLUE, respectively, across all fine-tuning batch sizes, for the final model results tabulated in Sections 5.
A.4 Optimizing fine-tuning protocols for MoE models In Table 19, we compare GLUE results for different fine-tuning learning protocols.Consistent with (Zoph et al., 2022), we find that fine-tuning results are improved with an increased expert dropout (0.2) and larger base learning rates.Increasing the capacity factor, cf , yields quality gains but slows down the model.We did not find benefits from freezing different parts of the model during fine-tuning.For MoE models in the main text, we always use an expert dropout of 0.2 during fine-tuning.Similarly, we use the larger base learning rates during the coordinate descent program (Section 4).However, when comparing Sparse Mixer with BERT (Section 5) we use a wider range of learning rates for both models.

A.5 Speed-accuracy plots
Figure 4 shows the NSP-accuracy equivalent of the MLM-accuracy based efficiency plot in Figure 2.
Table 20 gives the model configurations that were used to construct Figure 2 (main text) and Figure 4 (Appendix A.5).In configuring the Sparse Mixer model configurations, we tried to roughly hew to the proportions in the Base model.

Figure 1 :
Figure 1: Sparse Mixer encoder blocks for the Base configuration.Layer norms, residual connections, embedding layers and output layers are not shown.The top K = 4 blocks contain self-attention and dense MLPs; the middle M = 4 blocks contain mixing and sparse MLPs; and the remaining L = 1 and P = 5 blocks contain mixing and dense MLPs.
) we use a 32000 SentencePiece vocabulary model (Kudo and Richardson, 2018) trained on a 100 million sentence subset of C4; and (3) we use a smaller batch size of 64 (Devlin et al. (2019) uses 256).We use a sequence length of 512 throughout pre-training.Experiments are run on 8 V100 GPU chips, except for the scaling experiments (Section 5) which are run on 32 TPU v3 chips.

Figure 2 :
Figure 2: Pre-training Speed-accuracy trade-offs for Sparse Mixer and BERT.The corresponding model configurations are shown in Table 20 in Appendix A.5.The dashed line shows the Pareto efficiency frontier, indicating the best trade-offs.All models are trained on 32 TPU v3 chips.To better utilize the increased number of devices, we use a larger batch size of 256 but train for fewer (250k) steps.

Figure 3 :
Figure 3: Block based encoder architecture.The model has N encoder blocks, each containing mixing and MLP sublayers.Each MLP sublayer may be sparse or dense.Each mixing sublayer may use self-attention or a mixing transformation.

Figure 4 :
Figure 4: NSP pre-training Speed-accuracy trade-offs for Sparse Mixer and BERT.The dashed line shows the Pareto efficiency frontier, indicating the best trade-offs.
who sparsify multiple components of the Transformer, primarily by replacing softmaxes with argmaxes, to achieve an over 2x speed-up in unbatched inference speed on CPUs for Base/Large model sizes In contrast to our work, their speedups to not carry over to accelerator hardware or to training.
Jaszczur et al. (2021)ally cannot ask a user to wait longer for a more accurate model response.A notable exception isJaszczur et al. (2021),

Table 1 :
Average accuracy metrics and median pretraining step speeds for mixing models.The "Fourier" model is identical to FNet (Lee-Thorp et al., 2021).Speed-ups relative to BERT (see Table 8) are shown in parentheses.The best metrics are highlighted in boldface, while the second best metrics are underlined.Stars indicate the selected configurations.

Table 2 :
Metrics for hybrid attention-mixing models.Hartley-k denotes a model with k self-attention sublayers and 12 − k Hartley sublayers.

Table 3
) and the intermediate MLP activation dimension (Table12in Appendix A.2).For each coordinate, we find that there are cutoffs (d f f = 2048 and d m = 512)

Table 3 :
Varying the model dimension, d m .As in the Transformer, we set the model and embedding dimension to be equal.For the self-attention sublayers, we fix the number of self-attention heads to d m /64.

Table 4 :
Accuracy and speed metrics for Top-1 Tokens Choose (TC) and Experts Choose (EC) routing.

Table 5 :
Varying the number and layout of MoE sublayers.Layout definition: 6-BOTTOM (first 6 layers), 6-MIDDLE (middle 6 layers) or 6-MIXED (every odd layer), 6-MIXED-odd (every even layer), and 6-TOP (final 6 layers).The number of experts and the expert capacity -the number of tokens processed by each expert -is fixed.Each MoE layer adds some compute and device communication overhead, slowing the model.

Table 6 :
GLUE results on the Validation split.We report F1/accuracy scores for QQP and MRPC, Spearman correlations for STS-B and accuracy scores for all other tasks.The MNLI accuracy metrics are reported by the match/mismatch splits.

Table 7 :
SuperGLUE Validation results.We report macro-F1 scores for CB, micro-F1/exact match scores for MultiRC, F1/exact match scores for ReCoRD, and accuracy scores for all other tasks.

Table 10 :
Stability of BERT, sparse BERTs and Sparse Mixer (SM).BERT-k denotes a BERT model with k

Table 11 :
Average accuracy metrics and median pretraining step speeds for varying where the self-attention sublayer are placed within the Linear-4 model.Speedups relative to BERT (see Table8) are shown in parentheses.The best metrics are highlighted in boldface.The star indicate the selected configuration.

Table 12 :
Metrics for various intermediate MLP activation dimensions, d f f .

Table 13 :
Varying the number of model layers.The results are for post-layer normalization, as in BERT.We obtained similar results for pre-layer normalization.

Table 14 :
Increasing the number of experts.These experiments were run in parallel to those of Table5and therefore use the default 6-MIXED setup.Speed-ups relative to BERT (see Table8) are shown in parentheses.Based on the speed slowdown for 64 experts and the relatively weak performance of 32 experts, we did not evaluate 64 experts on GLUE.

Table 15 :
Varying the size of experts by varying each expert's intermediate activation dimension, d f f .These experiments were run in parallel to those of Table5and therefore use the default 6-MIXED setup.

Table 16 :
Full GLUE results (Validation split)for all coordinate descent experiments.See the corresponding table for descriptions of each configuration.For dense models (Tables

Table 17 :
GLUE results (Validation split) for final comparison of BERT, Sparse Mixer (SM), Fast Sparse Mixer (FSM) and other variants for different batch sizes.See the corresponding table for descriptions of each configuration.For each task, we select the best result across the base learning rates {10 −5 , 5 • 10 −5 , 10 −4 , 5 • 10 −4 , 10 −3 }.The highest average score for each model (across the three batch sizes) is highlighted in boldface.

Table 18 :
SuperGLUE results (Validation split) for final comparison of BERT, Sparse Mixer (SM), Fast Sparse Mixer (FSM) and other variants for different batch sizes.See the corresponding table for descriptions of each configuration.

Table 20 :
Model configurations.Following the Base convention for both BERT and Sparse Mixer, we set d f f = 4d h and the number of self-attention heads to d h /64.Roughly following the Sparse Mixer Base design, we set the number of MoE and attention layers to be roughly 25-33% of the larger models and no less than 2 for the smaller models.We increase the number of experts for the larger models.