Approximating Two-Layer Feedforward Networks for Efficient Transformers

,

One promising approach explored by several recent works on extremely-large LMs is the sparse mixture of experts (MoE; Shazeer et al. (2017); Lewis et al. (2021); Lepikhin et al. (2021); Fedus et al. (2022); Clark et al. (2022); Chi et al. (2022)).Unlike their dense counterparts, MoEs only compute a subset of their activations (i.e,only a few experts) at each step, offering reduced computation and memory costs.However, MoEs are not yet generally adopted as a generic/to-go approach, perhaps because of certain common beliefs on MoEs: (1) They are hard to train (involving complex engineering tricks to prevent collapsing), (2) they are not competitive against their dense counterparts with the same number of parameters (in fact, prior work focuses on FLOP-equal comparison, "unfairly" comparing MoEs against dense baselines with many fewer trainable parameters), and finally, (3) they are reserved for extremely large models (they are rarely/never considered to further improve the efficiency of "small" models).Indeed, even prior works on MoE-based Transformer LMs only deploy MoEs in a few feedforward blocks; while ideally, all such blocks should benefit from replacement by MoEs.Here we challenge these common beliefs, and propose novel perspectives on MoEs.
We present MoEs within a unified framework of methods that approximate two-layer feedforward networks, which includes product-key memories (PKMs; Lample et al. (2019)) and top-k sparsification.This principled view not only allows us to conceptually group and compare MoEs with PKMs, it also provides insights on design choices for improving these methods.Our resulting MoE Transformer variant outperforms our improved PKMs, and performs as well as or even outperforms the dense baseline, while using a fraction of its compute for both training arXiv:2310.10837v1[cs.LG] 16 Oct 2023 and inference.Importantly, unlike prior work, we compare our MoEs with dense baselines with the same number of total trainable parameters, which is crucial for proper evaluation in language modeling.We conduct experiments on the standard WikiText-103 (at two different model scales) and Enwik8 datasets.We demonstrate that MoEs are not limited to extremely-large LMs, but useful as a generic approach for resource-efficient NNs at any scale, and in line with the recent trend of improving "smaller" models (Touvron et al., 2023;Taori et al., 2023;Chiang et al., 2023).Finally, we release a CUDA kernel for our MoE layers which allows for achieving faster wall clock time and large memory reduction compared to the dense model.2

Background
Transformers (Vaswani et al., 2017) have two main building blocks: the self-attention layer (Parikh et al., 2016;Cheng et al., 2016;Bahdanau et al., 2015), and the two-layer feedforward, i.e, multi-layer perceptron (MLP) block.Acceleration and memory reduction of the self-attention is rather well explored (see e.g., linear attention (Katharopoulos et al., 2020;Choromanski et al., 2021;Schmidhuber, 1991;Schlag et al., 2021)), and very efficient implementations (Dao et al., 2022) are also available.In constrast, resourceefficient MLP blocks are still underexplored.This is our main focus, and it is of particular relevance today, as the proportion of the total parameter counts, compute and memory requirements due to MLP blocks in Transformers is increasing in ever-growing LLMs.
Let d model , d ff denote positive integers.Each Transformer MLP block consists of one upprojection layer with a weight matrix W 1 ∈ R d ff ×d model where typically d ff = 4d model , and one down-projection layer with parameters W 2 ∈ R d model ×d ff that projects it back to the original size.Non-linearity (typically ReLU) is applied between these two layers.That is, an input where u ∈ R d ff , and we omit biases (as well as batch and time dimensions) for simplicity.
Alternatively, this layer can be viewed as a keyvalue memory accessed by attention (Vaswani et al.  (2017) 3 ,Geva et al. (2021)), where keys and values are rows and columns of weight matrices W 1 and W 2 : where Then, the output is computed as "attention": where Unlike the standard self-attention, the MLP block uses a ReLU activation function (instead of the softmax) without scaling.
It has been observed that, in practice, only a few of the factors k ⊺ i x are positive (Li et al., 2023;Shen et al., 2023), making the first layer's output, i.e., u, sparse.Concretely, Shen et al. (2023) report that in a Transformer with d model = 256 and d ff = 1024, 10% of the channels account for 90% of the total activation mass.We confirm this trend in our own preliminary study.Fig. 1 shows the average number of non-zero units in u of size d ff = 2053 in our 47M parameter dense model trained on WikiText-103 (we refer to App.A.2 for more details).The number is below 200 for all layers.This suggests that the MLP block can be approximated without a significant performance loss.
3 See the appendix "Two feedforward Layers = Attention over Parameter" in their paper version "arXiv:1706.03762v3." The core idea is to approximate the sum in Eq. 5, i.e., y = d ff i=1 y i by only keeping a subset S ⊂ {1, ..., d ff } of the key-value pairs, i.e., ŷ = i∈S y i .The intuition of this approximation is as follows.We assume that a good approximation ŷ of y is the one that minimizes their Euclidean distance e = || ŷ − y|| 2 2 ∈ R, which can now be expressed as 2 where S denotes the complement of S, i.e., S = {1, ..., d ff } \ S. Since we have e 2 ∈ R are small.If we further assume that all value vectors v i have the same norm, the crucial factor for approximation quality is reduced to the attention weights α i .In this context, we also call α i the contribution of key-value pair i.
Let K be a positive integer.The general idea of all methods discussed in this work is to keep K pairs (k i , v i ) whose contribution α i is the highest, and ignore other low-contribution pairs.The goal is to find the best mechanism to select such K pairs.Here we discuss three variants: Top-K activation (Sec.3.1), Product-Key Memories (PKMs, Sec.3.2), and Mixture of Experts (MoEs, Sec.3.3).

Top-K Activation Function
The most straightforward implementation of the approximation described above is the top-K activation function: Unfortunately this only saves less than half of the entire computation: while this allows us to reduce computation of Eq. 2, no computation can be saved in Eq. 1 because full computation of u = ReLU (W 1 x) is required for Eq. 6. Going beyond this requires to also introduce some approximation to Eq. 6 as in PKMs (Sec.3.2) and MoEs (Sec.3.3).

Product-Key Memories (PKMs)
Product-Key memories (Lample et al., 2019) consist of replacing , so that x = x a |x b , where | denotes concatenation.The matrix multiplication is then performed on these smaller vectors: in all possible ways (i.e., Cartesian products), similarly to the outer product, but using addition instead of multiplication, i.e., for all i ∈ {1, ..., d ff }, In addition to applying Top-K at the output as in Sec 3.1, here Top-K can also be used to accelerate the operation above.By applying Top-K to u a and u b before combining them to compute u, only the K 2 << d ff components of u[i] have to be calculated, and they are guaranteed to contain the K biggest components of the full u.
In the original formulation (Lample et al., 2019), PKMs use a softmax activation function, taking inspiration from self-attention (Vaswani et al., 2017).Instead, we'll show how a non-competing activation function, such as ReLU is a better choice (see Sec. 6.2).

Mixture of Experts (MoE)
Let N E , G denote positive integers.MoEs partition d ff pairs of (k i , v i ) (see their definition in Sec. 2) into N E groups of size G each, such that G • N E = d ff .This means that the weight matrices The output is computed as: where s[e] ∈ R is the e-th element of vector s ∈ R N E computed by an expert scoring function sel : R Given the notation above, it is straightforward to see that MoEs can also be viewed as approximating 2-layer MLPs with a trainable component (i.e., the selection function sel to produce s).Similarly to Eqs. 5 and 7, Eq. 11 can be expressed as: where, compared to Eqs. 5 and 7, the "contribution scores" of key-value pair i (defined in Sec.3/Preliminaries) have an additional factor s[e] of an expert group e to which the key-value pair belongs.
The key challenge of MoEs is to learn an expert selection mechanism/function sel above that assigns high scores to only a few experts (so that we can ignore others without sacrificing performance), while avoiding a well-known issue, called expert collapsing, where only a few experts are used and the rest are never selected.To avoid this, some regularization is typically applied to the selection score sel(x), encouraging more uniform routing of experts across the whole batch of tokens.We provide a comprehensive review of MoE variants and their details in Sec. 4 and our improved version in Sec. 5.

Existing MoE variants
Several variations of MoEs have been proposed with many different details.Here we briefly review the most popular and representative ones (e.g., we do not cover those that make use of reinforcement learning for expert routing) before describing our improved version in Sec. 5. We'll review their expert selection function and regularization method, and highlight their key characteristics.
Sparsely Gated Mixtures of Experts.Shazeer et al. (2017) have revisited MoEs (Jacobs et al., 1991;Ivakhnenko and Lapa, 1965) with the Top-K operation, allowing a reduction in its resource demands.Their method is basically the one described in Sec.3.3 (with re-normalization after Top-K) except that they use a noisy gating function: where W 4 ∈ R N E ×d model , the Gaussian noise term N (0, 1) is element-wise and independent for each channel, and softplus(x) = log(1 + e x ).They use the following auxiliary regularization term for load balancing, where CV(x) = µx σx is the coefficient of variation and B is the set of all tokens in the batch.
Key characteristics: The scores are normalized after the top-K operation (with K > 1), which is equivalent to applying top-K before the softmax.2022)'s key claims is that top-1 routing is enough.Their selection function is simply: sel(x) = softmax(W 3 x), but they propose a hard load-balancing between experts that run on different hardware accelerators: At most µ |B| N E tokens are allowed to be routed to an expert, where µ ∈ R >0 is the capacity factor (typically between 1 and 1.5), defining how many times more tokens can be processed by one expert compared to the ideal case of uniform routing.Each expert is forbidden to process more than this number of tokens.For regularization, the fraction of the tokens f ∈ R N E processed by each expert, and the average selection probability p ∈ R N E for each expert are calculated (K = 1; top-1 is used) as:

Switch
where 1 denotes the indicator function (which is equal to 1 if the argument is true, and 0 otherwise), and • denotes dot product.Intuitively, this serves as an adaptive regularization that penalizes experts that are used often with high "weights."In addition, they use dropout with a high drop rate (40%) in the experts (but only 10% in the normal layers).Furthermore, Fedus et al. (2022) also propose to initialize the experts with 0.1 G .As we'll see in Sec. 5, we use a modified version of this scheme.
Note that applying Top-K after softmax encourages collapsing: if the score of the selected expert is increased, the scores of all other experts are automatically decreased.This is not the case for Shazeer et al. (2017): In their method, only the selected experts compete with each other, so if their presence is beneficial, their score can be increased.
Key characteristics: Note that Top-1 is applied after the softmax without re-normalization.

BASE layers and S-BASE.
Inspired by the routing strategy and the hard capacity factor of the Switch Transformer, Lewis et al. (2021) propose BASE layers.They use top-1 routing and a sigmoid activation σ in the selection function: Now instead of using arg topk, they solve the following linear assignment problem to find the index This guarantees uniform assignment of experts, which is efficient for multi-accelerator training.
The output is computed using Eq.11 with E x = {e x } (a set with a single element; "top-1").However, at inference time, no such balancing is possible because not all tokens of the sequence are available at each step; E x = {arg max (sel(x))} is used instead.Lewis et al. (2021) show that, while during training, the routing is enforced to be completely uniform, during the test time, the distribution looks exponential (in fact, this is similar to the Switch Transformer but more balanced for BASE).
The algorithm for solving the linear assignment problem (Eq.19) is difficult to implement efficiently on modern accelerators.Clark et al. (2022) have proposed to use the Sinkhorn algorithm (Sinkhorn, 1964;Sinkhorn and Knopp, 1967) instead (resulting in a model called Sinkhorn-BASE or S-BASE), to approximate the solution to this problem (note that similar routing is independently discussed by Kool et al. (2021)).They report that this works well, while being simpler to implement.Thus, our reimplementation of BASE is S-BASE using the Sinkhorn algorithm.
Key characteristics: During training, Sinkhorn iterations are used on scores to obtain a balanced assignment.The sigmoid activation is always applied to compute the weighting score.
Overall, all load-balancing methods above are rather complex.We propose simpler but effective approach for MoEs in Sec. 5.

Improving Mixture of Experts
Here we present our improved MoE variant, which we call σ-MoE.We conduct thorough ablation studies on our design choices in Sec. 6. σ-MoE Expert Selection Function.Our MoE make use of the top-K operation (unlike BASE).The activation we use on the selection function is sigmoid (as in Eq. 18 of BASE) instead of softmax used in Switch Transformer and Sparsely Gated Mixtures of Experts.This choice is motivated by the view of MoEs as approximate 2-layer MLPs (Sec.3).In fact, softmax introduces competition between experts.No such competition between channels is used in the regular 2-layer MLP (i.e., there is no constraint on α i in Eq. 5).This suggests that, in principle, no competition is needed between terms in the sum of Eq. 12 in the MoE either, to induce sparsity.It is also well known to practitioners that softmax as regular activation negatively affects the trainability of standard MLPs.Softmax combined with top-K can also encourage expert collapsing: when the selection score of one expert increases, the score of the others automatically decreases.For all these reasons, we opt for sigmoid instead of softmax; we experimentally confirm that this is indeed a good choice.
Additionally, looking at MoEs in this framework gives us hints on combining them with Top-K activation (Sec.3.1) for further acceleration.We can calculate u e = s[e] ReLU(W e 1 x) (Eq.11) for the selected experts and perform an additional Top-K to keep the highest units among them and set the rest to zero.We leave this for future work.
σ-MoE Initialization.Another design choice guided by the MLP-approximation view of MoEs (Sec.3) is the initialization scheme for experts.Typically, experts are assumed to be independent, and the standard deviation of the initialization (Glorot and Bengio, 2010;He et al., 2015) of W e 2 is calculated based on G instead of d ff .Our experiments in Sec.6.3 show that this is sub-optimal.
In contrast, we initialize all weight matrices identically to the pre-layernorm dense baselines, not taking in account the smaller size of the individual experts, i.e., W e 1 ∼ N (0, where n layers denotes the number of layers, using d model and d ff instead of G.
We also take special care when initializing W 3 of the selection function.We initialize it to a normal distribution with the same standard deviation as W e 1 , but we also ensure that the rows of W 3 have the same norm 5 .This can be easily achieved in practice by initializing the weights to W ′ 3 ∼ N (0, 1), rescaling its rows to norm 1, and then rescaling the whole matrix again to have the desired standard deviation.Note that each scalar score in s is the dot product of a row of W 3 and x.This initialization method ensures that only the angle between x and the rows of W 3 initially affects the score s, rather than an additional random factor resulting from initialization.
σ-MoE Regularization.As already noted in Sec. 4, existing regularization methods for loadbalancing are complex (e.g., Switch Transformers need to deal separately with the actual selection distribution and the scores, Sparsely Gated Mixture of Experts needs noise in the selection function).
In contrast, we propose to simply maximize the entropy of the selection distribution p ∈ R N E calculated across the entire batch.Intuitively, this is a 5 Having rows with different norms would discourage the use of experts corresponding to rows with small norms, as their selection score would be low even if the angle of the selector (row of W3) fully aligns with x.
simple way to encourage equal expert usage within the batch and prevent unnecessary overconfidence in selecting individual experts.Let B be the set of all tokens in the batch (counting through both the batch and time dimensions).We introduce the following regularization term L: Furthermore, we propose to randomly drop complete experts, during training; we refer to this as expert dropout.Unlike the standard dropout on the activation level, we do not apply rescaling, i.e., where where δ is the dropout rate, and ⊙ is the elementwise product.This prevents the dropped experts from being selected while not affecting the other ones.Intuitively, when an expert dropout removes a popular expert, it forces the less popular ones to take over.Thus, the chance of them being trained and improved increases.We experimentally show that our regularization method (Eq.21) and expert dropout (Eq.22) are both effective despite their simplicity.

Experiments
Our experimental setup is based on Dai et al. (2019)'s Transformer XL with some modifications: we use pre-layer norm and reduce the number of training steps to 100k to reduce the computational budget.Also, to match the parameter counts between the baseline and MoEs, we slightly modify the hyperparameters of the baselines (Dai et al., 2019) (Sennrich et al., 2016) using SentencePiece tokenizer (Kudo and Richardson, 2018) instead of the word-level vocabulary, to avoid extra tricks required to reduce the parameter count and compute requirement resulting from the huge vocabulary size.On WikiText-103, we consider two different model sizes: a 47M-parameter one (denoted by "WT-S" for "small"), and a 262M-parameter one ("WT-B" for "big").We refer to Enwik8 as "E8" in certain tables.For more details, see Appendix B. For all the methods considered, we use them in every MLP block of the model, which is not a common practice in the literature.Typically, MoE (or other approximation methods) is used only once every n th layer or even only in one layer.This is not satisfactory since our goal is to find a generally applicable method that can accelerate all layers across the whole model.Moreover, this amplifies the difference between different methods, helping better illustrate effects of each of the design choices.

Top-K
We first evaluate the Top-K method (Sec.3.1).This standalone evaluation is important as Top-K is the basis of both the PKM and the MoE approximations.Tab. 1 shows the results.We observe that not only Top-K in the MLP blocks preserves the performance of Transformers, it even improves performance.We hypothesize that these improvements are due to the reduction in feature interference as described by Elhage et al. (2022).However, we obviously can not arbitrarily reduce K; there should be a trade-off between the denoising effect and the capacity of the network.Here, the optimal value we find is K = 128 or K = 512.Our view of Sec. 3 suggests using a noncompetitive activation such as ReLU instead of the softmax used in the original PKM (Lample et al., 2019).Our experiments confirm the benefits of this choice (Tab.2): the performance of the ReLU variants is much closer to the dense baseline (see also related findings in Shen et al. (2023)).But even the best PKM models underperform the dense baselines, indicating the fundamental limitation of PKMs.Note that, as stated above, we conduct a careful comparison between the approximation method (here, PKM) and the dense baseline using the same number of parameters.For more results and details on PKM, we refer to App.A.3.Here we evaluate our σ-MoE models (Sec.5) on Enwik8 and WikiText-103 as well as two additional datasets, C4 (Raffel et al., 2020) and the newly proposed peS2o (Soldaini and Lo, 2023).Given the large sizes of C4 and peS2o, we cannot afford to train for a full epoch; we train for 100k steps with the same hyperparameters as for WikiText-103.
Main results.Tab. 3 shows the main results.Our σ-MoE models match the performance of their parameter-equal dense baselines, while achieving significant memory and compute reduction.These models use K = 4 for N E = 16 or N E = 32, which is a "moderate" level of sparsity but already offering significant compute reduction as shown in the column "% FLOPs"; concrete compute and memory reduction is further shown in Fig. 2 (see Appendix A.5 for details).Naturally, there is a limit on the minimum sparsity level to preserve good performance of MoEs, which is determined by several factors.First, we empirically find that experts with a group size of G < 128 generally degrades performance.Second, our benchmarks with the Top-K operation (Tab. 1) and our ablations (Tab.10 in the Appendix) show that the minimum number of simultaneously active channels G • K need to be above a certain critical threshold (usually around 256-512).Finally, we match the number of parameters of the baseline model; this is the last constraint.Under these constraints, we find that the performance of the dense baselines can be matched using 25% of the required FLOPs and memory for activations for our small models, and 12.5% sparsity for the big one (note that FLOPs here do not take into account the linear projection used to select the experts, which is negligible within the range of N E used here).
Increasing   (with and without re-re-normalization) consistently perform worse than our sigmoid one.The same is true for the standard initialization ; ours is better.
Interestingly, removing all regularization methods degrades performance but does not entail catastrophic collapse even with N E = 128.We also examine the best (G, K) combinations, given a constant number (G • K) of active pairs k i , v i ; we find a high K = 4 works best within this range.Further analysis of our σ-MoE can be found in App.A.4.

Analyzing expert utilization.
A typical failure mode of MoEs is expert collapse, where only a few experts are used while others are completely ignored or underused.Here we conduct an analysis to evaluate whether various models including ours are affected by this issue.For each layer, we compute the proportion of the expert selection weights assigned to each expert (sel(x)) on the entire validation set of WikiText-103.We use WT-S* models from Tab. 4 with 128 experts.A representative layer is shown in Fig. 3. Models with poor performance (see Tab. 4), i.e., Switch Transformer (red) and a "bad" variant of σ-MoE with a softmax and renormalization "softmax (renom.)"(green), Figure 3: The total proportion of selection weights assigned to a given expert (x-axis; sorted by their popularity) on the validation set of Wikitext-103 using the WT-S* models from Tab. 4. This is for one representative layer ("Layer 5"; similar plots for other layers are shown in Fig. 7 in the appendix).The models with poor performance (Tab.4), i.e., Switch Transformer (red) and σ-MoE with a softmax and renormalization "softmax (renom.)"(green) can be easily identified.In contrast, the fine performance differences between the rest of the models do not seem to be due to expert collapse.can be easily identified: they severely suffer from the expert collapse problem.The statistics are rather similar for all other models; the fine performance differences among these models do not seem to be due to expert collapse.Remarkably, our entropy-regularized models with expert dropout, especially σ-MoE, are capable of matching the expert usage balancing of S-BASE without using the Sinkhorn activation function.Note that in general, we do not consider uniform expert activation to be optimal: we expect expert specialization, and thus the frequency of their usage should depend on the occurrence of the task they are performing.

Conclusion
Our novel view unifies methods that approximate 2-layer MLPs, such as Top-K, Mixture of Experts (MoE) and product-key memory (PKM) methods.While Top-K by itself provides limited performance improvements and speedups, further speedup requires PKM or MoE.A non-competitive activation function inspired by our unified view improves both PKM and MoE.Further novel enhancements of MoEs yield our σ-MoE which outperforms existing MoEs.Importantly, our σ-MoE with moderate sparsity matches the performance of parameter-equal dense baselines while being much more resource-efficient.Our new insights improve the training of language models with limited hardware resources, making language modeling research more accessible.

Limitations
Our experiments show that if we naively increase the number of experts, the performance gap between MoE models and their dense counterparts increases.This indicates the need for careful control of sparsity and hyper-parameters, which remains a challenge for MoEs.
Our CUDA kernel is sub-optimal and I/O limited.However, even in its current form, it already yields significant performance boosts and memory reduction.We expect that an expert CUDA programmer could improve the speed of our kernel by at least a factor of 2.
We do not consider load balancing between hardware accelerators as is done in Switch Transformers and S-BASE.Our goal is to make a larger model fit a single accelerator, or multiple accelerators in the standard data-parallel training.Our preliminary experiments suggest that such balancing entails a performance hit.
We could not reproduce the 277M Enwik8 model of Dai et al. (2019), because we could not fit the beaseline model on any of our machines.We tried to use rotary positional encodings with PyTorch 2.0's memory-efficient attention to reduce it's memory consumption; however, this resulted in a significant performance degradation (even for the smaller models).
Our study focuses on end-to-end trainable MoEs.Other MoE methods (Irie et al., 2018;Li et al., 2022) that pre-train LMs on disjoint data, to recombine them later into a single model, are out-of-scope.
Our study only considers standard Transformers; however, similar acceleration methods are of utmost importance for shared-layer Transformers, such as Universal Transformers (Dehghani et al., 2019) and NDRs (Csordás et al., 2022).In fact, layer sharing dramatically reduces the number of parameters.Compensating for this by naively increasing d model or d ff results in prohibitively high memory overhead and slow execution.In contrast, MoEs allow increasing the number of parameters without such dramatic drawbacks.We leave sharedlayer MoEs for future work.

A Further details and analyses
A.1 Definition of normalised Top-K Using the setting of Sec.3.3, we define the normalized top-K operation as follows:

A.2 Measuring the Number of Active Channels in u
In order to explore whether a (k i -v i ) sparsitybased approach is feasible, we measure the number of nonzero entries in the up-projected vector u in our baseline models (which, because of the ReLU activation function, is the same as the positive entries).We show the results of our 47M model in Fig. 1.Note that d ff = 2053 (See Tab.8) for the same model, which means that on average only 1-10% of the channels are active.We show the same analysis for the 262M model in Fig. 4. Interestingly, the counts remain the same, even though d ff = 4110 for this model.The 41M parameter model on En-wik8 shows a stark difference in the distribution of the channels between layers; see Fig. 5.This suggests that the key factor determining the count distribution is the dataset, and the size of the model plays only a secondary role.Fortunately, the sparsity is very high for all models considered.2019) with the following basic modifications.First, we do not use batch normalization (BN).As Lample et al. (2019) shows that BN is only beneficial for models with a very large memory size, we remove it as it simplifies inference where the effective batch size varies over time.Also, we directly divide the input vectors into two sub-keys without an additional projection.Finally, unlike Lample et al. (2019), we use the same learning rate for all parts of the network.
In addition to the parameter-equal comparison of Sec.6.2, there is another possibly "fair" way of setting the size of the PKM-based model: match the number of values (this would result in fewer parameters because of the key approximation), even though Elhage et al. (2022) suggest that the keys typically play a vital role, and reducing their capacity will cause a performance loss.See Tab. 6 for the corresponding results.Note that, for Enwik8 and Wikitext-103 small, the parameter-equal setting increases the number of sub-keys from 46 to 62 (2116 vs. 3844 values).This helps significantly.

A.4 Further Analyses of Our σ-MoE
We also examine the best (G, K) given a constant number (G•K) of active pairs k i , v i .In this setting, reducing K by a factor of m (K ′ = K m ) involves increasing G (G ′ = mG), which, for a constant number of parameters, reduces N E to N ′ E = N E m .The results can be seen in the 2 nd block of Tab. 10.We find that a higher K is beneficial.Given this, we ask the question how the selection distribution of the models with K > 1 is different from selecting the same experts together and acting as a larger expert.Are these models combining experts in more meaningful ways?To test this, we measure the distribution of experts that are used together on Wikitext-103 with our 47M MoE model with K = 4.The result can be seen in Fig. 6: the network combines experts in a rich way, further supporting the use of K > 1.Note that, it remains an open question whether such "compositions" may help the generalization and compositional behavior of the network (Fodor and Pylyshyn, 1988;Pagin and Westerståhl, 2010;Hupkes et al., 2020).Detailed Usage Count Analysis.We show the relative proportion of experts selected for all layers in Fig. 7.For more details, please refer to Sec. 6.3.

A.5 More on Resource Efficiency
For execution time and memory usage, both the dense MLP and the MoE layers are linear in d model (Fig. 9), the MLP is linear in d ff , and MoE is linear in G (Fig. 8) and K.For the same number of parameters (except for the selection network, which is negligible), d model = G • N E .However, both the memory usage and the execution time of the MoE are almost independent of N E , except for a small linear factor due to the selection network (see Fig. 2).Figures 2, 8 and 9 show the actual measured execution time and memory usage on a RTX 3090 GPU.
Note that there is no significant difference in terms of speed and memory usage between different MoE variants given the same d model , G, and K.This is because they only differ in the selection mechanism and regularization, and not in the way the experts are executed.Since all methods are configured to have the same number of parameters as the dense baselines, and K experts are used in parallel, the factor of reduction in both FLOPs and memory usage is given by K N E .We show this factor for all models in Tab. 7.

B Implementation details
We train all of our models for 100k steps with cosine learning rate decay, starting from the initial learning rate of 0.00025 and decaying to 0. We use the Adam optimizer (Kingma and Ba, 2015) with default PyTorch parameters (Paszke et al., 2019).We use gradient clipping with a max gradient norm of 0.25.We show the other hyperparameters of our dense models in Tab. 8. We train our models with an XL memory of the same size as the context size.However, following Dai et al. (2019), we evaluate the models using a longer memory.Unlike the hyperparameter-tuned memory sizes in Transformer XL, we use 4 times the context size (this approximates the size of the memory by Dai et al. (2019), while being simple).
The hyperparameters of the MoE models match those of their dense counterparts with the same number of parameters, except for the MoE-specific ones, which are shown in Tab. 9. δ denotes the expert dropout and γ denotes the regularization strength used for the loss L (See Eq. 21).For the non-MoE layers, the same dropout is used as for the baselines.For Switch Transformers, we use

B.1 A Few Words on the CUDA Kernel
We call the key operation for our MoE layers conditional vector-matrix multiplication, or CVMM, and we define it as follows.Given a batch of vectors, V ∈ R N ×M , where N is the batch size and M is the number of channels, a set of K matrices M ∈ R K×M ×L and selection indices S ∈ {0, ..., K − 1} N , CVMM(V , S, M ) ∈ R N ×L is: Our CUDA kernel is based on the blog post developing a matrix multiplication kernel by Simon Boehm (https://siboehm.com/articles/22/CUDA-MMM).However, there are major differences: unlike standard matrix multiplication, in our case, different matrices could be used for different batch elements of the input.In order to be able to reuse matrices fetched from the global memory of the GPU, we first do a preprocessing step: we sort the selection indices, and obtain a reordering vector.This gives us an ordering of the input and output batch elements, such that the consecutive indices are multiplied by the same matrix with high probability.Fortunately, multiple channels have to be fetched/written out at once, so this reordering has minimal overhead.Our kernel has an additional grid dimension compared to standard matrix multiplication, iterating over the matrix index, k ∈ {0, ..., K − 1}.We find that skipping matrices that do not have any corresponding inputs has minimal overhead.To avoid checking all elements of the reordering vector, we precompute their offsets.
Our kernel uses shared memory and register caching; however, it does not use asynchronous loads, which makes it I/O bound.It also does not support tensor cores and mixed precision.The pre-processing step uses the radix sort from the CUB library.However, computing the offsets requires counting the number of vectors assigned to a single matrix.This information, as well as the offset, which is their sum, are freely available as sub-results that the radix sort computes anyways; however, we found no way of extracting it from the CUB implementation.We estimate that by implementing a more efficient preprocessing step, asynchronous loads, and tensor core support, our kernel can be further accelerated by a factor of two.

B.2 Additional Results on MoEs
Additional results of different MoE variants with more model details are shown in Tab.10.We repeat the entries from Tab. 4 for easier comparison.
Transformer.Fedus et al. (2022) integrate the MoE above into the Transformer to obtain their Switch Transformer.In terms of MoE details, one of Fedus et al. (

Figure
Figure Number of active channels in u in our dense 262M parameter model on Wikitext-103.d ff = 4110 for this model, so the sparsity is below ∼ 5%.Standard deviation over all tokens of the test and validation set.

Figure 5 :
Figure 5: Number of active channels in u in our dense 41M parameter model on Enwik8.d ff = 2053 for this model, thus the sparsity is below ∼ 15%.Standard deviation all tokens of the test and validation set.

Figure 6 :
Figure 6: Expert co-occurrence in a σ-MoE model with N E = 16 experts and K = 4.Each row shows the distribution of experts used together with the one corresponding to the row.Measured on the validation set of Wikitext-103 in the 3 rd layer of our 47M σ-MoE model.The other layers and models behave qualitatively the same.

Figure 8 :
Figure 8: Measured execution time and memory usage of a forward-backward pass of a single MLP and MoE layer.|B| = 32768, corresponding to the realistic scenario of a batch size 64 and sequence length 512, d model = 512, K = 4, N E = 32 and d ff = G • N E .Full lines show the execution time, and dashed ones the memory consumption.Because they are both linear with similar slopes, they are almost indistinguishable.Even with our suboptimal CUDA kernel, the wall-clock time is faster starting from 16 experts.

Figure 9 :
Figure 9: Measured execution time and memory usage of a forward-backward pass of a single MLP and MoE layer.|B| = 32768, corresponding to the realistic scenario of a batch size 64 and sequence length 512, K = 4, N E = 32, G = 128 and d ff = G • N E .Full lines show the execution time, and dashed ones the memory consumption.Even with our suboptimal CUDA kernel, the wall-clock time is faster starting from 16 experts.
. In fact, our MoE CUDA kernel can only work with dimensions divisible by 4. We round the original sizes up to the next suitable number, e.g., we change d model of our 47M-parameter WikiText-103 model from the original 410 to 412.Furthermore, since MoEs require extra parameters for the expert selection function, we compensate for these by increasing the d ff of the baseline model to match the number of parameters.Our modified baseline model on Enwik8 still has 41M parameters and performs similarly to the original Transformer XL (see Tab. 1).For WikiText-103, we use subword units

Table 1 :
Effects of the top-k activation function on the perplexity (WikiText-103) and bits/character (Enwik8).

Table 2 :
Performance of the parameter-matched PKM models.We provide more results in Appendix/Tab.6.
N E and Impact of Sparsity.
ff , which is set to 16480 to match the number of parameters of the N E = 128 MoE.This baseline achieves a perplexity of 10.03: thus, the gap between the scaled-up MoE and its dense counterpart pert dropout leads to performance degradation for most of the cases, except the model with N E = 128 experts.The softmax-based selection functions

Table 6 :
The performance of the PKM model variants.Both value-count and parameter-matched variants are shown.Additionally, we show the effect of the initialization inspired by our unified view, which is marginal for PKMs.

Table 7 :
The relative amount of FLOPs and memory used by the feedforward block of the MoE transformer compared to its dense counterpart.The same configurations are shown as in Tab.10.

Table 9 :
MoE-specific hyperparameters for different model variants.γ denotes the scaler for the load balancing term in the loss and δ is the probability of the expert dropout.The standard, transformer-specific hyperparameters are the same as for the baselines.Please refer to Tab. 8. "SetencePiece" tokenization is used for Wikitext-103, C4 and PES2O datasets, and "Character" for Enwik8.