HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts

By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest that fixing the routers can achieve competitive performance by alleviating the collapsing problem, where all experts eventually learn similar representations. However, this strategy has two key limitations: (i) the policy derived from random routers might be sub-optimal, and (ii) it requires extensive resources during training and evaluation, leading to limited efficiency gains. This work introduces \HyperRout, which dynamically generates the router's parameters through a fixed hypernetwork and trainable embeddings to achieve a balance between training the routers and freezing them to learn an improved routing policy. Extensive experiments across a wide range of tasks demonstrate the superior performance and efficiency gains of \HyperRouter compared to existing routing methods. Our implementation is publicly available at {\url{{https://github.com/giangdip2410/HyperRouter}}}.


Introduction
Recent years have witnessed tremendous successes of the Transformer model (Vaswani et al., 2017) and its variants across a wide range of tasks, ranging from natural language and speech processing (Bao et al., 2022b;Gulati et al., 2020), computer vision (Dosovitskiy et al., 2021;Ruiz et al., 2021;Bao et al., 2022a), reinforcement learning (Chow et al., 2023), to life sciences (Rives et al., 2021).Since then, scaling up to larger models has become the prevailing approach for advancing the state-of-the-art in pre-training and finetuning tasks.However, training such large models comes with a high computational cost (Lin et al., 2022); therefore, there is a growing need to develop efficient strategies that facilitate the training of large language models (LLMs) (Fedus et al., 2022a).One of the most effective strategies thus far is the Sparse Mixture-of-Experts (SMoE) (Shazeer et al., 2017;Fedus et al., 2022b), which utilizes routers to direct each input token to a small subset of network parameters (experts).SMoE improves both efficiency and performance compared to dense training approaches (Lewis et al., 2021;Artetxe et al., 2022;Zhang et al., 2022;Du et al., 2022).Despite the encouraging results, SMoE has been found to encounter the issue of representation collapse, where all experts converge to similar representations (Chi et al., 2022;Chen et al., 2022).This problem arises due to the learning process of the router encouraging experts to cluster around a centroid (Chi et al., 2022).Therefore, significant efforts have been dedicated to addressing the representation collapse issue while preserving the simplicity and efficiency of SMoE training.To this end, one of the most effective strategies is freezing the router, as demonstrated by SMoE-Dropout (Chen et al., 2023), where a randomly initialized router remains fixed throughout the training process.Additionally, Chen et al. (2023) found that progressively increasing the number of selected experts can be beneficial.However, we argue that such a naive strategy exhibits two key limitations.Firstly, the random routing policy may be sub-optimal, which hinders the overall training process.Secondly, we will show in Section 3.3 that a fixed router will restrict the model's representation capabilities, necessitating a progressive increase in the number of chosen experts to achieve satisfactory performance.Therefore, SMoE-Dropout inherently suffers from limited representation and does not offer efficiency gains during inference.This work introduces a novel approach called HyperRouter to address the trade-off between fixed and trainable routers in SMoE training.Instead of employing a random but fixed router, HyperRouter utilizes a random but fixed hypernetwork (Ha et al., 2017) to generate the router's parameters based on a trainable router embedding.By doing so, HyperRouter can improve the routing policy throughout training while mitigating the representation collapse issues.To demonstrate its effectiveness, we conduct extensive evaluations on various NLP tasks, comparing HyperRouter to several state-of-the-art SMoE routing strategies.Moreover, HyperRouter achieves the same performance threshold with fewer experts (compute) during inference, thereby significantly enhancing the efficiency of deploying LLMs in real-world applications.Fig. 1 shows that HyperRouter consistently outperforms other competitors when the same number of experts is used during inference.

Related Work
Existing Routing Mechanism for SMoE.One of the most important components for training SMoE is the expert routing strategy, which specifies experts to process input tokens.There are two common classes of token-expert assignment algorithms for SMoE training: (1) letting tokens select the top-k experts and (2) letting experts select the topk tokens.For the first approach, (Fedus et al., 2022b, Switch Transformer), (Lepikhin et al., 2021, GShard), (Zuo et al., 2022, THOR), (Lewis et al., 2021, BASE), (Clark et al., 2022, S-BASE), andSMoE-Dropout (Chen et al., 2023) are representative methods.Meanwhile, Zhou et al. (2022) introduced expert choice that enables selecting different experts for each token and demonstrates the potential of the second approach.
On the Representation Collapse of SMoE.A major research focus is how to improve the tokenexpert assignment to avoid the representation collapse issue where: (i) all the inputs are routed to the same expert (Zuo et al., 2022) or (ii) all experts converge to similar representations (Chi et al., 2022;Chen et al., 2022).Such issues result in poor specialization among experts, parameter redundancy, and limited performance gains.Existing works address the issue (i) by employing various ad hoc-heuristics, e.g., adding Gaussian noise (Shazeer et al., 2017), limiting the maximum number of inputs that can be routed to an expert (Lepikhin et al., 2021), imposing a load balancing loss (Fedus et al., 2022b), using linear assignment (Lewis et al., 2021)

Methodology
This section describes SMoE training and details of the proposed HyperRouter method.

SMoE Training
We first describe SMoE training of LLMs, which consists of a router R(•) with parameter W r and N expert networks {E i (•)} N i=1 with parameters W e i .We follow the most common implementation (Fedus et al., 2022b) to use a linear network as the router and split the feedforward networks in LLMs into N separated experts.Switch Transformer.Given an input token x with its representation h ∈ R d , the SMoE output y is calculated by routing h only to k-most suitable experts determined by the router, i.e., (1) , where the TopK(•, k) function keeps the largest k

HyperRouter
We describe our proposed HyperRouter strategy that balances between a random but fixed router and a router optimized during training.HyperRouter employs a fixed hypernetwork (Ha et al., 2017) H(•) to dynamically generate the router's parameters conditioned on a trainable router embedding e.Specifically, HyperRouter obtains the output y as: (2) where e is a low dimensional trainable vector associated with the current layer.All parameters are jointly optimized to minimize the task loss.Lastly, like SMoE-Dropout, HyperRouter also gradually increases the number of activated experts through-out training, starting at k = 2 and ending at k = N .Fig. 2 visually illustrates our proposed HyperRouter method.

A Deeper Analysis of HyperRouter
We now investigate the representation capabilities of HyperRouter compared to the naive SMoE and SMoE-Dropout.Due to space constraints, we only present the main results here and provide the detailed calculations in Appendix B. First, the naive SMoE jointly trains the router and expert parameters, leading to entanglement and eventually the representation collapse issue (Zuo et al., 2022).SMoE-Dropout proposed to fix the router parameter W r , which we will show to result in a restricted representation.To do so, we need to calculate the Jacobian with respect to h, which characterizes how the feature h is involved during training.

Jacobian of SMoE and SMoE
h) j , 1 ji be the indicator function, and W r,i be the icolumn of W r , the Jacobians for SMoE and SMoE-Dropout are calculated as: Here, the Jacobian ∇ h L is a combination of two terms related to J 1 and J 2 .The J 1 component represents how h contributes to the output y and is the same for all methods.The second component related to J 2 characterizes learning better experts' representation of h and is the main difference.Since SMoE-Dropout fixes W r , its J2 = J ⊤ 2 ∇ y L term is expressed as a linear combination of columns in W r , which lies in a lower dimensional subspace because W r ∈ R k×d with k ≪ dnumber of experts is usually smaller than the feature dimension.Thus, SMoE-Dropout alleviates the entanglement between the router and experts at the cost of restricting the experts' representation capabilities.

Jacobian of HyperRouter.
By a similar calculation, the Jacobian of HyperRouter is shown in equation 4.
Given that e is trainable, HyperRouter's Jacobian is not expressed as a simple linear combination in a much lower-dimensional subspace.In contrast, HyperRouter maps the feature to a subspace whose dimension can be controlled by e, which we can easily set to be greater than k.Furthermore, by fixing the hypernetwork, the Jacobian does not change freely as in SMoE, which helps alleviate the entanglement.Overall, HyperRouter can alleviate the representation collapse while not sacrificing the representation capabilities.Jacobian of HyperRouter:

Experiments
We conduct experiments to investigate the following hypotheses.First, HyperRouter improves efficiency compared to existing routing strategies, requiring less computation to achieve similar performance thresholds during evaluation.Second, with the same amount of experts used during inference, HyperRouter can achieve better performances compared to other competitors.

Finetuning Result
Tab. 3 reports the results of the finetuning experiment on the SST-2, SST-5, IMDB, and BANKING77 datasets.Although we perform dense finetuning, we also report the results of using only half the experts during evaluation.Overall, we observe consistent accuracy gains from HyperRouter compared to other baselines on all datasets.Notably, on SST-2 and IMDB datasets, HyperRouter with only eight experts could substantially outperform other strategies when using all experts.We now investigate the distributional output from the routers (softmax of router's output) trained with different strategies.Such outputs determine how tokens are assigned to experts.We hypothesize that high-entropy distributions are not preferred since they are closer to being uniform, indicating the router's low confidence in choosing experts for the current token.In the extreme, the router outputs a uniform distribution and assigns tokens to all experts, eventually causing the collapse issue.Therefore, a router with a lower entropy is preferred since it confidently assigns tokens to a few experts, which can improve experts' specialization.To this end, we report the entropy of each router in the small TransformerXL trained on the enwik8 dataset in Tab. 4. The entropy is calculated on all samples in the test set using the model obtained after training.The result clearly shows that HyperRouter achieved much lower entropy than SMoE and SMoE-Dropout.Due to space constraints, we will provide additional visualizations and discussions in Appendix D.

Conclusion
In

A Additional Figures
Fig. 3 illustrates the conceptual difference between the traditional SMoE (Fedus et al., 2022b), SMoE-Dropout (Chen et al., 2023), and our HyperRouter.In summary, SMoE uses a trainable router to send tokens to experts.On the other hand, SMoE-Dropout fixed a randomly initialized router and gradually increased the number of experts chosen throughout training.Lastly, HyperRouter improves upon SMoE-Dropout by replacing the router with a fixed hypernetwork and trainable router embedding.

B Derivation of the Jacobian
This Section details the calculations of the Jacobians presented in Section.3.3.The Jacobian of SMoE and SMoE-Dropout is similar to the Jacobian of HyperRouter.Therefore, only the later detailed calculation is shown.Recall that HyperRouter obtains the output y as follows: Here, σ(•) is the standard Softmax function given by: Without loss of generality and for simplicity, by rearranging the index of the top k-experts, HyperRouter can be rewritten as follows: , ∀j = 1, . . ., k.The Jacobian matrix J of the output y with respect to h can be decomposed into two terms as follows: (since e is independent of h).Note that the last equation in ( 5) is obtained by using the chain rule, the logarithmic derivative, and the inner product as follows: where It is worth mentioning that the first term J 1 means to produce a better token representation given the current activation router S Hyper j (h), while the second term J 2 represents learning a better gating function for the appropriate activation score router S Hyper j (h).After the back-propagation, the gradi-5761 ent of the loss function L is obtained from the two paths mentioned above and is written as follows (6) Finally, by expanding the second term as follows, we obtain the desired result:

C Additional Experiments
This section provides the implementation details of our experiments in Sec. 4.

C.1 General Setting
Our experiments are conducted based on the publicly available SMoE-Dropout (Chen et al., 2023) implementation1 .Moreover, the pre-training experiments were conducted on a single A40 or A100 GPU, while the finetuning experiments were conducted on a single GeForce RTX 3090 GPU.We also emphasize that parallel training on multiple GPUs might yield different results.It is worth noting that the parameter overhead is fixed as we scale up to larger transformer models.
For the medium and large variants, we scale the model to eight and twelves layers, respectively.

C.2 Pre-training Experiments
Tab. A1 provides the implementation details for pre-training our TransformerXL small and medium on enwik8 and WikiText-103.The Trans-formerXL large network was trained in the same maner, but only for 100K iterations.HyperRouter .Overall, the additional trainable parameters in HyperRouter is neglectable.Moreover, the number of frozen parameters (hypernetworks) is quite small compared to the transformer backbone (5.7%).Investigating into sharing the hypernetworks across layers or generating the routers coordinately (Pham et al., 2022) can further reduce this cost while improving the performance, which we will leave for the future work.This Section provides the full details of the routers' entropy and visualizes its distributional output.This is supplementary to Section 4.4.Tab.A5 reports the mean and standard deviations of the entropy at each router.This table is the full version of Tab.A5.We can see that all methods have rather low standard deviation, indicating that the differences are significant.We also provide an illustrative example of the routers' outputs using a randomly picked sample on the test set in Fig. 4.Here we can clearly see that the policies from SMoE and SMoE-Dropout are much closer to uniform while HyperRouter 's policies are much sharper and have lower entropy.We emphasize that this example is not cherry-picked since we already calculated the averaged entropy on all samples in Tab.A5.Overall, this result shows insights into how HyperRouter can perform better than other state-of-the-art SMoE strategies.

E Future Work
Our HyperRouter opens several promising venues for future research.Particularly, we believe that investigating two HyperRouter components: (i) fixed hypernetwork, and (ii) trainable embedding can yield further performance and efficiency gains.Potentials directions include incorporating regularization such as ℓ 2 -penalty or dropout (Peng et al., 2015) or using better hypernetwork initialization techniques (Chang et al., 2020).Furthermore, the current implementation uses a hypernetwork for each transformer layer, and it generates all parameters of the router.Sharing hypernetworks among layers or generating the router coordinate-wise can offer knowledge sharing (Yin et al., 2021;Pham et al., 2022), which can further improve the result.5765

Figure 1 :
Figure 1: Perplexity (log-scaled) on the WikiText-103 dataset with varying numbers of experts used for inference.All methods have the same FLOPs.

Figure 2 :
Figure 2: An illustration of our HyperRouter that dynamically generates the router's parameters from a fixed hypernetwork.Yellow modules including the router embedding and experts are trainable while the gray module , the hypernetwork, is frozen.Best viewed in colors.values of a given vector while setting the remaining values to zero; σ(•) is the standard Softmax function.In practice, only a few experts are activated for the current token (k ≪ N ) for efficient training.SMoE-Dropout.SMoE-Dropout (Chen et al., 2023) is a state-of-the-art strategy that addresses the representation collapse issues (Sec.2) by simply fixing a randomly initialized router to improve the token routing consistency.Furthermore, SMoE-Dropout gradually increases the number of experts chosen throughout training, starting with k = 2 and ending at k = N .During the evaluation, SMoE-Dropout proposes to use half or all experts to balance its efficiency and performance.

Figure 3 :
Figure 3: An illustrative comparison among SMoE, SMoE-Dropout, and our HyperRouter .The input token is a representation vector (could be output from the previous layer).Yellow modules are trainable.Gray modules are frozen, indicated by a lock symbol ( ).Best viewed in colors.
Model architecture.The small TransformerXL variant(Chen et al., 2023) consists of 4 Transformer decoder layers with an input dimension of 256.Each layer consists of a self-attention layer with 8 attention heads, followed by a feedforward network with an inner dimension of 512.The dropout ratio is kept at 0.1.We split the feedforward network into 16 experts with the same dimension.For the HyperRouter , we initialize the embeddings with size 256 and employ a 2-layer perceptron with an inner dimension of 256 as the hypernetwork.The ReLU function is used as the activation function for the hypernetwork.Totally our HyperRouter introduces an additional 1024 trainable parameters in the TransformerXL model.

Figure 4 :
Figure 4: Visualization of the distribution of the routers' output.

Table 1 :
Bit-per-character and Perplexity on the enwik8 and WikiText-103 test sets, respectively.Lower is better.k denotes the number of experts chosen during inference.The best results are in bold.
(Srivastava et al., 2014): (i) Dense -the standard training of transformers where no routing mechanisms are implemented; and (ii) Dense+Dropout similar to Dense but with Dropout(Srivastava et al., 2014)inserted in the fully connected layers.All WikiText-103.The benefit of SMoE training is the efficiency during inference by using fewer experts.Our HyperRouter substantially outperforms both SMoE and SMoE-Dropout on both datasets in this regime.Notably, HyperRouter significantly

Table 2 :
Bit-per-character on the enwik8 test set using the TransformerXL medium and large models.Lower is better.k denotes the number of experts chosen during inference, and SD denotes the SMoE-Dropout method.The best results are in bold.

Table 3 :
Transfer performance (Accuracy) on the SST-2, SST-5, IMDB, and BANKING77 datasets.Higher is better.k denotes the number of experts chosen during inference.The best results are in bold.
outperforms SMoE-Dropout when using only one expert, reducing BPC from 3.02 to 1.48 on enwik8, and perplexity from 560.93 to 65.17 on WikiText-103.Overall, with eight experts or less,

Table 4 :
Average entropy of the distribution of the routers on the enwik8 dataset.Lower is better.

Table A2 :
Implementation details for finetuning experiments on four different datasets.

Table A3 :
Inference FLOPS (10 10 ) on the enwik8 dataset, k denotes the number of experts used during inference.
C.5 Parameter ComparisonTab.A4 provides the number of parameters in different components of SMoE-Dropout and HyperRouter .There are three categories: (i) trainable parameters which are the transformer

Table A4 :
Number of parameters in different components of SMoE-Dropout and HyperRouter during training.Blue parameters are trainable, red parameters are frozen, and underline parameter are dynamically generated in each iteration.

Table A5 :
Average entropy of the distribution of the routers' output on the enwik8 dataset.Lower is better.