Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Mixture-of-experts (MoE) models that employ sparse activation have demonstrated effectiveness in significantly increasing the number of parameters while maintaining low computational requirements per token. However, recent studies have established that MoE models are inherently parameter-inefficient as the improvement in performance diminishes with an increasing number of experts. We hypothesize this parameter inefficiency is a result of all experts having equal capacity, which may not adequately meet the varying complexity requirements of different tokens or tasks. In light of this, we propose Stratified Mixture of Experts (SMoE) models, which feature a stratified structure and can assign dynamic capacity to different tokens. We demonstrate the effectiveness of SMoE on three multilingual machine translation benchmarks, containing 4, 15, and 94 language pairs, respectively. We show that SMoE outperforms multiple state-of-the-art MoE models with the same or fewer parameters.


Introduction
Scaling up the model and data size has shown tremendous success in enhancing model performance across a large number of NLP tasks (Devlin et al., 2019;Conneau et al., 2020;Kaplan et al., 2020;Brown et al., 2020).Sparsely gated mixture of experts (MoE) (Shazeer et al., 2017;Lepikhin et al., 2021) provides an effective way to greatly scale the model size under the same computational cost and achieves state-ofthe-art performances on various tasks including natural language understanding (Fedus et al., 2021), machine translation (NLLB Team et al., 2022), language modeling (Du et al., 2022), etc.
The efficiency comes from sparsely activating a subset of the neural network weights for each incoming sample.However, MoE is reported to be parameter-inefficient (Hoffmann et al., 2022;Zuo et al., 2021;Gao et al., 2022) i.e., there are diminishing improvement returns from adding more experts.For example, Switch Transformer (Fedus et al., 2021) only outperforms T5 (Raffel et al., 2020) by an average of 0.7 on the GLUE benchmark (Wang et al., 2018) despite being 35× larger.Similarly, in the translation task, a MoE model with 20 times more parameters only offers an average improvement of 0.3 BLEU on its ablation dataset (MoE-64 vs. 1.3B dense) (NLLB Team et al., 2022).
We hypothesize that this parameter inefficiency stems from the equal capacity assignment, where we particularly define 'capacity' as the number of parameters used for the incoming token.For the current MoE models, the capacity of experts are the same used for serving all tokens.However, different tokens may demand varying capacities.For instance, in the context of multilingual machine translation, certain translation directions may necessitate a greater capacity to prevent overfitting, while others only require a smaller capacity.To address this limitation, our hypothesis posits that the dynamic allocation of capacity to tokens results in more efficient utilization of parameters.Thus, we propose Stratified Mixture of Experts (SMoE) models, characterized by a stratified structure, which allows for the dynamic assignment of capacity to incoming tokens.
A high-level comparison of vanilla MoE and SMoE is presented in Figures 1a and 1b.In vanilla MoE, a single routing gate connects to all E experts and sends tokens to the top-k experts.Here, we take E=5 as an example.In SMoE, the experts are divided into two strata.Each stratum has its own routing gate that connects to all experts in the current stratum as well as all experts in .Each stratum has a gate that is connected to all subsequent experts.Tokens can be directly sent to the last stratum to only experience one expert, or be sent to both strata and have more capacity.Hence, the dynamic capacity of a token depends on how many experts it needs to pass.c) A detailed architectural design, where a comprehensive explanation of the design components will be presented in Section 3.
the subsequent strata.If Gate 1 assigns tokens to Expert 4 or 5, the tokens will only need to pass through a single expert (an FFN layer).However, if tokens are sent to experts in the first stratum (Experts 1 to 3), they will need to go through the next stratum as well, meaning that another expert will be assigned by Gate 2 before exiting the SMoE block.This allows SMoE to dynamically assign capacity to different tokens.In addition, a comprehensive illustration of the architectural design of the SMoE model is provided in Figure 1c.A thorough explanation of the design elements will be provided in Section 3. Our main contributions are summarized as follows: • We introduce the concept of dynamic capacity for MoE models and propose a mixture-of-experts model with a stratified structure, namely SMoE, which can automatically assign dynamic capacity to different incoming tokens to make experts become more parameter-efficient.
• We focus on the task of multilingual machine translation (MMT) and show that SMoE substantially outperforms numerous strong baselines with fewer than or the same number of parameters.For instance, we demonstrate that SMoE only needs half the number of parameters to achieve a performance onpar with a naive MoE (Lepikhin et al., 2021).Furthermore, we carry out an indepth analysis to probe the factors that impact dynamic capacity assignment, including the language of tokens and the position of the SMoE block within the model's architecture.

Background and Related Work
Massively multilingual machine translation models have been developed to handle several translation directions simultaneously in a single model (Aharoni et al., 2019).However, the use of shared parameters for different languages often leads to negative transfer and decreased performance (Conneau et al., 2020;Fan et al., 2020).In contrast to dense MMT models, sparsely gated mixtureof-experts (MoE) models, which activate a subset of parameters for each input, have been shown to significantly improve translation performance (Kim et al., 2021;NLLB Team et al., 2022).Shazeer et al. (2017) first demonstrated the benefit of adding MoE layers to scale RNN models for improved translation performance, and Lepikhin et al. (2021) extended this work to transformer architectures (Vaswani et al., 2017) 2 are the weights of FFN e .A trainable routing gate with weights W g predicts scores for these experts to use for the input x, in the form of a routing vector G ∈ R E : (2) We select the set of top-K experts, denoted with E ⊂ {1, • • • , E}, and compute the output of the MoE layer as follows: MoE models suffer from the notorious load imbalance issue, where the gate weights could collapse and send most tokens to the same expert.As a result, recent research has focused on designing better auxiliary load balancing loss functions to encourage tokens to be evenly distributed across experts, e.g., Lewis et al. (2021) formulated token-to-expert allocation as a linear assignment problem, Roller et al. (2021) modified the feedforward layer to hash to different sets of weights depending on the current token, Zoph et al. (2022) proposed a router z-loss that resolves instability issues, and Zhou et al. (2022) reversely design an expert-to-token allocation algorithm.Other lines of investigation in MoE include regularization techniques such as gating dropout (Liu et al., 2022) and output masking (EOM and FOM) (NLLB Team et al., 2022), as well as novel MoE architectures, such as conditional MoE routing (CMR) that add an extra branch beside MoE layer (NLLB Team et al., 2022), or Pyramid-Residual MoE (Rajbhandari et al., 2022), a hybrid dense and MoE model with more experts in the last layers.
However, all previous work default to equal capacity for all tokens regardless of their language, frequency, or any other property.In the subsequent sections, we present a Stratified Mixture of Experts (SMoE) model that automatically assigns dynamic capacities to different types of tokens.
3 Stratified Mixture of Experts

Architectural Design
The guiding design principle for Stratified Mixture of Experts (SMoE) is to assign dynamic capacity to tokens.Given E experts in an MoE block, we partition them into L strata.The i th stratum has a gate Gate i which routes tokens to an expert in the current stratum as well as the subsequent ones.This means that tokens can never be sent back to the previous strata.Tokens keep getting routed in the SMoE block until they reach the final stratum.Different tokens will pass through a varying number of experts, resulting in different capacities, according to the assignment of gates.In vanilla MoE, however, the capacity through which every token goes is that of a single FFN layer.The workflow of how a token passes a SMoE block is shown in Figure 1c.For example, some tokens in the 1 st stratum may be assigned to experts in the 2 nd stratum while others are sent to the 3 rd stratum.After several rounds of assignments, tokens finally exit the block after reaching the last stratum.
In SMoE, the successive application of multiple FFN layers to a token can result in training instability.Therefore, following the approach of Vaswani et al. (2017); Wang et al. (2019); Xiong et al. (2020), we incorporate layer normalization (LayerNorm, Ba et al. (2016)) before dispatching tokens to experts and a residual connection after the tokens have passed through the experts.See Figure 1c for an overview of the design of our stratified experts.
Formally, given T tokens in a mini-batch, we denote the d-dimensional representation of the t th token in the i th stratum of the current SMoE block with x i,t .Let E i be the set of experts visible to the current gate (current stratum plus subsequent strata) and let E i = |E i | be its cardinality.Before being dispatched to FFN layers, tokens are firstly normalized with LayerNorm, Then, Gate i with weights W i predicts a probability distribution G t ∈ R E i , scoring all visible E i experts at that stratum: Following Lepikhin et al. (2021), we dispatch each token to at most k=2 experts.If E is the set of selected top-k experts and the expert with the highest score is in the j th (j > i) layer, x i,t will be assigned to the j th layer and the output computation on the token is: where G t,e is the gate score for the expert e.Finally, We employ a residual connection after the FFN layer: Tokens gradually pass experts in deeper layers until they finish passing the final layer.

Load Balancing
Similar to Lepikhin et al. (2021), we encourage tokens to be uniformly distributed across all visible experts.Each gate has a loss term to balance the load.For Gate i , the loss is: where f e is the fraction of tokens dispatched to expert e, as their first choice, through top-k-gating: and p e is the average routing probability to that expert over the T tokens in the mini-batch: The auxiliary loss for the current SMoE block is computed by taking the average of the loss over all gates within the block.
where α is a hyperparameter to control the strength of the load balancing loss.We average the loss over all SMoE blocks in the architecture as the final auxiliary loss appended to the original task loss.

Experiments
We evaluate the proposed SMoE on a many-tomany multilingual neural machine translation task.

Datasets
In this study, we consider three datasets comprising 4, 15, and 94 languages each.The initial two datasets are extracted from the primary bitexts of the NLLB-200 training dataset, and we adopt their resource-level categorizations: high-resource (≥ 1M), very low-resource (≤ 100K), and lowresource (the remaining).2These two datasets are developed and evaluated using the Flores-200 dataset.The third dataset is OPUS-100 (Zhang et al., 2020).We follow

OPUS-100
In addition to the datasets derived from NLLB, we also utilize OPUS-100 to examine a scenario involving a larger number of languages.OPUS-100 encompasses a total of 100 languages, which supports 94 development/test language pairs.
Evaluation During inference, we use beam search with a beam size of 5 and a length penalty of 1.0.We report BLEU scores (Papineni et al., 2002) for models trained on NLLB dataset and sacrebleu (Post, 2018) with flores200 tokenizer for models trained on OPUS-100.

Baselines
We use five strong baselines to evaluate the effectiveness of SMoE.All baselines are our own implementation following the settings from the original papers.Note that the total number of experts i.e., the full model capacity is kept constant for all models in order to ensure fair comparison (8 experts for M4 and 16 experts for M15).

Switch
Transformer.An MoE model with Top-1 gating (Fedus et al., 2021).Switch Transformer was introduced to mitigate training instabilities and improve the efficiency of Top-2 MoE models.Note that switch transformer uses fewer FLOPs per token due to the top-1 gating approach.

MoE + EOM.
A vanilla MoE model regularized with Expert Output Masking (EOM) (NLLB Team et al., 2022).EOM masks the expert output for a random fraction (p eom ) of outputs.We set p eom = 0.1 following the suggestion of NLLB

Training Details
Following Johnson et al. (2017), we prepend source sentences with a special language token <2xxx> to indicate the target language.We use a data sampling temperature of T =1 suggested by NLLB Team et al. (2022) to train on NLLB datasets, and T =5 suggested by Zhang et al. (2020) to train on OPUS-100.
The dense model architecture, backbone for all trained models, is a Transformer model (Vaswani et al., 2017) with 12 layers (6 on encoder and 6 on decoder).We use transformer base and E=8 experts for M4, and transformer big and E=16 experts for M15 and OPUS-100. 33 transformerbase: FFN dimension of 2048, 8 heads, and embedding dimension of 512; transformerbig: FFN dimension In MoE models, every other FFN layer of the encoder and decoder are substituted with an MoE layer.For SMoE, in the i th stratum, we enforce that each expert processes, at most, 2 × T i /E i tokens, where T i is the number of tokens in the mini-batch sent to the layer i and E i is the number of visible experts.For the other MoE baselines, it is 2 × T /E, where T is the number of tokens in the minibatch and E is the total number of experts.The multiplicative coefficient α for the auxiliary load balance loss is set to 0.01.A vocabulary of size 32k for both M4 and M15 and 64K for OPUS-100 with SentencePiece (Kudo and Richardson, 2018).For a fair comparison, All models are trained for the same number of updates.More details can be found in Appendix B.

SMoE configurations
We use a series of numbers separated by hyphens to describe the SMoE configuration.For instance, SMoE-4-4-8 indicates that all MoE blocks have 3 strata, where the 1 st stratum has 4 experts, the 2 nd has 4 and the 3 rd has 8.
It is worth noting that simply stacking MoE layers degenerates the model performance, which indicates the importance and effectiveness of the specific design of SMoE.(Fedus et al., 2021) 29.69 40.35 44.50 43.19 27.73 46.44 39.88 47.26   +0.53 BLEU, respectively.Note that if vanilla MoE wants to achieve similar performance to SMoE-4-12 (32.94 vs. 32.93BLEU on average), it has to increase experts from 16 to 32, which almost doubles the total number of parameters from 963M to 1.77B (last row of Table 1), which means our model is much more parameter-efficient.SMoE models with more strata, allowing for more depth, do not guarantee better performance.A clear example is SMoE-4-12 and SMoE-4-4-4-4 in M15 (32.93 vs. 32.75averaged BLEU).However, for 'balanced' SMoE (equal number of experts per stratum), fewer experts per stratum achieves better performance: SMoE-2-2-2-2 outperforms 8 (32.75 vs. 32.50BLEU) on M15.

M15 results We show results in
OPUS-100 Results: A Larger Performance Gap.Table 3 presents the comprehensive results.Notably, the performance disparity becomes more pronounced we scale our experiments to 94 languages ( OPUS-100 does not support the remaining 5 languages ).
Overall, SMoE outperforms all baselines with the same or a fewer number of parameters.

Computational Cost
As a token may pass through multiple strata in any given SMoE layer, the average computational cost is higher than other MoE models.In the last column of Tables 1 and 2, we record FLOPs per token for all models, and Table 3   to low-resource.On the decoder side, the average RC of eng tokens is similar regardless of the source language, but averaged RC has a large variance if the target language is different.On the encoder side, RC is always different even though the source language is the same.
same information with Table 2. 4 Although SMoE requires more FLOPs per token, the additional cost is only a marginal increase over vanilla MoE models.For example, in M15 and OPUS-100, our best setting SMoE-4-12 merely uses 8% more FLOPs/tok than other top-2-gating MoE models, but significantly outperforms all of them.

Analysis
The advantage of SMoE is that it can assign dynamic capacity to different tokens, with some tokens passing through only one expert and others passing through multiple experts.Here, we define the Requested Capacity (RC) as the average number of experts that a token need to pass in one SMoE block.RC of a token is dictated by how the SMoE gates route it throughout the different strata.
To understand what may affect the RC of tokens, we examine three potential influencing factors: the language of the input token, the frequency of the token, and the depth of the SMoE block.All analysis is conducted using the SMoE-4-4-4-4 model trained on the M15 dataset.

The Language of The Input Token
Here, we investigate whether different languages have different RCs.We begin with collecting the average RC in all translation directions for all tokens in the training and development (Flores-200 dev) sets.We investigate SMoE blocks in the encoder and decoder separately as they process different tokens (source tokens for the encoder and target tokens for the decoder).The average RC is then averaged across all SMoE blocks in either encoder or decoder.Figure 2a shows the average RC in the decoder for each translation direction in M15.When translating into English (xxx eng, blue bars), we observe that the target English tokens have similar RC in the decoder side (≈1.85 experts) irrespective of the source language.When translating from English (eng xxx), the RC varies a lot with respect to the target language.
Unlike in the decoder where only the target language matters, Figure 2b shows variability in RC with respect to both source and target languages i.e, not only the language of the tokens themselves (source), but also the target language we will be translating into once we move to the decoder.We hypothesize the special symbol at the beginning of the source sequence (<2xxx>) can affect the capacity assignment.In conclusion, capacity assignment is sensitive to the target language in the decoder and to the translation direction in the encoder.

Token Frequency
As in the previous section, we record the average RC for all tokens in training and development data, and in all translation directions.To avoid looking at all 32K tokens in our vocabulary, we select the top-25 tokens with the highest RC in each SMoE block and in each translation direction, totaling 4500 tokens. 5We similarly collect the bottom-25 tokens with the lowest RC.After removing tokens repeatedly selected by different directions or by different SMoE blocks, we end up with 2580 unique high-RC tokens and 3208 unique low-RC tokens.We draw in Figure 3 a violin plot to show the distribution of tokens in these two groups in terms of their frequency in the training data.We rank the frequencies on the y-axis so that a lower rank means more frequent tokens, e.g., rank 0 corresponds to the most frequent token in our training data.The results show that there is no strong correlation between frequency and RC for tokens with the highest RC.On the other end of the spectrum, tokens with the lowest RC tend to be high-frequency tokens, as indicated by the right violin plot being wider at the bottom part (rank < 10k).Many of these high-frequency tokens are basic subword units (like _li, _el, _pa) or punctuation marks.One can interpret RC as a metric for 'difficulty in processing a given token'.The model was overly exposed to these frequent tokens, and as such, does not require a lot of capacity to process them.

Location of The SMoE Block
We analyze in this section the average RC in relation to the location of the SMoE block in the transformer architecture.As depicted in Figure 4, RC varies depending on the location of the SMoE block.Early encoder layers (encoder 2nd layer is the first SMoE block in the model) request more capacity than the subsequent encoder SMoE blocks.We hypothesize that this first block takes on the task of mapping tokens coming from different languages and different scripts to a shared space.

Conclusion
This work presents Stratified Mixture of Experts (SMoE) models, a novel design for MoEs that is capable of dynamically assigning capacity to input tokens.Through experimental evaluation at three scales (M4, M15, and OPUS-100), we have demonstrated that the proposed SMoE model surpasses the performance of many current state-of-the-art MoE methods.This proves that dynamically assigning capacity to tokens in MoE models is a viable solution to address the MoE's parameter inefficiency.Additionally, we conduct a thorough analysis to investigate the factors that influence dynamic capacity assignment, including the language of the tokens and the location of the SMoE block within the model architecture.

Limitations
Stratified Mixture of Experts (SMoE) aims to improve the performance of Mixture of Experts (MoE) models by assigning dynamic capacity to different tokens.While SMoE has demonstrated performance improvements over many state-ofthe-art baselines, it also comes with an additional computational cost compared to traditional MoE models.However, the cost is small and the benefits of SMoE in terms of improved performance often outweigh this added computational cost, especially in tasks where performance is critical.For example, in OPUS-100, with 8% FLOPs/tok, SMoE-4-12 achives +1.01 BLEU compared with traditional MoE (Lepikhin et al., 2021).

Figure 1 :
Figure1: A high-level illustration of the vanilla MoE and SMoE.a) Vanilla MoE: gate is connected to all experts and sends tokens to top-k selection.b) SMoE: Experts are stratified into L strata (L=2 in this example).Each stratum has a gate that is connected to all subsequent experts.Tokens can be directly sent to the last stratum to only experience one expert, or be sent to both strata and have more capacity.Hence, the dynamic capacity of a token depends on how many experts it needs to pass.c) A detailed architectural design, where a comprehensive explanation of the design components will be presented in Section 3.

Figure 2 :
Figure 2: Average requested capacity (RC) of all tokens in each translation direction.The blue bars are for xxx eng directions and the purple bars are for eng xxx directions.Directions in each subset are sorted from high-resourceto low-resource.On the decoder side, the average RC of eng tokens is similar regardless of the source language, but averaged RC has a large variance if the target language is different.On the encoder side, RC is always different even though the source language is the same.

Figure 3 :Figure 4 :
Figure 3: plots of the token frequency in high-RC (left) and low-RC (right) tokens.Unlike high-RC tokens, low-RC tokens tend to be highly frequent ones.
Zhang et al. (2020)and divide directions into high-resource, mediumresource, and low-resource categories.
NLLB M15 dataset Taking into account linguistic diversity and larger data size, we expand the M4 dataset to cover a set of diverse 15 languages.M15 covers 6 linguistic families and a balanced number of high-resource, low-resource, and very low-resource languages (each category has 5 languages).We show a detailed listing and information on the M15 dataset in Appendix A.
CMR (NLLB Team et al., 2022)augments MoE with a binary gate that sends tokens to one of two branches: (1) a shared FFN layer and (2) an vanilla MoE layer.Note that this method requires extra parameters due to the added shared FFN layer.The CMR budget constraint is set to 0.8.

Table 1 :
Overall BLEU results on the M4 dataset.The best values are bold and the second-best values are underlined.The number of experts is 8 for all methods.The two SMoE models attain the two best performances across all languages.

Table 2 :
Overall BLEU results on the M15 dataset.The best values are bold and the second-best values are underlined.Unless otherwise mentioned, the number of experts is 16.All SMoE models outperform the baselines.
The best setting is SMoE-4-12, which outperforms vanilla MoE by +0.93 BLEU.Vanilla MoE would require to double its parameters to achieve similar performance to SMoE-4-12.

Table 3 :
Overall BLEU results on the OPUS-100 dataset.The best values are bold and the second-best values are underlined.The number of experts is 16.We consider the two best settings in M15 dataset, SMoE-4-12 and SMoE-4-4-4-4.Both of them substantially outperform all baselines.The number of parameters and FLOPs/tok for MoE models are the same as Table2.