Adaptive Gating in Mixture-of-Experts based Language Models

Large language models, such as OpenAI's ChatGPT, have demonstrated exceptional language understanding capabilities in various NLP tasks. Sparsely activated mixture-of-experts (MoE) has emerged as a promising solution for scaling models while maintaining a constant number of computational operations. Existing MoE model adopts a fixed gating network where each token is computed by the same number of experts. However, this approach contradicts our intuition that the tokens in each sequence vary in terms of their linguistic complexity and, consequently, require different computational costs. Little is discussed in prior research on the trade-off between computation per token and model performance. This paper introduces adaptive gating in MoE, a flexible training strategy that allows tokens to be processed by a variable number of experts based on expert probability distribution. The proposed framework preserves sparsity while improving training efficiency. Additionally, curriculum learning is leveraged to further reduce training time. Extensive experiments on diverse NLP tasks show that adaptive gating reduces at most 22.5% training time while maintaining inference quality. Moreover, we conduct a comprehensive analysis of the routing decisions and present our insights when adaptive gating is used.


Introduction
The field of natural language processing (NLP) has undergone a remarkable revolution driven by the rapid advancements in language models (Cha;Touvron et al., 2023;Bar;pal).They exhibit socalled "emergent" capabilities for a wide variety of applications (Wei et al., 2022).However, as demands for these applications continue to grow, scalability of these models poses an increasingly challenging hurdle due to constraints in computational resources, memory capacity, interconnect bandwidth, etc. (Pope et al., 2023).
Sparsely-activated mixture-of-experts (MoE) is a promising paradigm to address the scalability issue while maintaining a constant number of computation FLOPs (Lepikhin et al., 2020;Fedus et al., 2021).MoE utilizes an ensemble of experts to collectively tackle the learning task.Each input activates a subset of experts, resulting in a dynamically-changing and sparse computation graph.This method effectively distributes the computation among experts, increases model capacity and improves training efficiency (Du et al., 2022;Rajbhandari et al., 2022).Very recently, there has been quite some prior work on improving the performance of Transformers using MoE (Rajbhandari et al., 2022;Zoph et al., 2022;Chen et al., 2023a;Gale et al., 2022).Despite MoE's benefit in scalability, it suffers from suboptimal training efficiency.In particular, we focus on the gating mechanism that selects the experts for each token in this work.Existing MoE models adopt a fixed top-2 gating in training while employing top-1 gating during inference for shorter response times.Top-2 gating entails twice the computational cost per token and doubles the data transfer size of all-to-all operations compared to top-1.Yet, it remains unclear whether top-2 gating actually leads to performance gains that could justify the additional overheads.Therefore, a comprehensive analysis of the trade-off between training efficiency and model performance is increasingly crucial.More practically, how to construct an MoE language model that effectively balances training efficiency and performance, is of great interest and imminent value.
Towards this end, we present our first attempt to empirically characterize and improve the efficiency of the gating mechanism in MoE.We observe that across various models and tasks, a large number of tokens display simple linguistic characteristics or a single dominant feature, which allows them to be effectively processed using just the top-1 expert.This observation suggests that the current top-2 gating strategy incurs unnecessary computation costs for a significant number of tokens.
Motivated by this insight, we further introduce adaptive gating in MoE that enables tokens to be processed by a flexible number of experts depending on the gating decision.Our approach, in contrast to conventional MoE models, preserves the sparsity of MoE models while enhancing flexibility in token handling.We incorporate a threshold within the gating network to conduct adaptive token routing based on the distribution of expert probabilities.With adaptive gating, the majority of tokens use simple top-1 gating; top-2 gating is selectively applied only when necessary and beneficial, thus significantly reducing the computation cost.However, the training efficiency cannot achieve the same improvement as the computation cost due to the fact that tokens with top-2 gating always incur a longer training step, thus becoming the bottleneck.Therefore, to enhance training efficiency even further, we leverage the idea of curriculum learning by strategically adjusting the order of training data samples.
We conduct extensive experiments on six NLP tasks with different encoder and decoder models.The results show that our approach can effectively reduce the end-to-end training time by at most 22.5%, while achieving comparable inference quality with top-2 gating MoE models.Moreover, we show that the tokens routed to two experts are coupled with the nature of each NLP task.For sentiment analysis, those are the tokens expressing neutral opinions; translation task pays attention to sentences with complex structure; Question and Answer connects key words in question and context and assign both with top-2 gating; summarization puts more effort in understanding the pronouns and finding tokens expressing central idea; top-2 routing decision changes along with the token to generated in text completion task and conversational tokens in dialogue response task use top-2 experts frequently.Empirically, we find that a small threshold value (i.e.0.1, 0.2) in adaptive gating can lead to a similar performance as top-2 gating.
Our contributions are as follows: • We propose adaptive gating in the MoE training scheme, which enables tokens to be processed by a flexible number of experts.
• We leverage curriculum learning to alleviate the training bottleneck caused by varying execution times of tokens.
• We conduct extensive experiments on various NLP tasks and datasets and present a thorough analysis of the gating decision of the tokens to prove the effectiveness and efficiency of adaptive gating.

Background
2.1 Mixture-of-Experts Mixture-of-Experts (MoE) has been adopted in various deep neural network models (Shen et al., 2023;Chen et al., 2023b) and has shown great promise in enhancing the performance of language models.For example, GShard (Lepikhin et al., 2020) and Switch Transformer (Fedus et al., 2021) effectively scale Transformer-based language models with MoE layers.
In particular, these models typically employ an MoE layer to substitute the feed-forward network (FFN) layer.The MoE layer comprises multiple FFNs, each acting as an expert, along with a gating network.Each expert i is a fully-connected twolayer network utilizing ReLU activation and with its own set of parameters.For a given token x, the output of an expert can be defined as: where W i 0 and W i 1 are the trainable weights of the two linear layers in expert i.
The gating network takes in the embedding vector of each token x and multiplies them with its trainable matrix W G .The gate value for a specific token can be determined through: (2) This softmax activation R indicates the weight of each expert in processing the token.The gating network then dispatches this token to top-k experts with k highest activations.The final output of the MoE layer is: that is, the weighted sum of outputs from selected expert(s)

Design
We now discuss the design of adaptive gating in MoE for training.

Adaptive Gating in MoE
Observation.We first present our empirical findings from experiments with classical MoE models.Specifically, we extract the softmax activations and analyze the probability distribution of expert selection for each token in the gating network.Figures 1 depict the normalized activation values of four sampled tokens across 16 experts.We see that for tokens 1 and 4, their activations of the top-1 and top-2 expert are very close as shown in Figures 1a  and 1d, while for tokens 2 and 3 a significant bias towards the top-1 expert exists as in Figures 1b and  1c.We find that these significantly-biased distribution accounts for at least 55% of all the tokens in our evaluation.
Adaptive gating.Previous work has demonstrated that MoE experts specialize in different linguistic aspects.Building upon our empirical findings, one can see that many tokens can be effectively handled by a single expert during the training stage.To control the number of experts handling each token, we introduce a threshold parameter, denoted as T .
If the activation value difference between the top-1 expert, denoted as i, and the top-2 expert, denoted as j, is within the threshold T , we consider the token as requiring both expert i and expert j for processing.Otherwise, we route the token only to the top-1 expert.Load balancing loss.Adaptive gating uses a flexible number of experts to process each token.This flexibility, however, adds extra difficulty to the load balancing problem in training which aims to evenly distribute tokens among all experts.As it is still important to prevent the gating network from overly concentrating on a very small number of experts, in adaptive gating, we impose the soft load balancing constraints on the top-1 gating decisions, while allowing top-2 gating decisions to be trained without any soft constraints.That is, the loss of each MoE layer i becomes: where f 1 e is the fraction of tokens dispatched to expert e among those processed by top-1 gating; p e is the average gating probability to expert e over all tokens in the current batch, and E i is the number of experts at layer i just as in classical MoE (Fedus et al., 2021).

Batching
Challenge.While adaptive gating provides effective computational savings, Transformer MoE's model architecture poses a significant challenge to training efficiency.Specifically, there is a mismatch   (Wang et al., 2021).
Adjust training data order.Our intuition is that the number of experts required by each token can be an indicator of the token complexity.We can therefore reorder the training data in a way that prioritizes simpler sequences during model training.Additionally, we can group together training data with similar complexity levels to minimize the bottleneck effect caused by difficult tokens in need of top-2 experts.
To quantify the complexity of a training sample d, we define a complexity vector C: where L is the number of MoE layers in the model, and r i represents the ratio of tokens processed by top-2 experts to the sequence length (i.e., the total number of tokens in data sample d) in layer i.
To determine the order of the training data, we identify the data sample with the fewest tokens processed by top-2 experts, and calculate the cosine similarity using complexity vectors of the remaining data samples.Training data is then reordered based on this similarity value, starting from the most similar ones.This approach allows the model to gradually learn from simpler sequences and progressively handle more complex sequences.

Evaluation
We evaluate adaptive gating in MoE on six NLP tasks using various encoder and decoder models.
We then analyze the gating decision to better understand the effectiveness of adaptive gating.

Tasks and Models
Table 2 summarizes the details.

Training Configurations
We use 8 A100 GPUs, each with 40 GB memory.Data and expert parallel is used for distributed training.We distribute the experts evenly among all the GPUs.In terms of hyperparameters and model architecture, we adopt the default configurations established in the existing models (Wolf et al., 2020;Kwon and Chung, 2023)

Overall Performance
We present the overall training and inference performance in Table 3. Overall, adaptive gating achieves comparable performance to the baselines while significantly reducing the training time even compared to top-1 gating.This is because though top-1 gating maximizes the computation saving, it makes training more difficult to converge to the same loss value, eventually leading to slightly longer training time compared to top-2 gating in 4 out of 6 tasks we run.An in-depth analysis of how adaptive gating works in connection to each task is presented in Section 4.5.Sentiment analysis.Adaptive gating in MoE outperforms both Dense models and top-2 gating MoE  The sparsity introduced by MoE is advantageous for this task.All three MoE approaches outperform the Dense model.Among all the tasks evaluated, dialogue response exhibits the highest percentage, 23.4% of tokens routed to two experts, indicating the higher utilization of the top-2 gating mechanism among all the tasks.Upon evaluating the tokens, we observe that this task can be viewed as a combination of all the other evaluated tasks.

Analysis and Insights
While it is intuitive to understand that some minor tokens (e.g., "a", "the", "is") only need top-1 expert to process, this does not fully explain how and why adaptive gating works in different NLP tasks.
Thus we analyze how the tokens are processed in training with adaptive gating, and make quite a few interesting observations that can help better answer this question.In a broader sense, we believe our insights are also instrumental towards building better language models.
Note that when BPE tokenizer is used, we aggregate the result by mapping the tokens to the natural language word and perform analysis on the aggregated statistics.Sentiment analysis.Sentiment analysis exhibits the lowest percentage of top-2 gating among all tasks, and the percentage is stable across layers (Figure 2a).The top-2 gating mechanism focuses on two main types of input here.First, it frequently selects tokens that express a more neutral opinion since they are more difficult to classify (Table 4).Second, tokens associated with sarcastic statements, double negatives, or conflicting opinions are also commonly routed to two experts.Adaptive gating effectively identifies these tokens early on in the model as they are relatively easy to extract, which explains the stable percentage across layers.A special case is when the input does not explicitly convey any sentiment.Adaptive gating tends to initially route all tokens to either the top-1 or top-2 experts and gradually narrows down to more informative tokens.typical instance of this is "as a dentist's waiting room."Translation.We focus on English-to-German translation only.We examine the top-2 gating results based on our understanding of the source sentences.The distribution of the top-2 gating percentages varies between the encoder and decoder layers, exhibiting a gradual decrease in the encoder layers and an increase in the decoder layers (Figure 2b).From sampled tokens and the adjusted data order in adaptive gating, we observe that tokens requiring two experts are usually within the same sentence.This observation leads us to infer that the complexity of sentence structure influences the gating results.In Table 4, we present one sentence containing multiple clauses that are frequently processed by the top-2 experts.Question and Answer.The percentage of top-2 tokens in question and answer tasks fluctuates across layers (Figure 2c).First, adaptive gating pays extra attention to the question itself.Words listed in Table 4 are some common examples.These tokens often either specify the scope of the question or pose constraints to the answers.Second, in the context side, tokens routed to two experts are closely related to the question in the input as well.For example, asking a question about numbers and computations would result in top-2 gating on the numbers and the objects those numbers refer to.
Summarization.In summarization, the percentage of tokens using two experts decreases in both encoder and decoder layers (Figure 2d).Based on our analysis of sampled tokens, we identify two patterns for tokens that are likely to be routed to top-2 experts.First, tokens with multiple meanings that rely on both themselves and the surrounding context for their ultimate interpretation.They are often routed to two experts in the shallow layers.Second, pronoun tokens, as understanding their referents is crucial for accurate summarization, use two experts in the deeper layers.This pattern is particularly prevalent in this task.Additionally, certain key tokens (e.g."in conclusion", "however", "in all") that indicate the beginning the central idea or the main opinion of the context are often sent to two experts together with the following tokens.
Text completion.Text completion differs from the previous tasks as it is a decoder-only and autoregressive task.The gating results in text completion are influenced by the current prediction being generated.The focus of tokens changes dynamically based on the current prediction.It is challenging to identify specific types of tokens that consistently receive two experts.When predicting a pronoun, for example, the focus shifts to the names of individuals.Similar patterns can be observed for numbers and dates.Additionally, we find that the percentage of tokens routed to two experts is linked to the length of the current sequence.Longer sequences have a higher percentage of top-2 gating.
Dialogue response.Dialogue response, compared to text completion, requires more understanding of the narrative input and the dialogue history.We find that lots of effort are put into processing dialogue history.First, one key distinction is that tokens with a conversational meaning occur much more frequently.These words lack informative content but serve to express human-like sentiments, such as gratitude and politeness.We infer that routing these tokens for two experts indicates that there is a difference between the conversational usage and written text and it is also critical to learn where and when these words should be used.Second, given the nature of the dialogue, many conversations are based on underlying assumptions and conditions.Related tokens are usually processed with two tokens to improve the understanding of the context.For instance, the dialogue example provided in Table 4 is built on top of a scenario assuming that  "Johnathan tells his parents that he is gay" and asks the model to answer questions with this condition.

Ablation Study
Threshold T in adaptive gating.We now conduct an ablation study on the threshold T introduced in adaptive gating.Increasing the threshold value results in a less sparse model, where more tokens are assigned to the top-2 gating mechanism, subsequently increasing the computational FLOPs.
Table 5 shows the inference performance of different tasks when the threshold is increased from 0.05 to 0.5.When using a small threshold of 0.05, both the training time and inference performance closely resemble those of top-1 gating MoE.On the other hand, setting the threshold to 0.4 does not always lead to the same performance as top-2 gating.Together with iteration cannot achieve the same level of reduction as the computation FLOPs.Consequently, the end-to-end training time is significantly inflated, with an average increase of 13.7%.Additionally, the idea of the curriculum also contributes to the improvement in inference performance.The maximum drop is 0.21 in Question and Answer task when the data is fed and trained in a random manner.

Limitation
Choice of k.Adaptive gating in MoE currently is limited to top-k gating, where k can be either 1 or 2. This is built on the common practice in extensive prior work that top-2 gating shows a promissing resut in MoE.Further evaluation is necessary to validate the performance of a wider range of k values.Our experiments were conducted on a diverse set of NLP tasks and datasets, but it is essential to note that the effectiveness and efficiency of adaptive MoE may vary depending on the specific task characteristics.Different tasks may exhibit distinct patterns and complexities, which can impact the performance and generalizability of the proposed approach.Further investigation and evaluation on a wider range of tasks would provide a more comprehensive understanding of the limitations and applicability of adaptive MoE.

Conclusion
This paper demonstrates the effectiveness and flexibility of adaptive gating in MoE models for a wide range of natural language processing tasks.By dynamically adjusting the number of experts based on token characteristics, we achieve improved training efficiency without compromising inference performance.Additionally, the integration of curriculum learning allows us to tackle the challenge of varying execution times, thereby reducing training costs.Our research sheds light on the trade-off between training efficiency and model performance in sparse and dynamic MoE networks, offering valuable insights for the development of more scalable and adaptable language models.

Figure 1 :
Figure 1: Normalized expert probability computed by top-2 gating network from four sampled tokens.Here we use the Sentiment analysis task list in Table2.

Figure 2 :
Figure 2: Percentage of tokens computed top-2 experts over all the tokens in each layer when using adaptive gating in MoE.

Table 2 .
Table1: We compare the computation savings and running time reduction of the MoE layer of varying degrees of top-1 gating against top-2 gating.The MoE layer running time is measured on our testbed Section 4.3.Tokens are randomly selected from the data batch.Here we also use the Sentiment analysis task list in Table2.We show the results averaged from 40 runs.

Table 2 :
Overall performance of adaptive MoE and compared baselines in different NLP tasks.All the models converge to the same loss value.
tation.Table1shows the computation reduction as well as empirical MoE layer running time, both normalized to conventional top-2 gating.We use PyTorch Profiler to obtain the computation time of MoE layer.For simplicity, here we force a fixed percentage of tokens to be routed to only top-1 expert and measure the running time.The reduction in running time is clearly much smaller than the computation savings.

Table 3 :
Overall performance of adaptive gating and compared baselines in different tasks.We normalize the training time with reference to the performance of top-2 gating MoE.All the schemes in the same task converge to the same loss. .
anyone who has had the opportunity to visit Algeria during recent months or years can make a better assessment of what this terrible outbreak of terrorism means to the Algerian people and, indeed, I believe that it would be to our credit if we dealt with this issue in an urgent debate.Question and Answer Which entity, who else, after what, Up until, who was blamed, in terms of, after, Who's death caused this protest?Summarization Japanese actress Rinko Kikuchi walks Anjali Rao through the streets of Tokyo.She stunned global cinema audiences with her controversial and Oscar-nominated performance as a lonely deaf girl in the film "Babel".Rinko Kikuchi is one of Japan's hottest young actresses and models, recently working with Karl Lagerfeld as the new face of Channel.Despite her success, she remains an unconventional figure in Japan, at odds with the traditional demure image of the Japanese woman and forging a career on her own terms...Text completion Harris announced he would be stepping down as rabbi in 2011, and the synagogue hired Boris Dolin as his successor.Born and raised in Oregon, Dolin had worked at Temple Beth Israel as a teacher and youth group adviser from 1999 to 2001.Dialogue response exactly, definitely, hmm, um, well, I guess, [Narrative] Johnathan plans to tell his parents that he is gay.He feels anxious because he doesn't know they will react.He is worried that they will be disappointed or even angry with him.

Table 4 :
Examples of tokens using top-2 experts in different tasks.Underlined tokens use top-2 gating in a sequence.

Table 5 :
Overall performance when the threshold T changes.Training time is normalized with reference to top-2 gating MoE.We highlight the best one with the least training time.

Table 6 :
Overall performance comparison of adaptive gating when data batch is not adjusted.