MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

Recent work has shown that feed-forward networks (FFNs) in pre-trained Transformers are a key component, storing various linguistic and factual knowledge. However, the computational patterns of FFNs are still unclear. In this work, we study the computational patterns of FFNs and observe that most inputs only activate a tiny ratio of neurons of FFNs. This phenomenon is similar to the sparsity of the human brain, which drives research on functional partitions of the human brain. To verify whether functional partitions also emerge in FFNs, we propose to convert a model into its MoE version with the same parameters, namely MoEfication. Specifically, MoEfication consists of two phases: (1) splitting the parameters of FFNs into multiple functional partitions as experts, and (2) building expert routers to decide which experts will be used for each input. Experimental results show that MoEfication can conditionally use 10% to 30% of FFN parameters while maintaining over 95% original performance for different models on various downstream tasks. Besides, MoEfication brings two advantages: (1) it significantly reduces the FLOPS of inference, i.e., 2x speedup with 25% of FFN parameters, and (2) it provides a fine-grained perspective to study the inner mechanism of FFNs. The source code of this paper can be obtained from https://github.com/thunlp/MoEfication.


Introduction
Recent years have witnessed great success of Transformer-based pre-trained language models † Corresponding authors Part of the work was done while Peng Li was working at Tencent.
(PLMs) (Devlin et al., 2019;Brown et al., 2021;Han et al., 2021), attracting many efforts to interpret the inner mechanism of Transformer (Manning et al., 2020;Kovaleva et al., 2019).However, most of these works focus on the attention mechanism but ignore the feed-forward networks (FFNs), which constitute nearly two-thirds of model parameters.Although recent work has shown that FFNs can be viewed as memory networks storing amounts of knowledge (Geva et al., 2021;Dai et al., 2021), the computational patterns of FFNs are still unclear.
In this work, we study the activation patterns of FFNs in Transformer models and find a phenomenon of sparse activation, i.e., only a tiny fraction of neurons are activated for a single input.For example, when we perform inference on a finetuned T5-Large model (Raffel et al., 2020) with 700-million parameters, 90% inputs only activate less than 5% neurons1 .This phenomenon is similar to the sparsity in the human brain (Olshausen and Field, 1996;Gross, 2002), which drives research on functional partitions of the human brain (Garey, 1999).Inspired by such observation, we further raise up a question: do the functional partitions also emerge in artificial neural models, i.e., FFNs in pre-trained Transformer?
To investigate this problem, we explore whether a Transformer can be converted into an equivalent Mixture-of-Experts (MoE) model (Bengio, 2013), which regards different functional partitions in FFNs as different experts conditionally activated.Specially, we propose MoEfication to discover the functional partitions (experts) in FFNs and build routers for selecting experts.It consists of two phases.(1) Expert Construction: Split a whole feed-forward layer into multiple experts.The goal is to group those neurons that are often activated simultaneously into the same expert network.(2) Expert Selection: Select those experts that contain as many activated neurons as possible for each input to approximate to the original results.
In the experiments, we evaluate MoEfication on two typical kinds of downstream tasks, including GLUE and QA benchmarks (Wang et al., 2019;Rajpurkar et al., 2016;Lai et al., 2017), using T5 and BERT (Raffel et al., 2020;Devlin et al., 2019).Experimental results verify that FFNs in Transformers can be converted to mixtures of experts, and thus we can use only 10% to 30% of FFN parameters to maintain over 95% original performance, which verifies that the pre-trained Transformers also learn the functional partitions in FFNs.Besides, MoEfication brings two advantages: (1) It can significantly speed up the inference of Transformers.Using 25% of FFN parameters brings 2x speedup on CPU and 1.2x speedup on GPU.(2) We can study MoEfied models to interpret the inner mechanism of FFNs at a fine-grained level.In this work, we study their routing patterns and hope these findings can help future work on the design and training of MoE models.

Related Work
Interpretation of Large-scale Transformers.Due to the success of Transformer-based PLMs, there are many studies on the interpretation of Transformer, including the functionality of different layers (Tenney et al., 2019;Jawahar et al., 2019;Wang and Tu, 2020;Ramnath et al., 2020), and the mechanisms of both attention networks and FFNs (Manning et al., 2020;Kovaleva et al., 2019;Wallace et al., 2019).Recent work find that the FFNs of Transformers can be viewed as memory networks storing lots of knowledge learned from language modeling (Geva et al., 2021;Dai et al., 2021;Suau et al., 2020).Meanwhile, researchers explore to modify the knowledge stored in FFNs and achieve promising results (De Cao et al., 2021;Meng et al., 2022).In this work, we show that how the knowledge stored in FFNs is used, that is, most FFNs can be viewed as a MoE network where the knowledge is conditionally activated.
Large-scale PLMs with MoE.Jacobs et al. (1991) propose mixture-of-experts to build a system composed of many separate networks, which learn to handle a subset of the training examples independently.When deep neural networks achieve great success (Hinton et al., 2012;Krizhevsky et al., 2012;Goodfellow et al., 2013), Bengio (2013) thinks the model size is a key factor and MoE is an important technique to scaling model computation and proposes the idea of "conditional computation".The first large-scale MoE language model is proposed by Shazeer et al. (2017), which adds an MoE layer between two LSTM layers and independently assigns tokens to combinations of experts.Recently, GShard (Lepikhin et al., 2021), Switch-Transformer (Fedus et al., 2021), BASELayer (Lewis et al., 2021), and Hash-Layer (Roller et al., 2021) study how to build largescale Transformer-based models with MoE and optimal training strategies, which can fully utilize the model capacity.Different from them, we utilize the naturally-existing sparse activation phenomenon to convert a model into its MoE version for better efficiency during inference.
Model Acceleration for PLMs.Model acceleration aims to reduce the time and space complexity of PLMs.There are several techniques including knowledge distillation (Sanh et al., 2019;Sun et al., 2019;Jiao et al., 2020), model pruning (Voita et al., 2019;Michel et al., 2019;Zhang et al., 2021), attention approximation (Wang et al., 2020;Kitaev et al., 2020;Zaheer et al., 2020),and model quantization (Zafrir et al., 2019;Zhang et al., 2020;Bai et al., 2021), and dynamic inference (Xin et al., 2020;Li et al., 2021;Ye et al., 2021;Hou et al., 2020).Among these techniques, dynamic inference explore to selectively omit unnecessary computation for acceleration, which is similar to the target of MoEfication.Previous work usually focuses on how to dynamically drop layers to accelerate inference (Huang et al., 2018;Wu et al., 2020;Li et al., 2021), which introduces additional training objectives and prediction strategies.In contrast, MoEfication simplifies models in a finer granularity, and does not change the process of training and inference.In summary, MoEfication can be regarded as a novel direction diagonal with the above-mentioned approaches.

MoEfication
In this section, we will introduce the general idea of MoEfication and divide it into two phases: expert construction and expert selection.< l a t e x i t s h a 1 _ b a s e 6 4 = " V c o l k r X Z 7 0 K y 7 q E F o p / 9 / n P D o l

Overall Framework
MoEfication aims to utilize the sparse activation phenomenon in the FFNs of Transformers to reduce the computation cost.We first formally describe the sparse activation phenomenon.The FFNs of Transformers are twolayer fully connected networks, which process an input representation x ∈ R d model by where  (Nair and Hinton, 2010), which is used by the original Transformer (Vaswani et al., 2017) and some widelyused Transformer-based PLMs (Sun et al., 2020;Raffel et al., 2020).Since there are many inactive (zero) values in the intermediate output σ(h), the computation of these values can be omitted for acceleration.Meanwhile, different inputs will activate different neurons.Hence, we explore to select the possiblelyactivated neurons of h before the FFN computation instead of model pruning.
We show an example in Figure 1.In this FFN, d model is 2, d f f is 4, and the bias vectors are omitted for simplification.For a given input representation x, there are two positive values in h.Hence, we only need to compute part of the FFN, i.e., a 2 × 2 submatrix of W 1 and a 2 × 2 submatrix of W 2 , to obtain the same output F (x). Correspondingly, we can MoEfy the original FFN to have an MoE layer with two experts and select the one on the right-hand side for this input x.
For MoEfication, we first split the FFN into several independent parts, namely expert construction, and then design a router to select suitable experts for each input, namely expert selection.

Expert Construction
In this subsection, we introduce how to split an FFN into several parts.The core idea is to group together the neurons that are often activated simultaneously.In this way, for each input, we can select a small number of experts to cover all its activated neurons.To achieve better parallel computation performance, we set the size of each expert to be the same.If the number of experts is k, the input and output dimension of experts is still d model and their intermediate dimension is d e = d f f k .Then, the parameters of i-th expert are denoted by Given the result of splitting, we construct the corresponding permutation of intermediate neurons by , where f (n) is the mapping function from the original neuron index to the permuted neuron index.We compute f (n) by where e(n) is the expert index of the n-th neuron, which varies from 1 to k, and |{m|m ≤ n, e(m) = e(n)}| is the index of the n-th neuron in the expert.Then, we use its permutation matrix P ∈ R d f f ×d f f to permute the rows or columns of parameters and have the following split: where ⊕ represents the vertical concatenation.Note that the permutation will not influence the output representation: (5) In this work, we propose two methods to split an FFN into k parts.
Parameter Clustering Split.To take the parameter information into consideration, we treat the columns of W 1 as a collection of vectors with d model dimension.Based on the intuition that the neurons with similar vectors will be activated simultaneously, we apply balanced K-Means (Malinen and Fränti, 2014) to the vector collection to obtain k clusters to construct the mapping function.
Co-Activation Graph Split.To directly use the information of co-activation, we construct a co-activation graph by counting co-activations of PLMs for the samples of the training set.Each neuron will be represented by a node in the graph, and the edge weight between two nodes are their co-activation values.The co-activation value is computed by where h m are the n-th and the m-th neurons of h for the input x and 1 h (x) m are activated simultaneously.Then, we apply graph partitioning algorithms (Karypis and Kumar, 1998) to the co-activation graph to obtain the split, where the internal connections for each group will be strong.Please refer to Appendix F for the details of the partitioning algorithm.It means that the neurons splitted into the same group are often activated simultaneously for the training samples.

Expert Selection
In this subsection, we introduce how to create a router for expert selection.An MoEfied FFN processed an input x by where S is the set of the selected experts.If all experts are selected, we have F m (x) = F (x). Considering that σ(xW i 1 +b i 1 )W i 2 equals to 0 for most experts, we try to select n experts, where n < k, minimize ||F m (x) − F (x)|| 2 .The selection methods will assign a score s i to each expert for the given input x and select the experts with the n highest scores by Groundtruth Selection for the intermediate output σ(h).We can obtain the groundtruth selection, which minimizes ||concat({f (σ , by a greedy algorithm.f is a padding function with zeros to match the dimension between σ(xW i 1 + b i 1 ) and σ(h).We calculate the sum of positive values in each expert as s i and select experts using Equation 8.This selection should approximate to the lower bound of ||F m (x) − F (x)|| 2 .Correspondingly, its performance will approximate to the ideal performance of an MoEfied model.Meanwhile, it is intractable to directly optimize ||F m (x) − F (x)|| 2 because there are too many possible combinations of experts.
Similarity Selection.To utilize the parameter information, we average all columns of W i 1 and use it as the expert representation.Given an input x, we calculate the cosine similarity between the expert representation and x as s i .
MLP Selection.We train a multi-layer perceptron (MLP), which takes the x as input and predicts the sum of positive values in each expert.Then, we use the prediction as s i .This method tries to approximate to the performance of groundtruth selection.

Experimental Setups
Models and Hyperparameter.We use four variants of T5 (Raffel et al., 2020), which are the 60-million-parameter T5-Small, the 200-millionparameter T5-Base, the 700-million-parameter T5-Large, and the 3-billion-parameter T5-XLarge.The non-linear activation function is ReLU (Nair and Hinton, 2010).We use Adam as the optimizer and a learning rate of 10 −6 for fine-tuning T5 models on downstream tasks.The batch size is set to 64 and the number of epochs is set to 3.
Datasets.We use several natural language understanding datasets to evaluate our models.We use SST-2 (Socher et al., 2013), MNLImatched (Williams et al., 2018), and RACE (Lai et al., 2017) as the main evaluation datasets, which cover single-sentence classification, sentence-pair classification, and reading comprehension.We report the results on their development sets.We also report the results of MoEfication in other datasets in Appendix A including the tasks in GLUE benchmark (Wang et al., 2019) and SQuAD (Rajpurkar et al., 2016).
Expert Construction.For balanced K-Means, we use an open-source implementation2 .Besides Parameter Clustering Split and Co-activation Graph Split, we also implement Random Split as a naive baseline, which uses an identity matrix as P .For the number of neurons in each expert, if the number is small, there will be a lot of experts, making the routing computation cost high.Meanwhile, if the number is large, there will be more inactive neurons in each expert for a given input, which is harmful to the performance with the same amount of selected neurons.Hence, selecting the number of neurons in each expert is a trade-off between computation cost and accuracy.According to our pilot experiments, we set the number of neurons in each expert d e to 32.Correspondingly, the number of experts varies from 64 to 512 (k = data and split them into the training and development sets with the ratio of 9 : 1.Note that we only use the activation information as supervision.The training time of each FFN is about several minutes on a single GPU.

MoEfy ReLU-based Models
In this subsection, we evaluate MoEfication on different T5 models.We consider two factors: the model size and whether the model is compressed.
For the model size, we use five variants of T5 (Raffel et al., 2020), from T5-Small to T5-XLarge.For convenience, we directly use the scale names as the abbreviations.To investigate the influence of model compression, we compress T5-Large to T5-Small by classic knowledge distillation (Hinton et al., 2015).Specifically, the teacher model is a fine-tuned T5-Large and the student model is a pre-trained T5-Small.The distilled model is denoted by T5-Small-Distill.The expert construction and selection methods used here are Co-activation Graph Split and MLP Selection, which are proved to be the best combination in Section 4.4.We report the performance of these models on three datasets, SST-2, MNLI, and RACE, in Table 1.They are the representative datasets for single-sentence classification, sentence-pair classification, and reading compression, respectively.The original performance of PLMs grows as the model size grows, and knowledge distillation improves the performance of T5-small.
We first calculate the activation statistics of different models by inputting the training data of each dataset.The results are shown in Figure 2. From the figure, we have three observations.(1) The activations of these models are sparse.Different from the previous study on models trained with smaller datasets, where the activation ratios are range from 10% to 50% (Geva et al., 2021) 3 , we find most inputs activate less than 10% of the neurons. (2) The activations of larger models are sparser than those of smaller models.For example, 80% inputs only activate less than 3% neurons in T5-XLarge while 40% inputs activate more than 3% neurons in T5-Small.(3) The sparsity is less related to distillation than the model size.The CDF curve of T5-Small-Distill is close to that of T5-Small.
Then, we compare the performance of MoEfied models with different sizes and ratios of selected neurons and report the results in Figure 3.To measure the performance of MoEfication, we calculate the relative performance of the MoEfied model to the original model.From the figure, we have four observations.(1) MoEfication works well with all models on all three datasets.MoEfied models use 10% to 30% of FFN parameters while maintaining over 95% original performance.(2) The larger models can use fewer neurons to recover the original performance.For example, T5-XLarge achieves nearly 98% relative performance on SST-2 and MNLI with 10% neurons while T5-Small achieves the same results with 30% to 40% neurons.This result is consistent with the activation statistics, that is, larger models are sparser.We can expect that MoEfication can provide better effi- < l a t e x i t s h a 1 _ b a s e 6 4 = " f t A 9 / P i r a 2 w A 4 2 v j P o M 8 P 5 O N + g R u Y W l P r 3 f q w P u e j B S v f O Q Y L Z X 3 9 A t 6 k n 9 I = < / l a t e x i t > ciency with super large models.(3) Difficult tasks require models to select more experts to maintain the performance.From Table 1, we can see that the accuracy of RACE is much lower than the other two tasks, and hence we think RACE is more difficult.Correspondingly, the relative performance with 10% neurons on RACE is also lower than those on the other tasks.(4) MoEfication works similarly on T5-Small and T5-Small-Distill, which indicates that MoEfication can work with knowledge distillation for more efficient inference.

MoEfy GeLU-based Models
In addition to using ReLU as the activation function, many PLMs use GeLU (Hendrycks and Gimpel, 2016), including BERT (Devlin et al., 2019) and GPT (Brown et al., 2021).In this subsection, we study whether BERT, which is the most representative GeLU-based model, can be MoEfied.Considering that GeLU gives negative inputs small activations instead of 0, we first transform a GeLU-based BERT into a ReLU-based BERT, and then MoEfy the ReLU-based model.Specifically, we initialize a ReLU-based BERT using the pre-trained parameters of a BERT-Large4 and train the ReLU-based BERT on the pre-training corpus for the adaptation of the change of activation functions.In this work, we use the pre-training framework provided by NVIDIA5 and keep all hyperparameters unchanged.Wikipedia and Bookcorpus are used as the pre-training corpus.In the experiments, after 400 optimization steps, the pretraining loss is close to that of the original model.Hence, the adaptation cost is much smaller than the pre-training cost (about 10000 steps).Meanwhile, the downstream performance of the ReLU-based model is comparable to the original model (93.1 v.s 93.5 on SST-2 and 84.8 v.s 85.2 on MNLI).Based on this ReLU-based BERT-Large, we study the sparse activation phenomenon and the effect of MoEfication and report the results in Figure 4.
From this figure, we have two observations: (1) The sparse activation phenomenon still exists in BERT.For example, more than 80% of inputs activate less than 10% of neurons.It reveals the generality of the sparse activation phenomenon in pre-trained Transformers.It will be an interesting direction to explain this phenomenon empirically or theoretically in the future.(2) MoEfication also archives good performance on BERT.For example, selecting 30% to 40% of neurons can recover 97% performance.Since the activation of BERT is slightly denser than that of T5, it requires more neurons to recover most performance.

Comparisons of MoEfication Strategies
To find the most effective MoEfication strategy, we evaluate different combinations of expert construction and selection methods.We use T5-Large and also set the ratio of selected neurons to 20%.The results are shown in Table 2. From the table, we have two observations: (1) For expert construction, Co-activation Graph  Split is the best method according to the overall performance.Compared to the other two methods, Co-activation Graph Split directly uses the co-activation information to group the neurons activating simultaneously into the same expert.
(2) For expert selection, the performance of Groundtruth Selection is close to that of the original model, which indicates that 20% parameters of FFNs are sufficient to achieve good performance on T5-Large.Meanwhile, MLP Selection is the best expert selection method and can work well with both Parameter Clustering Split and Co-activation Graph Split.

Analysis
In this section, we analyze the efficiency and routing patterns of MoEfied models.

Efficiency Improvement
In this subsection, we show the efficiency improvement brought by MoEfication.We synthesize a batch of sequences with the input and output lengths of 64 and evaluate T5-Large on the data.To comprehensively show the efficiency improvement, we report the relative speedup based on FLOPS, CPU, and GPU in Table 3.The FLOPS is estimated according to the statistics provided by Brown et al. (2021).The results of CPU and GPU are tested on an Intel Broadwell CPU and an NVIDIA Tesla V100 GPU, respectively.
From this table, we have three observations: (1) MoEfication can significantly reduce the total FLOPS, such as 2x speedup in the ratio of 25%.Meanwhile, the speedup on CPU is close to that on FLOPS.Considering that CPU is widely used for model inference in real-world scenarios, MoEfication is practical for the acceleration of various NLP applications.(2) The smaller the ratio, the smaller the gain.For example, the gain of halving 25% (to 12.5%) is 1.2x while the gain of halving 50% (to 25%) is 1.3x.Although the FLOPS reduction of feed-forward networks is linear in the ratio, the cost of attention networks is unchanged and becomes the bottleneck.Hence, 20% is a good ratio, which can have a significant speedup (2x) and maintain most performance.(3) Since some of the operations of MoE cannot be easily paralleled, the speedup on GPU is smaller than that on GPU.Recently, some packages such as Fast-MoE (He et al., 2021) and Deepspeed-MoE (Rajbhandari et al., 2022) are working on paralleling the inference of MoE models on distributed computing platforms and already have some promising results.We believe the bottleneck of parallel computing in MoE models will be well solved in the future.
a 0 M g B a p H 2 q k 5 j 7 g C F 0 K o w g U 6 k n H z j p 5 Z w a 6 S t w F y Z I F y h 3 7 q 9 U N e e x D g F w y r Z u u o 4 e S T P 5 J W 8 W U / W i / V u f c x H U 9 Z i 5 5 j 8 g f X 5 A 7 P J n G s = < / l a t e x i t > Figure 6: Input similarities between experts in the last encoder layer of MoEfied T5-Small.For the most selected experts, both the self-similarities and intersimilarities are low.For the least selected experts, the self-similarities are much higher than inter-similarities.

Routing Patterns
In this subsection, we investigate the routing patterns of MoEfied models.First, we count the selection frequency of each expert.Previous work introduces training objectives to ensure balance selection to make full use of model parameters (Lepikhin et al., 2021;Fedus et al., 2021).We report the results of the MoEfied T5-Small with 20% experts on SST-2 in Figure 5. From the figure, we observe that the frequency distribution of expert selection is much unbalanced.There are some commonly-used experts, whose frequencies are higher than 80%.Meanwhile, there are also some long-tail experts whose frequencies are lower than 10%.
Then, we calculate the self-similarities and intersimilarities of inputs between experts by sampling 10, 000 inputs for each expert.We report the results of the last layer in Figure 6.For the most selected experts, which are selected by most inputs, the self-similarities are close to the inter-similarities.For the least selected experts, the self-similarities are much higher than the inter-similarities, which suggests that the inputs of each expert have obvious cluster structure.
From these results, we can conclude the routing patterns of MoEfied models: there are some general experts, which can work for most inputs, and some input-specific experts, which are seldom used and may work in specific domains or tasks.This observation may inspire future work on training MoE models from scratch.

Conclusion
In this work, we verify that Transformer FFNs are naturally mixtures of experts and propose MoEfication, which utilizes the sparse activation phenomenon in FFNs to convert a normal model to its MoE version with the same parameters.Experimental results show that MoEfied models can achieve comparable performance to the original models using only 10% to 30% of FFN parameters.Correspondingly, it significantly reduces the FLOPS of inference, e.g., 2x speedup with 20% of FFN parameters.Besides, by studying the routing patterns of MoEfied models, we find that there are general and input-specific experts, which may inspire future work on training MoE models.We hope MoEfication can benefit real-world applications of PLMs with better efficiency and benefit the interpretation of the inner mechanism of FFNs.
We evaluate MoEfication on several downstream natural language understanding tasks with T5-Large.The ratio of selected neurons is set to 20%, which is sufficient for T5-Large as show in Figure 2. In practice, there is still a gap between the performance of MoEfied models and that of original models because selected experts cannot cover all positive neurons with a limited computation budget.Hence, the outputs of MoEfied models will be slightly different from those of original models.To calibrate MoEfied models, we further fine-tune the models on the training set, namely parameter calibration.Considering that current routers are based on the first layers of FFNs (W 1 and b 1 ), we only optimize the second layers of FFNs (W 2 and b 2 ) to ensure routers can also work well after fine-tuning.We use a small learning rate of 10 −7 for calibration.The other hyper-parameters remain the same as fine-tuning.The results are shown in Table 4. MoEfied refers to the combination of Co-activation Graph Split and MLP Selection.MoEfied+GT refers to the combination of Co-activation Graph Split and Groundtruth Selection.MoEfied+Calib is the calibrated version of MoEfied.To calculate the average performance, we also include SST-2, MNLI, and RACE.
We observe that MoEfication introduces small performance loss (about 1.5% on average) with an 80% reduction of the computation cost in FFNs.Meanwhile, calibration can effectively deal with the issue of the precision errors brought by MoEfication.For example, MoEfied+Calib improves

B Activation Statistics before Fine-tuning
We count the activation statistics of PLMs before fine-tuning on the pre-training data containing about 50, 000 input tokens.The results are shown in Figure 7.We observe that PLMs before finetuning also have the sparse activation phenomenon and fine-tuning brings little change.Then, we compare the activations of pre-trained models and those of fine-tuned models.We use the average ratio of activated neurons as the index.The results are shown in Table 5.We observe that fine-tuning increases the average activation ratio for most models.The reason may be that different neurons start to learn the same task-specific patterns during fine-tuning.Interestingly, the increase on RACE is smaller than that on the other datasets.Since RACE is more difficult than the other datasets, there should be more task-specific patterns in RACE and less neurons learn the same patterns.Moreover, the pre-training task MLM requires more patterns than RACE so the ratios of MLM are lowest.

C Results of Graph Partition
Co-activation Graph Split achieves good performance in expert construction.Here, we study whether the co-activation graph is suitable for partitioning.We report the results of graph partition of T5-Large on SST-2 in Figure 8. Smaller ratios of edgecuts, which straddle partitions, mean that more co-activation pairs are included in experts.We only

D Accuracy of MLP Selection
MLP selection trains MLPs to fit the groundtruth selection.In this part, we report the accuracy of MLPs in T5-Large fine-tuned on SST-2.The results are shown in Figure 9 and 10.The overall accuracy of the encoder is about 0.8 and the overall accuracy of the decoder is about 0.7.

E Relative Cost of Routing
In this work, we set the number of neurons in each expert to 32.Then, the number of experts in each layer k is The computation complexity of FFNs for each input is Then, the relative cost of routing to that of FFNs is constant for different models.It is also similar to MLP Selection.

F Graph Partitioning Algorithm
The goal of graph partitioning is to divide a graph into several sub-graphs where the number of edges crossing sub-graphs is minimized.In this work, we use the graph partitioning algorithm proposed by Karypis and Kumar (1998).The graph partitioning algorithm consists of three phases: coarsening phase, partitioning phase, and refinement phase.
(1) In the coarsening phase, we create new super nodes by grouping nodes that are highly connected together.For example, if the weight of the edge  between two nodes is large, these two nodes will be grouped together.In the setting of coarsening coactivation graphs studied in this work, two neurons that often activate simultaneously will be treated as a new super neuron.(2) In the partitioning phase, we start with an initial bipartition of the super node graph and then iteratively search for super nodes from each part of the graph, such that swapping them leads to a partition with a smaller number of crossing edges.To divide a graph into k parts, we need log k rounds of bipartition.(3) In the refinement phase, we project super nodes to the original nodes and then continue to iteratively swap nodes to reduce the number of crossing edges.

G Comparisons with Model Pruning
Based on the fine-tuned T5-Large on SST-2, we compare MoEfication with model pruning, which omits the weight having small values.The results are shown in Figure 11.We observe that model pruning significantly degrades the performance.However, MoEfication achieves good performance by selectively activating parts of the network according to input.

H MoEfication vs. MoE pre-training
In this subsection, we compare the performance of two kinds of MoE models.The first one is pre-trained from scratch.The second one is transformed from a standard model by MoEfication.For fair comparisons, we pre-train one MoE model and one standard model with the same model size from scratch using WikiText-103 (Merity et al., 2017).The pre-training objective is masked language modeling (MLM).The model architecture is the same as T5-Small.For pre-training, we use the batch size of 4096, the learning rate of 0.01, the maximum sequence length of 512, and the Adam optimizer.
The number of experts is set to 64 and the router will select 32 of them for a single input.
We report the MLM loss on the validation set in Table 6.From the table, we have two observations.(1) The loss of the standard pre-trained model is lower than that of the pre-trained MoE model.We guess that the optimization of MoE models is difficult than that of the standard models because of the restricted selection of MoE models.( 2 x

Figure 1 :
Figure 1: An example of the sparse activation phenomenon and MoEfication.(a) shows the computation process of an FFN for a given input.(b) shows the unused elements and neurons for this input.(c) shows how to construct experts.(d) shows how the MoEfied model handles this input efficiently.

Figure 2 :Figure 3 :
Figure2: CDF of the ratio of activated neurons for each input with different models on three datasets.

Figure 4 :
Figure 4: (a) CDF of the ratio of activated neurons in BERT-Large on SST-2, MNLI, and RACE.(b) Relative performance of MoEfied BERT-Large.

Figure 5 :
Figure 5: Selection Frequency of 64 experts in each encoder layer of MoEfied T5-Small.The frequency of ideal balance selection is 0.2 while the distribution is much unbalanced.
(a) The 8 most selected experts < l a t e x i t s h a 1 _ b a s e 6 4 = " u Q 9 s 3 l D g n Z C E 7 d A l z O w 8 i n 4 m v J I 3 6 8 l 6 s d 6 t j / l o y l r s H J M / s D 5 / A P 3 j n A o = < / l a t e x i t > (b) The 8 least selected experts < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 m b s 5 x b U

Figure 7 :
Figure 7: CDF of the ratios of activated neurons for each input with different models before fine-tuning.

Figure 8 :
Figure 8: Ratio of edgecuts in different layers.

Figure 9 :
Figure 9: Accuracy of MLPs of encoder layers.

Figure
Figure 11: Comparison between MoEfication and model pruning.
) MoEfied models achieve better performance than the pretrained MoE model.It indicates that pre-training a standard model then conducting MoEfication can be a better option than pre-training an MoE model from scratch.

Table 1 :
T5 variants.With the same expert size, the relative computation cost of routing for different models is the same as shown in Appendix E.Expert Selection.Besides Similarity Selection and MLP Selection, we also implement Random Selection, where we treat each expert as a collection of vectors with d model dimension and randomly select one of them as the expert representation.For Random Selection and Similarity Selection, the computation complexity for routing is O(kd model ).For MLP Selection, we use a twolayer feed-forward network as the architecture.The input dimension is d model , the intermediate dimension is k, and the output dimension is k.The nonlinear activation function is tanh(•).Its computation complexity is O(kd model + k 2 ).Compared to the computation complexity of FFNs of the original model, O(d model • d f f ), the computation cost of routers is ignorable because k is much smaller than d f f .For example, k is 128 and d f f is 4096 for T5-Large.For the training of our MLP routers, we adopt cross-entropy as the training objective and use the Adam optimizer with the learning rate of 10 −2 .The batch size is set to 512 and the number of epochs is set to 10.We sample nearly 500 thousand input representations from the training Original Performance of different models on three downstream tasks.The model architecture is T5.

Table 2 :
Comparisons of different combinations of expert construction and selection methods using T5-Large.The first row is the original performance.The best results in each group are underlined and the best results on each dataset are in boldface.

Table 3 :
Speedup of FLOPS, CPU and GPU with different ratios of selected neurons.

Table 4 :
Results of T5-Large on GLUE benchmark and two QA datasets.The last row reports the differences between the original model and MoE+Calib.MoEfied models with parameter calibration achieve comparable performance to original models.

Table 5 :
Average ratio of activated neurons for each input.MLM represents the pre-trained models with masked language modeling.SST-2, MNLI, RACE represent the fine-tuned models on each dataset.
report the results of encoder layers because all ratios of decoder layers are smaller than 0.001.From this figure, we can see that the overall ratio is small and these graphs are suitable for partitioning.

Table 6 :
Comparisons of MoE models pre-trained from scratch and modified by MoEfication.We report MLM loss on the validation set.Standard pre-training with MoEfication is better than pre-training a MoE model from scratch.