AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation

Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks. Existing works in MoE mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network. Furthermore, existing MoE works do not consider computational constraints (e.g., FLOPs, latency) to guide their design. To this end, we develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints. AutoMoE leverages Neural Architecture Search (NAS) to obtain efficient sparse MoE sub-transformers with 4x inference speedup (CPU) and FLOPs reduction over manually designed Transformers, with parity in BLEU score over dense Transformer and within 1 BLEU point of MoE SwitchTransformer, on aggregate over benchmark datasets for NMT. Heterogeneous search space with dense and sparsely activated Transformer modules (e.g., how many experts? where to place them? what should be their sizes?) allows for adaptive compute -- where different amounts of computations are used for different tokens in the input. Adaptivity comes naturally from routing decisions which send tokens to experts of different sizes. AutoMoE code, data, and trained models are available at https://aka.ms/AutoMoE.


Introduction
Sparsely activated models like the Mixture-of-Experts (MoE) (Fedus et al., 2022b) perform conditional computation in which only a subset of the weights of the network are activated per input.Selective compute allows us to design neural networks with a large number of model parameters, without significant increase in the computational cost.With increased capacity, these sparse models have demonstrated state-of-the-art performance in natural language tasks such as neural machine translation (NMT) (Kim et al., 2021;Kudugunta et al., 2021;Zuo et al., 2022).
MoE architectures require several design choices: (a) Expert placement: Identifying Transformer layers for introducing expert sub-networks.(b) Number of experts: How many experts to place in different layers?(c) Expert FFN size: What should be the feedforward network (FFN) size for each expert?Given the large search space of potential architectures and the exorbitant computational cost of training and evaluating them, existing approaches manually design MoE architectures from a highly-restricted homogeneous space.For instance, they use the same number of experts of the same capacity in different layers and make ad-hoc decisions like introducing experts in every other layer (Fedus et al., 2022b;Kim et al., 2021;Zuo et al., 2022;Du et al., 2022;Artetxe et al., 2021) or every four layers (Zoph et al., 2022).
While these MoE's support conditional computation, homogeneity (specifically, fixed-size experts) results in the same amount (albeit different subsets) of weights to be applied to each input.We hypothesize that this is not an optimal solution and that we can reduce the number of experts (in some layers) to reduce communication cost, and the size (of some experts) to reduce computation cost resulting in reduction in model size, FLOPs and latency without much quality degradation.
This naturally extends MoEs to be adaptive compute models (similar to work on early exit (Schuster et al., 2022)) where different amounts of computations are used for different inputs.The adaptivity comes naturally from the routing decisions which would send tokens to experts of different sizes.
The above observations are depicted in Table 1, which shows demonstrative examples of manually designed MoE's vs. those designed by our AutoMoE framework.We compare these architectures against various computational metrics (e.g., latency, FLOPs, active MoE parameters), archi- tectural configurations and task performance.For the most efficient configuration (last row in the table), AutoMoE reduces the number of decoder layers, compensating for the capacity with increased experts in the bottom layer, and places most of the experts in the encoder.Overall AutoMoE introduces the following components and contributions: • Heterogeneous design with adaptive computation for MoEs with variable number, size and placement of experts in both encoders and decoders.(Wang et al., 2020) and Evolved Transformer (So et al., 2019)).

Background
Mixture-of-Experts: MoE's have a rich literature in machine learning dating back to the early 90s (Yuksel et al., 2012).They have received significant attention with works such as (Shazeer et al., 2017), Switch Transformers (Fedus et al., 2022b), GShard (Lepikhin et al., 2020), BASE (Lewis et al., 2021), Hash (Roller et al., 2021), GLaM (Du et al., 2022), Stochastic Experts (Zuo et al., 2022), Gating Dropout (Liu et al., 2022) and ST-MoE (Zoph et al., 2022).Some crucial differences in these works include choice of expert routing function, expert placement technique, stability/performance enhancement techniques and nature of the task (pretraining vs. fine-tuning).Some challenges in building sparse expert models include: (i) lack of diversity in expert design (expert layer selection, number of experts, expert size, etc.), (ii) training instability, (iii) poor out-of-distribution generalization, (iv) cross-task adaptation of pre-trained models, (v) communication bottleneck, (vi) high memory and (vii) expert load balancing issue, to name a few.A comprehensive review of recent sparse expert models can be found at (Fedus et al., 2022a).MoE design: Most works in MoE rely on ad-hoc manual choices for expert placement, number of experts and their sizes.Existing approaches mostly use manual design, where they add experts on (i) alternate layers (Fedus et al., 2022b;Kim et al., 2021 Zuo et al., 2022;Du et al., 2022;Artetxe et al., 2021), (ii) every four layers (Zoph et al., 2022), or (iii) final few layers (Rajbhandari et al., 2022).While these MoE's support conditional computation, they generally do not support adaptive compute since same number of expert parameters apply to every input, largely given by their homogeneous design (e.g., all experts of same size).Further, MoE design is generally agnostic to computational constraints (e.g., latency, memory) of the hardware in which the MoE model has to be deployed.
Neural Architecture Search (NAS): Given a search space of architectures and efficiency constraints (e.g., model size, latency), NAS typically aims to identify the optimal architecture that maximizes the task performance, while satisfying the efficiency constraints.NAS has been recently used for natural language understanding tasks to build efficient BERT (Devlin et al., 2019) and GPT (Brown et al., 2020) based pre-trained language models (Xu et al., 2021;Yin et al., 2021;Xu et al., 2022a,b;Gao et al., 2022;Dong et al., 2021;So et al., 2021;Javaheripi et al., 2022) as well as for machine translation tasks (So et al., 2019;Wang et al., 2020).
Hardware aware transformers (HAT) (Wang et al., 2020) is a state-of-the-art NAS framework with dense Transformers for MT that uses hardware latency as feedback for optimization.
However, all of the above NAS works consider a search space with densely activated Transformers and non-MoE architectures, They primarily search over typical Transformer architectural hyper-parameters like number of layers, attention heads and hidden size.In contrast, we propose the first NAS framework that searches for efficient sparsely activated Mixture-of-Expert modules in Transformers.Our heterogeneous AutoMoE framework addresses some longstanding design choices for MoE's like how many experts?which layers to place them?what should be their sizes?and so on.

Designing Heterogeneous
Mixture-of-Experts We now present the components of AutoMoE framework (illustrated in Figure 1) for designing efficient MoE's under computational constraints.

Heterogeneous MoE Search Space
Existing MoE approaches restrict their design space by considering uniform distribution of size and number of experts placed in different Transformer layers.For instance, the standard MoE design (Fedus et al., 2022b) for an L-layer Transformer with M experts placed in alternate layers have only two possible configurations viz., {1-M - Our design space allows variable number of experts in each layer resulting in M L possible configurations.(b) Furthermore, our design space also allows variable expert size, e.g., by modulating the width of the feedforward (FFN) subnetworks for different experts.Considering N possible FFN dimensions for each expert results in N M L possible configurations for designing the expert space.(c) Finally, given the autoregressive nature of tasks like neural machine translation, the inference cost is dominated by the decoder (Kasai et al., 2021).For instance, for token-based MoE, decoders take 200× the time per step compared to encoders at peak throughput (Kudugunta et al., 2021).Therefore, we further consider variable number of decoder layers along with the above choices for expert placement and expert capacity.
To the best of our knowledge, our work is the first to study such a flexible and exhaustive design space for MoE architectures.
In addition to heterogeneous experts, we allow flexible design for non-expert Transformer modules like the number of attention heads, hidden size and intermediate feedforward dimensions.This heterogeneous design of non-MoE, i.e., dense Transformer modules, has been explored in prior works such as HAT (Wang et al., 2020)  tasks like NMT, and AutoDistil (Xu et al., 2022a) for understanding tasks like those in the GLUE benchmark (Wang et al., 2018).Table 2 shows our search space.We demonstrate our heterogeneous MoE search to perform better than both manual and NAS-searched architectures in the dense space.

Searching for Efficient MoE Subnetwork with Computational Constraint
AutoMoE search is based on an evolutionary algorithm that takes the hardware computational constraint (e.g., CPU latency ≤ 600ms) as input and aims to identify the MoE subnetwork from the Supernet which achieves maximum accuracy for the task while satisfying the constraint.Latency estimate for each architecture is obtained by measuring the latency directly on the target device.The standard approach measures gold latency for forward propagation of a batch of examples for a large number (e.g., 300) of passes and then computes the truncated mean (after removing bottom and top 10% outlier latencies).This latency estimation can be costly given the large space of candidate architectures.To overcome this challenge, AutoMoE uses partially gold latency, which is obtained by forward propagation of a batch of examples for a small number (e.g., 100) of passes and then computing truncated mean.After the search is completed, the MoE architecture with the highest performance is selected as the optimal one.

Training Efficient MoE Sub-Transformer
Once the optimal MoE architecture is identified, we train the model weights for the final architecture to convergence for the same number of training steps as our baseline models for a fair comparison.

Experiments
Datasets and evaluation metrics.
We  Table 3.We use pre-processed datasets and evaluation setup from (Wang et al., 2020).We report BLEU score (Papineni et al., 2002) as a performance metric with beam of size 5 and a length penalty of 0.6 (for WMT).
Baselines.We compare AutoMoE against both manually designed and NAS-searched architectures.
For NAS baselines, we consider (c) HAT (Wang et al., 2020), which is a Supernet-based state-of-theart NAS framework for identifying efficient dense sub-Transformers for neural machine translation (same task setting as ours); and (d) Evolved Transformer (So et al., 2019) which is one of the earlier works on finding efficient dense sub-Transformers with evolution-based architecture search.Note that both the NAS baselines apply only to dense non-MoE transformers, and AutoMoE is the first work to leverage NAS to identify efficient sparse MoE subtransformers.Finally, we consider (e) AutoMoE with Random Search (typically treated as a strong baseline for NAS) that samples an MoE subnetwork (given latency constraints) randomly from AutoMoE search space and trains it till convergence.Training configurations and search space.All the baselines and AutoMoE including the Supernet and final model are trained with the same setting for fair comparison.All the models are trained for 40K steps, with a warmup of 10K steps from 10 −7 to 10 −3 and use cosine annealing to 10 −7 for the rest of the steps.All models are trained using fairseq toolkit (Ott et al., 2019) with an effective batch size of 524K tokens on 16 V100 GPUs.All the NAS baselines have the same search space for dense Transformer modules (e.g., number of decoder layers, q-k-v dimension, attention heads, etc.) with AutoMoE further incorporating MoE relevant aspects (e.g., experts, gating, routing, etc.) in the search space.The number of encoder layers is kept fixed for all the NAS baselines including AutoMoE since the latency is primarily determined by the decoders for autoregressive generation (as we discuss in Section 5.2).Evolutionary search setup.For performance estimation, we monitor the validation loss of subnets on the NMT task.We compute latency by measuring the time taken to perform translation from a source sentence to a target sentence with same desired input/output length (30 for WMT) and original beam settings (see Section 4) on target device (e) Overall vs. Decoder latency.
Figure 3: Architecture analysis for AutoMoE-generated MoEs.We sample several architectures from the Pareto for AutoMoE subnets, and report aggregate statistics in terms of the impact on different computational metrics.
(Intel Xeon CPU).We measure latency 300 times for gold (to report final metrics) and 100 times for partially gold (during evolutionary search) respectively; discard top and bottom 10% (outlier latency) and compute mean of the rest.Hyper-parameter settings for evolutionary search include: 15 as iterations, 125 as population size, 25 as parents' size, 50 as mutation population size with mutation probability of 0.3 and 50 as crossover population size.Unless otherwise stated, latency constraint for all experiments is set to 600ms.

AutoMoE vs. Baseline Performance
Table 4 presents a comparison of AutoMoE with baselines on several computational metrics and task performance.We report the number of parameters without embedding weights, and FLOPs without the last decoding layer for all the models, consistent with (Wang et al., 2020)  Compared to all other models (both dense and sparse), we observe AutoMoE to generate networks with high sparsity resulting in massively reduced active parameters and FLOPs.For the NAS models, we train the top-2 sub-Transformers in the Pareto and report the one with the best trade-off in BLEU vs. FLOPs on the validation set.Maximum experts for the best performance vary for different tasks, with 6 experts for WMT'14 En-De, 16 experts for WMT'14 En-Fr and WMT'19 En-De -given the latter two datasets are 10× larger than the former.

Analysis
Decoder layers vs. FLOPs.Figure 3 (a) shows the average FLOPs for several AutoMoE architectures with different decoder layers as obtained during our search (varying from 3 to 6) from the Pareto, and baseline models.Notice that the FLOPs increase with increase in decoder layers, given the auto-regressive nature of NMT tasks which require generating tokens sequentially.In contrast to manually designed Transformers with 6 decoder layers (both dense and sparsely activated MoE variants), AutoMoE-and HAT-searched architectures reduce the number of decoder layers with a resulting decrease in both FLOPs and latency.This is also evident in Figure 3 (e) which shows that decoder latency dominates the total inference latency for all the models by more than 90%.Expert distribution in encoder vs. decoder.Figure 3 (b) plots the number of encoder experts as ratio of total experts for AutoMoE-generated sub-Transformers.We observe that AutoMoE assigns significantly larger number of experts to encoder as compared to the decoder.As a result, encoders have much higher capacity (i.e., encoder parameters as a proportion of overall parameters) than decoders.This correlates with the earlier observation that models with higher encoder layers compared to decoder layers enjoy better latency-performance trade-off (Kasai et al., 2021).Our findings from AutoMoE designed architectures indicate that the number of layers and experts are two knobs that jointly help in modulating encoder capacity and decoder latency to design efficient MoE.Expert distribution in different layers.Figures 3  (c) and (d) show the percentage of experts allocated to different layers for encoders and decoders -averaged over several sampled architectures from AutoMoE Supernet.Notice that the middle encoder layers (3 rd , 5 th ) are allocated the maximum number of experts, while the first layer receives the least.The trend reverses for decoder, with the first layer receiving most experts with gradual reduction in expert allocation.This is also consistent with keeping decoders light by dropping layers to reduce latency; while compensating for the reduced capacity with increased experts in the first few layers.Latency vs. FLOPs as constraint for search.Ta-1 We use same hyper-parameters for all models with no tuning (provided in code).Given 40K training steps for each model and no tuning, MoE numbers may not be comparable to SOTA numbers which typically train for more steps.HAT and Evol.Transformer numbers are reported from (Wang et al., 2020).We follow their evaluation and reporting protocol.ble 6 presents the impact of latency and FLOPs as computational constraints on the performanceefficiency trade-off.Constraining FLOPs results in models that fully exhaust the FLOPs budget; while leading to higher latency.On the other hand, constraining latency tends to under-utilize the budget leading to relatively superior FLOPs and latency, providing a stricter control.Pareto-optimal AutoMoE generated MoE architectures.WMT14 En-Fr and WMT19 EN-De respectively.MoE Search space variations.Table 7 presents the impact of search space choices on MoE efficiency and performance trade-off.The first variation is to make '#Encoder Layers' an elastic search dimension.Note that both HAT and AutoMoE consider the number of encoder layers to be fixed (refer to Table 2).We observe that varying encoder layers has a relatively poor trade-off on model performance vs efficiency as compared to varying decoder layers, re-inforcing our prior observations on the importance of encoder capacity and depth.
In the second variation (see third major row), we fix the expert architecture (with 2 experts manually placed uniformly) in the search space and only search for standard Transformer hyper-parameters.Observe that AutoMoE-designed models have better FLOPs than such manually designed ones.
The last variation introduces identity or dummy experts (i.e., expert with 0 intermediate FFN size, equivalent to identity operation).This explores the idea that we can skip the computation for some of the tokens based on context rather than always forcing them through an FFN.We observe identity experts to marginally hurt the performance but significantly reduce FLOPs (see last major row).

Conclusion
AutoMoE is the first framework to design heterogeneous MoE's under computational constraints.It supports adaptive computation i.e. variable compute for different inputs with variable-size experts.It leverages NAS to explore a heterogeneous search space with variable number of experts, sizes, and placement choices; alongside other standard Transformer architectural hyper-parameters.AutoMoE generated MoE subnetworks reduce FLOPs and latency over both manually designed and NASsearched architectures on benchmark MT tasks.

Limitations
Given our focus on finding efficient MoE models under computational constraints, AutoMoE search space and evaluation has been restricted in scale to big-sized Transformer models for benchmark MT tasks.A natural extension of this work is to explore the limits of MoE models like SwitchTransformers (Fedus et al., 2022b) and GShard (Lepikhin et al., 2020) that are significantly larger containing billions to trillions of parameters; as well as designing sparse and transferable efficient expert models (Zoph et al., 2022) for diverse types of tasks like reasoning, summarization and understanding.
The limitations of this work are as follows: 1. Sandwich sampling (Yu et al., 2019), inplace knowledge distillation (Yu et al., 2019), and gradient conflict reduction (Gong et al., 2022) are popular techniques to improve the training procedure of supernet.It would be interesting to study the impact of these techniques to improve AutoMoE's supernet.
2. AutoMoE uses the hidden dimension of intermediate feedforward network (FFN) to modulate the capacity of each expert.It would be interesting to study other techniques to modulate expert capacity such as stacking variable number of hidden layers in FFN.
3. The backbone of AutoMoE's supernet uses Switch Transformer, which adds FFN based expert layers and routes each token to exactly one expert (top-1 routing).It would be interesting to: (i) search for the number of tokens to route, and (ii) search for the Transformer component (e.g., FFN, self-attention projection layers, LayerNorm) to add expert layers.

AutoMoE's search space contains classical
Transformer components such as multi-head attention and FFN layers.It would be interesting to add components that are efficient by design such as convolutional layer, FLASH (Hua et al., 2022), and g-MLP (Liu et al., 2021).liance of Canada. 2 Lakshmanan's research was supported in part by a grant from NSERC (Canada).

Figure 1 :
Figure 1: AutoMoE Framework.(1) Heterogeneous MoE with variable dimensions for dense Transformer blocks and sparsely activated expert modules.(2) Supernet training by sampling subnetworks from search space and training them by sharing common weights with Supernet.(3) Evolutionary search to find efficient architectures by (a) sampling MoE subnetworks from the search space; (b) using latency measured in the target device; and (c) performance estimation from Supernet as feedback for iterative optimization via crossover and mutation.(4) Efficient MoE subnetwork(s) from evolutionary search is trained on downstream task.

Figure 2 :
Figure 2: Weight sharing in the MoE Supernet for sparsely activated expert modules.(Output) of the corresponding Supernet weights.For the second expert, the weight matrices of shape 1024 × 512 (Input) and 512 × 1024 (Output) are extracted from the first 1024 rows, 512 columns (Input) and first 512 rows, 1024 columns (Output) of the corresponding Supernet weights.This example is illustrated in Figure 2 (b).The subnet extraction technique does not extract weights from the third and fourth experts of the Supernet as the subnet is designed to have only two experts (not shown in the figure).Such a weight sharing technique allows us to design architectures with varying intermediate FFN size for each expert.Additional techniques for improving expert capacity such as stacking FFNs, and techniques for improving Supernet performance with sandwich sampling(Yu et al., 2019), inplace knowledge distillation(Yu et al., 2019), gradient conflict reduction(Gong et al., 2022) are left for future work.

Table 1 :
Manual vs. AutoMoE designed MoE for illustration with 6-layer encoder-decoder Transformer.Detailed results in Table4.We report computational metrics (measured on Intel Xeon CPU) and BLEU score of MoE's on WMT'14 En-De MT task.Number of experts per layer are separated by hyphen (-) for encoder and decoder.

Table 2 :
for generation Search space of AutoMoE compared to manually configured Transformer Base / Big.'PL' and 'PE' refer to per layer and per expert search dimensions.Decoder arbitrary attn.searches last k encoder layers to attend for each decoder layer.FFN size varies across layers and experts.M denotes maximum experts per layer.
AutoMoE also extracts front rows and front columns from the weight matrices of each FFN expert from the Supernet, corresponding to the subnet design.For the previous example, assume the intermediate FFN size of each expert in the Supernet to be 3072 (shape of weight matrix for first FFN layer is 3072 × 640 and second FFN layer is 640 × 3072).Assume the sampled subnet to be designed for 2 experts with intermediate FFN size of one expert to be 2048 while the other to be 1024.For the first expert, the weight matrices of the subnet of shape 2048 × 512 (Input) and 512 × 2048 (Output) are extracted from the first 2048 rows, 512 columns (Input) and first 512 rows, 2048 columns in Neural Architecture Search that were developed for standard non-MoE architectures.We extend Supernet training to the search space for MoE's by incorporating experts, gating and routing protocols.Typically, a Supernet consists of thousands of subnetworks that are all jointly trained via weight-sharing.The Supernet for AutoMoE is the largest sparsely activated MoE in the search space.It consists of the maximum number of experts (M ) placed in every layer of the Transformer in both encoder and decoder.example to exactly one expert (out of M experts) for top-1 routing; (ii) FFN expert: a standard Transformer FFN block that has unique weights and is learned independently.AutoMoE's expert layers follow the Switch Transformer (Fedus et al., 2022b) specification.For subnetwork extraction from the Supernet, AutoMoE extracts front rows and front columns of the Supernet's router weight matrix, corresponding to the subnet design.For example, consider the Supernet's router to be designed for 4 experts and 640 embedding size with the shape of the router weight matrix as 4 × 640.Consider a sampled subnet during Supernet training to consist of 3 < 4 experts and 512 < 640 embedding size with the subnet's router matrix as 3 × 512.To populate this matrix, we extract the first 3 rows and first 512 columns from the Supernet's weight matrix (as illustrated in Figure 2 (a)).Such a weight sharing technique allows us to design hetegogeneous MoE architectures with varying number of experts in each Transformer layer.

Table 3 :
The algorithm works by sampling an initial set of MoE candidate architectures randomly from the Supernet; evolving the top architectures iteratively by mutation; followed by crossover; until the search iterations are exhausted.Candidate MoE architectures are easily ranked by the Supernet performance estimator based on the validation score for the task.Machine translation benchmark data.
evaluate AutoMoE on standard machine translation benchmarks: WMT'14 En-De, WMT'14 En-Fr and WMT'19 En-De with dataset statistics in

Table 4 :
Comparison of AutoMoE vs. baselines with Pareto-optimal architectures highlighted in blue color.We report active model parameters, and sparsity measured as non-active parameters as a percentage of total parameters.We train all baselines and AutoMoE for the same 40K training steps for fair comparison to report BLEU 1 .Training time (with search, if applicable) is reported in hours for one Nvidia V100 GPU.Inference latency is measured in Intel Xeon CPU.AutoMoE significantly reduces FLOPs and latency with parity in BLEU, on aggregate, over NAS methods in dense search space (e.g., 1.3× and 2.4× FLOPs reduction and speedup overHAT and Evolved  Transformer).AutoMoE with Random Search obtains the best speedup but results in significant regression in BLEU.

Table 5 :
AutoMoE-generated Pareto-optimal architectures for different datasets.FFN intermediate sizes for fractional experts (i.e.varying expert sizes within each layer) are enclosed within square brackets.

Table 6 :
Impact of latency and FLOPs constraints on WMT'14 En-Fr dataset.Latency is computed on 1 NVIDIA V100 GPU.