Scaling Vision-Language Models with Sparse Mixture of Experts

The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs). These models aim to bridge the gap between text and visual information, enabling a more comprehensive understanding of multimedia data. However, as these models become larger and more complex, they also become more challenging to train and deploy. One approach to addressing this challenge is the use of sparsely-gated mixture-of-experts (MoE) techniques, which divide the model into smaller, specialized sub-models that can jointly solve a task. In this paper, we explore the effectiveness of MoE in scaling vision-language models, demonstrating its potential to achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost. Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling VLMs. We hope our work will inspire further research into the use of MoE for scaling large-scale vision-language models and other multimodal machine learning applications.


Introduction
The ability to understand and generate natural language from visual information is a critical component of many real-world applications, including visual question answering (VQA), visual reasoning, and multimodal information retrieval.In recent years, the success of deep learning in natural language processing (NLP) has led to the development of large-scale vision-language models (VLMs) (Tan and Bansal, 2019;Chen et al., 2020;Li et al., 2021b;Gan et al., 2020;Kim et al., 2021a;Alayrac et al., 2022;Wang et al., 2022c;Shen et al., 2022b;Li et al., 2021a;Shen et al., 2022a;Jia et al., 2021;Li et al., 2022;Yu et al., 2022) that leverage powerful neural network architectures to encode and decode multimodal information.However, state-of-the-art vision-language models like Flamingo-80B (Alayrac et al., 2022), BEIT-3-1.9B(Wang et al., 2022b), and PaLI-17B (Chen et al., 2022) can be computationally expensive and difficult to train, which has motivated researchers to explore ways of improving their efficiency and effectiveness.
Recently, sparsely activated Mixture of Experts (MoE) models have been successfully employed to scale both vision (Riquelme et al., 2021;Lou et al., 2021;Mustafa et al., 2022) and text models (Shazeer et al., 2017;Lepikhin et al., 2020;Zoph et al., 2022;Du et al., 2022).These models are motivated by the need to increase model parameters while controlling compute costs.In addition, these models provide other advantages, including sparsity that can mitigate catastrophic forgetting in continual learningg (Collier et al., 2020;Komatsuzaki et al., 2022), and an inductive bias that can enhance performance in multitask learningg (Ma et al., 2018;Kudugunta et al., 2021;Kim et al., 2021b).Overall, the use of MoEs has proven to be a promising strategy for scaling deep learning models across various domains.
Building on the success of MoEs in individual domains and applying the intuition that sparse models may better handle different tasks versus dense counterparts, we investigate the potential of MoEs for visionlanguage modeling.To this end, we take the first step in this direction and explore models that can process both images and text for vision-language tasks.One similar effort has been studied in LIMOE (Mustafa et al., 2022), where the authors proposed a modal-agnostic CLIP-style (Radford et al., 2021) multimodal MoEs architecture, but their focus is mainly on the contrastive pre-training objective and vision-only downstream tasks.There are two limitations in this setting: (1) The increasing model capacity of MoEs under the the simple contrastive objective can easily lead to over-fitting issues.
(2) The vision-only benchmarking does not reveal the full power of scaling up multimodal models.Alternatively, our goal is to demonstrate the effectiveness of MoEs under generative modeling for vision-language tasks and provide a more comprehensive foundation for future research in this area.Specifically, we propose a novel VLM architecture that employs MoE to scale both the text-based and vision-based feed-forward networks (T-FFN and V-FFN, respectively) in a unified framework.Our approach divides the model into multiple sub-models, each of which is responsible for processing a modal-specific subset of the input data.The text and vision input representations are then aligned via three mask data modeling objectives (Wang et al., 2022b).
We train a range of VL-MoE models and evaluate the model on vision-language classification, visionlanguage retrieval, vision-only and language-only tasks, Our experiments demonstrate that MoE can significantly improve the efficiency and effectiveness of VLMs, enabling them to handle large-scale, real-world multimedia data.We scale BASE-size model up to a 1.8B parameter VL-MoE LARGE/16E , which only applies 560M parameters per token and achieves competitive performance with dense models that make use of similar or more pre-training image-text pair data and apply 3-4× more parameters per token.
In summary, our contributions are as follows: • We propose VL-MoE, the first large-scale generative MoEs multimodal models for vision/langaugeonly, as well as vision-and-language tasks.
• We explore various scaling strategies, including increasing dense model size, increasing expert numbers, and scaling either T-FFN or V-FFN alone, to investigate the trade-offs between model complexity and performance on various downstream tasks.
• We present ablations to understand VL-MoE model's behavior, interpretability, and our design choices.
For model architecture, there are two main designs.The first design, utilized by models such as (Radford et al., 2021;Jia et al., 2021;Yuan et al., 2021) separately encodes each modality with different encoders.While this approach performs well for image-text retrieval tasks, it struggles with complex vision-language tasks like visual reasoning.The second design, employed by models like (Tan and Bansal, 2019;Li et al., 2021a;Lu et al., 2019;Li et al., 2019;Kim et al., 2021a;Chen et al., 2022;Alayrac et al., 2022), uses a complex fusion module with cross-modal attention to combine modalities.However, this design sacrifices efficiency for improved performance.Recently, a new design has emerged with the MOME Transformer used in both VLMO and BEIT-3.This design unifies the dual-encoder and fusionencoder models by introducing a mixture-of-modalityexperts technique.With MOME, various modalities are encoded within a shared Transformer block, allowing for improved scalability and achieving state-of-the-art performance on vision-language tasks.There is an increasing interest to grow the VL model capacity with an affordable compute budget, including MoE (Mustafa et al., 2022) and the injection of new trainable modules on pre-trained models (Alayrac et al., 2022;Shen et al., 2022a;Liu et al., 2023b;Li et al., 2023d,b;Koh et al., 2023); the former remains less studied.
For pretraining objectives, multiple cross-modal pretraining objectives have been studied.They can be categorized into two classes: (1) Discriminative modeling, including image-text contrastive learning (Radford et al., 2021;Jia et al., 2021), image-text matching (Tan and Bansal, 2019;Kim et al., 2021a;Li et al., 2021a;Bao et al., 2022b) and word-patch/region alignment (Chen et al., 2020;Kim et al., 2021a); (2) Generative modeling, including masked language modeling (Tan and Bansal, 2019;Su et al., 2020;Kim et al., 2021a) or prefix language modeling (Wang et al., 2022c), masked region modeling (Tan and Bansal, 2019), multimodal prefix language modeling (Wang et al., 2022c).Recently, BEIT-3 shows strong scaling results by unifying the generative multimodal pretraining objective with masked data modeling, which comprises masked image modeling and masked language modeling on the monomodal encoders and masked multimodal modeling on the multimodal encoder.In this paper, we perform MoE study, by adopting the MOME Transformer as the backbone dense network and generative (masked data) modeling as pretraining objectives given its simplicity and scaling ability.
Sparse Mixture of Experts models.We build upon the concept of deep sparse MoEs, which have been studied independently in both Computer Vision (Riquelme et al., 2021;Lou et al., 2021;Mustafa et al., 2022) and Natural Language Processing (Riquelme et al., 2021;Lou et al., 2021;Mustafa et al., 2022;Shazeer et al., 2017;Lepikhin et al., 2020;Fedus et al., 2021;Du et al., 2022;Zoph et al., 2022;Clark et al., 2022;Zhou et al., 2022;Komatsuzaki et al., 2022;Kudugunta et al., 2021;Shen et al., 2023) in the context of conditional computation.The goal of conditional computation is to increase the number of model parameters without a proportional increase in computational cost, which is achieved by selectively activating only relevant parts of the model based on input-dependent factors (Bengio, 2013;Chen et al., 1999;Davis and Arel, 2013).MoE models use a learned gating mechanism that activates only a subset of k experts out of E ≫ k for a given input, allowing an input to select either all experts (Eigen et al., 2013) or only a sparse mixture thereof, as in recent massive language models (Fedus et al., 2021;Du et al., 2022).
While many works aim to improve the gating mechanism itself (Hazimeh et al., 2021;Lewis et al., 2021;Roller et al., 2021;Zhou et al., 2022), MoE models have also been studied for multitask learning (Hazimeh et al., 2021;Kudugunta et al., 2021) with per-task routers (Ma et al., 2018), although a shared pool of experts is typically used.
MoE models have been explored for multimodal learning as well, with LIMOE (Mustafa et al., 2022) and Uni-MoE (Zhu et al., 2022) being most relevant to our work.However, LIMOE considers the CLIP-style contrast as the pre-training objective, and vision/retrieval tasks as the downstream evaluation.Uni-MoE focuses on routing decisions with limited experts and evaluates on caption/vision/language/retrieval tasks.To the best of our knowledge, the proposed VL-MoE is the first the MoE scaling study to consider the generalized generative modeling objective in the VL pre-training, and we evaluate its scaling performance in a more comprehensive manner, including vision/language-only, as well as vision-and-language tasks.

Method
We first describe the masked data modeling pretraining objectives.We next discuss MoEs, sparse MoEs and present how we apply sparse MoEs methodology to vision-language models, before explaining our design choices for the routing algorithm and the implementation of VL-MoE.

Vision-Language Masked Data Modeling
We utilized a unified masked data modeling objective (Wang et al., 2022b) to pretrain VL-MoE on monomodal (i.e., images and texts) and multimodal data (i.e., image-text pairs).This approach has been demonstrated to be scaling-friendly with small batchsizes.Our pretraining process involved masked image modeling on monomodal image data, masked language modeling on monomodal text data, and masked visionlanguage modeling on multimodal image-text pairs.

Masked Language Modeling
We use masked language modeling (MLM) to learn language representations from large-scale text-only data.For MLM, 15% of tokens in monomodal text data are randomly masked, and the model is trained to recover the masked tokens from the corrupted input text.Masked tokens are replaced by a [MASK] token 80% of the time, a random token 10% of the time, and kept the original tokens 10% of the time, following BERT (Devlin et al., 2019).

Masked Image Modeling
In addition to masked language modeling, VL-MoE uses masked image modeling (MIM) to learn vision representations from large-scale image data.For MIM, block-wise masking is applied to 40% of image patches, and the pretraining objective is to reconstruct the discrete visual tokens of masked patches, following BEiT (Bao et al., 2022a).The im- age tokenizer of BEITv2 (Peng et al., 2022) is used to obtain the discrete tokens as the reconstructed targets.

Masked Vision-Language Modeling
To learn aligned vision-language representation, we use masked vision-language modeling (VLM), which extends masked language modeling and masked image modeling to multimodal data.The task aims at recovering masked image patches and text tokens based on visual and linguistic clues.In VLM, text tokens (with 50% mask ratio) are randomly masked as in MLM, and the model is trained to recover the masked text tokens based on the joint image-text representations.Image patches are also masked with the same ratio as in MIM, and the corresponding visual tokens are predicted based on the image-text pair.The VLM task further encourages the model to learn alignments between image and text pairs.

VL-MoE Architecture
Input Representation.To obtain text representations, the input text is tokenized and projected onto word embeddings ({w i } M i=1 ), where M is the length of the tokenized text sequence.Two special tokens, a startof-sequence token ([T CLS]) and a special boundary token ([T SEP]), are added to the sequence.Text representations are obtained by summing the word embeddings and text position embeddings, resulting in For image representations, the input 2D image v ∈ R H×W ×C is split and reshaped into N = HW /P 2 patches v p ∈ R N ×(P 2 C) , where C is the number of channels, (H, W ) is height and width of the input image, and P is the patch size.These patches are then flattened into vectors and linearly projected to obtain patch embeddings following vision Transformers (Dosovitskiy et al., 2020;Touvron et al., 2020;Bao et al., 2022a).We prepend a learnable special token [I CLS] to the sequence.The resulting image input representations are given by To form image-text input representations, we concatenate image and text input vectors, resulting in The dense backbone network of VL-MoE is a shared multimodal Transformer, illustrated in Figure 1.To encode different modalities, we utilize a mixture-of-modality-experts (MOME) Transformer (Bao et al., 2022b;Wang et al., 2022b), which takes image and text representations of monomodal data, as well as representations of image-text pairs as input.The MOME Transformer comprises multiple layers of blocks, each consisting of a multi-head self-attention layer and a feed-forward expert layer.While the selfattention module is shared across modalities, each feedforward expert layer contains a pool of modality-specific experts (V-FFN, T-FFN, or VL-FFN) that act as a substitute for the feed-forward network in standard Transformers.This allows for hard routing over the pool of feed-forward networks based on the modality of the input tokens.
Conditional Computation with MoEs.The concept of conditional computation involves selectively activating different parts of a neural network based on the input (Bengio, 2013).One specific approach is to use a mixture-of-experts (MoE) model, where different "experts" handle different regions of the input space (Jacobs et al., 1991).In this paper, we adopt the MoE layer proposed in (Shazeer et al., 2017), which consists of E experts and is defined as Here, x is the input to the layer, e i : R D → R D is the function computed by expert i, and g : R D → R E is the "routing" function that determines the input-dependent weights for the experts.Both e i and g are implemented as neural networks.Although this formulation still involves a dense network, it can be made sparse by restricting g to assign only k ≪ E non-zero weights, thereby eliminating the computation of unused experts.This approach allows for super-linear scaling of the number of model parameters in both training and inference.
VL-MoE.We apply sparse MoE to vision-language models in the context of the MOME.As illustrated in Figure 1, inputs from different modalities are routed to V-FFN and T-FFN in the first (L − F ) layers, and V-FFN, T-FFN, or VL-FFN in the last F layers.To avoid instability due to modality input imbalance when applying MoEs to modal-agnostic VL-modules in V-MOE (Riquelme et al., 2021), we only use MoE for V-FFN and T-FFN in the first (L − F ) layers.V-FFN and T-FFN have two layers and a GeLU (Hendrycks and Gimpel, 2016) non-linearity: V/T-FFN(x) = W 2 σ gelu (W 1 x).For VL-MoE, we replace a subset of V-FFN and T-FFN with V-MoE and T-MoE layers, where each expert is an FFN with the same architecture e i (x) = FFN θi (x) but different weights . This design pattern is similar to that of GShard (Lepikhin et al., 2020) and V-MOE (Riquelme et al., 2021) models.In V-MoE and T-MoE layers, each token x ∈ R D is processed sparsely by k out of E available experts.To select which one, a lightweight V/T-Router predicts gating weights per token: g(x) = softmax(W g x) ∈ R E , where W g ∈ R D×E is learned.The k activated experts' outputs are combined linearly according to the gating weights: To ensure computational efficiency and implementation constraints, each expert in VL-MoE has a fixed buffer capacity, which determines the maximum number of tokens it can process.The assumption is that tokens are approximately balanced across experts.In case the capacity is exceeded, some tokens are not processed by the expert and are dropped, leading to a decrease in the success rate.This rate is a vital indicator of balanced routing and training stability.To mitigate this problem, we employ Batch Priority Routing (BPR) (Riquelme et al., 2021;Mustafa et al., 2022), which selectively skips tokens based on their routing weights.BPR prioritizes tokens with larger routing weights, as they are deemed more informative.Our results show that BPR is crucial for stable training of VL-MoE.We further analyze token routing decisions in Section 5 and dropped tokens in Appendix.

Pretraining Setup
Pretraining Data.Our pretraining process uses both monomodal and multimodal data.The monomodal data comprises ImageNet-22K for images and English Wikipedia and BookCorpus (Zhu et al., 2015) for text.The multimodal data combines four datasets of imagetext pairs: Conceptual Captions (Sharma et al., 2018), SBU Captions (Ordonez et al., 2011), COCO (Lin et al., 2014), and Visual Genome (Krishna et al., 2017), containing a total of 4 million images and 10 million imagetext pairs.Pretraining Setting.For the large-size model, we employ a 24-layer Transformer network with 1024 hidden size and 24 attention heads, following VIT (Dosovitskiy et al., 2020), BEiT (Bao et al., 2022a), and VLMO (Bao et al., 2022b).The use of VL-FFN starts at 20th layer.The base/small-size model is an 12/8-layer Transformer network with 768/384 hidden size and 12/6 attention heads, where VL-FFN is used in 10/8th layer.We randomly initialize the model parameters using the method described in BEiT (Bao et al., 2022a).The image resolution is set to 224 × 224, and the patch size is 16 × 16.The maximum sequence length for text is 96.We use a batch size of 6, 144 and train the model from scratch for 200k steps, which is equivalent to 40 epochs of the image-text pairs.Each batch contains 2, 048 images, 2, 048 texts, and 2, 048 image-text pairs.We perform image augmentation using random resized cropping, horizontal flipping, and color jittering, following the same method as BEiT (Bao et al., 2022a).The text data is tokenized using a SentencePiece (Kudo and Richardson, 2018) tokenizer with a vocabulary size of 64k.We use the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9 and β 2 = 0.999 to optimize the model.The peak learning rate is 2e-3, and we use linear warmup for the first 10, 000 steps and cosine learning rate decay.The weight decay is 0.05, and we disable dropout and use stochastic depth (Huang et al., 2016) with a rate of 0.1.The three pretrain losses are equally weighted as in BEIT-3 (Wang et al., 2022b).
MoE Setting.For the default setting of MoEs in VL-MoE BASE/16E , we use E = 16 experts for T-FFN and V-FFN, respectively.All VL-MoEs activate k = 1 expert per token, similar to Switch Transformer (Fedus et al., 2021) and LIMoE (Mustafa et al., 2022).We replace every second dense T-FFN or V-FFN sublayer with MoE sublayer following GShard (Lepikhin et al., 2020) and Switch Transformer (Fedus et al., 2021).We use BPR for stability in V-MoE (Riquelme et al., 2021).For auxiliary loss, we use loading loss in (Shazeer et al., 2017) for T-FFN's MoE and averaged loading loss and importance loss in V-MoE (Riquelme et al., 2021) for V-FFN's MoE.The combination ratio for auxiliary loss is set as 0.01 in all our experiments We use 32 expert parallelism and TUTEL (Hwang et al., 2022) for fast routing and computation.All the models are based on Deep-Speed (Rasley et al., 2020).Pre-training experiments are done on 32 Nvidia Tesla V100-32GB GPUs.Following ST-MoE (Zoph et al., 2022), we freeze all the MoE modules (router and expert network) during finetuning process.The capacity factor C is set to be 1.05 during training and 1 during inference following (Riquelme et al., 2021).Table 1: Finetuning results of different models on vision-language classification tasks and image-text retrieval tasks.
We report vqa-score on VQA test-dev and test-standard split, accuracy for NLVR2 development and public test set (test-P) and top-1 recall for image retrieval (IR) and text retrieval (TR).( * denotes the model that is reproduced by us and trained with the same setting as VL-MoE.)

Vision-and-Language Downstream Tasks
In our study, we explore the performance of VL-MoE on vision-and-language downstream tasks through finetuning experiments on three standard tasks: visual question answering (Goyal et al., 2017), natural language for visual reasoning (Suhr et al., 2019), and image-text retrieval (Plummer et al., 2015;Lin et al., 2014).Following BEIT-3, we use 480 × 480 image resolution for VQA fine-tuning and 384 × 384 for the other tasks.
Visual Question Answering (VQA).For VQA, the task is to generate/choose the correct answer given a natural image and a question.Following previous work (Kim et al., 2021a;Bao et al., 2022b;Wang et al., 2022b), we utilize the VQA 2.0 dataset (Goyal et al., 2017) and formulate it as a classification problem with 3, 129 most frequent answers.We finetune VL-MoE as a fusion network to encode both the image and question.We use the final encoding vector of the [T CLS] token as the representation of the image-question pair, and feed that into a classifier layer to predict the label.

Natural Language for Visual Reasoning (NLVR2).
Visual reasoning task aims to predict whether a text description is true about a pair of images.We use NLVR2 (Suhr et al., 2019) dataset for evaluation.Following OSCAR (Li et al., 2020), VinVL (Zhang et al., 2021) and VLMO (Bao et al., 2022b), we reformulate the triplet input into two image-text pairs, each containing the text description and one image.We use VL-MoE as a fusion network to jointly encode the image and text.
The concatenated final vector of [T CLS] token from the two pairs is then fed into a classification layer to predict the label.
Image-Text Retrieval.For image-text retrieval, it contains both image-to-text retrieval and text-to-image retrieval for different target modalities.We use the widely used COCO (Lin et al., 2014) and Flickr30K (Plummer et al., 2015) datasets to evaluate the model, and adopt the Karpathy split (Karpathy and Fei-Fei, 2015) following common practices.Noted that in the architecture of VL-MoE and BEIT-3 (Wang et al., 2022b), it does not involve the image-text matching module as existing in CLIP (Radford et al., 2021).Table 2: Results of base-size models on image classification (ImageNet-1K) and natural language inference (MNLI-m).We report top-1 accuracy for both.
2022b) and BEIT-3.During inference, VL-MoE is used to encode images and text separately and compute the matching scores by the dot product of image and text vectors to obtain the top-k candidates.
Table 1 presents the results of our vision-language model on classification and retrieval tasks, including VQA, NLVR2, COCO, and Flickr30K.To ensure a fair comparison, we provide details on the amount of pretraining image-text pair data, pretraining steps, and the number of parameters per input token.Following LIMOE (Mustafa et al., 2022), we define the number of parameters per input token as the number of parameters that the model applies to each image-text token pair.Notably, VL-MoE LARGE/16E contains 2 billion parameters in total, but only applies 560 million parameters per token.Additionally, all routers combined account for less than 0.5 million parameters.Our model outperforms previous large/base-size models on VQA, NLVR2, COCO, and Flickr30K by a significant margin, particularly when compared to a reproduced BEIT-3 (Wang et al., 2022b), which was pretrained using the same settings as VL-MoE.Moreover, to the best of our knowledge, VL-MoE is the first to demonstrate that a mixture-of-experts architecture can successfully scale with a comparably modest architecture size and training counts, while achieving generalization performance on a range of tasks in the context of vision-language tasks.Interestingly, Switch Transformer (Fedus et al., 2021) struggles with generalization for language MoE, while V-MOE (Riquelme et al., 2021) and LIMOE (Mustafa et al., 2022) only evaluate on downstream vision tasks.Additionally, VL-MoE even outperforms VLMO LARGE and ALBEF, which are pretrained with more imagetext pair data and initialized from pretrained models, on COCO and Flickr30K and achieves competitive performance on VQA and NLVR2.We assume that this may be due to the fact that the capacity of VL-FFN has not been scaled in VL-MoE, as reflected in the pretraining plot in Figure 2 (the difference of VLM loss between VL-MoE and dense BEIT-3 model is smaller compared to that of MLM and MIM loss).We leave the scale of the VL-FFN module for future work, considering the increasing instability in modal-agnostic MoE architectures demonstrated in LIMOE (Mustafa et al., 2022).

Vision/Language-Only Downstream Tasks
Image Classification.We use the image classification task to evaluate the model on the vision-only downstream task, where the objective of this task is to categorize an input image into its corresponding class.We employ the ILSVRC-2012 ImageNet dataset (Russakovsky et al., 2015), which consists of 1.3M images with 1k classes.Following BEIT (Bao et al., 2022a) and VLMO (Bao et al., 2022b), we perform average pooling over the final vectors and feed the resulting vector into a linear classifier layer to predict the label.
Natural Language Inference.We use the natural language inference task to evaluate the model on the language-only downstream task.The task involves determining the relationship between two pieces of text.In this task, a model is given a premise sentence and a hypothesis sentence, and it needs to determine whether the hypothesis is true, false, or undetermined based on the information provided in the premise.We use Multi-Genre Natural Language Inference (MNLI) (Williams et al., 2018) dataset, which contains 433k sentence pairs annotated with textual entailment information.We evaluate on matched (MLM-m) setting only.
As shown Table 2, we compare VL-MoE with two base-size vision Transformers and V-MOE-B/16-E16 on image classification.For BEIT, BEIT-3 BASE and VL-MoE BASE/16E , we perform intermediate finetuning on ImageNet-22k to compare with VIT pretrained on ImageNet-22k.The model performs competitively with previous state-of-the-art supervised and self-supervised models on ImageNet-1k.Besides the dense counterpart BEIT-3 BASE , VL-MoE also outperforms other strong vision-language models (SIMVLM) pretrained with more data and more steps on MNLI-m.

Discussions
We conduct ablation studies to analyze the contributions of Mixture-of-Experts module used in VL-MoE from different perspectives.We evaluate the models on visual reasoning (NLVR2), image-text retrieval (Flickr30k), image classification (ImageNet-1k) and natural language inference (MNLI-m).Table 3: Ablation studies of scaling strategies (all the results are based on VL-MoE SMALL/E16 models).All the *-MoE uses 16 experts (where T/V stands for applying MoE on the T/V-FFN).

Scaling Strategy
Scaling Strategy.In addition to scaling both T-FFN and V-FFN, we have also explored different scaling strategies by applying Mixture-of-Experts (MoEs) modules for either T-FFN or V-FFN alone.The results of our experiments are presented in Table 3.Our findings indicate that scaling a single modality can improve the downstream performance on the corresponding modality as well as overall vision-language tasks.However, we observed that scaling both vision and language modalities leads to the most balanced performing model with 70.6% averaged performance.This may be attributed to the fact that we employ three different pretraining objectives for each modality, and scaling each modality contributes to better optimization of the specific modality pretraining loss as well as the VLM loss.For further evidence, we include the pre-training loss in Appendix.

Number of Experts.
The optimal number of experts in Mixture-of-Experts (MoEs) is still a topic of debate, as there is no agreement on the ideal number.Previous NLP research has experimented with a wide range of expert numbers, ranging from thousands in early studies (Shazeer et al., 2017;Fedus et al., 2021), to as low as 32 or 64 in more recent research (Zoph et al., 2022;Du et al., 2022;Zhou et al., 2022), which has become the standard for vision models (Riquelme et al., 2021;Mustafa et al., 2022).In Figure 5, we investigate this further with VL-MoE, and our findings suggest that larger expert pools consistently yield performance improvements.
Effects of the Auxiliary Losses.As previously mentioned, experts in MoEs have a fixed buffer capacity, and without intervention, top-k MoEs tend to collapse, leading to poor performance as most tokens are dropped (Shazeer et al., 2017;Zhou et al., 2022).To prevent this, prior research has employed auxiliary losses to promote balanced routing (Riquelme et al., 2021;Zoph et al., 2022;Zhou et al., 2022;Mustafa et al., 2022).However, as shown in LIMOE (Mustafa et al., 2022), in multimodal settings, new challenges emerge, such as modality misbalance, where one data type may be more prevalent than the other.We design VL-MoE in a modal-specific fashion to prevent the instability caused by imbalance of multimodal data and experiment with different auxiliary losses for V-MoE: loading balance loss (Shazeer et al., 2017), averaged loading balance and important loss ("vloss") (Riquelme et al., 2021),   z-loss (Zoph et al., 2022)). 1 We present the results on VL-MoE SMALL/E16 in Figure 4, which suggest that Z-loss presents to hurt the vision-and-lanaguage pretraininig of VL-MoE and using loading balance loss only will introduce unstable training and underperforming models.The "vloss" turns out to lead to most stable training, which is consistent with V-MOE (Riquelme et al., 2021) and LIMOE (Mustafa et al., 2022).BPR also helps in stablizing training.
Token Routing Examples in VL-MoE.In Figure 3, we provide a qualitative analysis of token routing decisions on COCO.For vision tokens, their specialization is clear, as they are routed to specific experts such as food and vegetable experts, eyes experts, OCR experts, etc.On the other hand, language tokens show signs of syntax specialization, with some experts processing mostly padding tokens, while others focus on nouns and adjectives (and some padding), excluding prepositions, determiners, or verbs.Comparision with LIMOE.In LIMOE (Mustafa et al., 2022), the single-modality MoE architecture and the employed contrastive loss are the two main building blocks.To directly compare the two components of multimodal LIMOE under our setting, we thoroughly experimented with optimizing either the single-modality MoE architecture or VL-MoE with contrastive or masked data modeling (MDM) loss.However, we found that the models fail to converge when optimizing the LIMOE architecture with the MDM loss, likely due to the fact that the MDM losses consist of three losses aiming for different modalities, which may exacerbate the modality imbalance problem and make it difficult to optimize MoEs even equipped with the entropy balancing loss in (Mustafa et al., 2022).Therefore, we focused on optimizing VL-MoE and LIMOE with the contrastive loss, as it yielded more stable results.However, it should be noted that while LIMOE uses 1.8B image-text pairs, our setting only has 4M.We then report the training and validation loss across steps by optimizing VL-MoE or LIMOE with the contrastive loss in Figure 8.The batch size is set to be 2k.From the zero-shot validation results, it can be seen that both models quickly overfit to the 4M imagetext pairs, but the single modality MoE architecture in LIMOE inherits more instability.
Furthermore, we use 4M data to enrich the experiments using contrastive loss with different model settings in Table 5.We can see that LIMOE seems to exhibit a trend where performance doesn't improve much or even decreases as the number of training steps increases (from 75k to 100k), especially in the 105M parameter setting.This could be a sign of overfitting, where the model is starting to fit the training data more closely but is not generalizing as well to the validation/test data.Increasing the number of experts for LIMOE does not lead to significant performance gains, especially in the 105M parameter setting.This might

Models
Size IN0shot

Conclusion
In this paper, we have explored the use of Mixture-of-Experts (MoE) for scaling vision-language models.Our experiments demonstrate that MoE can be a promising technique for improving the efficiency and effectiveness of vision-language models.Specifically, we have shown that dividing a large vision-language model into smaller, specialized sub-models through MoE can achieve stateof-the-art performance on several benchmarks while reducing computational costs.Our experiments have also shown that larger expert pools yield consistent performance improvements.Furthermore, we have explored the impact of MoE on model interpretability and found it can improve the interpretability of vision-language models by providing better insights into how the model processes different inputs.
In conclusion, our findings suggest that MoE is a valuable technique for scaling vision-language models, enabling them to handle large-scale, real-world multimedia data.Our work opens up new research directions for exploring the effectiveness of MoEs in other visionlanguage tasks, such as visual question answering, visual reasoning and image-text retrieval, and we hope our findings will inspire further investigations into this research area.

A Appendix
A.1 Further Analyses "Dropped" Tokens.In MoE training, the issue of "Dropped Tokens" is inherited (Lepikhin et al., 2020;Shazeer et al., 2017;Mustafa et al., 2022;Riquelme et al., 2021;Zhou et al., 2022) and caused by the limited capacity of each MoE expert, which can lead to instability.To provide a detailed analysis of this issue, we present Figure 6, which illustrates the distribution of dropped tokens in VL-MoE BASE/16E across different pre-training tasks.The figure shows that MLM and MIM tasks exhibit a more balanced distribution of tokens compared to VLM task, which may explain the improved performance of using MoEs in the former two pre-training tasks, as depicted in Figure 2. Additionally, the problem of dropped imag tokens is more severe compared to dropped text tokens, which aligns with the results of different scaling strategies presented in Section 5 and the findings in (Mustafa et al., 2022;Riquelme et al., 2021).
Pretrain Losses for Different Scaling Strategies.We additionaly report the effect of different scaling strategy in Section 5 for VL-MoE SMALL/16E scaling on three mask language modeling (MLM), mask image modeling (MIM), and masked vision-language modeling (VLM) pre-training tasks across training steps in Figure 7.The results support our hypothesis that using three distinct pretraining objectives for each modality and scaling each modality leads to improved optimization of both the specific modality pretraining loss and the VLM loss.

Additional Results
We conduct experiments using COCO captions following (Wang et al., 2022b), where VL-MoE achieves 139.2 for CIDEr and 23.1 for SPICE, which outperforms the BEIT-3 with 137.5 for CIDer and 22.7 for SPICE using base-size.We also observe interesting routing specialization when generating the final word "cake" considering the T-MoE in VL-MoE in Figure 3. "NN: lady" and "NN: slicing" route to experts 1 and 13 respectively."DT: A, a" both route to expert 1. "JJ: hairnet, big" route to expert 7.These routings underscore the inherent nature of expert specialization in the VL-MoE model, potentially highlighting its advantages.Natural Language for Visual Reasoning (NLVR2).
For results of Table 1, the base/large-size models are fine-tuned for 10 epochs with 128 batch size.The peak learning rate of the base-size models is set to 5e-5.The input image resolution is 384 × 384.For ablation experiments, we fine-tune the models for 10 epochs with 128 batch size, and choose learning rates from {5e-5, 1e-4}.The input image resolution is 224 × 224.All the ablation results of NLVR2 are averaged over 3 runs.
COCO.We fine-tune the base/large-size model for 20 epochs with 2048 batch size.The peak learning rate is 2e-5 and the input image resolution is 384 × 384.
Flickr30K.For results of Table 1, the base/large-size models are fine-tuned for 40 epochs with a batch size of 2048 and a peak learning rate of 1e-5.We use the fine-tuned model on COCO as the initialization.The input image resolution is 384 × 384.For all ablation experiments, we fine-tune the models for 10 epochs with 1024 batch size.The peak learning rate is set to 5e-5, and the input image resolution is 224 × 224.
ImageNet-1k.We fine-tune the base-size VL-MoE with V-MoE and V-FFN only for 15 epochs with 2048 batch size.The peak learning rate is 3e-5 and the input image resolution is 384 × 384.
MNLI.We fine-tune the base-size VL-MoE with T-MoE and T-FFN only for 10 epochs with 32 batch size.
The peak learning rate is 3e-5.

A.3 Formula of Auxiliary Loss
Given a token x ∈ R D , we denote by g(x) = softmax(W x) ∈ R E the gating weights across the E experts, with W ∈ R E×D being the routing parameters.
When we deal with a batch of multiple tokens {x i } n i=1 , we use the notation X ∈ R n×D .
Importance loss.
We follow the definition from (Riquelme et al., 2021;Mustafa et al., 2022).The importance loss Ω imp ensures that the gating weights are evenly distributed among the experts, maintaining a balanced profile.For any expert e ∈ {1, . . ., E}, we have imp e (X) = x∈X g(x) e and the loss Ω imp is defined via the squared coefficient of variation for imp(X) = {imp e (X)} E e=1 Ω imp (X) = std(imp(X)) mean(imp(X)) 2 .
Load loss.Like previously, we follow (Riquelme et al., 2021).We assume the gating weights g noisy (x) are obtained by perturbing the routing function with noise, i.e., g noisy (x) = softmax(W x + ε) with ε ∼ N (0, σ 2 I) and σ = 1/E.We denote η k the kth largest entry of W x + ε.The importance loss Ω imp aims to balance the selection probability of experts by focusing on the likelihood of choosing them, as assigning tasks to experts is a discrete process.The load loss Ω load complements this by striving to even out the number of assignments among the experts.To calculate the selection probability, the expert e ∈ {1, . . ., E} is assumed to be among the top-k even when resampling only the noise as v-loss.The notation "v-loss" we used in Section 5 is essentially the final employed loss in V-MOE (Riquelme et al., 2021), where Ω vloss (X) = 0.5 * Ω imp (X) + 0.5 * Ω load (X).
(a) Encode Image Only (b) Encode Text Only (c) Encode Image-Text Pair (d) V-MoE & T-MoE Figure 1: The encoding process of VL-MoE for various modality inputs, for which gray and colored blocks indicate non-activated and activated modules, respectively.(a) For image input only, the encoding process switches to V-MoE or V-FFN (b) For text input only, the encoding process switches T-MoE or T-FFN.(c) For image-Text Pair input, the encoding process switches, V-MoE & T-MoE and VL-FFN.(d) For the early layers, we scale the V-FFN and T-FFN with Sparse Mixture-of-Experts as V-MoE and T-MoE, respectively.VL-MoE will utilize conditional computation to allocate tokens in a modality-specific fashion.V/T-MoE converts multiple V/T-FFNs as experts, where the image/text input will be conditionally routed by V/T-Router Network.

Figure 2 :
Figure 2: Effect of VL-MoE scaling on three mask language modeling (MLM), mask image modeling (MIM), and masked vision-language modeling (VLM) pre-training tasks across training flops.

Figure 3 :
Figure 3: Token routing decisions on COCO.Examples of vision tokens routing decisions and breakdown of language token routing decisions at the V/T-MoE layer placed in the 6-th encoder block -i.e.middle of the networkfor VL-MoE LARGE/16E .

Figure 4 :
Figure 4: Effect of auxiliary loss on training stability.

Figure 5 :
Figure 5: Effect of Experts Number.

Figure 8 :
Figure 8: Comparision of Dense, VL-MoE, and LIMOE on contrastive pre-training task across training steps.
p e (x) = 1 − Φ η k − (W x) e σwith Φ the cumulative distribution function of a Gaussian distribution.The load loss Ω load is eventually defined byΩ load (X) = std(load(X)) mean(load(X)) 2 where load(X) = {load e (X)} E e=1 , load e (X) = x∈X p e (x).Z-loss.The z-loss Ω zloss introduced in(Zoph et al., 2022) aims at controlling the maximum magnitude of the router activations A = {W x i } n i=1 ∈ R n×E with entries a i,e = (W x i ) e .The loss is defined by Ω zloss (X

Table 4 :
Efficiency results of base-size VL-MoE models with different optimizations.
Efficiency In Table4, we use one V100×16 node for benchmarking the efficiency of VL-MoE with various optimizations.The EP stands for the expert parallelism provided in DeepSpeed library and KN denotes the specialized kernel fusing operation we implemented (expert dispatch as well as bias gelu fusion).