mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Large-scale pre-trained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from inefficiency and linguistic signal overwhelmed by long visual sequences in cross-modal alignment. To address both problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections.mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability on vision-language and video-language tasks. The code and pre-trained models are available at https://github.com/alibaba/AliceMind


Introduction
Large-scale pre-training of vision-language models have recently received tremendous success on a wide range of cross-modal tasks [1,2,3,4,5,6,7]. Such vision-language models learn cross-modal representations from a quantity of image-text pairs by aligning the visual and linguistic modalities. A great challenge of learning vision-language models is to find a good alignment between the two modalities to close the semantic gap in-between.
To discover a cross-modal alignment, prior studies [4,8,9] employ a pre-trained object detector * Equal contribution † † Corresponding authors to extract salient regions from images, which are then aligned with language counterparts. Such an architecture, however, is generally limited by the power of the object detector, the pre-defined visual semantics it can represent, and the quantity of annotations available. Besides, it is also computationally expensive to extract region-based visual features from high-resolution (e.g. 600×1000) images. More recent work [3,7,6,10,11], which scales and performs better on many vision-language tasks, drops the requirement of pre-trained object detection and enables a direct alignment between the image and text representations in an end-to-end manner. These models extract finer-grained visual representation with a long sequence of image patches or grids for good vision understanding [11].
However, there exist two significant problems in modeling long visual sequences: 1) efficiency: full self-attention on long visual sequences requires much more computation than that on textual sequences, and 2) information asymmetry: the caption text in widely-used image-text pre-training data is usually short and highly abstract while more detailed and diverse information can be extracted from the image. This asymmetry presents challenges for effective multi-modal fusion between the modalities.
One straightforward way of multi-modal fusion is the connected-attention network as shown in Figure 1 (a). It adopts a single Transformer [12] network for early fusion of vision and language by simply taking the concatenation of visual and linguistic features as input [13]. This paradigm allows self-attention to discover alignments between the modalities from the bottom level, and requires full self-attention on the concatenation of cross-modal sequences, which is rather time-consuming. Besides, this type of methods process information from both modalities equally, which may suffer from the information asymmetry especially when there is a big difference in information density or sequence lengths between the modalities.
Another line of work keeps separate Transformer networks for both textual and visual features, and uses techniques such as cross-attention to enable cross-modal interaction [11], as shown in Figure 1 (b). This architecture design conducts multi-modal fusion on both modalities independently, which can help alleviate the information asymmetry problem. However, it still suffers from computation inefficiency for full self-attention on long visual sequences, and it is not that parameter-efficient with two separate Transformer networks.
In this work, we propose mPLUG, a unified Multi-modal Pre-training framework for both vision-Language Understanding and Generation. mPLUG performs effective and efficient visionlanguage learning with novel cross-modal skipconnections to address the fundamental information asymmetry problem. Instead of fusing visual and linguistic representations at the same levels, the cross-modal skip-connections enables the fusion to occur at disparate levels in the abstraction hierarchy across the modalities. It creates inter-layer shortcuts that skip a certain number of layers for visual representations to reflect the semantic richness of language compared to vision. As shown in Figure 1 (c), in each block of our cross-modal skip-connected network, mPLUG first adopts an asymmetric co-attention architecture at the first few layers for efficiency, by removing the co-attention on vision side. It is then followed by one layer of connected-attention, by concatenating the original visual representation and the co-attention output on the language side as input. In addition to the modeling efficacy due to the asymmetry, the crossmodal skip-connections ease the model training by alleviating vanishing gradients with the inserted shortcuts. Figure 1 shows that the new cross-modal skip-connected network achieves superior performance with at least four times speeding-up than other cross-modal fusion networks.
Our key contributions can be summarized as follows: • We propose a unified vision-language pretrained model mPLUG of cross-modal understanding and generation for both effectiveness and efficiency in cross-modal learning.
• We introduce a new asymmetric visionlanguage architecture with novel cross-modal skip-connections, to address two fundamental problems of information asymmetry and computation inefficiency in multi-modal fusion.
• mPLUG achieves state-of-the-art performance on a wide range of vision-language tasks, including image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zeroshot transferability when directly transferred to a wide range of vision-language and videolanguage tasks.
2 Related Work

Vision-Language Pre-training
Vision-Language pre-training (VLP) has recently received tremendous success and achieved state-ofthe-art results across a variety of vision-language tasks [14,15,16]. In terms of how information from different modalities are aggregated, typical approaches to VLP [1,2,3,5,6,17,18] can be roughly divided into two categories: dual encoder and fusion encoder. Dual encoder approach utilizes two single-modal encoders to encode images and text separately, and then uses simple functions such as dot product to model the instance-level crossmodal interaction between image and text. The advantage of dual encoder models like CLIP [17] and ALIGN [18] is that images and text can be precomputed and cached, which is quite computationefficient and more appropriate for retrieval tasks. However, they tend to fail in handling more complicated VL understanding tasks that require complex reasoning, such as visual question answering [14]. In contrast, fusion encoder approach uses deep fusion functions such as multi-layer self-attention and cross-attention networks to model the fine-grained cross-modal interaction between image and text sequences. Representative methods of this category include the single-stream architecture such as UNITER [2] and OSCAR [4], and two-stream architecture such as LXMERT [1], ALBEF [6] and ERNIE-ViL [5]. This kind of methods can better capture the underlying association between image and text for vision-language understanding tasks, while it needs to jointly encode all possible imagetext pairs, which leads to a relatively slow inference speed.
To improve the inference speed, some recent work such as Pixel-BERT [3], E2E-VLP [19] and ViLT [10] removes the complicated object detector in feature extraction, and conducts end-toend VL learning with CNN-based grid features and linearly projected patched embeddings, respectively. To combine the benefits of both categories of architectures, VLMo [20] further unifies the dual encoder and fusion encoder modules with shared mixture-of-modality-experts Transformer. In this work, mPLUG introduces a new crossmodal fusion mechanism with cross-modal skipconnections, to enables the fusion to occur at disparate levels in the abstraction hierarchy across the modalities. It achieves superior performances in effectiveness and efficiency across a wide range of VL tasks.

Skip-connection
Skip-connection is a popular technique to bypass the gradient exploding or vanishing problem for model optimization in deep neural networks, which is widely-used in CV and NLP architectures such as ResNet [21] and Transformer [12]. A variety of skip connection methods have been proposed in recent years [22,21,12,23,24,25]. ResNet [21] introduces summed shortcut connections between different layers using simple identity mapping, while highway network [22] designs a transform gating function to control the balance of the input and the transformed input. DenseNet [23] designs new architectures with concatenated skip-connections, allowing the subsequent layers to re-use all the middle representations of previous layers. Layer Normalization and recursive skip connection are further used in combination with plain skip connection for further stablizing model optimization and better incorporating the transformed input [12,25]. In this work, mPLUG proposes a new cross-modal skip connection method to address cross-modal fusion problem, and combines the concatenated skip-connection and summed skip-connection for choosing whether to attend to all the concatenated representations of different modalities or just focus on the cross-modal interaction part at each layer.

mPLUG
In this section, we will first introduce our new model architecture with the key module of the cross-modal skip-connected network, and then give the details of the pre-training objectives and scalable training infrastructure.

Model Architecture
As shown in Figure 2, mPLUG consists of two unimodal encoders for image and text independently, a cross-modal skip-connected network and a decoder for text generation. To better model the inherent modality bias information, we first use two unimodal encoders to encode image and text separately. Following [11,26], we use a visual transformer [27] directly on the image patches as the visual encoder, which is more computation-friendly than using pre-trained object detectors for visual feature extraction [8,9]. The visual encoder divides an input image into patches and encodes them as a sequence of embeddings {v cls , v 1 , v 2 , ..., v M } with an additional [CLS] token. The input text is fed to the text encoder and represented as a sequence of embeddings {l cls , l 1 , l 2 , ..., l N }, where l cls is the embedding of the [CLS] token and used to summarize the input text. Then, the visual and linguistic representations are fed into a cross-modal skip-connected network, which consists of multiple skip-connected fusion blocks. In each skipconnected fusion block, we adopt connected crossmodal fusion to each of S asymmetric co-attention layers where S is a fixed stride value. The aim of this network is to take advantage of the effectiveness of the connected cross-modal fusion and the efficiency of the asymmetric co-attention for   Figure 2: The model architecture and objectives of mPLUG, which consists of two unimodal encoders for images and text separately, a cross-modal skip-connected network and a decoder for text generation. An image-text contrastive loss is first applied to align the unimodal representations from the visual encoder and text encoder. Then, we use a novel cross-modal skip-connected network to fuse the visual and linguistic representations effectively and efficiently. We adopt connected cross-modal fusion to every S asymmetric co-attention layers, where S is a fixed stride value. Based on the connected representation of the image and prefix sub-sequence, the decoder is trained with a prefix language modeling (Prefix LM) loss by generating the remaining caption. enhanced cross-modal fusion in a recursive manner. Finally, the output cross-modal representations are fed into a transformer decoder for sequence to sequence learning, which equips mPLUG with both understanding and generation capabilities.

Cross-modal Skip-connected Network
The cross-modal skip-connected network consists of N skip-connected fusion blocks. In each skip-connected fusion block, we adopt connectedattention layer to each of S asymmetric coattention layers where S is a fixed stride value. We first pass the text feature and image feature from unimodal encoders through the S asymmetric co-attention layers, and then connect the output text feature and image feature to one connectedattention layer. We repeat the skip-connected fusion block N times for the final connected image and text representation.
Specifically, the asymmetric co-attention is composed of the self-attention (SA) layer, crossattention (CA) layer and the feed-forward network (FFN). The input text feature l n−1 is first fed to the self-attention layer, and then the visual feature v n−1 is injected into the text feature l n SA by the cross-attention layer which gives l n CA . The output of self-attention l n SA and cross-attention l n SA are added up and fed to the FFN layer for the visualaware text representation l n : where LN is short for layer normalization. The connected-attention layer is composed of the self-attention (SA) layer and the feed-forward network (FFN). We connect the image feature v n−1 and input text feature l n−1 , where l n−1 is the output of S asymmetric co-attention layers. The connected image and text feature [v n−1 ; l n−1 ] are fed to the self-attention layer and FFN layer: Then [v n ; l n ] is fed into the next cross-modal skip-connected network repeatedly to get the final connected image and text representation. Finally, the connected output is fed into a Transformer decoder for sequence to sequence learning.

Pre-training Tasks
We perform four pre-training tasks including three understanding tasks (Image-Text Contrastive Learning, Image-Text Matching, Masked Language Modeling) and one generation task (Prefix Language Modeling). These pre-training tasks are optimized jointly.
Image-Text Contrastive (ITC): Following [6], we employ the task to align the image features and the text features from the unimodal encoders. Specifically, we calculate the softmax-normalized image-to-text and text-to-image similarity, and take two dynamic memory queues (text, image) to increase the number of negative examples as MoCo [28].
Image-Text Matching (ITM): This task aims to predict whether an image and a sentence match with each other on the cross-modal representation. We also select hard negative image-text pairs based on the contrastive text-image similarity as [6].
Masked Language Modeling (MLM): The task setup is basically the same as in BERT [29], where we randomly mask 15% of tokens in text and the model is asked to predict these masked words with the cross-modal representations.
Prefix Language Modeling (PrefixLM): This task aims to generate the caption given an image and predict the text segment subsequent to the cross-modal context as [30]. It optimizes a cross entropy loss by maximizing the likelihood of text in an autoregressive manner.

Distributed Learning on a Large Scale
Training a big model like mPLUG on large-scale datasets faces many efficiency challenges. We increase the throughput from the perspective of reducing memory usage and computation time, thereby accelerating the training of the model.
The memory usage during model training is mainly composed of two aspects: the static memory usage composed of parameters/optimizer states/gradients, etc., and the runtime memory usage caused by intermediate variables like activation values. For static memory overhead, we use the ZeRO [31] technique to partition parameters/optimizer states/gradients into the entire data-parallel group, so that the static memory overhead of a single GPU can be approximately reduced to 1/N , where N denotes the number of GPU cards. We use gradient checkpointing [32] for the runtime memory cost, which greatly reduces the runtime memory usage at the expense of increasing forward time by recomputing part of the activation values during backward pass without keeping them in memory.
To reduce the computation time, we use BF16 precision training. BF16 is a new data type supported by NVIDIA's new Ampere architecture GPU like A100. Compared with the previously widely used mixed-precision training of FP16 and FP32, BF16 has the same representation range as FP32, thereby reducing the risk of numerical overflow and ensuring model convergence stability, and at the same time has the same fast computing speed as FP16.

Data & Setup
Following the previous work [6], we use the same pre-training dataset with 14M images with texts, which includes two in-domain datasets (MS COCO [36] and Visual Genome [37]), and three web outdomain datasets (Conceptual Captions [38], Conceptual 12M [39], SBU Captions [40]. We pretrain the model for 30 epochs with the total batch size of 1024 on 16 NVIDIA A100 GPUs. We use a 6-layer Transformer for both the text encoder and the cross-modal skip-connected network, and a 12-layer Transformer for the decoder. The text encoder is initialized using the first 6 layers of the BERT base [29] model and the skip-connected network is initialized using the last 6 layers of the BERT base . We initialize the visual encoder by CLIP-ViT [17] pretrained on 400M noisy imagetext pairs. The visual transformer with ViT-B/16 is used as our base architecture, the one with ViT-L/14 as the large architecture. We use the AdamW [41] optimizer with a weight decay of 0.02. The learning rate is warmed-up to 1e-5 (ViT-B/16) and 1e-4 (BERT base ) for mPLUGViT-B , and 5e-6 (ViT-L/14) and 5e-5 (BERT base ) for mPLUGViT-L in the first 1000 iterations, and decayed to 1e-6 following a cosine schedule. During pre-training, we take random image crops of resolution 256 × 256 (ViT-B/16)/224 × 224 (ViT-L/14) as input, and also apply RandAugment [42] to improve the generalization of vision encoders. For VQA and image captioning tasks, we do an additional continue pretraining on 4M image-text pairs. We increase the image resolution during finetuning. For image-text contrastive learning, the queue size is set as 65,536 and the momentum coefficient is set as 0.995.

Evaluation on Vision-Language Tasks
We compare our pre-trained model against other VLP models on the six downstream V+L tasks. We introduce each task and our fine-tuning strategy below. Details of the datasets and fine-tuning hyperparameters are in Appendix.

Visual Question Answering
The VQA task [14] requires the model to answer natural language questions given an image. Most methods [1,20,4,7] deal with visual question answering tasks as multi-label classification on predefined answer sets. This strategy achieves strong performance, but it is not suitable for real-world open scenarios. We treat VQA as an answer generation task and directly use unconstrained openvocab generation during inference, which is different from constrained close-vocab generation models [6,35]. Following [4,35], we concatenate the question with the object labels and OCR tokens extracted from image. As shown in Table 2, mPLUG achieves 81.27 on Test-std split and outperforms the SOTA models including SimVLM and Florence, which use 100X and 60X more pre-training image-text pairs, respectively. Based on the same 4M pre-training data, mPLUG outperforms CLIP-ViL and METER, which also use CLIP [17] as the visual encoder. Besides, under the same settings, mPLUG always significantly outperforms ALBEF and BLIP which only rely on co-attention from images to text for cross-modal fusion. The gain can derive from the network design of cross-modal skip-connections specifically for information asymmetry of the two modalities. Neither ALBEF nor BLIP addresses this problem well, with bias towards the language modality.

Image Captioning
The image captioning task requires a model to generate an appropriate and fluent caption for a given image. We evaluate image captioning on two datasets COCO Caption [47] and NoCaps [48]. mPLUG finetuned with training data of COCO Caption is tested on both of the datasets. We train

Image-Text Retrieval
We conduct experiments for both image-to-text retrieval (TR) and text-to-image retrieval (IR) on COCO [36] and Flickr30K [53] datasets. Following [6,34], we jointly optimize the ITC loss and the ITM loss during fine-tuning. During inference, we first select top-k candidates by computing the dot-product similarity between the image and text encoder features, and then rerank the selected candidates based on their ITM scores. We set k = 256 for COCO and k = 128 for Flickr30K. As shown in Table 3, mPLUG outperforms all existing methods on both datasets. Using 14M images, mPLUG achieves better performance than BLIP with 129M and Florence with 0.9B pre-training data. Using the same 14M pre-training images, mPLUG substantially outperforms the previous best model BLIP by +2.7% in TR recall@1 on COCO and +1.0 % in TR recall@1 on Flickr30K.

Visual Grounding
Given a query in plain text and an image, visual grounding requires models to localize the referred object in the image. Instead of regressing the bounding boxes directly, we concatenate visual features and attended textual features and feed them into the decoder to predict the coordinates. Table  4 shows that mPLUG outperforms all the SOTA methods. We observe that in RefCOCO testB the images often contain arbitrary objects and in Rec-COCOg test-u the expressions are longer than other datasets. Compared with the previous best model OFA, mPLUG achieves 3.16% absolute improvement on RefCOCO testB and 1.22% absolute improvement on RefCOCOg test-u. It demonstrates that mPLUG learns better multi-modal interaction from cross-modal skip-connections and is better at handling complex images and long queries.

Visual Reasoning
We consider two datasets for visual reasoning: NLVR2 [54] and SNLI-VE [55]. The NLVR2 [54] task requires the model to predict whether a sentence describes a pair of images. Following [34], we use two cross-attention layers to process the two input images, and their outputs are merged and fed to the FFN. An MLP classifier is then applied on the output embedding of the language [CLS] token. The SNLI-VE [55] task requires the model to evaluate how the given image and text are semantically correlated, i.e., entailment, neutral, or contradiction. Following [35], the image premise, text premise and text hypothesis are fed to the encoder. While we remove the decoder, and only use the encoder modules for three-way classification, which can save nearly half of the total computation cost. We predict the class probabilities using   the multimodal encoder's output representation of the language [CLS] token. As shown in Table 5, mPLUG can obtain competitive performances to the SOTA models 1 in both visual reasoning tasks, and even outperform SimVLM [7] and BLIP [34], which use far more pre-training data.

Effectiveness and Efficiency
To validate the effectiveness and efficiency of our proposed cross-modal skip-connected network, we conduct in-depth analysis on different stride values and various cross-modal fusion methods.

Analysis of Stride for Skip
The stride S is the key factor to control the effectiveness and efficiency tradeoff. Therefore, we further compare the running time and performance of dif- ferent stride value S in cross-modal skip-connected network on VQA and NLVR2 tasks. Specifically, we test four different stride values, which can be divisible by the total number of cross-modal fusion layers. The model is chosen as mPLUGViT-B and all the other experiment settings are kept the same. As shown in Figure 3, we can see that the larger S is, the more efficient cross-modal fusion is, where the running time can be largely reduced from skipping the vision co-attention layers by 5X times from S = 1 to S = 6. The performances of mPLUG on both datasets gradually increases when S = 3, and slightly decreases later on. Compared with S = 3, mPLUG can achieve comparable performance at S = 6, while speeding up by nearly 30%. Therefore, we set S = 6 on mPLUGViT-L for faster pre-training.

Analysis of Cross-modal Fusion
We compare the effectiveness and efficiency of different cross-modal fusion variants in terms of run-  ning time and performance on VQA and NLVR2 tasks. Specifically, we pre-train mPLUG with different cross-modal fusion network based on the same image encoder and text encoder. All the pretraining settings and the number of fusion layers are kept the same as in the original mPLUG pretraining. As shown in Figure 4, the fusion methods of co-attention and connected-attention both requires much more running time due to long visual sequence. Compared with the two fusion methods, our proposed skip-connected network is 4X faster and obtain better performance on both datasets. We also compare it with the asymmetric co-attention used in BLIP [6,34] which only relies on the co-attention layers from images to text. Despite running slightly faster than the skip-connected network does, the asymmetric co-attention performs worse in accuracy on both datasets. The performance degradation is attributed to the information asymmetry and bias towards language, as shown in Section 5.2.1.

Large-scale Training
Combining the techniques introduced in Section 4 has dramatically increased the training throughput.
With the utilization of memory saving and acceler-   ated training techniques, the throughput of mPLUG improves 3X more from 124 samples per second to 422 samples per second, as shown in Table 6.

Zero-shot Transferability
In this section, we examine the generalization of mPLUG and compare the zero-shot result on two Vision-Language and three Video-Language tasks.
As shown in Table 7, the zero-shot performance of mPLUG is competitive with fully supervised baselines such like Oscar and VinVL. With further finetuning on MSCOCO dataset, mPLUG outperforms the SimVLM huge , which use more pretraining image-text pairs and has larger model parameters. Image-text Retrieval: We perform zero-shot retrieval on Flickr30K. The result is shown in Table 8, where zero-shot mPLUG outperforms models (CLIP, ALIGN, Florence) pretrained with more image-text pairs. Following [34], we also evaluate zero-shot retrieval by the model finetuned on MSCOCO dataset. Table 8 shows that mPLUG achieves better performance than the previous SOTA models.

Zero-shot Transfer to Video-Language Tasks
To evaluate the generalization ability of mPLUG to Video-Language Tasks  without any video pre-training or supervision. Table 9 shows that zero-shot mPLUG can outperform the SOTA models pretrained on far more pretraining data (e.g., Florence, BLIP), and can even outperform models finetuned on the supervised video dataset without using temporal information (e.g., VideoCLIP, VIOLET); Video Question Answering: Following BLIP [34], We treat Video QA as an answer generation task and perform evaluation based on models finetuned on VQA. As shown in Table 10, the zero-shot mPLUG outperforms BLIP pretrained with more image-text pairs; Video Caption: We use a prefix prompt "A video of" to improve the quality of decoded captions. Table 10 shows that zero-shot mPLUG also achieves better performance than BLIP.

Conclusion
This paper presents mPLUG, an effective and efficient VLP framework for both cross-modal understanding and generation. mPLUG introduces a new asymmetric vision-language architecture with novel cross-modal skip-connections, to address two fundamental problems of information asymmetry and computation efficiency in cross-modal alignment. Pretrained on large-scale image-text pairs, mPLUG achieves state-of-the-art performance on a wide range of vision-language tasks. mPLUG also demonstrates strong zero-shot transfer ability when directly applied to multiple video-language tasks. Our work explores the cross-modal alignment with a newly-designed VLP architecture and we hope it can help promote future research on image-text foundation models. Following [4,35], we first fine-tune mPLUG with cross-entropy loss for 5 epochs with a learning rate of 1e-5 and a batch size of 256. Based on the fine-tuned model, we the fine-tune it with CIDEr optimization [49] for extra 5 epochs with a smaller learning rate of 8e-7. During inference, we use beam search with a beam size of 10, and set the maximum generation length as 20.
Visual Grounding. We evaluate our method on three referring expression grounding datasets: Re-fCOCO, RefCOCO+ [16] and RefCOCOg [69]. The RefCOCO and RefCOCO+ datasets share 19K images and contain 142/141K queries. The Re-fCOCOg dataset contains 25K images and 95K queries. To fully use training data, we first train the model with a mixed dataset with a learning rate of 2e-5. Then we continue fine-tuning the model on each dataset with a learning rate of 2e-6.