Fusion or Defusion? Flexible Vision-and-Language Pre-Training

Existing approaches in the vision-and-language pre-training (VLP) paradigm mainly deploy either fusion-based encoders or dual-encoders, failing to achieve both effectiveness and efficiency in downstream multimodal tasks. In this paper, we build a flexible VLP model by incorporating cross-modal fusions into a dual-encoder architecture, where the introduced fusion modules can be easily decoupled from the dual encoder so as to switch the model to a fusion-free one. To better absorb cross-modal features from the fusion modules, we design a cross-modal knowledge transfer strategy along with other comprehensive pre-training tasks to guide the training process, which can further strengthen both the fusion-based and fusion-free representation learning. Extensive experiments conducted on various downstream vision-language tasks show that our proposed model is well-equipped with effectiveness as well as efficiency, demonstrating a superior performance compared with other strong VLP models.


Introduction
With the great development of self-supervised pretraining in both the community of natural language processing (Devlin et al., 2019;Raffel et al., 2020) and computer vision (Dosovitskiy et al., 2021;Bao et al., 2022a), recent researches have also witnessed the success of Vision-and-Language Pretraining (VLP).VLP learns generic multimodal representations from large-scale image-text pairs and can be further finetuned on various downstream Vision-Language (VL) tasks, including image-text retrieval (Lin et al., 2014), visual question answering (Goyal et al., 2017), visual reasoning (Suhr et al., 2019) and visual entailment (Xie et al., 2019).
The core of VLP resides in modeling the interaction between image and text representations.Most of the mainstreams first represent the input image via pre-trained deep feature extractors, then feed the derived visual features along with the text embeddings into multi-layer Transformers (Vaswani et al., 2017), in which cross-modal attention is used to fuse multimodal representations.Despite demonstrating superior performances on downstream VL tasks, the fusion-based methods need to jointly encode image and text representations, significantly degrading the efficiency in retrieval tasks with massive candidates of image-text pairs.To make VLP models applicable in real-world scenarios, another line of methods independently encode text and image with dual encoders, shown in Fig. 1(a), in which cross-modal fusion is conducted by lightweight modules such as dot production.Thanks to the dual-encoder architecture, encoded features of image and text can be precomputed offline for inference efficiency.Nevertheless, independent encoding with shallow interaction fails to fully exploit the cross-modal interaction, making the performance far from satisfactory in VL classification tasks that require a strong ability of multimodal reasoning.
There are some recent works that attempt to keep  both effectiveness and efficiency in downstream VL tasks.In particular, Wang et al. (2021b) empower a dual-encoder model by distilling knowledge from a fusion-based model.Although the distilled dualencoder learns useful knowledge from cross-modal fusions while keeping its efficiency, this kind of method needs to pre-train a fusion-based model as a teacher and the performance is severely limited by the ability of the teacher model.VLMo (Bao et al., 2022b) introduces mixture-of-experts to encode various modalities with a modality-agnostic Transformer, which can be used as either a fusion encoder or a dual encoder.However, to fully train such a sparse model with experts towards different modalities, not only the image-text pairs but also massive images and text are required.
In this paper, we propose a unified and flexible VLP model named FOD, which incorporates cross-modal fusions into a dual-encoder architecture for achieving both efficacy and efficiency in multimodal scenarios.Specifically, we adopt a dual architecture with one image encoder and one text encoder, in which cross-modal fusions are placed in the text encoder side.Considering that conventional fusions are based on either concatenation (Kim et al., 2021;Singh et al., 2022) or cascading (Li et al., 2021a;Dou et al., 2022) that can't be directly decoupled from the boarding encoder, we employ a parallel-style fusion module to model cross-modal interactions, shown in Fig. 1.In this way, FOD can explicitly capture the complex interaction between modalities during training while switching the fusion-based text encoder to a fusionfree one by removing the fusion module.
In order to retain more cross-modal knowledge in FOD when the fusion modules are removed, we further design a cross-modal knowledge transfer strategy that forces both the unimodal features of image and text to approximate the multimodal representation produced by the fusion-based encoder.Intuitively, since paired image and text describe the same object in different views, we can naturally associate a set of relevant images when given a caption (and vice versa).Thus, if the text feature learns to "associate" its related images and absorbs them to enhance itself, the enhanced text feature can become closer to the relevant image candidates (and also farther to the unrelated ones) in inference.A concrete example illustrating this intuition is shown in Fig. 2.
We evaluate our model on both image-text retrieval tasks and vision-language understanding tasks.Experimental results show that our model outperforms other VLP methods on all downstream VL tasks, and even performs competitively with models that use a larger order of magnitude of data for pre-training.Thanks to the detachable fusion module and the strategy of knowledge transfer, our model can be flexibly switched to a fusion-free pattern to enjoy a much faster inference speed of retrieval while retaining most of the performance.

Related Work
Without considering the ways of visual feature extraction, the approaches of vision-language pretraining can be divided into two categories based on the interaction form between image and text.The first category, fusion-based model, explicitly utilizes deep fusion layers with cross-modal attention to model the interaction of images and texts (Tan and Bansal, 2019;Lu et al., 2019;Su et al., 2019;Li et al., 2019;Chen et al., 2020;Li et al., 2020, 2021b; Gan et al., 2020;Zhang et al., 2021;Huang et al., 2020Huang et al., , 2021;;Kim et al., 2021;Li et al., 2021a;Wang et al., 2021c;Li et al., 2022;Zeng et al., 2022;Wang et al., 2022).These models perform well on vision-language understanding tasks due to the ability of capturing deep cross-modal features.However, for vision-language retrieval tasks, the fusion-based methods need to encode all the possible image-text pairs to find the most relevant candidate, resulting in extremely high time cost.
The second category, dual-based model, utilizes a visual encoder and a text encoder to separately encode images and text, while the interaction between images and text is modeled by cosine similarity or linear projection (Radford et al., 2021;Jia et al., 2021;Yao et al., 2021).Although dualbased models are effective for retrieval tasks since features can be pre-computed and cached offline, the shallow interaction is insufficient to tackle the vision-language understanding tasks that require complex VL reasoning.Besides, training a dualbased model often necessitates a large number of image-text pairs (e.g.300M for Filip (Yao et al., 2021) and 1.8 B for ALIGN (Jia et al., 2021)).
Recently, some researchers have devoted themselves to investigating a unified model that is wellperformed on vision-language understanding tasks while maintaining the efficiency towards retrieval tasks (Wang et al., 2021b;Liu et al., 2021;Wang et al., 2021a;Bao et al., 2022b;Dou et al., 2022).
To achieve this, one line of the works leverage knowledge distillation, in which a fusion-encoder model is pre-trained as a teacher model to guide the training of a dual-encoder model (Wang et al., 2021b), but the performance is inevitably limited by the teacher model.Other efforts attempt to train a modality-agnostic encoder with shared parameters, which can be used as either a fusion encoder or a dual encoder (Wang et al., 2021a;Bao et al., 2022b).Despite the benefits of modeling all the modalities into a single encoder, it is hard to fully train such a huge model and a large number of training samples in different modalities are required.Different from these methods, we incorporate a detachable cross-modal fusion module into a dualencoder architecture, which can easily remove the fusion module in inference and switch to a fusionfree model.More importantly, our model does not rely on teacher models or massive data in other modalities.

Model Architecture
As shown in Fig. 3, FOD is in a transformer-based dual-encoder architecture that includes a visual encoder and a text encoder.The text encoder can be flexibly switched between a fusion-based pattern and a fusion-free pattern.For the fusion-based pattern, cross-modal fusions are incorporated into the text encoder to model multimodal interactions.For the fusion-free pattern, the fusion module is decoupled from the text encoder so as to get rid of the cross-modal calculation.During training, both fusion-based and fusion-free patterns are involved in the learning process, while in inference, the text encoder will be switched to one of the two patterns according to the type of downstream tasks.In the following sections, we introduce the visual encoder and the two patterns of the text encoder, followed by the pre-training strategies.

Visual Encoder
We utilize Vision Transformer (Dosovitskiy et al., 2021) to build the visual encoder.Given a 2D image I ∈ R C×H×W , we first reshape I into a sequence of 2D image patches V p ∈ R N ×(P 2 •C) , where (H, W ) is the original image resolution, C is the number of channels, (P, P ) is the patch resolution, and N = HW/P 2 is the number of patches.
Then we flatten the patches and embed them to , where D is the hidden size.
We also prepend a learnable embedding V e cls ∈ R D to the patch embeddings V e .Besides, positional information is also important for path representations.Therefore, the embedded patches V are obtained by summing [V e cls ; V e ] and learnable 1D position embeddings V pos ∈ R (N +1)×D .Finally, we obtained visual features V by encoding V with the visual encoder VE. (3)

Text Encoder
As mentioned before, there are two patterns of the text encoder: fusion-free text encoder and fusionbased text encoder.These two patterns are both based on Transformers (Vaswani et al., 2017) and share all the fusion-free parameters except the output linear projection in the last encoding layer.Given the input text t = {t cls ; w 1 ; • • • ; w S }, we first embed t to T 0 ∈ R S×D via a word embedding matrix and a position embedding matrix.Then the text embedding T 0 can be fed into different patterns of the text encoder to produce different output features.

Fusion-free Text Encoder
In this pattern, the text encoder skips the crossmodal fusions and outputs text-only features.The text encoder is a L-layer Transformer, and the output of the l-th layer T l is computed as follows: where MSA, LN and MLP are shot for Multi-Head Self-Attention, layer normalization and multi-layer perceptron respectively, T is the final features of the fusion-free text encoder.

Fusion-based Text Encoder
To fully capture vision-and-language interactions, both self-attention and cross-modal attention are considered in the fusion-based encoder.Specifically, in the l-th layer, we separately compute the fusion-free self-attention and the image-fused cross-attention, and then sum them up to produce the multimodal features.The detailed process is shown as follows: where MCA is Multi-Head Cross Attention, V is the final visual features produced by the visual

Fusion-free Learning
For this strategy, we utilize image-text contrastive learning to train the dual architecture with the ability of unimodal encoding, which is not only beneficial to other cross-modal learning strategies, but also the basis for applying the model to downstream retrieval tasks.

Image-Text Contrast
We select V cls and T cls produced by visual encoder and fusion-free text encoder to compute the loss of contrastive learning.In order to have more negative examples here, we maintain two queues to store the most recent K image and text representations computed by momentum encoders like MoCo (He et al., 2020).For convenience, we denote these representations in queues as V k cls and T k cls , where k ∈ {1, • • • , K}.For each image representation V j cls and text representation T j cls in the current batch, the image-to-text similarities p i2t j and text-to-image similarities p t2i j are computed by: where f v and f t are linear projections, g is L2 normalization, and σ is a learnable temperature parameter.Let y i2t and y t2i denote the ground-truth ont-hot similarity, where positive pairs have a probability of 1 and negative pairs have a probability of 0. The image-text contrastive loss L itc is defined as the cross-entropy H between p and y:

Fusion-based learning
For this strategy, we apply image-text matching (ITM) and mask language modeling (MLM) to the fusion-based text encoder for learning both coarsegrained and fine-grained cross-modal fusions.

Image-Text Matching
ITM focuses on coarse-grained multimodal learning, which aims to predict whether a pair of image and text is matched or not.Since the imagetext pairs in a batch are all positive, we sample global hard negative image-text pairs from all input batches on all the GPUs based on the similarity scores calculated in Eq. 7. Then we feed the final hidden vector of the fusion-based encoder M cls into a binary classifier to predict a two-class probability p itm .Given the ground-truth label y itm ∈ {0, 1}, the image-text matching loss L itm is defined as the cross-entropy H between y itm and p itm :

Masked Language Modeling
MLM predicts masked tokens on the image-fused text features, which serves as the fine-grained crossmodal learning.Formally, we randomly mask 15% of the tokens in the text sequence t with a whole word masking strategy (Cui et al., 2021) and denote the input embedding of the masked text as T 0 .
Then the model is trained to predict the masked tokens based on the final outputs M by feeding T 0 into the fusion-based encoder.The detailed process is similar to Eq. 5. Let y mask denote the groundtruth label of the masked tokens, and p mask denote the models' prediction for the masked tokens, then the masked language modeling loss is defined as the cross-entropy H between y mask and p mask :

Cross-modal Knowledge Transfer
In our preliminary experiments, we observe that if the ITM loss is removed from the training process, the performance in retrieval tasks would dramatically degrade.From the perspective of feature distributions, we believe that ITM can better close the spatial distance between the unimodal features of image and text, which encourages us to explicitly utilize ITM to enhance unimodal representations.
To achieve this, we further design the strategy of cross-modal knowledge transfer (CKT).Given an image-text pair, we can first extract its image V cls , text T cls and multimodal representations M cls .Obviously, M cls is the most comprehensive feature that describes the image-text pair among them, but only V cls and T cls are used to compute similarity score in retrieval tasks.In this case, if we enhance the text feature to actively associate its related images by transferring knowledge from M cls to T cls , it will be easier to find the relevant image candidates based on the enhanced text feature in inference (and similar for V cls ).Thus, we force both V cls and T cls to approximate M cls via meansquared loss in the last layer, which are calculated as follows: Image Retrieval R@1 R@5 R@10 R@1 R@5 R@10 where f v and f t are the linear projections used in Eq. 6.We do not freeze M cls in knowledge transfer so that multimodal and unimodal features can be jointly trained.  in url format and some of them are inaccessible, we only collected 3.4M images, which is around 600K less than the original settings.In the experiments, we term the setting of 3.4M images as 3M.

Implementation Details
For model settings, the visual encoder adopts the same architecture as ViT-Base (Dosovitskiy et al., 2021) and we initialize it with pre-trained weights of Beit (Bao et al., 2022b).The text encoder is modified on Bert-Base (Devlin et al., 2019) by adding a multi-head cross attention and we initialize it with pre-trained weights of uncased-bert-base.
For hyper-parameter settings during pre-training, the resolution of input images is 256 × 256 and the patch size is 16 × 16.RandAugment (Cubuk et al., 2020) is applied to the input images.We use AdamW optimizer (Loshchilov and Hutter, 2017) with weight decay of 1e-2 and the learning rate is Table 5: Ablation study of cross-modal knowledge transfer.For retrieval tasks, we report the average of R@1, R@5 and R@10.
warmed up to 1e-4 over the first 1k steps.We pretrain for 300K steps on 32 NVIDIA A100 GPUs with a batch size of 2048.

Image-Text Retrieval Tasks
The vision-language retrieval tasks include imageto-text retrieval and text-to-image retrieval.We evaluate our model on the Karpathy and Fei-Fei (2015) split of MSCOCO (Lin et al., 2014) and Flickr30K.During fine-tuning, we preserve the loss of image-text contrastive learning, image-text matching and cross-modal knowledge transfer.For a better comparison with various methods, we have two settings in the inference phase, namely "Dual" and "Re-Rank".
For the "Dual" setting, we use Eq. 6 to precompute images and text representations separately, and compute the similarity scores of all possible image-text pairs by dot production.For the "Re-Rank" setting, we first utilize the similarity scores derived from Eq. 6 to select the top-k candidates, and then predict the final results by calculating their ITM scores (p itm ).

Visual Question Answering
The VQAv2 (Goyal et al., 2017) task requires to predict answers based on the given pair of an image and a question.Following Cho et al. (2021) and Li et al. (2021a), we treat VQA as an answer generation problem.In order to compare fairly with other methods, we restrict the answer generation space to the same candidate set (Kim et al., 2021;Bao et al., 2022b) during inference.

Natural Language for Visual Reasoning
The NLVR2 (Suhr et al., 2019) task asks the model to predict whether a text correctly describes a pair of images.We follow previous work (Li et al., 2021a;Zeng et al., 2022) to extend the fusion-based encoder to enable reasoning over image pairs and feed the encoded vector of the input pair into a classification layer to predict answer.

Image-Text Retrieval Results
Table 1 and Table 2 show the results of fine-tuned and zero-shot image-text retrieval on MSCOCO and Flickr30K.For a fair comparison, only basesize models pre-trained on the standard 4M data are selected as the compared models.In this setting, our model achieves state-of-the-art performance on both datasets, and even performs competitively with CLIP, ALIGN and BLIP that are pre-trained on a larger order of magnitude of data.Furthermore, thanks to the designed parallel-style fusions and cross-modal knowledge transfer strategy, more cross-modal knowledge is retrained when the fusion module is decoupled in inference, narrowing the gap between "Dual" and "Re-Rank" settings.Detailed analysis of performances between "Dual" and "Re-Rank" settings are given in Appendix.

Vision-Language Understanding Results
The VQA2 and NLVR2 are categorized as understanding tasks since they both require the ability of VL reasoning, and the results are shown in Table 3.Our model achieves the best performances on both tasks among all the competitors that are also "Terrier" "running" "grass" "fence" A Boston Terrier is running on lush green grass in front of a white fence.
"on" "woman" "pink" "apron" "holding" A woman wearing a pink shirt and red apron stands in her restaurant holding food.

"food"
Figure 5: Grad-CAM visualizations of cross-modal attention maps according to different words on VL retrieval.
in base-size and pre-trained with the standard 4M data, and even outperforms models pre-trained on more data like SimVLM and BLIP, demonstrating the effectiveness and efficiency of our model.

Different Designs of Fusions
We incorporate cross-modal fusions into a dual architecture to better model vision-language interactions.Conventional fusions are based on two kinds of methods, namely concatenation and cascading.
(1) Concatenation jointly encodes the concatenation of image and text, which is quadratic in time complexity and twice in memory consumption.(2) Cascading first uses self-attention to encode the text input and then fuses it with image via crossattentions, which has a strong dependency between cross-attention and self-attention.Table 4 reports the ablation results of different fusions, our design that incorporates cross-modal fusions in a parallel manner outperforms other methods on retrieval tasks, showing that parallel-style fusion can switch our model into the "Dual" setting more flexibly.

Cross-modal Knowledge Transfer
We conduct the ablation experiments towards the strategy of cross-modal knowledge transfer, which is shown in Table 5.The objectives of I2M and T2M are defined in Eq. 11.From the results we can observe that: (1) I2M specifically improves the performance on image-text retrieval (TR) while T2M is beneficial for the text-image (IR) side, which are consistent with their intuitions; (2) I2M and T2M are complementary to each other.Adding both I2M and T2M during training can further bring improvements for retrieval tasks while keeping the performances on VL understanding tasks.

Fusions on Both Sides
Intuitively, in addition to placing cross-modal fusions in the text encoder, we can also add the fusion modules into the visual side in a similar way.In this setting, ITM and downstream classifications are based on the concatenation of the multimodal features produced by both text and visual encoders.Fig. 4 shows the results of placing fusions in different sides, from which we find that when fusions are placed on both sides, the performance unexpectedly drops on all downstream tasks.We analyze that one possible reason comes to the difference between text and vision in self-supervised learning.It is obvious that BERT naturally works better in selfsupervision than ViT, and thus we can utilize the MLM task from BERT to learn fine-grained crossmodal interaction.When it comes to the visual side, self-supervised tasks are much more complex than MLM, inevitably making it more difficult to train such a VLP model.

Qualitative Analysis
We further provide a qualitative analysis by using Grad-CAM (Selvaraju et al., 2017) to illustrate the per-word visualizations of the cross-modal attention maps of the fusion-based encoder.As shown in Fig. 5, from the visualizations we observe that when conducting image-text matching tasks, our model can focus on specific regions in an image according to different words in each sentence, including objects, actions, attributes and background.More examples are given in Appendix.

Conclusion
In this work, we propose a flexible VLP model that incorporates cross-modal fusions into a dualencoder architecture.The fusion module can be easily decoupled in inference, enabling the model to be switched between fusion-based and a fusionfree patterns according to different scenarios.Extensive experiments conducted on both image-text retrieval and vision-language understanding tasks show that our model is well-equipped with effectiveness and efficiency compared with existing VLP models.

Limitations
The findings of this study have to be seen in light of some limitations.(1) It is non-trivial to extend our model for generation tasks.Since the main focus of this work is to improve both effectiveness and efficiency of the dual-encoders, text-decoder is not considered in model design.In the future, autoregressive mechanisms will be consider to applied in model architecture so that the model can be directly used for generation tasks like image captioning.
(2) There may be disadvantages of the model in region-level VL tasks such as Object Detection.The reason is that these tasks require images in high resolution and fine-grained annotations of bounding boxes, which are non-trivial in generic VLP settings.To solve this problem, exploring different levels of granularity between image-text pairs is a promising direction and will be considered as the future work.

A.1 Statistics of Pre-training Datasets
In the expriments, we use four widely-used datasets for pre-training: SBU Captions (Ordonez et al., 2011), Microsoft COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017) and Google Conceptual Captions (Sharma et al., 2018).Due to the inaccessible problem of url, we can only collected 3.4M images, which is around 600K less than the original settings (Kim et al., 2021;Li et al., 2021a;Bao et al., 2022b).Details are shown in Table 6.
Intuitively, if we can access to the full 4M data, our model could perform better.Visual Question Answering.For visual question answering, most methods convert VQAv2 to a classification task by preserving the most frequent 3192 answers in datasets.However, this will prevent some data from being used for fine-tuning because their answers are not in the candidate set.Thus, we follow previous work (Cho et al., 2021;Li et al., 2021a;Zeng et al., 2022) and treat VQA as an answer generation problem.More specifically, we predict the probability distribution on the vocabulary of the first token, and select the top-k candidates with the highest probability from the distribution.Finally, we use language-modeling loss to predict the final answer from the top-k candidates.For a fair comparison, we restrict the answer generation space to the same candidate set (Kim et al., 2021;Bao et al., 2022b) during inference.
We finetune our model for 8 epochs with 256 batch size and the learning rate is 2e-5.The resolution of images is set to 576 × 576 (Dou et al., 2022) and k is set to 128.
Natural Language for Visual Reasoning.For NLVR2, we follow previous work (Li et al., 2021a;Zeng et al., 2022) and extend the fusion-based encoder to enable reasoning over image pairs, in which an additional pre-training step is applied for training model to reason the relations among text and images.Then, we fine-tune the model for 15 epochs.The batch size is 128, learning rate is 2e-5 and the resolution of the input image is set to 384 × 384.

A.3 Performance Retaining
For VL retrieval tasks that involve massive candidates of image-text pairs, it is crucial for a VLP model to have the ability of acting as a dual-encoder for efficient inference.Table 7 reports the comparisons between "Re-Rank" and "Dual" settings on retrieval tasks.Our model performs best in terms of performance retraining when switched from "Re-Rank" to "Dual" setting, showing the effectiveness of the designed parallel-style fusions and crossmodal knowledge transfer strategy.Table 7: Results of different retrieval settings on MSCOCO (5K) and Flickr30k (1K)."R" and "D" are short for "Re-Rank" and "Dual" settings.We report the average of TR and IR.

A.4 Inference Speed
We further evaluate the inference time of our models and other compared methods on MSCOCO dataset.All the models are evaluated on a single A100 GPU.From the results reported in Table 8, we can observe that our model is well-equipped both efficacy and efficiency in retrieval tasks.Notably, when our model is switched to the fusionfree (dual) pattern, it can still achieve a comparable performance compared with other methods while enjoy a much faster inference speed.
"person" "Striped" "holding" "rope" "mountain " The person has a striped shirt on and is holding on to a rope on a mountain.
"going" "into" "building" A man in blue pants is going into a building "wat er" "sunlight" "reflecting" "tree" A large body of water with sunlight reflecting off the water and a tree to the side.
"man " "pants "  ‡ relies on a heavy object detector.† inference is based on the "Re-Rank" method.

A.5 Visualization
We provide more examples of per-word visualizations of our fusion-based encoder finetuned on VL retrieval tasks, as shown in Fig. 6.The visualizations suggest that our model can focus on specific regions of the image according to different words in text when conducts image-text matching.

Figure 1 :
Figure 1: Different designs for vision-language fusions based on multi-head attention, "I" and "T" are short for image and text respectively.(a) Lightweight: only a few or even no parameters are used for VL fusions.(b) Concatenation: multi-head cross-attentions are applied to fuse the concatenation of image and text.(c) Cascading: first uses self-attentions to fully encode the unimodal input, then fuses the encoded features via cross-attentions.(d) Parallel: self-attentions and cross-attentions are independently calculated.

Figure 2 :
Figure 2: An example intuitively illustrates why a fusion-free text encoder can still work when the cross-modal fusions are removed from it.During training, the cross-modal fusions teach the text feature to "associate" what the related images could be.In inference of image-text retrieval, since a relevant image candidate is naturally closer to other images that are also related to the same text, the text feature along with its "associated" images will become closer to the relevant candidates than the original text feature.

Figure 3 :
Figure 3: Overview of FOD, which consists of an image encoder and a flexible text encoder.The text encoder can be switched between fusion-based pattern and fusion-free pattern.We utilize three complementary learning strategies to jointly train FOD: Fusion-free learning with Image-Text Contrastive learning (ITC), Fusion-based learning with Image-Text Matching (ITM) and Mask Language Modeling (MLM), and Cross-modal Knowledge Transfer with Image-to-Multimodal (I2M) and Text-to-Multimodal (T2M) learning.

Figure 4 :
Figure 4: Ablation study of placing fusions in both text and visual encoders.FOD-both: fusions are added on both sides.

Figure 6 :
Figure 6: More examples of visualizations of cross-modal attention maps according to different words on VL retrieval.

Table 1 :
Fine-tuned image-text retrieval results on MSCOCO (5K test set) and Flickr30K (1K test set).† inference is based on the "Re-Rank" setting.The bold numbers denote the best results of methods that are pre-trained with the standard 4M data.

Table 6 :
Comparison of # Images used in Pre-training between official settings and ours.