Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks

Foundation models or pre-trained models have substantially improved the performance of various language, vision, and vision-language understanding tasks. However, existing foundation models can only perform the best in one type of tasks, namely language, vision, or vision-language. It is still an open question whether it is possible to construct a foundation model performing the best for all the understanding tasks, which we call a general foundation model. In this paper, we propose a new general foundation model, X-FM (the X-Foundation Model). X-FM has one language encoder, one vision encoder, and one fusion encoder, as well as a new training method. The training method includes two new techniques for learning X-FM from text, image, and image-text pair data. One is to stop gradients from the vision-language training when learning the language encoder. The other is to leverage the vision-language training to guide the learning of the vision encoder. Extensive experiments on benchmark datasets show that X-FM can significantly outperform existing general foundation models and perform better than or comparable to existing foundation models specifically for language, vision, or vision-language understanding. Code and pre-trained models are released at https://github.com/zhangxinsong-nlp/XFM.


Introduction
With the enormous power of foundation models, also known as pre-trained models, remarkable performance gains have recently been achieved in a variety of understanding tasks in natural language processing (NLP), computer vision (CV), and other fields (Devlin et al., 2019;Liu et al., 2019;Lewis et al., 2020;Raffel et al., 2020;Brown et al., 2020;Dosovitskiy et al., 2021;He et al., 2022;Bao et al., 2021;Lu et al., 2019;Tan and Bansal, 2019; Chen * Correspondence to: <xszhang0320@gmail.com>.et al., 2020;Li et al., 2020Li et al., , 2021a;;Zeng et al., 2021Zeng et al., , 2022) ) .Foundation models are usually equipped with Transformer (Vaswani et al., 2017) as the backbone, pre-trained with a tremendous amount of unlabeled data, and then fine-tuned with small amounts of labeled data in downstream tasks.The strong representation ability of the model, the massive amount of data, and the effective means of training make the foundation models powerful for successfully solving the tasks of vision, language, and vision-language (Li et al., 2021b,c;Singh et al., 2021;Wang et al., 2021bWang et al., , 2022b;;Diao et al., 2022;Wang et al., 2022a).
The state-of-the-art foundation models usually work the best for one type of tasks, namely language, vision, and vision-language.For example, RoBERTa (Liu et al., 2019), BEiTv2 (Peng et al., 2022), and X-VLM (Zeng et al., 2021(Zeng et al., , 2022) ) are language, vision, and vision-language foundation models respectively, and can achieve stateof-the-art performances for the specific type of tasks.It is still very challenging, however, to build a general foundation model that can perform the best in all types of tasks.Existing models, such as FLAVA (Singh et al., 2021), OFA (Wang et al., 2022b), DaVinci (Diao et al., 2022) and Uni-Perceiver-MoE (Zhu et al., 2022), are trying to achieve the goal.Their performances are still not satisfactory, however, when compared with the best performing foundation models for the individual types of tasks, as shown in Table 1.Previous work (Bingel and Søgaard, 2017;Wang et al., 2020) also shows that it is difficult to train a general foundation model in a multi-task learning setting that can effectively learn and utilize representations for all types of tasks.The reason is that language, vision, and vision-language are very different in nature, and a simple way of jointly training a model from language, vision, and vision-language data can easily create a suboptimal solution.
method for training general foundation model, and bring in X-FM (X-Foundation Model).X-FM consists of three modular encoders for language (text) encoding, vision (image) encoding, and fusion encoding, as shown in Fig 1 .The language encoder, the vision encoder, and the entire model can be used in downstream tasks of language, vision, and vision-language understanding, respectively.The language encoder and the vision encoder follow the implementations of BERT (Devlin et al., 2019) and ViT (Dosovitskiy et al., 2021), respectively.Note that X-FM do not include any extra parameters for language and vision tasks.The fusion encoder has the same architecture as BERT except that there is a cross-attention sub-layer after the self-attention sub-layer in each Transformer layer.
In learning of X-FM, the language encoder, vision encoder, and fusion encoder are jointly trained with text data, image data, and image-text pair data as input.Given the text data, we train the language encoder by masked language modeling (MLM).Given the image data, we train the vision encoder by masked image modeling (MIM).Given the image-text pair data, we train the fusion encoder by image text matching (ITM), image-conditioned masked language modeling (IMLM), bounding box prediction (BBP), also train the vision encoder and the language encoder by image-text contrastive learning (ITC).(See Fig 1 .) The essential thinking of our learning method is that language is more abstract than vision, and there is an asymmetric relationship between language and vision.Therefore, we separate the learning of the three encoders.The language encoder is trained mainly from text data and is isolated from the training of the fusion encoder.The vision encoder is simultaneously trained from image data and image-text pair data, guided by the visionlanguage training.The fusion encoder is trained from image-text pair data.
Our learning method includes two new techniques.One technique is to stop gradients from the vision-language training when learning the language encoder.The gradient flow is stopped from the fusion encoder to the language encoder in training, while the activation flow from the language encoder to the fusion encoder is as usual.As a result, the language encoder is not affected by training of the fusion encoder with image-text pair data.Moreover, the training of the fusion encoder concentrates on learning the alignments between language and vision features.
The other technique is to leverage the visionlanguage training to guide the learning of the vision encoder with masked image modeling (MIM).In MIM, the masked image is compared with the original image by the differences between the predicted representations and target representations at the masked and [CLS] positions.The vision encoder creates both the predicated and target rep-resentations, while there is gradient flow from the predicted representations but no gradient flow from the target representations.The vision encoder can create the target representations because it is also trained in the vision-language training.
We conduct experiments on a variety of twentythree tasks of language, vision, and vision-language understanding.X-FM can outperform other general foundation models by a large margin and can even achieve better or comparable performance than SOTA foundation models specifically designed for language, vision, or vision-language understanding tasks, as shown in Table 1.
Our contribution is as follows.
(1) We address the problem of how to build a general foundation model that can perform the best for all the understanding tasks of language, vision, and vision-language.
(2) We propose a general foundation model, X-FM, which can achieve better or competitive performance on both unimodal understanding tasks and multi-modal understanding tasks through two training techniques.
(3) The stop gradient technique is useful in maintaining text understanding capability and enhancing multi-modal understanding capability at the same time.We also propose a convenient method for mask image modeling with multi-modal learning.The technique can enhance both vision and multimodal understanding.
Recently, the fact that Transformer can model multi-modal data within a single architecture has inspired research to develop general foundation models that can solve language, vision, and vision-language tasks at the same time.UNIMO (Li et al., 2021b,c) jointly learns vision representations, language representations, and vision-language alignments in a shared space.FLAVA (Singh et al., 2021) performs pre-training with masked uni-modal and multi-modal modeling objectives.OFA (Wang et al., 2022c) formulates visionlanguage tasks as sequence-to-sequence (seq2seq) problems and pre-trains a seq2seq model in multitask learning.SimVLM (Wang et al., 2021c) pretrains a seq2seq model with a single objective of language generation (prefix language modeling).DaVinci (Diao et al., 2022) combines prefix language modeling and prefix image modeling to learn a general foundation model for a wide range of tasks.Uni-Perceiver (Zhu et al., 2021(Zhu et al., , 2022) builds a unified perception architecture that processes various modalities and tasks with a single Transformer and shared parameters.
Previous studies on general foundation models have shown that different capabilities can be established with only one model.Still, few studies demonstrate that the best performance can be achieved in all tasks with one model.In this paper, we propose a new method for training general foundation model and show that it can perform the best for all the understanding tasks of language, vision, and vision-language.We compare our model extensively with recent general foundation models on multiple dimensions, as shown in Appendix A.
Several super-large foundation models (over 1B parameters) are proposed recently, most of which are trained on super-large in-house datasets (over 900M image-text pairs).The authors do not report results at the base (about 300M parameters) scale on public datasets, which we consider in this paper.CoCa (Yu et al., 2022) pre-trains an image-text sequence-to-sequence model with contrastive loss and captioning loss.BEiT-3 (Wang et al., 2022d) uses a multi-way Transformer and a unified objective of masked "language" modeling for learning from image, text, and image-text pair data.Flamingo (Alayrac et al., 2022) makes use of a large language model in vision-language pre-training to solve the "in-context learning" problem for vision-language tasks.PaLI (Chen et al., 2022) jointly scales up the vision encoder and language encoder to cover a variety of language, vision, vision-language, and multilingual tasks.3 Method

Model Architecture and Training Process
We propose a new method for training general foundation model and bring in X-FM, having a language encoder, a vision encoder, and a fusion encoder, shown as Fig 1 .The architectures of language encoder, vision encoder and fusion encoder are following precious works (Devlin et al., 2019;Dosovitskiy et al., 2021;Li et al., 2021a).We propose a new method for training general foundation model.Text, image, and image-text pair data are used as input to train X-FM.The language encoder is trained by masked language modeling (MLM) and image text contrastive learning (ITC).The vision encoder is trained by masked image modeling (MIM) and ITC.The fusion encoder is trained by image text matching (ITM), image-conditioned masked language modeling (IMLM), and bounding box prediction (BBP).There are two new techniques developed for the training.Stop Gradient.We stop gradients from the vision-language training when learning the language encoder.Specifically, when the fusion encoder is trained with image-text pair data by ITM, IMLM, and BBP, there are forward flows (activations) from the language encoder to the fusion encoder, but there are no backward flows (gradients) from the fusion encoder to the language encoder.In this way, the language encoder is only trained with text data by MLM and with image-text pair data by ITC.The former helps the language encoder to learn text representations, and the latter helps to make alignments between text representations and image representations.Meanwhile, the training of the fusion encoder is performed separately with the focus of learning cross-modal alignments.
Vision-Language Guided Masked Image Modeling.The training of vision encoder by MIM is carried out as follows.The image data is first masked and then predicted by the vision encoder.The differences between predicted representations and 'target' representations at masked positions and [CLS] position are then measured with MSE (mean squared error) loss.The target representations are obtained from the same image data (without masking) by the vision encoder.There are no gradients from the target representations in the learning of the vision encoder.The vision encoder can create target representations because it is also trained with image-text pair data.In this way, the vision encoder is trained by both the cross-modal objectives (ITC, ITM, BBP, IMLM) with imagetext pair data and the uni-modal objective (MIM) with image data.The representations obtained from the vision-language training are highly semantic, which is necessary for MIM as demonstrated in previous work (Bao et al., 2021;Peng et al., 2022;Wei et al., 2022a,b).
There are mainly two advantages by exploiting the new MIM technique.First, it is convenient to conduct MIM with the signals from the visionlanguage training.Note that most previous work for MIM uses an external image tokenizer such as VQ-VAE (Bao et al., 2021;Singh et al., 2021), CLIP (Wei et al., 2022b), and VQ-KL (Peng et al., 2022).Second, the learning of the vision encoder and that of the fusion encoder are mutually enhanced.Once the vision encoder is trained, it is also utilized to train the fusion encoder.Fortunately, image data for training the vision encoder is relatively easy to obtain.

Pre-training Objectives
We explain six objectives in learning of X-FM.Here, T represents the distribution of text data, I represents the distribution of image data, and D represents the distribution of image-text pair data.
Masked Language Modeling (MLM) We perform MLM on text data to learn the language encoder of X-FM.Specifically we recover the masked tokens in a text by minimizing the cross entropy loss below.
where T denotes a text, T denotes the masked text of T , ⃗ p denotes the predicted probability vectors of masked tokens of T , ⃗ y denotes the one-hot vectors representing the original tokens of T , and H denotes cross-entropy.
Image-Text Contrastive Learning (ITC).We use a contrastive loss as in CLIP (Radford et al., 2021) to learn the alignments between images and texts in ITC.Given a batch of images and texts, we calculate the cosine similarities between all image-text pairs.For each image, there is one text matched and the rest is unmatched.For each text, there is one image matched and the rest is unmatched.The contrastive loss is defined as follows.
where (I, T ) denotes an image-text pair, ⃗ p i2t (I) denotes the in-batch image-to-text similarities, ⃗ p t2i (T ) denotes the in-batch text-to-image similarities, ⃗ y i2t (I) denotes the one-hot vectors representing the image-to-text matching relations, ⃗ y t2i (T ) denotes the one-hot vectors representing the text-toimage matching relations, and H is cross-entropy.

Image-Text Matching (ITM).
We also learn the alignments between images and texts in ITM, using a loss indicating whether an image-text pair is matched.For each image in a batch there is a matched (positive) text, and we sample an unmatched (negative) text in the batch.For each text there is a matched (positive) image, and we sample an unmatched image in the batch.The loss is defined as follows.
+H(p match (I, T )) where (I, T ) denotes a positive image-text pair, ( Ĩ, T ) and (I, T ) denote negative image-text pairs, p match (I, T ) denotes a predicted matching probability of (I, T ), and H denotes logistic loss.
Image-conditioned Masked Language Modeling (IMLM) We conduct IMLM on image-text pair data to learn the fusion encoder.We recover the masked text tokens given for an image-text pair by minimizing the cross entropy loss below.
where (I, T ) denotes an image-text pair, T denotes the masked text of T , ⃗ p(I, T ) denotes the predicted probability vectors of the masked tokens of T based on I, ⃗ y denotes the one-hot vectors representing the original tokens of T , and H denotes cross-entropy.
Bounding Box Prediction (BBP) We adopt the BBP in X-VLM (Zeng et al., 2021(Zeng et al., , 2022)), which locates the visual concept in the image by a bounding box given the text.With BBP we learn the alignments between the images and texts in multigranularity.In BBP, two losses are simultaneously minimized to measure the differences between the predicted bounding box and the groundtruth bounding box.One is generalized intersection over union GIoU (Rezatofighi et al., 2019) and the other is ℓ 1 distance.
Masked Image Modeling (MIM) We perform MIM on image data and image-text pair data to learn the vision encoder.Specifically, we recover We report accuracies for all the vision and multimodal tasks.FT is short for fine-tuning, LE for linear evaluation, ZS for zero-shot, TR for text retrieval, and IR for image retrieval.Results for RoBERTa are from its corresponding paper (Liu et al., 2019), and they use the mid-training (Phang et al., 2018) on MNLI for RTE, MRPC, and STS-B while other models (e.g., DaVinci, X-FM) do not use this trick.Note that mPLUG-2 used more layers and parameters than RoBERTa and X-FM for the language understanding tasks.Language Avg. is the average score of all the language tasks, while Vision Avg. is the average score of six line evaluation tasks except ImageNet.Vision-Language Avg. is the average score of all vision-language tasks.† are our reproduced results with the officially released models.
the masked image patches in an image by minimizing the loss below.
where (I, T ) and I denote an image-text pair and a single image respectively, Ī denotes the masked image I, ⃗ v( Ī) denotes the predicted representations at the masked positions and [CLS] of Ī, and ⃗ v( Ī) denotes the target representations at the masked positions and [CLS] of Ī. || ˙|| 2 is the MSE loss.We employ block masking following previous work (Bao et al., 2021;Peng et al., 2022).Note that (I, T ) and I are independently sampled from D and I, and the sample sizes are not necessarily equal.
Finally, the pre-training objective of X-FM is defined as the sum of the losses described above.
Pre-training Settings.Our model is of base size, and the detailed parameters are explained in Appendix D. The vision encoder is initialized with BEiTv2.The language encoder is initialized with RoBERTa.The fusion encoder is trained from scratch.X-FM is pre-trained at image resolution of 224 × 224 with patch size of 16 × 16.We pre-train X-FM for 200K steps with a batch size of 3072 image-text pairs, 3072 images, and 8192 sentences on 32 A100, which takes about six days.The learning rate for both models is warmed-up to 1e −4 in the first 2500 steps and decayed following a linear schedule.We set the maximum number of text tokens to 30 for image-text pairs, while that of pure text corpus is set to 128.For the "more data" setting, we pre-train X-FM for 400k steps with 18k batch size on 64 A100.Due to the consideration of computational cost, we did not pre-train the large or giant models.We apply mixed precision for pretraining.We choose widely used downstream tasks whose details are shown in Appendix C.

Comparison with Foundation Models
We extensively compare the performance of X-FM with state-of-the-art foundation models on vision, language, and multi-modal tasks.We first compare our model with general foundation models, including UNIMO-v2 (Li et al., 2021c), FLAVA (Singh et al., 2021), SimVLM (Wang et al., 2021c), OFA (Wang et al., 2022b), DaVinci (Diao et al., 2022), Uni-Perceiver-MoE (Zhu et al., 2022), OmniVL (Wang et al., 2022a), and mPLUG-2 (Xu et al., 2023).We also include comparisons with SOTA foundation models specifically designed for language, vision, or vision-language tasks, RoBERTa (Liu et al., 2019), BEiTv2 (Peng et al., 2022), and X 2 -VLM (Zeng et al., 2022).There are several observations in Table 2. First, X-FM base (column 15) outperforms all the previous general foundation models (column 5-13) across almost all tasks by a large margin, becoming a new and stronger general foundation model.When we use less pre-training data, X-FM can also achieve competitive performance compared with previous general foundation models (column 5-13 vs 14).Second, we compare X-FM with state-ofthe-art foundation models specifically designed for language, vision, and vision-language tasks, RoBERTa, BEiTv2 and X 2 -VLM.We observe that X-FM is also better than or comparable with the foundation models (column 1,2,3,4 vs 15).We further compare our model, X-FM base , with three previous foundation models on 18 image classification tasks on the linear evaluation setting to evaluate generalization performance on vision understanding tasks.The results are shown in Table 4. X-FM base wins 11 of 18 tasks, 7 for CLIP, 2 for FLAVA, and 2 for DaVinci.

Comparison with multi-modal Models
In addition to general foundation models, we also compare X-FM with state-of-the-art visionlanguage models.The results are shown in Table 3 and Table 6.X-FM demonstrates its superiority on five downstream vision-language tasks including MSCOCO Retrieval, Flick Retrieval, VQA, NLVR and RefCOCO+.Note that X-FM base outperforms CLIP, ALIGN and Florence on image-text retrieval tasks with fewer parameters and much less training data.Compared to the recently released SOTA vision-language model, X 2 -VLM, X-FM is much better on zero-shot image-text retrieval tasks.When we scale up pre-training datasets, X-FM base is consistently better than previous vision-language models for most cases.

Ablation Study
To verify the contributions of different modules in our framework, we ablate them and evaluate the performance of X-FM on all downstream tasks.The results are shown in Table 5.We first explain several abbreviations in the table.S-MLM means that we only stop the gradient of language representations in IMLM task, while S-ITM means stopping the gradient of language representations for computing ITM and BBP.wostop indicates without stopping the gradients of all language representations.woMIM means that we do not learn by MIM, while wBEiTv2 tokenizer means that we learn by MIM with the image tokenizer used in BEiTv2.Multi-task is a variation that uses straightforward multi-task learning to optimize the three encoders in X-FM.To make a fair comparison, we also train RoBERTa, BEiTv2 and X 2 -VLM with the same data noted as RoBERTa † , BEiTv2 † and X 2 -VLM † .Note that we also increase the fusion layers in X 2 -VLM † to make the parameter sizes comparable to our models.RoBERTa † , BEiTv2 † and X 2 -VLM † all have slightly better results on average than the IR-Fine-Tune TR-Fine-Tune IR-Fine-Tune TR-Zero-Shot IR-Zero-Shot TR-Zero-Shot IR-Zero-Shot R@1/R@5/R@10 R@1/R@5/R@10 R@1/R@5/R@10 R@1/R@5/R@10 R@1/R@5/R@10 R@1/R@5/R@10 R@1/R@5/R@10 R@1/R@5/R@10 ALBEF 210M official ones.From the results, we have the following observations.First, both designs (stop gradient and visionlanguage guided MIM) bring improvements, and the combination can make further improvements on all three downstream tasks (column 10 vs. others).Second, without separated language representations, models always perform worse on language understanding tasks (column 10 vs. 2,3,4).Besides, the separate language representations in the IMLM task on image-text data are helpful for multimodal tasks (column 2 vs. 4).As we point out in section 1, the fusion encoder can concentrate on learning the alignments between language and vision features instead of predicting masked tokens with clues from other visible text tokens.Although S-ITM shows slight side effects (column 4 vs. 3), stopping the gradients of language representation in the fusion encoder is necessary to simultaneously achieve strong language understanding and vision-language understanding capability.Third, the vision-language guided MIM task is useful for both vision-language and vision learning (column 10 vs. 6).Meanwhile, the targets in our MIM task are better than the BEiTv2 tokenizer (column 10 vs. 7).Four, X-FM is much better than a naive multi-task learning strategy for a foundation model, compared with which, X-FM base improves an average of 0.9%, 1.7% and 1.6% on language, vision, and vision-language tasks, respectively (column 10 vs. 9).Five, X-FM is also better than foundation models specifically designed for language, vision, and vision-language tasks with the same training corpus (column 10 vs. 1,5,8).

Limitations and Potential Risks
Limitations.Like most existing work on foundation models, the entire project consumed over 5 A100 GPU years on a computing cluster with high electricity costs, although we only tested base models.There is still potential for efficiency improvement through sparse attention (Zaheer et al., 2020) or the lottery ticket hypothesis (Frankle and Carbin, 2018).We will explore the techniques to improve the training efficiency and reduce the carbon footprint so that we can adhere to the proposals on "green" deep learning (Schwartz et al., 2020;Xu et al., 2021 Table 5: Ablation studies on vision, language, and vision-language tasks.We use the same settings as Table 2. "ALL" means we use both of our proposed techniques.To compare fairly, we pre-train all variants with the same data at the same settings for both pre-training and fine-tuning.Avg.means the average score.Due to considerations of fair comparisons and computational resources, we did not try super-large models which use at least 1.9B or more parameters like BEITv3 (Wang et al., 2022d), CoCa (Yu et al., 2022) and PaLI (Chen et al., 2022).We also did not pre-train large size model on large-scale datasets.However, scalability is also an important factor for foundation models.We leave the investigations to future work.
Potential Risks.The image-text pairs use for training our model are mostly derived from lexical databases and image queries in English, resulting in source material with a North American or Western European bias.

Conclusion
In this work, we address the problem of how to build a general foundation model that can perform the best for all the understanding tasks of language, vision, and vision-language.We propose a new method for training general foundation model with two new and effective techniques, bringing in X-FM, to learn rich language, vision, and visionlanguage representations at the same time.Experimental results demonstrate that X-FM outperforms other general foundation models by a large margin.Moreover, X-FM can even be better than or comparable to the SOTA foundation models specifically designed for language, vision, or vision-language understanding tasks.

A Comparison of Foundation Models
Table 7 shows an extensive comparison of recent foundation models and X-FM on multiple axes.Previous work either (i) perform best on uni-modal tasks (Liu et al., 2019;Peng et al., 2022) or visionlanguage tasks (Zeng et al., 2021(Zeng et al., , 2022)); (2) target a specific uni-modal domain along with part of vision-and-language tasks (Wang et al., 2021a;Radford et al., 2021;Jia et al., 2021;Wang et al., 2021c;Yu et al., 2022;Wang et al., 2022b;Diao et al., 2022); or (3) target all domains but cannot perform best on all the tasks (Li et al., 2021c;Singh et al., 2021;Zhu et al., 2022).Our model, X-FM, is a general foundation model that can perform the best for all the understanding tasks of language, vision, and vision language.

Figure 1 :
Figure 1: The architecture and pre-training process of X-FM, a Transformer-based general foundation model.Given a text, we learn the language encoder by MLM.Given an image, we learn the vision encoder by MIM.Given an image-text pair, we learn the fusion encoder by BBP, ITM, IMLM and ITC, and further learn the vision encoder by MIM.The gradients of BBP, ITM, and IMLM are stopped from the fusion encoder to the language encoder.The vision encoder is trained by MIM with both the image-text pair data and the image data.M, N and L denote numbers of encoder layers.

Table 2 :
Experimental results on vision, language and vision-language tasks.The multi-modal data size used for pre-training are reported under the model name.MNLI results are average of MNLI-m and MNLI-mm.MRPC results are average accuracies and F1 scores.Matthews correlation coefficient (MCC) is reported for CoLA, and Pearson correlation coefficient (PCC) is reported for STS-B.

Table 4 :
Linear evaluation performance of four foundation models over 18 datasets.B/16-224px means base size model, 16*16 patches, and 224*224 resolution, respectively.The best performance is identified with bold.

Table 6 :
Results on VQA, visual reasoning and visual grounding.Giant models with over 1B parameters (e.g., CoCa and BEiT-3) are in grey because they are not directly comparable with other models.

Table 9 :
Size variants of X-FM.All modules consist of transformer layers.Param indicates the parameter.Total means the total parameter number, and Trans.indicates the parameter number for Transformer layers.D Details of hyper parametersPre-training X-FM base is implemented with a 12-layer language encoder, a 12-layer vision encoder, and a 12-layer fusion encoder, 768 dimensions for hidden states, 3072 for intermediate size, and 128 for maximum input length.We initialize the language encoder with RoBERTa and the vision encoder with BEiTv2.The weight decay is set to 0.01 with β 1 = 0.9, β 2 = 0.98.The learning rate is 1e-4 with a warm-up period for the first 2500 steps and then linearly decayed to 0. In each batch, there are 3072 image-text pairs, 3072 images, and 8192 text-only sentences.We use center-crop to resize each image to the size of 224×224.The model sizes and default hyper-parameter settings are shown in Table9 and Table 10, respectively.

Table 10 :
Pre-training setting.Fine-tuning The learning rate is ∈ {1e-5, 2e-5, 5e-5} and our model is optimized by AdamW.Because the image resolution differs between pretraining and fine-tuning, the position parameters are adapted using linear interpolation.For all downstream tasks, we apply random resize crops and horizontal flips augmentation during training.The default settings for text classification, image classification and vision-language understanding are shown in Tables11, 12, 13 and 14, respectively.Note that the resolution for VQA is different as described in Section C.