KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning. Previous mainstream VLP approaches typically adopt a two-step strategy relying on external object detectors to encode images in a multi-modal Transformer framework, which suffer from restrictive object concept space, limited image context and inefficient computation. In this paper, we propose an object-aware end-to-end VLP framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly. More importantly, we propose to perform object knowledge distillation to facilitate learning cross-modal alignment at different semantic levels. To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision: 1.) Object-guided masked vision modeling task focuses on enforcing object-aware representation learning in the multi-modal Transformer; 2.) Phrase-region alignment task aims to improve cross-modal alignment by utilizing the similarities between noun phrases and object labels in the linguistic space. Extensive experiments on a wide range of vision-language tasks demonstrate the efficacy of our proposed framework, and we achieve competitive or superior performances over the existing pretraining strategies.


Introduction
With the success of BERT (Devlin et al., 2018) in language modeling, self-supervised Vision-and-Language Pretraining (VLP) has attracted much interest from AI community, which aims to learn generalizable multi-modal representations from largescale image-text data. Combined with a pretrainthen-transfer strategy, it shows great potential in tackling vision and language reasoning tasks, such as image-text retrieval, visual question answering (VQA) and visual entailment (Antol et al., 2015;Xie et al., 2019;. A critical step in such representation learning is to jointly model linguistic entities and visual semantic concepts (e.g., attributes, objects, and relations), as well as their alignment. However, this is particularly challenging due to large discrepancy in visual and language representations (pixels vs words) and lack of entity-level cross-modal correspondence in supervision.
To tackle those challenges, most existing approaches Lu et al., 2019) adopt a two-step pretraining strategy that firstly utilizes off-the-shelf detectors to parse images into a set of object tokens, and then builds a multi-layer Transformer to learn visual and language embeddings jointly. In order to facilitate the multi-modal learning, those networks are typically trained via a set of carefully designed BERT-like objectives (e.g. Image-Text Matching). Despite its promising performance, the two-step strategy suffers from several limitations: 1) limited visual object concepts as the external detectors are trained on a predefined set of object categories; 2) lack of context cues outside of the object regions, which are crucial for complex reasoning tasks; 3) sub-optimal visual representation due to stage-wise training; and 4) computational inefficiency caused by additional detection modules. To overcome those limitations, recent works attempt to learn a joint visual-linguistic representations in an end-to-end manner (Huang et al., 2021(Huang et al., , 2020Xu et al., 2021;Kim et al., 2021). These methods directly take dense visual features from image grids as inputs to a multi-modal Transformer network, and hence do not rely on external object detectors in both pretraining and finetuning stages. Such model design significantly simplifies overall network architecture and allows deeper integration between visual and language features. However, using grid-level features makes it difficult to capture object-level visual concepts, which often results in less expressive multi-modal representations and inferior performances in downstream tasks.
In this work, we propose a novel object-aware end-to-end (E2E) VLP approach that inherits the strengths of both types of pretraining strategies mentioned above. Our core idea, which we name KD-VLP, is to incorporate visual object concepts in the E2E multi-modal learning, which is instantiated by performing Knowledge Distillation from semantic objects (e.g., from the off-the-shelf detectors) during the pretraining stage. This allows the network to better capture object representations and hence facilitates learning the alignment of linguistic entities and visual concepts. To achieve this, we introduce two novel pretext tasks to perform object knowledge distillation based on a CNN+Transformer architecture: an object-based masked vision modeling task for enforcing objectaware feature embeddings, and a phrase-region alignment task for building correspondence between object regions and language entities. Specifically, we adopt a typical CNN backbone+multi-modal Transformer model for the pretraining. Given an image-text pair, the visual backbone firstly computes a set of visual features on the image grid. Then a multi-layer Transformer takes the visual features and the corresponding text tokens as input to generate their multi-modal embeddings. Based on those embeddings, a set of task-specific heads compute the corresponding objectives to train the entire network in an end-to-end fashion. Here, in addition to the commonly-used image-text matching and masked language modeling objectives, we develop two object-aware pretext tasks. The first task, object-guided masked vision modeling (OMVM), aims to reconstruct the RoI features and semantic label of each object (from an external detector) using the surrounding visual context and text description. To facilitate cross-modal alignment, we also develop a knowledge-guided masking strategy, which samples object candidates for reconstruction according to the similarity scores between the noun phrases in the corresponding text and their semantic labels. The second task, phrase-region alignment (PRA), aims to further improve cross-modal alignment by matching the above-mentioned phrase-label similarity scores of each phrase with the cross-modal similarity scores between the noun phrase embeddings and object region embeddings. After pretraining, we then transfer the learned multi-modal representations to different downstream vision-language tasks.
We perform pretraining on two widely-used indomain datasets: MSCOCO Caption (Lin et al., 2014) and Visual Genome (Krishna et al., 2016), and validate the learned multi-modal representations on five well-known visual-language tasks: Visual Question Answering (VQA), Imagetext retrieval, Nature Language Visual Reasoning (NLVR 2 ), Visual Entailment (VE) and Visual Commonsense Reasoning (VCR). Empirical results show that our method outperforms the state-of-theart end-to-end approaches by a sizeable margin. To better understand our method, we also provide a detailed ablation study and visualization.
The contributions of our work are three-fold: • We propose a novel end-to-end pretraining strategy, capable of better encoding visual object concepts and facilitating multi-modal representation learning.
• We design an object-guided masked vision model task for distilling knowledge from external object detectors, and a phrase-region alignment task to facilitate learning better phrase-region correspondence.
• Compared with existing methods, we achieve competitive or superior performances without using external detection outputs during finetuning stage and model test.

Related Work
The existing self-supervised VLP approaches can be largely categorized into two groups: the twostep pretraining and the end-to-end pretraining, depending on whether they rely on visual object embeddings as input for the Transformer. Two-step Pretraining firstly employ an off-theshelf object detector to convert an image into a set of object embeddings, and then feed them into a Transformer jointly with text embeddings to generate their multi-modal representations. Hence their visual feature networks are not optimized during both pretraining & finetuning stage. Most of these methods, such as LXMERT (Tan and Bansal, 2019),ViLBert (Lu et al., 2019), VL-Bert (Su et al., 2020), Unicoder-VL (Li et al., 2020a) and UNITER , adopt BERT-like objectives to train their networks, which include Masked Language Modeling (MLM), Masked Vision Modeling (MVM) and Image-Text Matching (ITM). In addition, VILLA  develops an advanced adversarial pretraining and finetuning strategy to improve generalization ability. OSCAR (Li et al., 2020b) and VINVL  introduce object labels to bridge different modalities and revisit the importance of visual features. Ernie-ViL  exploits structured knowledge in the text and constructs scene graph prediction tasks to learn joint representations. UNIMO  proposes a unified model to leverage large-scale free text corpus, image collections, and image-text pairs simultaneously through a contrastive learning task. Despite their strong performances, those methods are limited by the object detector and neglect visual cues outside of object regions, often leading to mistakes in downstream tasks.
End-to-End (E2E) Pretraining directly feed dense features on image grids from a visual backbone network into a Transformer network along with text tokens. As such, both the visual and Transformer networks are optimized jointly in an end-to-end manner in the pretraining & finetuning stage. Pixel-Bert and SOHO (Huang et al., 2021(Huang et al., , 2020 pioneer the use of the E2E pretraining architecture and propose a novel visual-dictionary masked vision modeling task. E2E-VLP (Xu et al., 2021) presents a pretraining framework supervised with additional object detection and image captioning tasks to enhance visual semantics learning. It is worth noting that their object detection pretext task requires millions of bounding boxes annotation, unable to generalize to large-scale image-text corpus. ViLT (Kim et al., 2021) is the first to unify vision and language with a pure Transformer network, which has a simpler structure and enjoys faster inference. However, compared to the twostep methods, they are typically less expressive in terms of object-level concepts and thus suffer from weaker performances on challenging visual reasoning tasks. Our method is in line with the E2E pretraining framework. The key difference is that we propose to facilitate learning object-aware multi-modal representations by performing object semantic knowledge distillation.

Problem Definition and Overview
The goal of self-supervised VLP is to learn a generic and transferable visual-linguistic representation from a large amount of image-text data, which can achieve strong generalization performances in downstream vision-language tasks. To this end, the pretraining framework typically develops a variety of carefully-designed cross-modal pretext tasks (e.g. MLM, ITM) to train a deep network that encodes the multi-modal representation. Formally, we denote the image-text corpus for training as where I represents the image and D is the corresponding language description. In general, we construct a pretraining network consisting of a representation network module M θ and a set of task-specific network heads {Φ θs } S s=1 where s indicates the pretext tasks. The overall pretraining objective is defined as follows, where Y s and L s are task-specific ground-truth label and loss function respectively, and • is a network compound operator. After pretraining, we remove all the task-specific heads and apply the representation network M θ * with the learned parameters θ * to the downstream tasks, followed by task-specific fine-tuning.
In this work, we aim to design an E2E pretraining strategy for the VLP problem. To this end, we adopt a modular representation network, which takes image grid features from a CNN-based visual network and the corresponding text embeddings into a multi-modal Transformer (Huang et al., 2020(Huang et al., , 2021. Our goal is to learn the visual network and the Transformer jointly, and yet to effectively encode object-level visual concepts in the multimodal representations. This enables us to capture rich cross-modal alignment between linguistic entities and visual semantic concepts for the downstream tasks, and meanwhile to enjoy the benefits of an efficient E2E network design without relying on detectors during fine-tuning and inference. To achieve this, we propose a set of cross-modal pretext tasks that perform object knowledge distillation from external detectors in both semantic and feature space. Specifically, in addition to the imagetext matching (ITM) and masked language modeling (MLM) tasks, we introduce two novel pretext tasks, Object-Guided Masked Vision Modeling (OMVM) and Phrase-Region Alignment (PRA), which take the object RoI feature embeddings and semantic labels from external detectors as supervision. The OMVM task masks out the object regions and forces the network to predict the corresponding external RoI feature embeddings and object labels while the PRA task exploits object labels to encourage the alignment between visual objects and language entities. Fig.1 illustrates an overview of our framework. Below we will first present the details of model architecture in Sec.3.2, followed by our design of pretext tasks in Sec.3.3.

Model Architecture
Given an image-text pair, our model firstly computes the image embeddings and linguistic embeddings respectively, and then concatenates them into a sequence of tokens with two additional tokens [sep] and [cls] as inputs to a Transformer for generating multi-modal contextualized embeddings.

Visual Embedding We adopt a CNN backbone to extract image features
In addition, each feature is further concatenated with its 2-D sine position embedding (Carion et al., 2020). Following SOHO, we use a ResNet-101 (He et al., 2016) as the visual backbone, followed by additional 1x1 Conv and 2x2 strides Max-pooling to reduce the memory footprint.
Linguistic Embedding For the language D, we first tokenize the sentence into a sequence of word tokens using WordPiece (Wu et al., 2016), then encode them into word embeddings W = {w j } T j=1 where w j ∈ R dw is the feature vector. Similarly, an index position (Devlin et al., 2018) embedding is supplemented to each word embedding.
Multi-modal Transformer After obtaining image and linguistic embeddings, we assemble them into a sequence of tokens {V, [sep], W, [cls]}, and adopt a multi-layer Transformer to compute their representations encoded by the final-layer states represent the states for visual and language part respectively. Finally, those representations are sent into each pretext task head to compute the supervision signals.

Pretext Tasks
We now describe our cross-modal pretext tasks for the E2E pretraining, aiming to learn more effective multi-modal representations. Below we will first introduce objects-aware pretext tasks that take external object features and semantic labels as supervision, followed by the standard MLM and ITM.
Specifically, for each image, we first generate a set of object proposals from an off-the-shelf detector, denoted as {(o n , c n , f n )} N n=1 where o n ∈ R 4 is box location, c n indicates object category, and f n ∈ R do is object RoI features with dimension R do . For ease of notation, we also introduce a binary mask 1 on the feature map for each object o n and denote its flattened version as m n ∈ R L . For the corresponding text, we extract a set of noun phrases P = {p z } |P| z=1 with an external language tool 2 and calculate the similarity α z,n between each noun phrase p z and the object category c n in the linguistic space: where Cos(·, ·) indicates cosine distance function and E ext represents an off-the-shelf language embedding (e.g. BERT). Using them as supervision, we design two novel pretext tasks to distill objectlevel knowledge below.
Object-guided Masked Vision Modeling (OMVM) The first task aims to learn more explicit object concepts in the E2E pretraining. Specifically, we sample an object each time and mask out its features in the Transformer input, and enforce the network to generate external object RoI features and semantic labels. To learn better cross-modal alignment, we propose a knowledge-guided masking strategy, which samples noun phrase-related object regions to mask based on the (normalized) similarity score α z,n . The selected object region is denoted with its binary mask, category and RoI features, as (m * , c * , f * ).
We design two learning objectives, Masked Region Classification (MRC) and Masked Region Feature Regression (MRFR) as below To calculate the losses L MRC and L MRFR , we first compute the object representation h m * for the masked region at the final layer, which is averagepooled over H V based on its binary mask m * . For MRC, a multi-layer FC network Φ MRC is adopted to predict its object category. Thus, L MRC = CE(Φ MRC (h m * ), c * ) is the standard cross-entropy loss. In addition, we take another FC network Φ MRFR to learn the object concept in feature space directly by minimizing the L2 distance, The second task, PRA, mainly focuses on learning cross-modal alignment at object-level, which aims to pull positive phrase-region pairs closer and push negative pairs away. Here we utilize the similarity α z,n between the noun phrase and object category in the linguistic space as a guidance.
Concretely, we first compute the object representation h mn for each proposal and the phrase representation h pz , both of which are obtained from the final layer states of the Transformer. Specifically, h mn is average-pooled over H V based on binary mask m n while h pz = 1 |pz| j∈pz h w j represents average states of word tokens within p z . We define the cross-modal similarity asα z,n = Cos(h pz , h mn ).
The task PRA minimizes the KL-divergence between the cross-modal similaritiesα z = {Softmax(α z,n )} N n=1 and the phrase-label similarities α z = {Softmax(α z,n )} N n=1 as below: Finally, denoting the mask set M = {m n } N n=1 , we have the overall PRA loss function as follows: Masked Language Modeling (MLM) We take the same masking strategy (15% prob. to mask) as in BERT (Devlin et al., 2018) to randomly mask out the input word tokens. Here, MLM aims to predict the original word index in vocabulary space for each masked token based on the whole image and its surrounding language context via the Transformer. Hence a cross-entropy loss is adopted: Image-Text Matching (ITM) In ITM, the multilayer Transformer is trained to distinguish whether the input image-text pairs are semantically matched based on the final layer [cls] token representation h cls . To construct the training samples, we randomly replace the text for each image-text pair with another text from dataset with a probability of 0.5. Thus, the output label can be defined as y ∈ {0, 1} where y = 1 indicates matched pair. The training objective for the ITM task is to minimize binary cross-entropy loss: 4 Experiments

Experiment Setup
Pretraining Corpus: Following the E2E pretraining strategy (Huang et al., 2021(Huang et al., , 2020Xu et al., 2021), we take indomain datasets: MSCOCO (Lin et al., 2014) and VG (Krishna et al., 2016) as pretraining datasets since it is widely used in literature. In total, two datasets comprise about 200K images and 5.6M image-text pairs, where each image is associated with multiple captions.  Implementation Details: We follow BERT to tokenize caption into word tokens by using Word-Piece, and resize the image into (800, 1333) as prior works. For model architecture, a widely-used ResNet101 for visual encoding and 12-layer Transformer for multi-modal fusion are adopted for a fair comparison. Both networks are initialized with Im-ageNet and BERT pretrained parameters. Besides, following the majority of two-step methods, we apply the widely-used object detector BUTD (Anderson et al., 2018) to generate object proposals as well as their RoI embeddings as our supervision. For model learning, we optimize the entire network by using SGD for CNNs with a learning rate of 1e-2 and AdamW for Transformer with a learning rate of 1e-4, as suggested in SOHO. The training iterations are up to 100K with batch-size 512 in each. The learning rate decays 10 times at 20K, 40K respectively. All experiments are conducted on 16 NVIDIA V100 GPUs with mixed-precision training to reduce memory cost about 7 days.

Downstream Tasks
As in prior works, we evaluate our approach by finetuning it over a set of well-established VL understanding tasks, including image-text retrieval, visual entailment (VE), natural language visual rea-soning (NLVR 2 ), VQA, and VCR. During finetuning, we compound a specific learnable head with the pretrained visual backbone and Transformer, then finetune the entire network with downstream task-specific loss in an E2E fashion. In this work, we mainly compare performance with SOHO, Pixel-Bert, E2E-VLP, and ViLT since they are the E2E pretraining as ours. Besides, several representative two-step pretraining approaches are also selected to compare without loss of generality. Next, we will depict results analysis for each task and leave finetuning experiment setups in Suppl.
Image-Text Retrieval aims retrieval an image when give a specific caption, or vice versa. As in Tab.1&2, we achieve superior performances in all evaluation settings on both datasets, especially outperforming SOHO by 5.65% and 4.90% R@1 in Flickr30k-IR/-TR, 1.71% and 3.52% R@1 in MSCOCO-IR/-TR 1K test set as well as 6.04% and 7.88% in the 5K test set. It is worthing noting that we outperform the two-step pretraining SOTA approach UNIMO by a moderate margin, despite that they use additional outdomain datasets, text corpus, image collections, and adversarial training.
Visual Entailment (VE) predicts whether an image semantically entails the text and requires  fine-grained reasoning ability in a model. In Tab.1, we achieve we achieve 78.21% accuracy in val set and 77.87% in test set. It is worth noting that SOHO takes additional text premise as input, which leads to large improvements. For a fair comparison, we also implement that setting and outperform SOHO by a sizeable margin.
NLVR 2 aims to determine whether a natural caption is true about a pair of photographs, which is full of semantic diversity, compositionality challenges. We outperform SOHO, Pixel-bert, ViLT and E2E-VLP by a clear margin as in Tab.1, and performs comparably with two-step pretraining.
VQA requires requires a richer multi-modal understanding to solve the free-form and open-ended questions. In Tab.1, the results present a clear improvement compared with E2E pretraining methods while surprisingly outperform the strong twostep pretraining methods by a slight margin.
VCR requires higher-order cognition and commonsense reasoning about the world. We achieve superior accuracy, specifically 76.70%/78.63%/60.53% in three different problem setting. It is worth noting that we set up the first end-to-end benchmark for the challenging VCR task without relying on detection during inference. Besides, we outputform VL-BERT and OSCAR by a clear margin and work comparably with VILLA, which adopts advanced adversarial training and more outdomain corpus.
Overall, our approach outperforms the previous E2E pretraining by a sizeable margin, which indicates the superiority of our object-aware E2E multimodal representation. In addition, we also performs better or comparably with previous state-ofthe-art two-step pretrainig, like UNIMO, VILLA, Ernie-ViL, which even adopt more outdomain corpos, sophisticated adversarial training.

Ablation Study & Visualization Analysis
In this section, we validate the effectiveness of each pretext task and provide qualitative visualization analysis. To save experimental cost, we adopt a light-weighted ResNet-18 and 3-layer Transformer network to conduct the ablation study.
Baseline: The baseline takes standard ITM and MLM to train the entire model. In Tab.3, it still achieves decent results over various VL tasks.
Object-guided masked vision modeling: As in Tab.3, compared with baseline, OMVM presents a clearly consistent improvement over all downstream tasks. It suggests that OMVM can enhance the end-to-end multi-modal representations with explicit object concepts learning. In addition, the knowledge-guided masking strategy further helps establish cross-modal correspondence.
To further investigate the OMVM task, we randomly mask a box region with 15% probability Figure 3: Performance gains in different model size rather than sampling a region based on the normalized similarity score α z,n , denoted as Random-MVM. The other pretraining details are the same as in OMVM. We observe a significant performance drop over all downstream tasks, especially in image-text retrieval and NLVR 2 . It indicates that simple RandomMVM will result in inefficient multi-modal representation learning because there is a high probability that the selected region has no relationship with the associated description.
In addition, we also explore the similar masked feature regression task as in UNITER by randomly masking out the image grid features as in BERT and then requiring the Transformer to reconstruct its original features rather than the external object RoI embeddings, denoted as StandardMVM. The results show that such StandardMVM fails to facilitate multi-modal representation learning in the E2E framework.
Phrase-region alignment: The OMVM above mainly focuses on instance-level knowledge distillation by absorbing external object RoI features and semantic labels. Different from that, PRA aims to establish positive object-phrase correspondence while suppressing the negative ones under the guidance of similarities between noun phrases and object labels in linguistic space. As in Tab. 3, we significantly improve 0.78% R@1 of MSCOCO-TR and 1.87% in MSCOCO-IR. In addition, PRA shows slight improvements for more challenging fine-grained reasoning tasks, like VE, NLVR 2 , and VQA. The results indicate that PRA is beneficial to multi-modal representation learning.
Visualization analysis: In Fig.2(a), our knowledge-guided masking strategy always masks out the phrase-related image regions, which can facilitate multi-modal learning. On the contrary, previous works, like SOHO, VILLA ..., mask out background regions or part of the object region with a high probability, which have no relationship with the corresponding description  and result in inefficient cross-modal alignment. Fig.2(b) demonstrates the word-to-image attention maps. Compared to SOHO, our method can attend more accurately to image regions for the corresponding word. Surprisingly, even the word "smiling" can locate the baby's face correctly, which suggests that our approach not only learns better noun-region alignment but also helps establish high-order correspondence, like actions. (see Suppl. for more visualization.) Influence of object detector: We adopt the default BUTD detector in a typical 2-step pretraining method for a largely fair comparison. To investigate the influence of object detectors, we also conduct pretraining with objects knowledge extracted from FRCNN-RN101 pretrained on COCO. In Tab.4, we observe a performance drop compared with the model pretrained with BUTD, which suggests large object knowledge space will facilitate multimodal pretraining. Besides, although with COCO detector, we still outperform SOHO by a clear margin, indicating the superiority of object knowledge in E2E pretraining framework.
Contribution of each pretext task: In Tab.5, we show the individual contributions of our proposed tasks. MRC, MRFR, PRA pretext tasks all help facilitate multi-modal representation learning and improve the performance compared with the baseline model as a result.
Impact of object knowledge distillation in different model sizes: We take SOHO as a strong baseline and compare it at different model sizes (ResNet18 + 3-layer Transformer, ResNet101 + 12-layer Transformer) to investigate the impact of object knowledge distillation. Fig.3 demonstrates the performance gains over some representative vision-language tasks. It shows that object concepts learning always helps multi-modal representation learning no matter what model size it is. In VE and text-retrieval, the larger model even improves significantly than the light-weighted model and shows more capacities to learn external object semantics knowledge.

Conclusion
In this paper, we have proposed a novel selfsupervised VLP method that promotes learning object-aware multi-modal representations in an end-to-end framework. Our key idea is to perform object knowledge distillation in both semantic and feature space from external detectors in the pretraining stage. In particular, we develop an objectguided masked vision modeling task for distilling external object knowledge, and a phrase-region alignment task for learning better alignment of linguistic entities and visual concepts. Compared with prior works, we achieve competitive or superior performance without relying on sophisticated object detectors during model finetuning and test in downstream tasks.

A Limitations
In this paper, we only pretrain our proposed KD-VLP framework on indomain datasets, including MSCOCO and Visual Genome caption datasets. In the future, we need to scale up our model pretrained on more noisy web image-text pairs to make it to learn more general knowledge.

B.1 Dataset Statistics
As shown in Tab.6 we summarize the dataset statistics of pretraining and each downstream task, including the number of image-text pairs and number of images for each dataset split. It is worth mentioning that we select the MSCOCO & Visual Genome image-text data as our pretraining datasets since they are typical indomain datasets for many downstream tasks and are widely adopted by prior works.

B.2 More Pretraining Details
In pretraining stage, we also adopt gradient accumulation 3 and gradient checkpointing 4 techniques to further reduce the GPU memory footprint and increase the batch-size. In our experiments, the gradient accumulation step size is set as 4.

B.3 Binary mask for each proposal
As shown in Fig.4, we generate a binary mask of the same size of feature map for each proposal where locations within the bounding box fill 1 and others fill 0.

B.4 Detailed experiment setup for each downstream task
Image-Text Retrieval: The image-text retrieval typically includes two sub-tasks: image-retrieval (IR) aims to retrieval an image when given a specific caption and text-retrieval (TR) is on the contrary. We perform experiments on both Flickr30k (Plummer et al., 2015) and MSCOCO dataset. As in UNITER, we construct a mini-batch for each GPU of a matched image-text pair, t-1 negative images, and t-1 negative texts where t is set as 32. Besides, we take a fully-connected network on top of h cls and adopt the binary cross-entropy loss as supervision signal. The finetuning iterations are up to 10K by following linear decay scheduling with initial lr 7e-5 for Transformer, 1e-4 for CNNs. Top-K (R@K, K ∈ {1, 5, 10}) recall is the evaluation metric.
Visual Entailment (VE): VE task aims to predict whether an image semantically entails the text and requires fine-grained reasoning ability in a model. VE dataset is built upon SNLI (Bowman et al., 2015) and Flickr30k. Each image-text pair is assigned with one of three classes: entailment, neutral, contradiction. As in UNITER, we formulate it as 3-way classification problem based on h cls . The batch size is 32 per GPU while other finetuning strategies are the same.
Natural Language Visual Reasoning (NLVR 2 ): NLVR 2 aims to determine whether a natural caption is true about a pair of photographs, which is full of semantic diversity, compositionality challenges. We follow UNITER to construct two imagetext pairs for each sample and concatenate their h cls features to infer true or false. All finetuning strategies are the same as before except for a batch size 12 per GPU.
Visual Question Answering (VQA): VQA requires a richer multi-modal understanding to solve the free-form and open-ended questions. VQA dataset contains 204K images from MSCOCO, 614K free-from nature language question and around 6M answers. It is typically formulated as a 3192-way classification problem and supervised by binary cross-entropy loss as in UNITER. The batch size here is 32 per GPU while other finetuning strategies are kept the same.

Visual Commonsense Reasoning (VCR):
Given a question for an image, VCR needs to 1.) correctly answer (Q→A); 2.) provide a rationale justifying its answer (QA→R); 3.) reason both of them (Q→AR), which requires higher-order cognition and commonsense reasoning about the world. Following UNITER, we introduce a second-stage pretraining over the VCR dataset due to severe difference in dataset distribution compared to indomain image-text corpus. In addition, we also utilize a similar person grounding (Park et al., 2020) pretext task to tightly align the person tags in text and their visual locations. During finetuning stage, we concatenate each question along with each possible answer to form four kinds of text inputs, and feed each of them into Transformer network with corresponding image embeddings. Finally, a binary cross-entropy loss is adopted to supervise each pair. Since VCR questions explicitly reference objects at specific locations, we implement coreferencing between text and image by replacing referenced entities in the questions with their corresponding box locations. In the second stage pretraining for VCR, we reduce the learning rate to a constant 5e-05 and trained for an additional 9K steps. Due to longer sequence lengths in the VCR dataset, a training batch-size of 224 is used. We also use a step size of 2 for gradient accumulation. After pretraining, we finetuned on the VCR task for 10K steps with a learning rate of 1e-04 for both the Transformer and the CNNs. Linear warmup of the learning rate is applied for 1000 steps, followed by a linear decay ending at a total of 10K steps.

B.5 Influence of image size
We adopt larger image size mainly for fair comparisons with most 2-step pretraining methods, Pix-elBert and E2E-VLP as all of them use the size (800, 1333). To investigate this, we pretrain our method with size (600,1000) and report the results  in Tab.7. We can see that our method has a mild performance drop, but still outperforms SOHO by a decent margin.

B.6 More Visualizations
As in Fig.5a, we observe that our knowledgeguided masking strategy masks out the image regions, which are highly related to the corresponding sentences. This design can force Transformer to infer object features and semantic labels based on the surrounding visual context and its language descriptions. On the contrary, SOHO randomly masks out either background regions ( Fig.5a(1) & Fig.5a(2)) or local object parts (Fig.5a(3) & Fig.5a(4)), which are not related to the corresponding sentences with a high probability and result in inefficient multi-modal representation learning. As shown in Fig.5b, it shows that our objectaware end-to-end multi-modal representations can accurately establish the correspondence between word tokens and visual tokens, which demonstrates the superiority of our approach.