Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

In this paper, we reconsider the problem of (partial) false negative samples from the Mutual Information (MI) Maximization perspective, the traditional contrastive loss (like InfoNCE loss) will equally push away the anchor of all positive samples and negative samples regardless of their possible semantic similarities. We theoretically show that InfoNCE loss will not only maximize the MI between the anchor and positive samples but minimize the MI between the anchor and false negative samples even though they share similar semantic which could provide a possible theoretical explanation for the observation of the existence of false negative samples in the cross-modal contrastive learning will decrease the downstream task performance of VLP models. Above analysis motivate us to propose the VLP model with a novel Semantic Awared Contrastive Learning framework named SACL where different negative samples are assigned with different contrastive weights according to the semantic similarity between them and the anchor.


Introduction
Large-scale pre-trained vision-language models have recently achieved tremendous success on a wide range of cross-modal tasks (Tan and Bansal, 2019;Chen et al., 2020c;Huang et al., 2020;Wang et al., 2021b;Li et al., 2022a;. Self-supervised learning (SSL) (Jaiswal et al., 2020; have impressively contributed to vision-language pretraining (VLP) due to its capability of leveraging large-scale image-text pairs without annotations. More recently, Self-supervised Multi-modal Contrastive Learning (SMCL) triggered great progress (Li et al., 2022b;Radford et al., 2021;Yao et al., 2021;Li et al., , 2022a by conducting crossmodal alignment. SMCL consists of image-totext and text-to-image contrastive learning, e.g., * corresponding author.  Figure 1: Conceptual illustration of cross-modal similarity regularization. Sub-figure (a) demonstrates that (partial) false negatives commonly exist in cross-modal training data. In sub-figure (b), negative samples will be equally pushed away from the anchor in conventional cross-modal contrastive learning, leading to data deficiency given these false ones. Instead, we take the first step to contrast negatives according to cross-modal similarity (represented by a set of concentric circles in sub-figure (c)), keeping good while removing harmful effects of (partial) false negative samples.
with the InfoNCE (Oord et al., 2018) loss. Taking the text-to-image one as an example, given a text-image pair (T, I), I will be treated as the positive sample for the anchor T, and other images in a mini-batch of text-image pairs will be regarded as negatives. The training objective is to attract the positive to the anchor while repelling all the negative samples. However, this contrasting strategy can be problematic given the many-to-many correspondences between images and texts. As shown in Figure 1 (a), a text can be semantically paired with multiple images. In this scenario, though images I 4 and I 5 are treated as negatives, they are actually semantically consistent (or partially consistent) with the text anchor "A bird in the tree." The (partial) false negatives like I 4 and I 5 will inevitably hinder the contrasting effect, yielding sub-optimal crossmodal representations.
Some pioneering efforts have addressed the noisy image-text pairing problem in VLP pretraining datasets Andonian et al., 2022), by feeding the contrastive loss with soft labels in a self-distillation manner. Though these methods can address the problem of false negatives to some extent, the specific harmful effect of false negatives remains far from being systematically studied. For example, based on these methods (e.g., ALBEF  ), we can easily improve the performances of downstream tasks by simply filtering false negatives, as shown in Table 1.
In this paper, we investigate the problem of false negatives from the perspective of Mutual Information (MI) optimization. The InfoNCE loss used in contrastive learning has been proved to maximize the lower bound of MI between anchors and their positives (Oord et al., 2018). We revisit the theoretical proof in the presence of non-negligible false negatives. Defining the MI between anchors and positives as MI-P, and the counterpart between anchors and negatives as MI-N, we derive a more general conclusion (see the appendix A.2) that optimizing InfoNCE is equivalent to maximizing the lower bound of (MI-P − MI-N). The finding suggests that MI-N will be minimized (e.g., as close to zero as possible), even though some negatives may semantically match the anchor. The theoretical analyses explain the deficiency of the vanilla contrasting strategy on the one hand, and inspire us with another derivation (appendix A.3) that guarantees proper MI optimization for negative samples on the other hand.
Guided by these theoretical analyses, we propose a novel contrasting strategy regulated by crossmodal similarity. We hypothesize that the MI between an image and text positively correlates with their semantic similarity. Therefore, we introduce a contrastive weight, which is derived based on cross-modal similarity and progressively refined with training, for each negative sample as a contrasting regulator. This regulator will guide the model to optimize MI-N properly, keeping it from being unexpectedly minimized and thus yielding a more semantically structural representation space. We equip our proposed contrasting strategy on AL-BEF    , we directly remove false negatives samples in a heuristic way from a mini-batch (more details in Section 4.3), achieving a new pre-trained model AL-BEF++. We report the performance of Zero-shot Crossmodal Retrieval (Flicker 30K) and Visual Question Answering (VQA). Even by simply removing false negatives, ALBEF++ outperforms ALBEF by an evident margin, indicating that existing efforts have not sufficiently addressed the harmful effects of false negatives.
Cross-modal Retrieval, Zero-shot Cross-modal Retrieval, and Natural Language for Visual Reasoning (NLVR). The experimental results show that our adjusted contrastive learning significantly improves their performances.
In summary, our contributions are: • We investigate the issue of false negatives in cross-modal contrastive learning from the perspective of Mutual Information (MI) optimization. We deduce a more general form of MI's lower bound for InfoNCE loss in the presence of non-negligible false negatives, revealing that the MI between (partial) false negatives and anchors is improperly minimized.
• Based on a theoretical derivation that guarantees appropriate MI optimization for negative samples, we propose a novel contrasting strategy by attaching each negative sample with a progressively refined contrastive weight based on cross-modal similarity.
• Applying the contrasting strategy to VLP methods yields impressive performance improvement on various downstream tasks, and demonstrates our contrasting strategy systematically balances the positive and negative impacts of false negatives.

Theoretical Analysis from Mutual Information Perspective
Mutual Information (MI) is designed to measure the relationship between random variables or determine the amount of shared information (Becker, 1996(Becker, , 1993. Oord et al. (2018) has proven that the InfoNCE loss function widely used in contrastive learning can be seen as a lower bound of MI between anchors and positives. Note that  provides a conceptual yet more intuitive discussion of the correspondence between InfoNCE and MI in the VLP scenario. In this paper, we go one step further to revisit the proof of Oord et al.

Preliminaries
The standard InfoNCE loss in VLP consists of two parts: L Inf oN CE = L v Inf oN CE +L t Inf oN C , where the former corresponds to image-to-text alignment and the latter corresponds to text-to-image alignment. For the following discussion, we will take L v Inf oN CE as an example. Suppose we randomly sample N semantically paired image-text tuples Inf oN CE is defined as: (1) where f (v i , t i ) measures the distance between v i and t i in a semantic space. According to Oord et al. (2018), the function f (v i , t i ) can be utilized to model the density ratio, which preserves the mutual information between v i and t i and we can rewrite the f (v i , t i ) to P (t i |v i ) P (t i ) . Then we can derive the well-known lower bound of MI between t i and v i : where the I(ti, v i ) is the mutual information between t i and v i . The details of this copy-to-VLP derivation can be found in appendix A.1.

MI Derivation with False Negatives
The derivation process in Appendix A.1 implicitly assumes that t j (the negative sample) and v i are independent, which is reasonable given a large enough number of negatives with little noise. So the expectation of density ratio P (t j |v i ) P (t j ) is equal to 1 and eliminated (e.g., from Equation 12 to Equation 13). In the presence of non-negligible false negatives, t j and v i may not be independent. Therefore, we revisit this derivation and deduce a more general conclusion (see detailed derivation in appendix A.2): Equation 3 provides a more general lower bound form that the InfoNCE loss optimizes. The first term on the left side of this equation is MI between an anchor and the positive, and the second term is MI expectation between an anchor and negatives. Equation 3 reveals that optimizing InfoNCE is equivalent to maximizing the lower bound of the difference between the former and the latter.

Theoretical Guidance for Addressing False Negatives
Combining Equations 2 and 3, we can find that in addition to maximizing MI between an anchor and the positive (say MI-P), InfoNCE loss will also minimize the MI expectation between an anchor and negatives (say MI-N), e.g., to be as close to zero as possible, despite the existence of the (partial) false negative samples. Since they may semantically match the anchor, over-minimizing MI-N could produce less structural cross-modal representation space.
To optimize MI-N to a proper value, we first need to provide a prior estimation of MI-N as a target. Here we exploit cross-modal similarity to approximate MI between an image and text. The second problem is integrating this prior estimation into the optimization process. Based on the derivation of Equation 3, we further theoretically prove that assigning a positive weight w i,j to each f (v i , t i ) can push MI expectation between an anchor and negatives to a controllable positive value, given the following two conditions: • Condition 1. The covariance between w i,j and P (t i |v i ) P (t i ) is negative.
• Condition 2. The expectation of w i,j among all negatives is equal to 1.
With this theoretical guidance (see complete proof in Appendix A.3), we propose to improve InfoNCE loss by applying each negative with a contrastive weight, which is inversely proportional to its cross-modal similarity with the anchor.  Figure 2: The pipeline of our method. The proposed model consists of two unimodal encoders for images and text separately and a multi-modal encoder for the fusion of the cross-modal information. After feeding the input to the unimodal encoders, we take the representation of [CLS] token as the global representation and use Similarity-Regulated Contrastive Learning (SRCL) to align the unimodal representations of an image-text pair. We also apply an image-text matching loss and a masked-language-modeling loss to learn multimodal interactions between image and text.

Method
In this section, we will first introduce our model architecture, and then introduce our Similarity-Regulated Contrastive Learning (SRCL), followed by the details of other pre-training objectives. Figure 2 shows an overview of our model, our model consists of two unimodal encoders for image and text independently and a multi-modal encoder. To better model the inherent modality bias information, we first use two unimodal encoders to encode the image and text separately. Following (Dou et al., 2021;Shen et al., 2021), we use a visual transformer (Dosovitskiy et al., 2020) directly on the image patches as the visual encoder, which is more computation-friendly than using pretrained object detectors for visual feature extraction (Anderson et al., 2018;Zhang et al., 2021). The visual encoder divides an input image into patches and encodes them as a sequence of embeddings {v cls , v 1 , v 2 , ..., v m } with an additional [CLS] token. The input text is fed to the text encoder and represented as a sequence of embeddings {t cls , t 1 , t 2 , ..., t n }, where t cls is the embedding of the [CLS] token and used to summarize the input text. Then, the visual and linguistic representations are fed into the multi-modal encoder, which consists of multiple transformer layers.

Cross-modal Similarity Regulation
In section 2, we reveal that vanilla InfoNCE loss will treat negative samples equally without considering their semantic similarity with anchors. Thus the MI between the (partial) false negative samples and the anchor is over-reduced, limiting the performance of pre-training models.
We propose a novel contrasting strategy regulated by cross-modal similarity. We hypothesize that the MI between an image and text positively correlates with their semantic similarity. Therefore, we introduce a contrastive weight, which is derived based on cross-modal similarity and progressively refined with training, for each negative sample as a contrasting regulator. This regulator drives the model to optimize MI-N properly rather than simply minimizing it.
Formally, with a batch of N semantically paired image-text tuples of each image and text in the batch, the image-to-text contrastive loss is: i,j indicate the contrastive weight of j-th negative text sample in the contrastive framework. Similarly, the contrastive loss from text to image can be written as follow: i,j indicate the contrastive weight of j-th negative image sample in the contrastive framework.

Implementation of Regulation Weights
In this subsection, we introduce how to calculate the regulation weight of the negative samples in contrastive learning. As the regulation weights are inversely proportional to the semantic similarity between anchors and negatives, we need first to calculate the semantic similarity to estimate the regulation weight. Due to the capacity of the VLP model to align images and texts, the VLP model could be utilized to measure cross-modal semantic similarity. However, we notice that the VLP model in the earlier training stages is unreliable since the semantic structure of the embedding space is still under optimization. Therefore, in the beginning, we use the high-quality human-annotated dataset (Chen et al., 2015) to train another model denoted as H β which shares the same structure with our VLP model S γ . This model H β is optimized by InfoNCE loss and is used to estimate the semantic similarity of the image text pairs at early pretraining stages.
During the pre-training our VLP model S γ , the parameters of model H β are frozen. The final semantic similarity between anchors and negatives is derived by taking a weighted average of similarity computed from S γ and H β . At the beginning of the pre-training stages, the weight of the VLP model S γ for calculating the final similarity is set to 0, and the weight of H β is set to 1. As the number of training epochs rises, we progressively increase the weights of F γ and decrease the weights of H β .
Formally, given a mini-batch {(T 1 , I 1 ) , . . . , (T N , I N )} which contains N image-text pairs, for an text anchor T i and a negative image sample I j , the similarityŝ i,j calculated from the H β is: wheret i cls is the [CLS] representation of the text T i extracted from the text encoder of H β andv j cls is the [CLS] representation of the Image T i extracted from the image encoder of H β . Similariy, the similarityṡ i,j calculated from the V γ is:ṡ Then the finally semantic similarity between T i and V j : where α is a hyper-parameter and will continue to decrease with the increase of pretraining steps. The contrastive weight w t i,j can be driven as follow: Where δ is a scaling factor. Notably, w t i,j is inversely proportional to the similarity to meet Condition 1 described in Section 2.3, and the N orm function makes the mean value of all negatives' weights to be 1 to meet Condition 2.
Similarly, given an image anchor and its text negative samples, we can also calculate the imageto-text contrastive weight.

Pre-training Datasets
We construct our pre-training data using two web datasets (Conceptual Captions (Sharma et al., 2018), SBU Captions (Ordonez et al., 2011)) and two in-domain datasets (MSCOCO (Chen et al., 2015) and Visual Genome (Krishna et al., 2017)). The total number of unique images is 4.0M, and the number of image-text pairs is 5.1M.

Main Result
We implement SRCL based on ALBEF  framework and evaluate it in four widely used downstream tasks: image-text retrieval, zeroshot image-text retrieval (ZSR), visual question answering (VQA), and natural language for visual reasoning (NLVR).

Image-Text Retrieval
We conduct experiments for both image-to-text retrieval (TR) and text-to-image retrieval (IR) on MSCOCO (Chen et al., 2015)    weight of the negative samples. As shown in Table  2, incorporating SRCL into ALBEF brings evident improvement, achieving competitive performances compared with other VLP baselines.

Visual Question Answering
Most methods (Tan and Bansal, 2019;Wang et al., 2021a;Wang et al., 2021b) deal with visual question answering tasks as multi-label classification on pre-defined answer sets. This strategy achieves strong performance, but it is not suitable for real-world open scenarios. We treat VQA as an answer generation task and use constrained closevocab generation models like ; Wang et al. (2022). As shown in

Natural Language for Visual Reasoning
The NLVR2 (Suhr et al., 2018) task requires the model to predict whether a sentence describes a pair of images which is a binary classification task. We follow ) and use two crossattention layers to process the two input images, and their outputs are merged and fed to the Feed Forward Network (FFN). An MLP classifier is then applied to the output embedding of the text [CLS] token. Similarly, in Table 3, our SRCL outperforms ALBEF and other existing VLP methods.

Zero-shot Image-text Retrieval
To investigate the semantic structure of the learned representation space, we examine the SRCL on the zero-shot image-text retrieval task on Flickr30K(Plummer et al., 2015). The results are shown in Table 4 where SRCL outperforms AL-BEF, indicating SRCL could yield a better semantic structural representation space. SRCL also achieves better performance than the previous stateof-the-art models (e.g., CLIP, ALIGN, and Florence) pre-trained with more image-text pairs.  (1) shows the zero-shot crossmodal retrieval (ZCR) results on Flickr30K by explicitly removing negatives based on different contrastive weight thresholds. IR means image retrieval, and TR means text retrieval. The horizontal line represents the performances of SRCL. Sub-figure (2) shows the contrastive weights (CW) distribution between anchors and negatives by averaging 10000 mini-batches.

False Negatives v.s. Hard Negatives
An astute reader may notice that (partial) negatives will somewhat overlap with hard negatives. It is non-trivial to accurately define hard or false negatives in vision-language contrasting since the cross-modal semantic boundary is blurry. But we do face a paradox here: we want to alleviate the contrastive effect of false negatives that contain a certain number of hard ones, while many works about hard negative mining (HEM) Xiong et al., 2020;Kalantidis et al., 2020) try to learn with more hard negative samples.
To investigate this problem, we experiment with different proportions of false negatives (or hard negatives, approximately). Specifically, we use the contrastive weights, negatively correlated with cross-modal similarity, to roughly approximate whether a negative sample is false. If the weight is lower than a threshold, the corresponding sample is regarded as false, and true otherwise. We explicitly remove the identified false negatives when contrasting, and then check the performance of the pre-trained ALBEF on zero-shot cross-modal retrieval.
As shown in Figure 3(a), there is a general trend that the performances of downstream tasks initially boost as the threshold increases and then begin to decline when a certain threshold (e.g., 0.2 and 0.3) is reached. We statistic the distribution of contrastive weight by averaging 10000 mini-batches and visualize it in Figure 3(b). We can estimate that with the threshold of 0.7, about 20% negative samples will be discarded. Combining Figure 3 dox: in vanilla contrastive learning, too many false negatives (or hard negatives) could bring harmful impacts, so removing some of them deliver performance improvements; but they are indispensable for a promising contrasting effect, so overly removing them also hinder performances, which is also the reason why hard negative mining methods will increase hard negatives in the absence of them.
From another perspective, the above explanation validates our method's merits. With the crossmodal similarity regulation, we drive the model to optimize the MI between negatives and their anchor more appropriately rather than simply minimizing it, systematically balancing false negatives' beneficial and harmful effects.

The Impact of Pretraining Data Size
To better understand the correlation between pretraining data size and downstream performance, we experiment with pretraining data of 4M, 6M, 8M,10M, and 12M. Figure 4 plots the zero-shot cross-modal retrieval and VQA results for SRCL and ALBEF. We can observe that our SRCL continuously maintains higher performance, and the gap becomes more evident with the data size increase. This observation verifies that SRCL promisingly addresses the harmful effect of false negatives and thus enhances data efficiency.

Qualitative Analysis
In this section, we conduct a qualitative analysis by visualizing the zero-shot text-image retrieval results of ALBEF and our method. We choose this zero-shot task to directly examine the model's representation capacity without fine-tuning impacts. In Figure 5, we find that ALBEF tends to focus more narrowly on one specific commonality while neglecting others. For example, in the second case, ALBEF intensely targets "a woman with blond a security officer with a tiny face and big glasses leans on a metal gate looking into the camera Our Baseline Our Baseline a woman with blond hair is sitting in a booth with a drink working on her laptop Figure 5: An visualization of zero-shot text-to-image retrieval result. We compare the baseline ALBEF and our model in this figure. hair" but misses the critical information "working on her laptop." On the other hand, our approach can successfully extract all the essential aspects of the query. These retrievals suggest that our learned features more comprehensively capture potential similarities between a text caption and the image. Meanwhile, our method's result ranking reflects a trend from full alignment to partial alignment between the retrieved images and the query. These observations clearly verify that our contrasting strategy produces better cross-modal representations for downstream tasks. Note that these two examples are not cherrypicked. The phenomenon in these two examples is commonly observed among other samples. We demonstrate more cases in Appendix 6. Meanwhile, other qualitative analyses can be found in Appendix F.

Contrastive Learning
Recently, self-supervised learning has made significant progress thanks to contrastive learning (Chen et al., 2020a;Oord et al., 2018;He et al., 2019;Chen et al., 2020b;Radford et al., 2021). InfoNCE (Oord et al., 2018) is commonly used in traditional contrasting learning, which optimizes the similarity of positive pairings and minimizes the similarity of negative pairs. In the contrastive learning framework, the negative pairs play a vital role as they prevent shortcuts and collapse solutions. However, Chen et al. (2021) shows the unfavorable effect of false negatives and proposes to incrementally detect and explicitly remove the false negative samples in the contrastive learning framework. Compared with Chen et al. (2021), we propose a more solid method by regulating the false negative samples rather than directly omitting them.

Vision-Language pre-training
Recent years have seen significant success for largescale pre-trained vision-language models (Tan and Bansal, 2019;Chen et al., 2020c;Huang et al., 2020;Wang et al., 2021b;Li et al., 2022a; in a variety of cross-modal tasks. Self-supervised Multi-modal Contrastive Learning (SMCL) has lately sparked significant advancements. (Li et al., 2022b;Radford et al., 2021;Yao et al., 2021;Li et al., , 2022a by conducting cross-modal alignment. SMCL consists of image-to-text and text-to-image contrastive learning, e.g., with the InfoNCE (Oord et al., 2018) loss. However, traditional cross-modal contrasting strategy can be problematic given the many-to-many correspondences between images and texts but few works notice this issue. Recently, to solve the issue of noisy image-text pairing in VLP pre-training datasets, some pioneering work has fed the contrastive loss with soft labels in a self-distillation method (Li et al., 2022bAndonian et al., 2022). Even while these techniques may help reduce the number of false negatives, their harmful effect has not been carefully explored.

Conclusion
We have presented our cross-modal contrastive learning method that addresses the problem of (partial) false negatives with vision-language semantic similarity guidance. A series of mathematical proofs based on InfoNCE loss provides a more general lower bound for contrastive optimization and inspires us with a novel contrasting strategy that theoretically guarantees the mitigation of false negatives. Empirically, our method demonstrates performance superiority on four downstream crossmodal tasks. Meanwhile, by comparing false negatives and hard negatives, we reveal that balancing the beneficial and harmful effects of (partial) false negatives is crucial to learn robust cross-modal representations.

Limitation
We verify our method mainly based on the recent robust VLP model ALBEF . Evaluating it more broadly by incorporating it into other VLP models can further highlight our contribution. Given the solid theoretical foundation of our method, the main conclusion regarding its effectiveness and performance will not be affected, but there can be more inspirational findings in a broader research context. Meanwhile, comparing false negatives and hard negatives is worth further exploration. We leave these problems for future work.

A.1 Proof A
We rewrite the proof provided by Oord et al. (2018) in the context of image-to-text contrastive learning, where v i represents an image anchor and t i and t j are positive and negative samples, respectively.
Therefore, we have I(ti, v i ) ≥ log (N ) − L v Inf oN CE , where N is the number of batch size.

A.2 Proof B
In the presence of non-negligible false negatives, we re-derive the above A.1 derivation as follows: Note the false negatives account for a relatively small proportion of the overall negatives, so the expectation E t j P (t j |v i ) P (t j ) is less than the density ratio P (t i ) P (t i |v i ) , thus we have: therefore we can safely derive the inequality from equation 18 to equation 19. Now we can get : According to Jensen's inequality, we have: Therefore, we have A.3 Proof C In this section, we prove that assigning a positive weight w i,j to each f (v i , t i ) can push MI expectation between an anchor and negatives to a controllable positive value, under specific conditions. Using image-to-text contrasting as an example, the loss can be written as follow: Following (Oord et al., 2018), the function f (v i , t i ) can be seen as density ratio which preserves the mutual information between v i and t i and could be written as P (t i |v i ) P (t i ) and we can rewrite the equation 28 as: Here we set the regulated weigth w i,j inversely proportional to P (t i ) P (t i |v i ) (Condition 1 in Section 2.3), so the covariance between w i,j and P (t i ) P (t i |v i ) is less than 0. Thus, we have: we have : Combine inequality 21 and 32, we have: Therefore, we can derive that Similar with the inequality 26, we get: When optimizing the loss, the last term on the left side of the inequality will be minimized, which means Then we can get Thus we have: As w i,j is inversely proportional to the semantic similarity between anchor v i and the negative sample t j , the MI expectation v i and t j will be optimized to a controllable positive value negative correlated with the average similarities between v i and t j .

B Comparison Methods
LXMERT (Tan and Bansal, 2019): is the first twostream region-based VLP model, which consists of an object relationship encoder, a language encoder and a cross-modality encoder. E2E-VLP : proposes the first endto-end VLP method for both V+L understanding and generation, with a unified Transformer encoderdecoder architecture.
VILT : adopts linear projection and word embedding as the visual and textual encoders, and uses the visual transformer as the crossmodal encoder to align and fuse the features of both modalities in an end-to-end manner.
ALIGN (Jia et al., 2021): leverages a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset.
OSCAR : proposes to use object tags detected in images as anchor points to the learning of cross-modal alignments.
VinVL (Zhang et al., 2021): pre-trains a largescale object-attribute detection model with much larger amounts of supervised data to extract better region-based visual features. ALBEF : adopts a contrastive loss to align the image and text representations, then fuses them through cross-modal attention in an endto-end manner. UNITER (Chen et al., 2020c): proposes a new word-region alignment pre-training task via the use of optimal transport to help fine-grained alignment between words and image regions.
ViLBERT (Lu et al., 2019): proposes one of the first work that extend the BERT architecture to a multi-modal two-stream region-based VLP model.

C Pre-training Objectives
We pre-train our model with three standard objectives: Image-Text Contrastive learning (ITC), Image-Text Matching (ITM) and Masked Language Modeling (MLM). Since we have introduced ITC in the previous subsections, in the following, we will only introduce two other pre-training tasks. Image-Text Matching (ITM) The goal of imagetext matching is to predict whether the input image and text are matched. We follow the design of  and select hard negative image-text pairs based on the contrastive text-image similarity.
We take the text [CLS] embedding of the multimodal encoder's output as the joint representation, followed by a Multi-Layer Perceptron (MLP) layer for prediction. Masked Language Modeling (MLM) The task setup is basically the same as in BERT (Devlin et al., 2018), where we randomly mask 15% of tokens in text and the model is asked to predict these masked words with the cross-modal representations.

D Implementation Details
We implement our method based on the ALBEF  framework and we pretrain the SRCL for 30 epochs with the total batch size of 512 on 8 NVIDIA V100 GPUs. We initialize the visual encoder by CLIP (ViT-B/16) (Radford et al., 2021) pretrained on 400M noisy image-text pairs and we use the AdamW (Loshchilov and Hutter, 2017) optimizer with a weight decay of 1e-2. The learning rate is warmed-up to 1e-5 (ViT-B/16) and 1e-4 (BERT base ) in the first 1000 iterations. During pre-training, we take image with the resolution of 256 × 256 as input, and increase the image resolution during finetuning. We use a 6-layer Transformer for both the text encoder and the crossmodal fusion network. As , the text encoder is initialized using the first 6 layers of the BERT base (Devlin et al., 2018) model and the cross-modal network is initialized using the last 6 layers of the BERT base .

E Downstream Task Details
We evaluate SRCL on the three downstream visionlanguage tasks. The hyperparameters that we use for finetuning on the downstream tasks are listed in Table 5. Following , all tasks adopt RandAugment, AdamW optimizer with a weight decay of 0.05 and a cosine learning rate schedule. Next we introduce the dataset settings in detail.
NLVR2. The NLVR2 (Suhr et al., 2018) task requires the model to predict whether a sentence.
We conduct experiments following the original train/val/test split in (Suhr et al., 2018).

F Visualization of Contrastive Weight In SRCL
In Figure 7, we plot the distribution of text-toimage contrastive weight in the mini-batch drawn from the Flickr30K testing set. As shown in the Figure 7, for false negative samples, our method can effectively assign them with low contrastive weights. For examples, in the sixth row and fifth column of the first case, for the text anchor "this is a cute cat.", the false negative sample is the sixth image which also contains a cat and the contrastive weight of it is 0.12. Beside, we can observe that most negatives have a high contrastive weight as semantic similarity between them and anchors are low. To further investigate the effectiveness of contrastive weight for regulating the (partial) false negative samples in contrastive learning, we visualize the false negative samples and their contrastive weights. As shown in Figure 8, for the false negative samples, they are all assigned with low contrastive weights (not more than 0.2). This also supply the results of the experiments in subsection 4.3 that masking the negatives whose contrastive weight is less than 0.2 can gets a remarkable improvement.