Conditioned Masked Language and Image Modeling for Image-Text Dense Retrieval

,


Introduction
Image-text retrieval is an important task in the cross-modal community.Recent years have witnessed the remarkable success of the large-scale language-image pre-trained models in this area.The existing works can be divided into singlestream and two-stream models.The former one as illustrated in Figure 1a relies on the heavy transformer layers (Vaswani et al., 2017) to fuse the cross-modal information (e.g.UNITER (Chen et al., 2020b), and OSCAR (Li et al., 2020b)).The intolerable drawback of these models is the slow inference speed.All possible query-candidate pairs need to be fed into the models to get the retrieval result of a query.For example, the average inference time of UNITER1 for a text query on MSCOCO (Lin et al., 2014) is more than 30 seconds.Therefore, these models are hard to be applied in real-life industrial applications.
To overcome this limitation, recent works turn to the two-stream models as shown in Figure 1b (e.g., CLIP (Radford et al., 2021)).These models embed images and texts into instance representations ([CLS]) with two separate encoders, aligning them on the instance-level with contrastive learning (InfoNCE (van den Oord et al., 2018)) and calculating the retrieval scores with a simple dot-product.Decoupling the correlation of images and texts, the inference speed of these models is much faster.Apart from the instance-level alignment, the fine-grained token-level tasks (Masked Language Modeling, MLM (Devlin et al., 2019) and Masked Image Modeling, MIM (Xie et al., 2022)) are adopted to boost the performance further (Sun et al., 2021;Lu et al., 2022).Nevertheless, the vanilla MLM and MIM as illustrated in Figure 1c are sub-optimal, which are designed to aggregate the token-level alignment information into the token representations, not instance representations.For example, the [CLS] representation of the well-known masked language pre-trained model RoBERTa (Liu et al., 2019) performs poorly on the semantic textual similarity tasks without finetuning (Reimers and Gurevych, 2019).Therefore, it is necessary for us to come up with more suitable token-level pre-training tasks for image-text dense retrieval.
In this work, we carefully design two novel conditioned token-level objectives, Conditioned Masked Language and Image Modeling  Beyond this, the instance-level interaction is also necessary for cross-modal retrieval.The momentum contrastive learning objective (He et al., 2020) is adopted to align the instance representations of images and texts, decoupling the queue size from the mini-batch size.Combing the instance-and token-level interaction, we proposed our Conditioned Language-Image Pre-training (ConLIP) framework for image-text dense retrieval.The experimental results on the popular cross-modal retrieval benchmarks (MSCOCO and Flickr30k (Plummer et al., 2015)) reveal that our token-level objects (ConMLM and ConMIM) are more effective than the vanilla MLM and MIM.In addition, the analysis of the cross attention scores in the token-level pre-training heads corroborates our claim that the instance representations play more active roles in our ConMLM and ConMIM.
The contributions of our works can be summarized as follows: • We design two novel token-level pre-training objectives, ConMLM and ConMIM for image-text dense retrieval.
• Combining with the instance-level contrastive learning, we introduce an effective conditioned language-image pre-training framework, ConLIP.
• Evaluating on the image-text retrieval tasks, our ConMLM and ConMIM are more effective than the vanilla MLM and MIM.

Related Work
Large-scale Pre-trained Models.Since 2019, the large-scale pre-training paradigm has become popular in natural language processing (NLP), computer vision (CV) and cross-modal areas.In NLP,  the BERT-like pre-trained language models (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2020;Raffel et al., 2020;He et al., 2021) show remarkable language understanding ability.In CV, the pre-trained vision transformers (Dosovitskiy et al., 2021;Xie et al., 2022;Bao et al., 2022) show superior image recognition ability.In the cross-modal area, the pre-trained transformer-based models (Tan and Bansal, 2019;Wang et al., 2022;Dou et al., 2022) also succeed in many cross-modal tasks, like visual question answering (Antol et al., 2015) and visual entailment (Xie et al., 2019).In our work, we align the pre-trained vision transformer (ViT) and language transformer (BERT) with our token-and instance-level cross-modal pre-training for imagetext dense retrieval.
Image-Text Retrieval.The goal of this task is to retrieve the relevant image/text with the query from another modality (Chen et al., 2020a).The early works majorly focus on the two-stream models (Kiros et al., 2014;Faghri et al., 2018;Wang et al., 2019;Bastan et al., 2020), embedding the image and text with two separate encoders.Later, the single-stream models with heavy cross/mergeattention layers encode images and texts within the same model and achieve much better performance (Chen et al., 2020b;Li et al., 2020b;Gan et al., 2020;Kim et al., 2021).As we mentioned in the introduction, these models' inference speed is too slow compared with the two-stream models.Recent works turn back to the two-stream style (Jia et al., 2021;Sun et al., 2021;Radford et al., 2021;Wen et al., 2021;Lu et al., 2022).Pre-training with the large-scale paired image-text data, the performance of these models becomes closer to the single-stream counterparts.
Apart from the instance-level alignment, recent works also introduce the Masked Language and Image Modeling as token-level tasks to boost models' performance (Sun et al., 2021;Lu et al., 2022).Inspired by the single-modal dense text retrieval works (Gao andCallan, 2021, 2022;Chuang et al., 2022), the vanilla MLM and MIM are sub-optimal for dense retrieval.Our works introduce two novel token-level objections conditioned on the instance representations (ConMLM and ConMIM) to fill in this gap.

Overview
The cross-modal dense retrieval aims to learn two separate encoders to embed images and texts with instance representations.If the image and text share the same semantic meaning, their representations will have a high similarity score (cosine similarity).Following the previous works (Radford et al., 2021;Jia et al., 2021), the images and texts are encoded by a Vision Transformer (Dosovitskiy et al., 2021) and a Language Transformer (Devlin et al., 2019).Formally, given a text x = [x 1 , x 2 , ...] and all patches of an image y = [y 1 , y 2 , ...], we can write: where TRF is a transformer model.The special token [CLS] is concatenated and encoded with the rest of the text tokens or image patches.Its hidden state in the final layer serves as the instance representation.The similarity score is calculated as: If we add the random masks to the input, we can write: where (•) * indicates that some tokens/patches are masked in the input.
During pre-training, the large-scale languageimage paired data are required to align the two encoders with instance-and token-level interaction.For the instance-level interaction, most of the previous works align the instance representations with the contrastive learning, L inst , maximizing the similarity scores (sim(x, y)) between paired samples.
For the token-level interaction, models are followed by two pre-training heads (one-layer transformer with cross-attention).In the vanilla MLM, the masked text tokens ([h In our work, we indicate that these vanilla MLM and MIM are sub-optimal.We design two novel conditioned masked language and image modeling objectives L Con token to aggregate the token-level alignment information into the instance representations.The overall pre-training objective is as follows:

Conditioned Token-level Interaction
The vanilla MLM and MIM objectives in crossmodal pre-training are sub-optimal for the imagetext retrieval task.The model can easily reconstruct the masked token with the token-level alignment, which is different from our goal to learn good instance-level representations.For example, in Figure 2, the model can ignore the instance representations and reconstruct the masked token "man" based on the image patches of "man".
To fill in this gap, we carefully design two novel conditioned token-level objectives, Conditioned Masked Language Modeling (ConMLM) and Conditioned Masked Image Modeling (ConMIM) as illustrated in Figure 3.The motivation for our design is to increase the importance of the instance representations during token-level pre-training, enhancing the retrieval performance.
ConMLM.As illustrated in Figure 3a, our method requires two passes.In the first pass, we input a complete image into image encoder to extract the image instance representation h img cls as equation 1.In the second pass, we input an image with random masks to extract the image patch representations ([h Since a part of image tokens and text tokens are masked, only the image instance representation h img cls contains complete semantic information.Image and text encoders need to aggregate the tokenalignment information into the instance representations to reconstruct the masked text token.For example, in Figure 3a, both the image patches and text tokens of "man" and "surfboard" are masked.In order to reconstruct them, the text encoder is forced to decode the corresponding information from the image instance representation h img cls .The objective function of ConMLM is similar to the vanilla MLM, but conditioned on the image instance representation: where x * i is the masked token in the randomly masked text x * . ConMIM.As illustrated in Figure 3b, such objective shares the similar idea as ConMLM.Only the text instance representation h txt cls contains complete semantic information.Models are asked to reconstruct the masked image patches conditioned on h txt cls .Following the state-of-the-art pre-trained self-supervised Vision Transformer, SimMIM (Xie et al., 2022), we directly require our models to reconstruct the image patch with L-1 norm, where y i , z i ∈ R 3HW ×1 are the true RGB values and the predicted values conditioned on the text information (x * and h txt cls ), respectively.The final conditioned token-level objective is as follows:

Instance-level Interaction
Apart from our conditioned token-level interaction, instance-level interaction is also necessary for image-text retrieval pre-training.To align the cross-modal information, we adopted the momentum contrastive learning to cache the negative samples with an image queue Q img and a text queue Q txt , so that we can decouple the queue size from the mini-batch size.To maintain the queues dynamically, we still need two momentum encoders for images and texts that share the same structure as the original ones.Following the works of MoCo (He et al., 2020), the momentum encoders are updated by: where θ m denotes the parameters of the momentum encoders, θ o denotes the parameters of the original encoders and m is the momentum hyperparameter.
Traditionally, the momentum contrastive learning loss from text to image representations is calculated as: where τ is the temperature hyperparameter and q j ∈ Q img .
In our works, we follow the decoupled contrastive learning (DCL (Yeh et al., 2021)) to tackle the negative-positive-coupling (NPC) effect to remove the positive samples in the denominator: Our ablation studies reveal that DCL leads to better performance.
Similarly, the DCL loss from image to text representations is: where p j ∈ Q txt .The total loss of the instancelevel interaction is defined as:  (Sharma et al., 2018) and -12M (Changpinyo et al., 2021).We combine them and randomly collect 200k, 5.3M and 9.5M image-text pairs for our experiments.Notably, these datasets are harvested from the web and contain much noise.In real-life applications, obtaining a large amount of high-quality annotated image-text data is hard.We make our pre-training settings more similar to reality.
Implementation Details.We pre-train and finetune our models with the AdamW optimization algorithm (Loshchilov and Hutter, 2019), 4096 batch size, mixed-precision training, FP16 and 30 epochs.We also adopt the linear learning rate decay warmup strategy.The warm-up step is set to 10% of the total training steps.Each image is resized into the size of 224x224 with center-crop.The masking ratios for images and texts are 50% and 15%.The other hyper-parameter values and implementation details are listed in the Appendix A.

Image-Text Retrieval Results
Table 1 compare the performance of our ConLIP and the baseline systems on two popular image-text retrieval benchmarks, MSCOCO and Flickr30k.The experimental results indicate that our Con-MLM and ConMIM are more effective than the vanilla MLM and MIM.
ConLIP and Vanilla Baseline.Table 1 shows that our baselines are strong models with compara- ble performance with the cutting-edge two-stream models.Replacing the vanilla MLM and MIM with our novel ConMLM and ConMIM, our ConLIP achieves better image-text retrieval performance on both MSCOCO and Flickr30k.Especially, the zero-shot R@1 performance of our ConLIP is around 2 points higher than the vanilla baseline on the Flickr30k test set.These experimental results reveal that our ConMLM and ConMIM are two more suitable token-level pre-training tasks for image-text dense retrieval.
ConLIP and Cutting-edge Models.Apart from our vanilla baselines, we also compare our Con-LIP with the cutting-edge single-and two-stream models.Pre-training with the same amount of image-text pairs, our ConLIP achieves comparable performance with the cutting-edge single-stream models.In addition, Figure 4 shows that the inference time of our ConLIP is extremely faster than the single-stream model.Our ConLIP has O(n) inference time complexity, while the singlestream models have O(n 2 ).Compared with the cutting-edge two-stream models, our ConLIP also performs better.Notably, our models are only pretrained with the noisy image-text data from the web, while most of the cutting-edge models also include the human-annotated image-text data in their pre-training.These results indicate that our ConLIP is an effective framework for image-text dense retrieval.
Qualitative Examples.Figure 5 shows a retrieval example of our ConLIP for the text query "a man wears an orange hat and glasses" in the Flickr30k test set.Our ConLIP successfully retrieves the ground truth image as the top-1 result.Though the second image shares the same keywords ("man", "orange", "hat" and "glasses") as the ground truth image, our model still can detect that this image is mismatched and assign a lower  score.Some keywords are mismatched for the remaining three images, so our model assigns much lower scores to them.

Cross Attention Analysis
In the introduction and Figure 2, we claim that the instance representations are ignored in the vanilla MLM and MIM for cross-modal pre-training.Our retrieval experimental results in Table 1 also reveal that our ConMLM and ConMIM can lead to better instance representations.In this section, we conduct an in-depth analysis of the cross-attention pattern in the two token-level pre-training heads to understand the influence of our designs.
We first consider the instance representation of image h img cls .We use the mean of the cross-attention scores from the text tokens to h img cls as the measurement of importance during token-level interaction (ConMLM or vanilla MLM).Table 2 shows that the mean score is close to zero in the vanilla MLM, revealing that h img cls is almost ignored by the text tokens.For our ConMLM, the score is around three times higher, revealing that h img cls acts as a more important role in our token-level interaction.In addition, we also analyze the standard deviation of the cross-attention scores.A higher standard deviation indicates a wider range of scores.In Table 2, we can find that the score in our ConLIP has a higher standard deviation, revealing that the cross attention scores from the text tokens to h img cls spread out over a wider range.
Beyond this, we consider the instance representation of text h txt cls .The cross-attention scores from the image patch tokens to it share the similar pattern as h img cls .This analysis corroborates that our ConMLM and ConMIM are more suitable than the vanilla MLM and MIM for image-text dense retrieval.

Ablation Studies
In this section, we conduct ablation studies to compare different settings for our models.Since largescale pre-training is time-consuming, we choose to pre-train our models with our CC200k dataset and evaluate the zero-shot retrieval performance on the MSCOCO.The experiments contain three perspectives: (1) different pre-training objectives; (2) different number of token-level pre-training layers; (3) different masking ratios for images and texts.
Pre-training Objectives. the instance-level pre-training, the decoupled contrastive learning (DCL) is a more effective loss than the traditional InfoNCE.For the token-level pretraining, both ConMLM and ConMIM can lead to better image-text retrieval performance.
Token-level Pre-training Heads Designs.In our ConLIP, we adopt one-layer transformers as our token-level pre-training heads.We wonder how the different number of layers affects our models' performance.Table 4 compares four different designs.First, increasing the number of layers can boost the image-to-text retrieval performance, but degenerate text-to-image scores.In addition, sharing the parameters of two token-level layers cannot lead to better performance.
Masking Ratios.In our ConMLM and Con-MIM, the masking ratios are an important hyperparameter during pre-training.He et al. (2022) indicate that the higher masking ratio can lead to a better self-supervised pre-trained Vision Transformer.Wettig et al. (2022) argue that 15% is not the perfect masking ratio for BERT.We compare several different masking ratios for our models.Table 5 shows that different masking ratios do not have much effect on the retrieval performance.Therefore, we choose to use the default setting (50% for image and 15% for text).

Conclusion
In this work, we design two novel token-level pretraining tasks, ConMLM and ConMIM for imagetext dense retrieval.Combining with the instancelevel objective, we propose our language-image pre-training framework ConLIP.The experimental results and the cross-attention analysis reveal the effectiveness of our methods.

Limitations
The major limitation of our work is scalability.In our settings, we pre-train our models with 5.3M or 9.5M image-text pairs, much smaller than the 400M pairs of CLIP (Radford et al., 2021).Therefore, it is unclear how the model performance would be if we scale up the size of pre-training datasets.However, these experiments require enormous GPU resources (256-592 V100 GPUs for CLIP), which are unaffordable.

Figure 1 :Figure 2 :
Figure 1: Comparing different categories of Language-Image Pre-training framework for Cross-Modal Retrieval.(a) Single-stream models; (b) Two-stream models; (c) Two-stream models with vanilla token-level interaction.
From Text to Image Conditioned Token-level Interaction (ConMIM).

Figure 3 :
Figure 3: An illustration of our conditioned token-level interaction.

Figure 4 :
Figure 4: Comparing the inference time between singlestream and two-stream models on MSCOCO test set.

Figure 5 :
Figure 5: Retrieval examples of our ConLIP for the text query "a man wears an orange hat and glasses" in the Flickr30k test set. )

Table 1 :
(CL: only pre-trained with contrastive learning.*: replacing our ConMLM and ConMIM with the vanilla MLM and MIM.zs: zero-shot performance.)The image-text retrieval results on the MSCOCO and Flickr30k test sets.Bold indicates the best results of the two-stream models.#I-T corresponds to the number of image-text pairs during cross-modal pre-training.TC is the time complexity during inference.

Table 2 :
The mean and standard deviation of the cross attention scores from the text tokens to the h img cls and the image patch tokens to the h txt cls in the token-level pretraining heads.We average the scores over the samples in the Flickr30k validation set.(*: Replacing our Con-MLM and ConMIM with the vanilla MLM and MIM.)

Table 3 :
Ablation study for different pre-training objectives.All models are pre-trained on our CC200k dataset.The scores are the zero-shot retrieval results.

Table 4 :
Ablation study for different number of tokenlevel interaction layers.* indicates that the parameters of the two layers are shared.All models are pre-trained on our CC200k dataset.The scores are the zero-shot retrieval results.

Table 5 :
Table 3 compares different pre-training objectives for our models.For Ablation study for different masking ratios for images and texts.All models are pre-trained on our CC200k dataset.The scores are the zero-shot retrieval results.