Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding

Phrase grounding aims to map textual phrases to their associated image regions, which can be a prerequisite for multimodal reasoning and can benefit tasks requiring identifying objects based on language. With pre-trained vision-and-language models achieving impressive performance across tasks, it remains unclear if we can directly utilize their learned embeddings for phrase grounding without fine-tuning. To this end, we propose a method to extract matched phrase-region pairs from pre-trained vision-and-language embeddings and propose four fine-tuning objectives to improve the model phrase grounding ability using image-caption data without any supervised grounding signals. Experiments on two representative datasets demonstrate the effectiveness of our objectives, outperforming baseline models in both weakly-supervised and supervised phrase grounding settings. In addition, we evaluate the aligned embeddings on several other downstream tasks and show that we can achieve better phrase grounding without sacrificing representation generality.

However, few existing papers have paid attention to the phrase grounding ability of their pretrained embeddings, namely the ability to map natural language queries to their corresponding image regions, which can 1) benefit tasks requiring identifying objects based on language (Deng et al., 2018); 2) be a prerequisite for advanced multimodal reasoning (Plummer et al., 2015). Among the prior work, Li et al. (2020a) demonstrate certain grounding abilities of VisualBERT, yet their analysis is limited to attention heads and it is unclear how Vi-sualBERT compares with state-of-the-art grounding models. Cao et al. (2020) provide insights on cross-modal interaction, but their analysis is primarily limited to the coreference relations between phrases and visual tokens.
In this paper, we study the phrase grounding ability of vision-and-language embeddings pre-trained on image-caption datasets. First, we propose a method to extract phrase-region pairs from the pre-trained embeddings without any fine-tuning. We find that while our method uncovers certain grounding abilities of the pre-trained embeddings, there is still much room for improvement. Therefore, we propose to fine-tune models with objectives designed for better aligning word and region representations on image-caption datasets. The fine-tuning objectives are designed to maximize the symmetricity between vision and language during fine-tuning for better phrase grounding while maintaining the representation transferability so that the learned representations are still useful for other downstream tasks. Specifically, we fine-tune models with 1) a masked language modeling objective conditioned on images; 2) an adapted masked region modeling objective with texts utilizing a dynamically constructed vision vocabulary; 3) a modified object label prediction objective that explicitly bridges the gap between vision and language; 4) a proposed bidirectional attention optimization objective encouraging the consistency between visionto-language and language-to-vision alignments.
We fine-tune pre-trained models on COCO (Chen et al., 2015) and test them on two representative phrase grounding datasets, RefCOCO+ (Kazemzadeh et al., 2014) andFlickr30k Entities (Plummer et al., 2015). We find that our fine-tuning objectives can improve the model grounding ability significantly, improving baseline in both weakly-supervised and supervised phrase grounding settings. We also evaluate the aligned representations on several downstream tasks and show that our model can achieve better phrase grounding without sacrificing performance on other types of tasks.

Extracting Phrase-Region Pairs from Pre-Trained Embeddings
Formally, the phrase grounding task can be defined as: given an image v consisting of multiple regions v 1 , · · · , v n and its corresponding caption l segmented into tokens l 1 , · · · , l m , for each noun phrase p i = l ix , · · · , l iy , a model needs to find its associated region v j . We first propose a way to directly extract the matched phrase-region pairs from pre-trained embeddings. Then, we evaluate this method on phrase grounding tasks with several popular pre-trained models, including LXMERT (Tan and Bansal, 2019), UNITER (Chen et al., 2020), ViLBERT , VisualBERT  and VL-BERT (Su et al., 2019).

Extraction Method
We propose to directly extract phrase-region pairs from pre-trained models based on representation similarities. Specifically, given an image v and its caption l, we feed them to a pre-trained vision-andlanguage model and obtain their representations h(v) and h(l). Note that here representations of the k-th model layer are taken, where k is a hyperparameter and is selected on the validation set.
Then, given a noun phrase p i , we average its token representations and get the phrase representation h(p i ) = MEAN( h(l ix ), · · · , h(l iy ) ). Afterwards, we score each candidate region v j by computing the dot product between h(p i ) and h(v j ). Regions with the highest scores are selected and we can measure the accuracy of the selected pairs.

Experiments
We evaluate the extraction method on RefCOCO+ using pre-trained models in a controlled setting (Bugliarello et al., 2020).  Table 1: Phrase grounding accuracy (%) of pre-trained models investigated with our proposed method. We also include the performance of probing classifiers (numbers in parenthesis) and a supervised VisualBERT model ('Supervised') for reference.

Setup
We follow the setting in Bugliarello et al. (2020) in this section. Specifically, all the vision-andlanguage models are pre-trained on a pruned Conceptual Captions dataset (Sharma et al., 2018), consisting of 2.77M images with weakly-associated captions automatically collected from billions of web pages. The image features are extracted using a  Table 1 shows the phrase grounding accuracy of the pre-trained models using our method. To provide upper-bound performance for our extraction method, we train a linear probing classifier with the frozen model embeddings as inputs (numbers in parenthesis). We find that our extraction method can better uncover the phrase grounding ability of single-stream models (UNITER, VisualBERT, VL-BERT), which process the vision and language inputs jointly. On the other hand, the grounding information in two-stream models (LXMERT, ViL-BERT) can be hard to extract, probably because the parameters of two-stream models are not shared in the top layers and thus they are less likely to learn aligned representations. Also, comparing with a supervised VisualBERT model, the pre-trained models can underperform their supervised counterparts by a large margin, indicating there is much room for improvement and additional efforts are required to align the pretrained vision-and-language embeddings.

Aligning Pre-trained Vision-and-Language Embeddings
To improve the model phrase grounding ability, we then propose four fine-tuning objectives for vision-and-language models. We assume an imagecaption dataset { v k , l k } is provided but no finegrained phrase-region annotations are available. A pre-trained vision model (Anderson et al., 2018) is used to segment images into regions and produce region representations and object labels.

Fine-tuning Objectives
We investigate four objectives that fine-tune pretrained vision-and-language models for phrase grounding: Masked Language Modeling (MLM). MLM with images has proven to be useful for representation learning  and here we investigate if it is also helpful for phrase grounding. Specifically, we randomly mask 15% of the tokens l and the model is trained to reconstruct l given the masked texts l mask and regions v: Masked Region Modeling (MRM). Inspired by MLM, we propose its counterpart in the vision side to encourage the symmetricity between vision and language. While previous work (Tan and Bansal, 2019) regress the region features, we find it is unhelpful in our setting (in Appendix). Instead, by imitating MLM which uses a text vocabulary, we create a dynamic vision vocabulary on the fly, and the model tries to reconstruct the input regions given the dynamically constructed vocabulary. Concretely, at each training step, we sample a batch of image-caption pairs { v k , l k } B k=1 and randomly mask 15% of the regions, where B is the batch size. We treat all the regions in {v k } B k=1 as candidate regions, and for each masked region, the model needs to select the original region within the set of candidate regions given masked inputs. Denoting the pre-trained vision model representations and our model representations respectively, we can represent the output probability at position i for the k-th instance as: where cos(·, ·) refers to the cosine similarity.
The model is trained to maximize this probability similar to noise contrastive estimation (Gutmann and Hyvärinen, 2010;Jozefowicz et al., 2016): (2) Object Label Prediction (OLP). The object labels predicted by the pre-trained vision model provide us with good anchor points to bridge the gap between vision and language, and previous work has tried to incorporate the information by predicting the object labels for each region (Tan and Bansal, 2019;Chen et al., 2020). In this paper, to better share the information between the two modalities, we propose to 1) use simple heuristics to convert object labels into text tokens and train our model to predict the object labels o v with a multi-class MLM objective; 2) share the classification layer of MLM and OLP. For example, if the object label of v i is "stop sign", we first tokenize it into "stop" and "sign", the model is then trained to maximize the joint probability of both the two tokens at v i : (3)

Bidirectional Attention Optimization (BAO).
Inspired by the work on encouraging the consistency between forward and backward attentions (Cohn et al., 2016;Hu et al., 2020;Dou and Neubig, 2021), we propose an objective to encourage the symmetricity of vision-to-language and language-to-vision attentions. Specifically, after obtaining the representations h(v) and h(l), we compute the forward and backward attention matrices as: where d denotes the feature dimension. We then minimize the distance between them by maximizing the trace of ATT T V L ATT LV : ). (4) Combined Objective. Our final objective is a combination of the four objectives: where α is selected from {0.1, 0.25, 0.5, 1.0} and is set to 0.1 based on the validation performance.

Baseline Ours
the girl is about to kick a soccer ball .

Experiments
We then train our model with the proposed objective and compare with several baselines.

Setup
Model/Datasets. We choose VisualBERT as our base architecture because it performs the best in Section 2 and pre-train it on COCO (Chen et al., 2015). We then further fine-tune models on COCO and evaluate them on RefCOCO+ and Flickr30k in both weakly-supervised and supervised settings. Details of the models and datasets are in Appendix.
Settings. In weakly-supervised settings where only the image-text pairs in COCO are given, we directly extract phrase-region pairs from models using our method in Section 2.1. In supervised settings where phrase-region annotations in Ref-COCO+ and Flickr30k are available, we add a linear layer on top of each region representation and fine-tune models with the cross-entropy loss.

Results
We first present the main results of the models and some ablation studies of the training objectives.   Main Results. In the weakly-supervised settings, Table 2 demonstrates that our objectives can improve the model grounding ability significantly, outperforming all the baselines. Moreover, we find that while MAF (Wang et al., 2020) achieves strong performance on Flickr30k, it fails on RefCOCO+. We hypothesize that this is because MAF is based on static word embeddings and in the RefCOCO+ setting multiple objects of the same type will typically present in one image, making MAF unable to disambiguate the phrases. With the aligned representations, we can also achieve better grounding ability than VisualBERT in supervised settings.
Ablation Studies. We ablate each of our training objective and test their contributions in Table 3. We can see that all of the objectives are beneficial for phrase grounding, with OLP being the most effec-tive one. BAO can bring marginal improvements, yet its contributions are still non-negligible. We also test most existing pre-training objectives in Appendix and show that our proposed objective works the best.

Analysis
We then perform analysis to provide insights on the fine-tuned model representations.
Transferring to Other Tasks. It is interesting to see if the aligned representations are still useful for other types of tasks. In Table 4, we test our model on image-text retrieval ( We find that our model can achieve comparable or superior performance compared with VisualBERT, especially on tasks relying more on the model grounding ability like image retrieval, which shows that our training paradigm can maintain the representation generality.
Qualitative Examples. We visualize the learned representations in Figure 1. We find it hard to observe clear patterns from the baseline representations. For example, while the token representation of "ball" have high similarity with its associated region embedding, it is also close to the representation of the mascot. By contrast, our model can clearly learn more aligned representations. It is interesting to note that our model can learn there is a partial correspondence between the word "kick" and the soccer ball region, indicating that our objectives can also align verb and region representations.

Related Work
We overview two lines of related work in this part.

Conclusion
In this paper, we first propose a method to extract matched phrase-region pairs from pre-trained vision-and-language embeddings and evaluate its performance across models. Then, we propose several fine-tuning objectives for phrase grounding and demonstrate their effectiveness in both weakly-superivsed and supervised phrase grounding tasks. We also evaluate our aligned representations on other downstream tasks and show that we can achieve better phrase grounding without sacrificing the representation transferability to other downstream tasks. Future directions include better utilizing the aligned representations and incorporating our objectives into pre-training.

A Implementation Details
In Section 2, we follow (Bugliarello et al., 2020) and pre-train the vision-and-language models in a controlled setting. Specifically, all the models are pre-trained on a pruned Conceptual Captions dataset (Sharma et al., 2018), consisting of 2.77M images with weakly-associated captions automatically collected from billions of web pages. The image features are extracted using a Faster R- CNN (Ren et al., 2016) with a ResNet-101 backbone (Anderson et al., 2018) trained on the Visual Genome dataset (Krishna et al., 2017) and the vision-and-language models are trained with 36 extracted regions of interest. For the probing experiments, we use the default hyper-parameters in (Bugliarello et al., 2020) for training the probing classifiers.
In Section 3, we pre-train VisualBERT on the COCO dataset (Chen et al., 2015), consisting of 413K captions for 82K images (each image is paired with five different captions). VisualBERT is pre-trained with its original objectives for 11K steps and with two RTX 2080 GPUs, taking about 40 hours per experiment. Then, our models are further fine-tuned on two RTX 2080 GPUs for 11K steps, taking about 2 days per experiment. The batch size is set to 480 and the learning rate is set to 5e-5. The models are trained with 64 extracted regions of interest. α in Equation 5 is selected from {0.1, 0.25, 0.5, 1.0} based on the validation performance on Flickr30k. The image features and labels are extracted from a ResNeXT-152 Faster-RCNN model trained on Visual Genome with attribute loss. For efficiency, we mask both vision and language inputs and perform MLM, MRM, OLP jointly on the masked inputs instead of training models with these objectives sequentially. We also tried to finetune VisualBERT with its original objectives for phrase grounding, but the grounding performance did not get improved.
For the phrase grounding datasets we use, the Re-fCOCO+ dataset is collected in an interactive game interface and we follow its standard split. During test, RefCOCO+ provides person vs. object splits for evaluation, where images containing multiple people are in "testA" and images containing multiple other objects are in "testB". The Flickr30k Entities dataset contains 224K phrases and 31K images in total, where each image is associated with five captions, and we follow its standard splits. Following previous work, we consider a prediction to be correct if the IoU (Intersection of Union) score between our predicted bounding box and the ground-truth box is larger than 0.5. We fine-tune the models for 20K steps, with the batch size set to 32 and the learning rate set to 2e-5. The models are trained with 100 extracted regions of interest.
For the image-text retrieval task, we evaluate models on Flickr30k (Plummer et al., 2015) with Recall@1 as the evaluation metric. The models are fine-tuned for 20 epochs with the batch size set to 256 and the learning rate set to 1e-4. The models are trained with 36 extracted regions of interest. For the visual entailment task, we experiment on the SNLI-VE dataset (Xie et al., 2019) and test the accuracy numbers. We fine-tune the models for 60K steps with the batch size set to 480 and the learning rate set to 5e-5. The models are trained with 100 extracted regions of interest. For the visual question answering task, we choose the VQAv2 dataset (Goyal et al., 2017) and evaluate models with the VQA-score 2 on both test-dev and test-standard datasets. The models are fine-tuned for 60K steps with the batch size set to 480 and the learning rate set to 5e-5. The models are trained with 100 extracted regions of interest.

B Negative Results
In this part, we show some negative results of four fine-tuning objectives that we have tried in our settings.

B.1 Fine-tuning Objectives
In addition to the objectives presented in the main content, we also experiment with the following four objectives: Masked Region Regression (MRR). Previous work (Tan and Bansal, 2019;Chen et al., 2020) have attempted to regress the region features by minimizing the L2 distance between the predicted and the original image features. An additional feedforward layer is used to transform the hidden representations into the image feature space.
Masked Region Classification (MRC). Similar to our OLP objective, researchers (Tan and Bansal, 2019;Su et al., 2019;Chen et al., 2020) have also tried to utilize object labels by predicting the object semantic class without sharing the classification layer between vision and language modalities. The object labels are obtained from a pre-trained vision model. The main difference between MRC and our OLP is that we perform image classification in the text space and share the prediction layer between the two modalities.
Image-Text Matching (ITM). In ITM, a special token ([CLS]) is inserted at the beginning of the input sentence and it tries to learn a fused representation of both vision and language. We feed the model with either matched or mismatched imagecaption pairs with equal probability. A classifier is added on the top of this token and its output is a binary label, indicating if the sampled image-caption pair is a match.
Optimal Transport (OT). Chen et al. (2020) use optimal transport to encourage word-region alignments, which is potentially beneficial for phrase grounding. Therefore, we follow their settings and implement the optimal transport objective. Specifically, for each pair of word l i and region v j , we first compute their cosine distance c ij = 1 − l T i v j ||l i || 2 ||v j || 2 . Then, the optimal transport objective is defined as:  where T ij is the transport plan between language and vision and is obtained using the IPOT algorithm (Xie et al., 2020). Table 5 shows the results of the four fine-tuning objectives on Flickr30k and RefCOCO+ in weaklysupervised settings. We can see that adding these objectives cannot improve the model performance in our phrase grounding settings. We think this is possibly because 1) the MRR and MRC objectives differ a lot from the MLM objective in the language part, and thus they can deviate the resulting vision-and-language representations; 2) ITM mainly cares about aligning sentence and image representations, while our phrase grounding tasks require fine-grained phrase-region alignments; 3) there can be multiple complicated many-to-many alignments for an image-caption pair, making it hard to find a reasonable transport plan between language and vision modalities, and thus the optimal transport techniques may not be not suitable for phrase grounding. Also, as shown in the last 6 rows of the table, our proposed MRM, OLP, BAO objectives are better than the MRR, MRC, OT objectives that previous work use.

C Phrase Grounding Abilities Across Layers
In this part, we plot the grounding performance of each model layer in Figure 2. Contrary to findings in multilingual encoders (Pires et al., 2019), we do not see coherent patterns from the performance of different models. While most models demonstrate better grounding abilities in the top and bottom layers than the middle layers, VL-BERT exhibits an opposite behavior.