Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding

Zi-Yi Dou, Nanyun Peng


Abstract
Phrase grounding aims to map textual phrases to their associated image regions, which can be a prerequisite for multimodal reasoning and can benefit tasks requiring identifying objects based on language. With pre-trained vision-and-language models achieving impressive performance across tasks, it remains unclear if we can directly utilize their learned embeddings for phrase grounding without fine-tuning. To this end, we propose a method to extract matched phrase-region pairs from pre-trained vision-and-language embeddings and propose four fine-tuning objectives to improve the model phrase grounding ability using image-caption data without any supervised grounding signals. Experiments on two representative datasets demonstrate the effectiveness of our objectives, outperforming baseline models in both weakly-supervised and supervised phrase grounding settings. In addition, we evaluate the aligned embeddings on several other downstream tasks and show that we can achieve better phrase grounding without sacrificing representation generality.
Anthology ID:
2021.emnlp-main.513
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6362–6371
Language:
URL:
https://aclanthology.org/2021.emnlp-main.513
DOI:
10.18653/v1/2021.emnlp-main.513
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.513.pdf
Data
COCOVisual Question Answering