Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining

Emanuele Bugliarello, Aida Nematzadeh, Lisa Hendricks


Abstract
Recent work in vision-and-language pretraining has investigated supervised signals from object detection data to learn better, fine-grained multimodal representations. In this work, we take a step further and explore how we can tap into supervision from small-scale visual relation data. In particular, we propose two pretraining approaches to contextualise visual entities in a multimodal setup. With verbalised scene graphs, we transform visual relation triplets into structured captions, and treat them as additional image descriptions. With masked relation prediction, we further encourage relating entities from image regions with visually masked contexts. When applied to strong baselines pretrained on large amounts of Web data, zero-shot evaluations on both coarse-grained and fine-grained tasks show the efficacy of our methods in learning multimodal representations from weakly-supervised relations data.
Anthology ID:
2023.emnlp-main.184
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3052–3071
Language:
URL:
https://aclanthology.org/2023.emnlp-main.184
DOI:
10.18653/v1/2023.emnlp-main.184
Bibkey:
Cite (ACL):
Emanuele Bugliarello, Aida Nematzadeh, and Lisa Hendricks. 2023. Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3052–3071, Singapore. Association for Computational Linguistics.
Cite (Informal):
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining (Bugliarello et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.184.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.184.mp4