CapOnImage: Context-driven Dense-Captioning on Image

Yiqi Gao, Xinglin Hou, Yuanmeng Zhang, Tiezheng Ge, Yuning Jiang, Peng Wang


Abstract
Existing image captioning systems are dedicated to generating narrative captions for images, which are spatially detached from theimage in presentation. However, texts can also be used as decorations on the image to highlight the key points and increase theattractiveness of images. In this work, we introduce a new taskcalled captioning on image (CapOnImage), which aims to generatedense captions at different locations of the image based on contextual information. To fully exploit the surrounding visual context togenerate the most suitable caption for each location, we propose amulti-modal pre-training model with multi-level pre-training tasksthat progressively learn the correspondence between texts and image locations from easy to difficult. Since the model may generateredundant captions for nearby locations, we further enhance thelocation embedding with neighbor locations as context. For thisnew task, we also introduce a large-scale benchmark called CapOnImage2M, which contains 2.1 million product images, each with anaverage of 4.8 spatially localized captions. Compared with other image captioning model variants, our model achieves the best resultsin both captioning accuracy and diversity aspects.
Anthology ID:
2022.emnlp-main.226
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3449–3465
Language:
URL:
https://aclanthology.org/2022.emnlp-main.226
DOI:
10.18653/v1/2022.emnlp-main.226
Bibkey:
Cite (ACL):
Yiqi Gao, Xinglin Hou, Yuanmeng Zhang, Tiezheng Ge, Yuning Jiang, and Peng Wang. 2022. CapOnImage: Context-driven Dense-Captioning on Image. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3449–3465, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
CapOnImage: Context-driven Dense-Captioning on Image (Gao et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.226.pdf