Improving Image Captioning via Predicting Structured Concepts

Ting Wang, Weidong Chen, Yuanhe Tian, Yan Song, Zhendong Mao


Abstract
Having the difficulty of solving the semantic gap between images and texts for the image captioning task, conventional studies in this area paid some attention to treating semantic concepts as a bridge between the two modalities and improved captioning performance accordingly. Although promising results on concept prediction were obtained, the aforementioned studies normally ignore the relationship among concepts, which relies on not only objects in the image, but also word dependencies in the text, so that offers a considerable potential for improving the process of generating good descriptions. In this paper, we propose a structured concept predictor (SCP) to predict concepts and their structures, then we integrate them into captioning, so that enhance the contribution of visual signals in this task via concepts and further use their relations to distinguish cross-modal semantics for better description generation. Particularly, we design weighted graph convolutional networks (W-GCN) to depict concept relations driven by word dependencies, and then learns differentiated contributions from these concepts for following decoding process. Therefore, our approach captures potential relations among concepts and discriminatively learns different concepts, so that effectively facilitates image captioning with inherited information across modalities. Extensive experiments and their results demonstrate the effectiveness of our approach as well as each proposed module in this work.
Anthology ID:
2023.emnlp-main.25
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
360–370
Language:
URL:
https://aclanthology.org/2023.emnlp-main.25
DOI:
10.18653/v1/2023.emnlp-main.25
Bibkey:
Cite (ACL):
Ting Wang, Weidong Chen, Yuanhe Tian, Yan Song, and Zhendong Mao. 2023. Improving Image Captioning via Predicting Structured Concepts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 360–370, Singapore. Association for Computational Linguistics.
Cite (Informal):
Improving Image Captioning via Predicting Structured Concepts (Wang et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.25.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.25.mp4