Visually-Aware Context Modeling for News Image Captioning

Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens


Abstract
News Image Captioning aims to create captions from news articles and images, emphasizing the connection between textual context and visual elements. Recognizing the significance of human faces in news images and the face-name co-occurrence pattern in existing datasets, we propose a face-naming module for learning better name embeddings. Apart from names, which can be directly linked to an image area (faces), news image captions mostly contain context information that can only be found in the article. We design a retrieval strategy using CLIP to retrieve sentences that are semantically close to the image, mimicking human thought process of linking articles to images. Furthermore, to tackle the problem of the imbalanced proportion of article context and image context in captions, we introduce a simple yet effective method Contrasting with Language Model backbone (CoLaM) to the training pipeline. We conduct extensive experiments to demonstrate the efficacy of our framework. We out-perform the previous state-of-the-art (without external data) by 7.97/5.80 CIDEr scores on GoodNews/NYTimes800k. Our code is available at https://github.com/tingyu215/VACNIC.
Anthology ID:
2024.naacl-long.162
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2927–2943
Language:
URL:
https://aclanthology.org/2024.naacl-long.162
DOI:
10.18653/v1/2024.naacl-long.162
Bibkey:
Cite (ACL):
Tingyu Qu, Tinne Tuytelaars, and Marie-Francine Moens. 2024. Visually-Aware Context Modeling for News Image Captioning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2927–2943, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Visually-Aware Context Modeling for News Image Captioning (Qu et al., NAACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.naacl-long.162.pdf