Exploring the Impact of Vision Features in News Image Captioning

Junzhe Zhang, Xiaojun Wan


Abstract
The task of news image captioning aims to generate a detailed caption which describes the specific information of an image in a news article. However, we find that recent state-of-art models can achieve competitive performance even without vision features. To resolve the impact of vision features in the news image captioning task, we conduct extensive experiments with mainstream models based on encoder-decoder framework. From our exploration, we find 1) vision features do contribute to the generation of news image captions; 2) vision features can assist models to better generate entities of captions when the entity information is sufficient in the input textual context of the given article; 3) Regions of specific objects in images contribute to the generation of related entities in captions.
Anthology ID:
2023.findings-acl.818
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12923–12936
Language:
URL:
https://aclanthology.org/2023.findings-acl.818
DOI:
10.18653/v1/2023.findings-acl.818
Bibkey:
Cite (ACL):
Junzhe Zhang and Xiaojun Wan. 2023. Exploring the Impact of Vision Features in News Image Captioning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12923–12936, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Exploring the Impact of Vision Features in News Image Captioning (Zhang & Wan, Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.818.pdf