Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning

Zhihao Fan, Zhongyu Wei, Siyuan Wang, Xuanjing Huang


Abstract
Image Captioning aims at generating a short description for an image. Existing research usually employs the architecture of CNN-RNN that views the generation as a sequential decision-making process and the entire dataset vocabulary is used as decoding space. They suffer from generating high frequent n-gram with irrelevant words. To tackle this problem, we propose to construct an image-grounded vocabulary, based on which, captions are generated with limitation and guidance. In specific, a novel hierarchical structure is proposed to construct the vocabulary incorporating both visual information and relations among words. For generation, we propose a word-aware RNN cell incorporating vocabulary information into the decoding process directly. Reinforce algorithm is employed to train the generator using constraint vocabulary as action space. Experimental results on MS COCO and Flickr30k show the effectiveness of our framework compared to some state-of-the-art models.
Anthology ID:
P19-1652
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6514–6524
Language:
URL:
https://aclanthology.org/P19-1652/
DOI:
10.18653/v1/P19-1652
Bibkey:
Cite (ACL):
Zhihao Fan, Zhongyu Wei, Siyuan Wang, and Xuanjing Huang. 2019. Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6514–6524, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning (Fan et al., ACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/P19-1652.pdf
Code
 LibertFan/ImageCaption
Data
MS COCOVQGVisual Question Answering