Understanding Guided Image Captioning Performance across Domains

Edwin G. Ng, Bo Pang, Piyush Sharma, Radu Soricut


Abstract
Image captioning models generally lack the capability to take into account user interest, and usually default to global descriptions that try to balance readability, informativeness, and information overload. We present a Transformer-based model with the ability to produce captions focused on specific objects, concepts or actions in an image by providing them as guiding text to the model. Further, we evaluate the quality of these guided captions when trained on Conceptual Captions which contain 3.3M image-level captions compared to Visual Genome which contain 3.6M object-level captions. Counter-intuitively, we find that guided captions produced by the model trained on Conceptual Captions generalize better on out-of-domain data. Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets, and that increased style diversity (even without increasing the number of unique tokens) is a key factor for improved performance.
Anthology ID:
2021.conll-1.14
Volume:
Proceedings of the 25th Conference on Computational Natural Language Learning
Month:
November
Year:
2021
Address:
Online
Venues:
CoNLL | EMNLP
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
183–193
Language:
URL:
https://aclanthology.org/2021.conll-1.14
DOI:
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.conll-1.14.pdf
Code
 google-research-datasets/T2-Guiding
Data
T2 GuidingConceptual CaptionsLocalized NarrativesVisual Genome