Edwin G. Ng
Image captioning models generally lack the capability to take into account user interest, and usually default to global descriptions that try to balance readability, informativeness, and information overload. We present a Transformer-based model with the ability to produce captions focused on specific objects, concepts or actions in an image by providing them as guiding text to the model. Further, we evaluate the quality of these guided captions when trained on Conceptual Captions which contain 3.3M image-level captions compared to Visual Genome which contain 3.6M object-level captions. Counter-intuitively, we find that guided captions produced by the model trained on Conceptual Captions generalize better on out-of-domain data. Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets, and that increased style diversity (even without increasing the number of unique tokens) is a key factor for improved performance.