Compositional Generalization in Image Captioning

Mitja Nikolaus, Mostafa Abdou, Matthew Lamm, Rahul Aralikatte, Desmond Elliott


Abstract
Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image–sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.
Anthology ID:
K19-1009
Volume:
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Month:
November
Year:
2019
Address:
Hong Kong, China
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
87–98
Language:
URL:
https://aclanthology.org/K19-1009
DOI:
10.18653/v1/K19-1009
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/K19-1009.pdf
Supplementary material:
 K19-1009.Supplementary_Material.pdf