Generating Question Relevant Captions to Aid Visual Question Answering

Jialin Wu, Zeyuan Hu, Raymond Mooney


Abstract
Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We present a novel approach to better VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question. The model is trained using an existing caption dataset by automatically determining question-relevant captions using an online gradient-based method. Experimental results on the VQA v2 challenge demonstrates that our approach obtains state-of-the-art VQA performance (e.g. 68.4% in the Test-standard set using a single model) by simultaneously generating question-relevant captions.
Anthology ID:
P19-1348
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3585–3594
Language:
URL:
https://aclanthology.org/P19-1348
DOI:
10.18653/v1/P19-1348
Bibkey:
Cite (ACL):
Jialin Wu, Zeyuan Hu, and Raymond Mooney. 2019. Generating Question Relevant Captions to Aid Visual Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3585–3594, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Generating Question Relevant Captions to Aid Visual Question Answering (Wu et al., ACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/P19-1348.pdf
Video:
 https://aclanthology.org/P19-1348.mp4
Data
Visual GenomeVisual Question AnsweringVisual Question Answering v2.0