All You May Need for VQA are Image Captions

Soravit Changpinyo, Doron Kukliansy, Idan Szpektor, Xi Chen, Nan Ding, Radu Soricut


Abstract
Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.
Anthology ID:
2022.naacl-main.142
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1947–1963
Language:
URL:
https://aclanthology.org/2022.naacl-main.142
DOI:
10.18653/v1/2022.naacl-main.142
Bibkey:
Cite (ACL):
Soravit Changpinyo, Doron Kukliansy, Idan Szpektor, Xi Chen, Nan Ding, and Radu Soricut. 2022. All You May Need for VQA are Image Captions. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1947–1963, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
All You May Need for VQA are Image Captions (Changpinyo et al., NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-main.142.pdf
Code
 google-research-datasets/maverics
Data
MAVERICSCOCOCOCO-QAConceptual CaptionsGQAOK-VQASQuADVQGVisual Question AnsweringVisual Question Answering v2.0