CommVQA: Situating Visual Question Answering in Communicative Contexts

Nandita Naik, Christopher Potts, Elisa Kreiss


Abstract
Current visual question answering (VQA) models tend to be trained and evaluated on image-question pairs in isolation. However, the questions people ask are dependent on their informational needs and prior knowledge about the image content. To evaluate how situating images within naturalistic contexts shapes visual questions, we introduce CommVQA, a VQA dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear (e.g., a travel website), and follow-up questions and answers conditioned on the scenario and description. CommVQA, which contains 1000 images and 8,949 question-answer pairs, poses a challenge for current models. Error analyses and a human-subjects study suggest that generated answers still contain high rates of hallucinations, fail to fittingly address unanswerable questions, and don’t suitably reflect contextual information.
Anthology ID:
2024.emnlp-main.741
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13362–13377
Language:
URL:
https://aclanthology.org/2024.emnlp-main.741
DOI:
Bibkey:
Cite (ACL):
Nandita Naik, Christopher Potts, and Elisa Kreiss. 2024. CommVQA: Situating Visual Question Answering in Communicative Contexts. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13362–13377, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
CommVQA: Situating Visual Question Answering in Communicative Contexts (Naik et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.741.pdf