On the Human-level Performance of Visual Question Answering

Chenlian Zhou, Guanyi Chen, Xin Bai, Ming Dong


Abstract
Visual7W has been widely used in assessing multiple-choice visual question-answering (VQA) systems. This paper reports on a replicated human experiment on Visual7W with the aim of understanding the human-level performance of VQA. The replication was not entirely successful because human participants performed significantly worse when answering “where”, “when”, and “how” questions in compared to other question types. An error analysis discovered that the failure was a consequence of the non-deterministic distractors in Visual7W. GPT-4V was then evaluated using and was compared to the human-level performance. The results embody that, when evaluating models’ capacity on Visual7W, the performance is not necessarily the higher, the better.
Anthology ID:
2025.coling-main.277
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4109–4113
Language:
URL:
https://aclanthology.org/2025.coling-main.277/
DOI:
Bibkey:
Cite (ACL):
Chenlian Zhou, Guanyi Chen, Xin Bai, and Ming Dong. 2025. On the Human-level Performance of Visual Question Answering. In Proceedings of the 31st International Conference on Computational Linguistics, pages 4109–4113, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
On the Human-level Performance of Visual Question Answering (Zhou et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.277.pdf