MVP-Bench: Can Large Vision-Language Models Conduct Multi-level Visual Perception Like Humans?

Guanzhen Li, Yuxi Xie, Min-Yen Kan


Abstract
Humans perform visual perception at multiple levels, including low-level object recognition and high-level semantic interpretation such as behavior understanding. Subtle differences in low-level details can lead to substantial changes in high-level perception. For example, substituting the shopping bag held by a person with a gun suggests violent behavior, implying criminal or violent activity. Despite significant advancements in various multimodal tasks, Large Visual Language Models (LVLMs) remain unexplored in their capabilities to conduct such multi-level visual perceptions.To investigate the perception gap between LVLMs and humans, we introduce MVP-Bench, the first visual–language benchmark systematically evaluating both low- and high-level visual perception of LVLMs. We construct MVP-Bench across natural and synthetic images to investigate how manipulated content influences model perception. Using MVP-Bench, we diagnose the visual perception of 10 open-source and 2 closed-source LVLMs, showing that high-level perception tasks significantly challenge existing LVLMs. The state-of-the-art GPT-4o only achieves an accuracy of 56% on Yes/No questions, compared with 74% in low-level scenarios. Furthermore, the performance gap between natural and manipulated images indicates that current LVLMs do not generalize in understanding the visual semantics of synthetic images as humans do.
Anthology ID:
2024.findings-emnlp.789
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13505–13527
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.789
DOI:
Bibkey:
Cite (ACL):
Guanzhen Li, Yuxi Xie, and Min-Yen Kan. 2024. MVP-Bench: Can Large Vision-Language Models Conduct Multi-level Visual Perception Like Humans?. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13505–13527, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
MVP-Bench: Can Large Vision-Language Models Conduct Multi-level Visual Perception Like Humans? (Li et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.789.pdf
Data:
 2024.findings-emnlp.789.data.zip