Evaluating Vision-Language Models on Bistable Images

Artemis Panagopoulou, Coby Melkin, Chris Callison-Burch


Abstract
Bistable images, also known as ambiguous or reversible images, present visual stimuli that can be seen in two distinct interpretations, though not simultaneously, by the observer. In this study, we conduct the most extensive examination of vision-language models using bistable images to date. We manually gathered a dataset of 29 bistable images, along with their associated labels, and subjected them to 121 different manipulations in brightness, resolution, tint, and rotation. We evaluated twelve different models in both classification and generative tasks across six model architectures. Our findings reveal that, with the exception of models from the Idefics family and LLaVA1.5-13b, there is a pronounced preference for one interpretation over another among the models, and minimal variance under image manipulations, with few exceptions on image rotations. Additionally, we compared the models’ preferences with humans, noting that the models do not exhibit the same continuity biases as humans and often diverge from human initial interpretations. We also investigated the influence of variations in prompts and the use of synonymous labels, discovering that these factors significantly affect model interpretations more than image manipulations showing a higher influence of the language priors on bistable image interpretations compared to image-text training data. All code and data is open sourced.
Anthology ID:
2024.cmcl-1.2
Volume:
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Tatsuki Kuribayashi, Giulia Rambelli, Ece Takmaz, Philipp Wicke, Yohei Oseki
Venues:
CMCL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8–29
Language:
URL:
https://aclanthology.org/2024.cmcl-1.2
DOI:
Bibkey:
Cite (ACL):
Artemis Panagopoulou, Coby Melkin, and Chris Callison-Burch. 2024. Evaluating Vision-Language Models on Bistable Images. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 8–29, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Evaluating Vision-Language Models on Bistable Images (Panagopoulou et al., CMCL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.cmcl-1.2.pdf