Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts

Aditya Sharma, Michael Saxon, William Yang Wang


Abstract
We present LoCoVQA, a dynamic benchmark generator for evaluating long-context reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images.Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking logarithmic decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries—a task that is quite easy for language models (LMs) in the text domain—demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.
Anthology ID:
2024.findings-emnlp.312
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5429–5451
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.312
DOI:
10.18653/v1/2024.findings-emnlp.312
Bibkey:
Cite (ACL):
Aditya Sharma, Michael Saxon, and William Yang Wang. 2024. Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5429–5451, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts (Sharma et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.312.pdf