Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering

Songtao Jiang; Chenyi Zhou; Yan Zhang; Yeying Jin; Zuozhu Liu

doi:10.18653/v1/2025.acl-short.41

Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering

Songtao Jiang, Chenyi Zhou, Yan Zhang, Yeying Jin, Zuozhu Liu

Abstract

Multimodal large language models (MLLMs) still struggle with complex reasoning tasks in Visual Question Answering (VQA). While current methods have advanced by incorporating visual prompts, our study uncovers critical limitations: these approaches indiscriminately annotate all detected objects for every visual question, generating excessive visual markers that degrade task performance. This issue stems primarily from a lack of focus on key visual elements, raising two important questions: Are all objects equally important, and do all questions require visual prompts? Motivated by Dual Process Theory, which distinguishes between instinctive and deliberate cognitive modes in human reasoning, we propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions, combining fast intuitive judgments with deliberate analytical reasoning to enhance the vision-language reasoning capability of the MLLM. For straightforward questions, FOCUS supports efficient zero-shot reasoning. For more complex tasks, it employs the conceptualizing before observation strategy to highlight critical elements. Extensive experiments on four benchmarks—ScienceQA, TextQA, VizWiz, and MME—demonstrate that FOCUS consistently improves the performance of both open-source and black-box MLLMs, achieving significant gains across all datasets. Ablation studies further validate the importance of combining diverse cognitive strategies with refined visual information for superior performance. Code will be released.

Anthology ID:: 2025.acl-short.41
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 525–534
Language:
URL:: https://aclanthology.org/2025.acl-short.41/
DOI:: 10.18653/v1/2025.acl-short.41
Bibkey:
Cite (ACL):: Songtao Jiang, Chenyi Zhou, Yan Zhang, Yeying Jin, and Zuozhu Liu. 2025. Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 525–534, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering (Jiang et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-short.41.pdf

PDF Cite Search Fix data