LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Omkar Thawakar; Dinura Dissanayake; Ketan Pravin More; Ritesh Thawkar; Ahmed Heakl; Noor Ahsan; Yuhao Li; Ilmuz Zaman Mohammed Zumri; Jean Lahoud; Rao Muhammad Anwer; Hisham Cholakkal; Ivan Laptev; Mubarak Shah; Fahad Shahbaz Khan; Salman Khan

doi:10.18653/v1/2025.findings-acl.1247

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Omkar Thawakar, Dinura Dissanayake, Ketan Pravin More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Ilmuz Zaman Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan

Abstract

Step-by-step reasoning is crucial for solving complex visual tasks, yet existing approaches lack a comprehensive framework for evaluating this capability and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing multi-step visual reasoning in large multimodal models (LMMs) through three key contributions. First, we introduce a Visual Reasoning Chain Benchmark, the most comprehensive benchmark for multi-step visual reasoning, covering eight diverse categories and over 4k reasoning steps. This enables rigorous evaluation of LMMs’ ability to reason accurately and interpretably across multiple steps. Second, we propose a fine-grained reasoning metric that evaluates correctness and logical coherence at each step, providing deeper insights beyond traditional accuracy metrics. Third, we introduce LlamaV-o1, a state-of-the-art multimodal reasoning model trained using a multi-step curriculum learning approach. LlamaV-o1 is optimized for structured, step-by-step reasoning and significantly outperforms existing open-source models. It surpasses Llava-CoT with a 3.8% absolute gain across six benchmarks, achieving an average score of 67.3 while being 5x faster during inference scaling. Our benchmark, model, and code is available at https://github.com/mbzuai-oryx/LlamaV-o1.

Anthology ID:: 2025.findings-acl.1247
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24290–24315
Language:
URL:: https://aclanthology.org/2025.findings-acl.1247/
DOI:: 10.18653/v1/2025.findings-acl.1247
Bibkey:
Cite (ACL):: Omkar Thawakar, Dinura Dissanayake, Ketan Pravin More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Ilmuz Zaman Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, and Salman Khan. 2025. LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs. In Findings of the Association for Computational Linguistics: ACL 2025, pages 24290–24315, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs (Thawakar et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.1247.pdf

PDF Cite Search Fix data