MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

Yang Shi; Yifeng Xie; Minzhe Guo; Liangsi Lu; Mingxuan Huang; Jingchao Wang; Zhihong Zhu; Boyan Xu; Zhiqi Huang

MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

Yang Shi, Yifeng Xie, Minzhe Guo, Liangsi Lu, Mingxuan Huang, Jingchao Wang, Zhihong Zhu, Boyan Xu, Zhiqi Huang

Abstract

Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 1997 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 12 representative VLMs, and even the best model, Gemini-3-Pro-Preview, classifies the error correctly in only 66.65% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal models.Project Page: https://mmerror-benchmark.github.io

Anthology ID:: 2026.acl-long.2083
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 44973–44986
Language:
URL:: https://aclanthology.org/2026.acl-long.2083/
DOI:
Bibkey:
Cite (ACL):: Yang Shi, Yifeng Xie, Minzhe Guo, Liangsi Lu, Mingxuan Huang, Jingchao Wang, Zhihong Zhu, Boyan Xu, and Zhiqi Huang. 2026. MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 44973–44986, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models (Shi et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.2083.pdf
Checklist:: 2026.acl-long.2083.checklist.pdf

PDF Cite Search Checklist Fix data