MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space

Anshul Singh; Chris Biemann; Jan Strich

doi:10.18653/v1/2025.findings-emnlp.1083

MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in interpreting visual layouts and text. However, a significant challenge remains in their ability to interpret robustly and reason over multi-tabular data presented as images, a common occurrence in real-world scenarios like web pages and digital documents. Existing benchmarks typically address single tables or non-visual data (text/structured). This leaves a critical gap: they don’t assess the ability to parse diverse table images, correlate information across them, and perform multi-hop reasoning on the combined visual data. To bridge this evaluation gap, we introduce MTabVQA, a novel benchmark specifically designed for multi-tabular visual question answering. MTabVQA comprises 3,745 complex question-answer pairs that necessitate multi-hop reasoning across several visually rendered table images. We provide extensive benchmark results for state-of-the-art VLMs on MTabVQA, revealing significant performance limitations. We further investigate post-training techniques to enhance these reasoning abilities and release MTabVQA-Instruct, a large-scale instruction-tuning dataset. Our experiments show that fine-tuning VLMs with MTabVQA-Instruct substantially improves their performance on visual multi-tabular reasoning. Code and dataset are available online: .

Anthology ID:: 2025.findings-emnlp.1083
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19866–19891
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.1083/
DOI:: 10.18653/v1/2025.findings-emnlp.1083
Bibkey:
Cite (ACL):: Anshul Singh, Chris Biemann, and Jan Strich. 2025. MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 19866–19891, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space (Singh et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.1083.pdf
Checklist:: 2025.findings-emnlp.1083.checklist.pdf

PDF Cite Search Checklist Fix data