Benchmarking Table Extraction: Multimodal LLMs vs Traditional OCR

Guilherme Nunes; Vitor Rolla; Duarte Pereira; Vasco Alves; André Carreiro; Márcia Baptista

doi:10.18653/v1/2025.xllm-1.2

Benchmarking Table Extraction: Multimodal LLMs vs Traditional OCR

Guilherme Nunes, Vitor Rolla, Duarte Pereira, Vasco Alves, Andre Carreiro, Márcia Baptista

Abstract

This paper compares two approaches for table extraction from images: deep learning computer vision and Multimodal Large Language Models (MLLMs). Computer vision models for table extraction, such as the Table Transformer model (TATR), have enhanced the extraction of complex table structural layouts by leveraging deep learning for precise structural recognition combined with traditional Optical Character Recognition (OCR). Conversely, MLLMs, which process both text and image inputs, present a novel approach by potentially bypassing the limitations of TATR plus OCR methods altogether. Models such as GPT-4o, Phi-3 Vision, and Granite Vision 3.2 demonstrate the potential of MLLMs to analyze and interpret table images directly, offering enhanced accuracy and robust extraction capabilities. A state-of-the-art metric like Grid Table Similarity (GriTS) evaluated these methodologies, providing nuanced insights into structural and text content effectiveness. Utilizing the PubTables-1M dataset, a comprehensive and widely used benchmark in the field, this study highlights the strengths and limitations of each approach, setting the stage for future innovations in table extraction technologies. Deep learning computer vision techniques still have a slight edge when extracting table structural layout, but in terms of text cell content, MLLMs are far better.

Anthology ID:: 2025.xllm-1.2
Volume:: Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)
Month:: August
Year:: 2025
Address:: Vienna, Austria
Editors:: Hao Fei, Kewei Tu, Yuhui Zhang, Xiang Hu, Wenjuan Han, Zixia Jia, Zilong Zheng, Yixin Cao, Meishan Zhang, Wei Lu, N. Siddharth, Lilja Øvrelid, Nianwen Xue, Yue Zhang
Venues:: XLLM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8–15
Language:
URL:: https://aclanthology.org/2025.xllm-1.2/
DOI:: 10.18653/v1/2025.xllm-1.2
Bibkey:
Cite (ACL):: Guilherme Nunes, Vitor Rolla, Duarte Pereira, Vasco Alves, Andre Carreiro, and Márcia Baptista. 2025. Benchmarking Table Extraction: Multimodal LLMs vs Traditional OCR. In Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025), pages 8–15, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Benchmarking Table Extraction: Multimodal LLMs vs Traditional OCR (Nunes et al., XLLM 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.xllm-1.2.pdf

PDF Cite Search Fix data