When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation

Lin Sun; Wangdexian; Jingang Huang; Linglin Zhang; Change Jia; Zhengwei Cheng; Xiangzheng Zhang

When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation

Lin Sun, Wangdexian, Jingang Huang, Linglin Zhang, Change Jia, Zhengwei Cheng, Xiangzheng Zhang

Abstract

Industrial Retrieval-Augmented Generation (RAG) systems depend on optical character recognition (OCR) to transform visual documents into text. Existing OCR benchmarks rely on character-level metrics, which inadequately measure downstream RAG effectiveness under real-world conditions. We introduce an OCR benchmark for industrial RAG systems covering 11 challenging document types, including extreme layouts, high-resolution pages, complex or watermarked backgrounds, historical documents with non-standard reading orders, visually decorated text, and documents containing tables and mathematical formulas. Evaluating recent SOTA OCR models under a controlled OCR-first RAG pipeline shows clear performance degradation on realistic industrial documents despite strong conventional benchmark scores. We find that high OCR accuracy does not necessarily translate into strong downstream RAG performance: structural and semantic errors can cause substantial retrieval failures even when WER/CER remains low. Further analysis shows that this mismatch is category-dependent, arises through both retrieval-side and downstream generation-side failures, and remains stable across representative OCR-first pipeline choices. The benchmark is publicly available at https://github.com/Qihoo360/InduOCRBench.

Anthology ID:: 2026.acl-industry.60
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Yunyao Li, Georg Rehm, Mei Tu
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 884–894
Language:
URL:: https://aclanthology.org/2026.acl-industry.60/
DOI:
Bibkey:
Cite (ACL):: Lin Sun, Wangdexian, Jingang Huang, Linglin Zhang, Change Jia, Zhengwei Cheng, and Xiangzheng Zhang. 2026. When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 884–894, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation (Sun et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-industry.60.pdf

PDF Cite Search Fix data