StruNRAG: Evaluation of OCR-Induced Structural Noise on RAG Robustness

Mengna Gao; Dapeng Yin; Shuyue Zhu; Bingxuan Hou; Zhanpeng Ni; Junli Wang

StruNRAG: Evaluation of OCR-Induced Structural Noise on RAG Robustness

Mengna Gao, Dapeng Yin, Shuyue Zhu, Bingxuan Hou, Zhanpeng Ni, Junli Wang

Abstract

Retrieval-Augmented Generation (RAG) systems rely on Optical Character Recognition (OCR) to ingest knowledge from unstructured documents. However, OCR engines often struggle with complex layouts, introducing Structural Noise, such as line insertion and paragraph interleaving, which disrupts the semantic flow of the text. Existing evaluations largely overlook this dimension, operating on the assumption of structurally perfect input. To bridge this gap, we introduce StruNRAG, a dedicated benchmark for evaluating RAG robustness against OCR-induced structural perturbations. We construct a bilingual dataset of 2,132 question-answer pairs derived from complex Chinese and English documents and systematically inject three categories of real-world structural noise: line insertion, paragraph interleaving, and line interleaving. Our evaluation of mainstream retrievers and Large Language Models (LLMs) reveals a nuanced interaction between noise and pipeline stages: while structural distortions consistently degrade retrieval performance, the generation stage exhibits unexpected robustness. Advanced LLMs demonstrate robustness against local noise (e.g., line insertion), but struggle to maintain reasoning capabilities under severe structural disruption that fragments global context. These findings indicate that while LLMs are capable of compensating for minor parsing errors, future RAG optimizations must take into account the effects of structural noise. Our code and datasets are available at [https://github.com/GaoMengnana/StruNRAG](https://github.com/GaoMengnana/StruNRAG).

Anthology ID:: 2026.findings-acl.955
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19129–19148
Language:
URL:: https://aclanthology.org/2026.findings-acl.955/
DOI:
Bibkey:
Cite (ACL):: Mengna Gao, Dapeng Yin, Shuyue Zhu, Bingxuan Hou, Zhanpeng Ni, and Junli Wang. 2026. StruNRAG: Evaluation of OCR-Induced Structural Noise on RAG Robustness. In Findings of the Association for Computational Linguistics: ACL 2026, pages 19129–19148, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: StruNRAG: Evaluation of OCR-Induced Structural Noise on RAG Robustness (Gao et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.955.pdf
Checklist:: 2026.findings-acl.955.checklist.pdf

PDF Cite Search Checklist Fix data