How to Efficiently Explore Noisy Historical Data? Leveraging Corpus Pre-Targeting to Enhance Graph-based RAG

Donghan Bian; Marie Puren; Florian Cafiero

How to Efficiently Explore Noisy Historical Data? Leveraging Corpus Pre-Targeting to Enhance Graph-based RAG

Donghan Bian, Marie Puren, Florian Cafiero

Abstract

Graph-based Retrieval-Augmented Generation (RAG) is increasingly used to explore long, heterogeneous, and weakly structured corpora, including historical archives. However, in such settings, naive full-corpus indexing is often computationally costly and sensitive to OCR noise, document redundancy, and topical dispersion. In this paper, we investigate corpus pre-targeting strategies as an intermediate layer to improve the efficiency and effectiveness of graph-based RAG for historical research.We evaluate a set of pre-targeting heuristics tailored to single-hop and multi-hop of historical questions on HistoriQA-ThirdRepublic, a French question-answering dataset derived from parliamentary debates and contemporary newspapers. Our results show that appropriate pre-targeting strategies can improve retrieval recall by 3–5% while reducing token consumption by 32–37% compared to full-corpus indexing, without degrading coverage of relevant documents.Beyond performance gains, this work highlights the importance of corpus-level optimization for applying RAG to large-scale historical collections, and provides practical insights for adapting graph-based RAG pipelines to the specific constraints of digitized archives.

Anthology ID:: 2026.latechclfl-1.23
Volume:: Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Diego Alves, Yuri Bizzoni, Stefania Degaetano-Ortlieb, Anna Kazantseva, Janis Pagel, Stan Szpakowicz
Venues:: LaTeCH-CLfL | WS
SIG:: SIGHUM
Publisher:: Association for Computational Linguistics
Note:
Pages:: 241–250
Language:
URL:: https://aclanthology.org/2026.latechclfl-1.23/
DOI:
Bibkey:
Cite (ACL):: Donghan Bian, Marie Puren, and Florian Cafiero. 2026. How to Efficiently Explore Noisy Historical Data? Leveraging Corpus Pre-Targeting to Enhance Graph-based RAG. In Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026, pages 241–250, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: How to Efficiently Explore Noisy Historical Data? Leveraging Corpus Pre-Targeting to Enhance Graph-based RAG (Bian et al., LaTeCH-CLfL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.latechclfl-1.23.pdf
Supplementarymaterial:: 2026.latechclfl-1.23.SupplementaryMaterial.txt

PDF Cite Search Supplementarymaterial Fix data