Marie Puren

2026

How to Efficiently Explore Noisy Historical Data? Leveraging Corpus Pre-Targeting to Enhance Graph-based RAG
Donghan Bian | Marie Puren | Florian Cafiero
Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026

Graph-based Retrieval-Augmented Generation (RAG) is increasingly used to explore long, heterogeneous, and weakly structured corpora, including historical archives. However, in such settings, naive full-corpus indexing is often computationally costly and sensitive to OCR noise, document redundancy, and topical dispersion. In this paper, we investigate corpus pre-targeting strategies as an intermediate layer to improve the efficiency and effectiveness of graph-based RAG for historical research.We evaluate a set of pre-targeting heuristics tailored to single-hop and multi-hop of historical questions on HistoriQA-ThirdRepublic, a French question-answering dataset derived from parliamentary debates and contemporary newspapers. Our results show that appropriate pre-targeting strategies can improve retrieval recall by 3–5% while reducing token consumption by 32–37% compared to full-corpus indexing, without degrading coverage of relevant documents.Beyond performance gains, this work highlights the importance of corpus-level optimization for applying RAG to large-scale historical collections, and provides practical insights for adapting graph-based RAG pipelines to the specific constraints of digitized archives.

2025

pdf bib abs

Évaluation automatique du retour à la source dans un contexte historique long et bruité. Application aux débats parlementaires de la Troisième République française
Julien Perez | Aurélien Pellet | Marie Puren
Actes de l'atelier Évaluation des modèles génératifs (LLM) et challenge 2025 (EvalLLM)

Dans le contexte de l’utilisation croissante des LLM, le besoin d’un retour efficace et automatique aux sources devient essentiel, en particulier pour les documents historiques. La capacité des LLM à identifier les sources pertinentes ne constitue plus seulement un maillon dans une chaîne où l’objectif final est la génération de réponses ; elle représente un enjeu fondamental de l’analyse, justifiant une évaluation à part entière. Quelles stratégies, quels modèles et quels paramètres offrent aux historiens les meilleures capacités d’exploration d’un corpus vaste et bruité ? Cet article propose une première tentative d’évaluation du retriever dans un cadre de RAG appliqué aux débats parlementaires de la Troisième République.

2022

pdf bib abs

Between History and Natural Language Processing: Study, Enrichment and Online Publication of French Parliamentary Debates of the Early Third Republic (1881-1899)
Marie Puren | Aurélien Pellet | Nicolas Bourgeois | Pierre Vernus | Fanny Lebreton
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference

We present the AGODA (Analyse sémantique et Graphes relationnels pour l’Ouverture des Débats à l’Assemblée nationale) project, which aims to create a platform for consulting and exploring digitised French parliamentary debates (1881-1940) available in the digital library of the National Library of France. This project brings together historians and NLP specialists: parliamentary debates are indeed an essential source for French history of the contemporary period, but also for linguistics. This project therefore aims to produce a corpus of texts that can be easily exploited with computational methods, and that respect the TEI standard. Ancient parliamentary debates are also an excellent case study for the development and application of tools for publishing and exploring large historical corpora. In this paper, we present the steps necessary to produce such a corpus. We detail the processing and publication chain of these documents, in particular by mentioning the problems linked to the extraction of texts from digitised images. We also introduce the first analyses that we have carried out on this corpus with “bag-of-words” techniques not too sensitive to OCR quality (namely topic modelling and word embedding).

Co-authors

Julien Perez 1

Pierre Vernus 1

Venues

Fix author