A Pipeline to Bootstrap the Evaluation of Retrieval-Augmented Generation for the Automation of Systematic Reviews in Computer Science

Pierre Achkar; Tim Gollub; Arno Simons; Harrisen Scells; Maik Fröbe; Martin Potthast

A Pipeline to Bootstrap the Evaluation of Retrieval-Augmented Generation for the Automation of Systematic Reviews in Computer Science

Pierre Achkar, Tim Gollub, Arno Simons, Harrisen Scells, Maik Fröbe, Martin Potthast

Abstract

Automating systematic reviews (SRs), i.e., evidence-driven analyses under explicit protocol constraints, is a natural target for retrieval-augmented generation and deep research agents, yet existing benchmarks evaluate isolated subtasks or assume fixed evidence inputs. We introduce RAG4SR-CS-200, a benchmark of 200 computer science systematic reviews designed for protocol-driven systematic review automation. Each instance comprises review objectives, research questions, eligibility criteria, cleaned full-text review structure, references, and extracted tables. These elements support evaluation across key tasks in systematic review creation such as literature retrieval, eligibility screening, citation-grounded review generation, and structured table generation, in both stage-wise and end-to-end settings. RAG4SR-CS-200 provides a foundation for developing more reliable and diagnosable deep research agents for scientific evidence synthesis. Code and data are publicly available (https://github.com/webis-de/rag4sr-cs-200).

Anthology ID:: 2026.rag4reports-1.8
Volume:: Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026)
Month:: July
Year:: 2026
Address:: San Diego, CA, USA
Editors:: Eugene Yang, Dawn Lawrie, Sean MacAvaney, James Mayfield, Luca Soldaini, Andrew Yates
Venues:: RAG4Reports | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 65–70
Language:
URL:: https://aclanthology.org/2026.rag4reports-1.8/
DOI:
Bibkey:
Cite (ACL):: Pierre Achkar, Tim Gollub, Arno Simons, Harrisen Scells, Maik Fröbe, and Martin Potthast. 2026. A Pipeline to Bootstrap the Evaluation of Retrieval-Augmented Generation for the Automation of Systematic Reviews in Computer Science. In Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026), pages 65–70, San Diego, CA, USA. Association for Computational Linguistics.
Cite (Informal):: A Pipeline to Bootstrap the Evaluation of Retrieval-Augmented Generation for the Automation of Systematic Reviews in Computer Science (Achkar et al., RAG4Reports 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.rag4reports-1.8.pdf

PDF Cite Search Fix data