Data-Centric Perspectives on Agentic Retrieval-Augmented Generation: A Survey

Jingwen Deng; Jihao Huang; Zhen Hao Wong; Hao Liang; Quanqing Xu; Bin Cui; Wentao Zhang

Data-Centric Perspectives on Agentic Retrieval-Augmented Generation: A Survey

Jingwen Deng, Jihao Huang, Zhen Hao Wong, Hao Liang, Quanqing Xu, Bin Cui, Wentao Zhang

Abstract

Large Language Models (LLMs) excel at natural language understanding and generation, yet their reliance on static pre-training corpora may lead to outdated knowledge, hallucinations, and limited adaptability. Retrieval-Augmented Generation (RAG) mitigates these issues by grounding model outputs with external retrieval, but conventional RAG remains constrained by a fixed retrieve-then-generate routine and struggles with multi-step reasoning and tool calls. **Agentic RAG** addresses these limitations by enabling LLM agents to actively decompose tasks, issue exploratory queries, and refine evidence through iterative retrieval. Despite growing interest, the development of Agentic RAG is impeded by *data scarcity*: unlike traditional RAG, it requires challenging tasks that require planning, retrieval, and multiple reasoning decisions, and corresponding rich, interactive agent trajectories. This survey presents the first data-centric overview of Agentic RAG, framing its data lifecycle—data collecting, data preprocessing and task formulation, task construction, data for evaluation, and data enhancement for training—and cataloging representative training datasets and benchmarks in different domains (e.g. question answering, web, software engineering). From data perspectives, we aim to guide the creation of scalable, high-quality datasets for the next generation of adaptive, knowledge-seeking LLM agents. The project page is at https://github.com/fatty-belly/Awesome-AgenticRAG-Data/.

Anthology ID:: 2026.findings-acl.78
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1570–1588
Language:
URL:: https://aclanthology.org/2026.findings-acl.78/
DOI:
Bibkey:
Cite (ACL):: Jingwen Deng, Jihao Huang, Zhen Hao Wong, Hao Liang, Quanqing Xu, Bin Cui, and Wentao Zhang. 2026. Data-Centric Perspectives on Agentic Retrieval-Augmented Generation: A Survey. In Findings of the Association for Computational Linguistics: ACL 2026, pages 1570–1588, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Data-Centric Perspectives on Agentic Retrieval-Augmented Generation: A Survey (Deng et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.78.pdf
Checklist:: 2026.findings-acl.78.checklist.pdf

PDF Cite Search Checklist Fix data