RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Kunlun Zhu; Yifan Luo; Dingling Xu; Yukun Yan (闫宇坤); Zhenghao Liu (刘正皓); Shi Yu (于是); Ruobing Wang; Shuo Wang; Yishan Li; Nan Zhang; Xu Han (韩旭); Zhiyuan Liu; Maosong Sun

doi:10.18653/v1/2025.acl-long.418

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Kunlun Zhu, Yifan Luo, Dingling Xu, Yukun Yan, Zhenghao Liu, Shi Yu, Ruobing Wang, Shuo Wang, Yishan Li, Nan Zhang, Xu Han, Zhiyuan Liu, Maosong Sun

Abstract

Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics—Completeness, Hallucination, and Irrelevance—to evaluate LLM-generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.

Anthology ID:: 2025.acl-long.418
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8520–8544
Language:
URL:: https://aclanthology.org/2025.acl-long.418/
DOI:: 10.18653/v1/2025.acl-long.418
Bibkey:
Cite (ACL):: Kunlun Zhu, Yifan Luo, Dingling Xu, Yukun Yan, Zhenghao Liu, Shi Yu, Ruobing Wang, Shuo Wang, Yishan Li, Nan Zhang, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8520–8544, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework (Zhu et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.418.pdf

PDF Cite Search Fix data