Yue Fei
2025
AutoEvolve: Automatically Evolving Queries for Applicable and Scalable Retrieval-Augmented Generation Benchmarking
Ding-Chu Zhang
|
Xiaowen Zhang
|
Yue Fei
|
Renjun Hu
|
Xiao-Wen Yang
|
Zhi Zhou
|
Baixuan Li
|
Yu-Feng Li
|
Xing Shi
|
Wei Lin
Findings of the Association for Computational Linguistics: EMNLP 2025
Retrieval-augmented generation (RAG) enables large language models (LLMs) to address queries beyond their internal knowledge by integrating domain knowledge in specialized corpus, which necessitates the generation of benchmarks on specific corpus to evaluate RAG systems. However, existing automated generation methods exhibit Weak Applicability and Weak Scalability. Weak Applicability refers to the reliance on metadata from specific corpora for query generation, constraining applicability to other corpora. Weak Scalability is characterized by fixed query content after generation, unable to dynamically increase difficulty, limiting scalability of the query. To overcome these issues, we propose AutoEvolve, an applicable approach for dynamically evolving queries to construct scalable RAG benchmarks. Our approach is grounded in three key innovations: (i) a corpus-agnostic method for constructing the universal entity-document graph; (ii) a suite of evolution operations designed to dynamically update queries; and (iii) a difficulty-guided metric that directs query evolution process. Through experiments on three generated benchmarks, we demonstrate that AutoEvolve evolves queries that are significantly more challenging, paving the way for more applicable and scalable RAG evaluations.