DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization

Haiyang Shen; Hang Yan (航 颜); Zhongshi Xing; Mugeng Liu; Yue Li; Zhiyang Chen; Yuxiang Wang; Jiuzheng Wang; Yun Ma

DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization

Haiyang Shen, Hang Yan, Zhongshi Xing, Mugeng Liu, Yue Li, Zhiyang Chen, Yuxiang Wang, Jiuzheng Wang, Yun Ma

Abstract

Retrieval-augmented generation (RAG) can substantially enhance the performance of LLMs on knowledge-intensive tasks. Various RAG paradigms—including vanilla, planning-based, and iterative RAG—all depend on a robust retriever, yet existing retrievers rely heavily on public knowledge and often falter when faced with domain-specific queries. To address these limitations, we introduce DRAGON, a framework that combines a data-construction modeling approach with a scalable synthetic data-generation pipeline, specifically designed to optimize domain-specific retrieval performance and bolster retriever robustness. To evaluate RAG performance on domain-specific RAGs, we propose DRAGONBench, a benchmark spanning 8 domain-specific document collections across 4 distinct fields and featuring a wide spectrum of query complexities, answerability, and hops. Leveraging DRAGON, we generate a large-scale synthetic dataset—encompassing both single-hop and multi-hop queries—to enrich retriever training. Extensive experiments demonstrate that retrievers trained on this data yield significant performance gains and exhibit strong cross-domain generalization. Moreover, when our optimized retrievers are integrated into vanilla, planning-based, and iterative RAG paradigms, we observe consistent end-to-end improvements in system accuracy.

Anthology ID:: 2026.findings-eacl.56
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1065–1078
Language:
URL:: https://aclanthology.org/2026.findings-eacl.56/
DOI:
Bibkey:
Cite (ACL):: Haiyang Shen, Hang Yan, Zhongshi Xing, Mugeng Liu, Yue Li, Zhiyang Chen, Yuxiang Wang, Jiuzheng Wang, and Yun Ma. 2026. DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization. In Findings of the Association for Computational Linguistics: EACL 2026, pages 1065–1078, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization (Shen et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-eacl.56.pdf
Checklist:: 2026.findings-eacl.56.checklist.pdf

PDF Cite Search Checklist Fix data