TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text

Songshuo Lu; Hua Wang; Yutian Rong; Zhi Chen; Yaohua Tang

TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text

Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang

Abstract

Current Retrieval-Augmented Generation (RAG) systems concatenate and process numerous retrieved document chunks for prefill which requires a large volume of computation, therefore leading to significant latency in time-to-first-token (TTFT). To reduce the computation overhead as well as TTFT, we introduce TurboRAG, a hybrid offline–online paradigm that (i) pre‐computes chunk‐level key-value (KV) caches, (ii) stitches them together at inference time using independent–attention and reordered‐RoPE techniques, and (iii) preserves answer quality without changing the model architecture. Hence, online computation of KV caches is eliminated during inference. Our approach is applicable to most existing large language models and their applications without any requirement in modification of models and inference systems. Experimental results across a suite of RAG benchmarks demonstrate that TurboRAG reduces TTFT by up to 9.4x compared to the conventional RAG systems (on an average of 8.6x), but reserving comparable performance to the standard RAG systems.

Anthology ID:: 2025.emnlp-main.334
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6599–6612
Language:
URL:: https://aclanthology.org/2025.emnlp-main.334/
DOI:
Bibkey:
Cite (ACL):: Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, and Yaohua Tang. 2025. TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6599–6612, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text (Lu et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.334.pdf
Checklist:: 2025.emnlp-main.334.checklist.pdf

PDF Cite Search Checklist Fix data