RefreshKV: Updating Small KV Cache During Long-form Generation

Fangyuan Xu; Tanya Goyal; Eunsol Choi

doi:10.18653/v1/2025.acl-long.1211

RefreshKV: Updating Small KV Cache During Long-form Generation

Abstract

Generating long sequences of tokens given a long-context input is a very compute-intensive inference scenario for large language models (LLMs). One prominent inference speed-up approach is constructing a smaller key-value (KV) cache, relieving LLMs from computing attention over a long sequence of tokens. While such methods work well to generate short sequences, their performance degrades rapidly for long-form generation. Most KV compression happens once, prematurely removing tokens that can be useful later in the generation. We propose a new inference-time method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation. After each full attention step, we update the smaller KV cache based on the attention pattern over the entire input. Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks. Lastly, we show that continued pretraining with our inference setting brings further gains in performance.

Anthology ID:: 2025.acl-long.1211
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24878–24893
Language:
URL:: https://aclanthology.org/2025.acl-long.1211/
DOI:: 10.18653/v1/2025.acl-long.1211
Bibkey:
Cite (ACL):: Fangyuan Xu, Tanya Goyal, and Eunsol Choi. 2025. RefreshKV: Updating Small KV Cache During Long-form Generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24878–24893, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: RefreshKV: Updating Small KV Cache During Long-form Generation (Xu et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1211.pdf

PDF Cite Search Fix data