LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation

Yiqun Shen; Song Yuan; Zhengze Zhang; Xiaoliang Wang; Daxin Jiang; Cam-Tu Nguyen

doi:10.18653/v1/2025.findings-emnlp.737

LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation

Yiqun Shen, Song Yuan, Zhengze Zhang, Xiaoliang Wang, Daxin Jiang, Cam-Tu Nguyen

Abstract

KV Cache is commonly used to accelerate LLM inference with long contexts, yet its high memory demand drives the need for cache compression. Existing compression methods, however, are largely heuristic and lack dynamic budget allocation. To address this limitation, we introduce a unified framework for cache compression by minimizing information loss in Transformer residual streams. Building on it, we analyze the layer attention output loss and derive a new metric to compare cache entries across heads, enabling layer-wise compression with dynamic head budgets. Additionally, by contrasting cross-layer information, we also achieve dynamic layer budgets. LAVa is the first unified strategy for cache eviction and dynamic budget allocation that, unlike prior methods, does not rely on training or the combination of multiple strategies. Experiments with four benchmarks (LongBench, Needle-In-A-Haystack, Ruler, and InfiniteBench) demonstrate its superiority over strong baselines. Moreover, our experiments reveal a new insight: dynamic layer budgets are crucial for generation tasks (e.g., code completion), while dynamic head budgets play a key role in extraction tasks (e.g., extractive QA). As a fully dynamic compression method, LAVa consistently maintains top performance across task types.

Anthology ID:: 2025.findings-emnlp.737
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13672–13692
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.737/
DOI:: 10.18653/v1/2025.findings-emnlp.737
Bibkey:
Cite (ACL):: Yiqun Shen, Song Yuan, Zhengze Zhang, Xiaoliang Wang, Daxin Jiang, and Cam-Tu Nguyen. 2025. LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 13672–13692, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation (Shen et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.737.pdf
Checklist:: 2025.findings-emnlp.737.checklist.pdf

PDF Cite Search Checklist Fix data