Question Tells You Where the Answer Is: Intention-aware Long-Context KV Cache Compression

Liang Zhao (赵亮); Xiaocheng Feng (冯骁骋); Weihong Zhong; Lei Huang (黄磊); Kun Zhu (朱坤); Baoxin Wang; Dayong Wu; Guoping Hu; Ting Liu; Bing Qin (秦兵)

Question Tells You Where the Answer Is: Intention-aware Long-Context KV Cache Compression

Liang Zhao, Xiaocheng Feng, Weihong Zhong, Lei Huang, Kun Zhu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, Bing Qin

Abstract

The increasing context window greatly extends the capabilities of large language models, but on the other hand, it incurs an unaffordable memory overhead and computational latency due to the increasing Key-Value (KV) cache size. Recent KV cache compression methods manage to reduce the cache size by dropping irrelevant KVs. However, these methods often fail to identify crucial KVs for generation while excluding others accurately, resulting in severe information loss. To address this gap, we propose **IntentKV**, an intention-aware KV cache eviction method that identifies and retains crucial KVs according to the attention distribution of intention, which semantically reflects the user’s goal and determines which part of the context is relevant. The consistency between the semantics and attention distribution is further substantiated through meticulously designed experiments. On this basis, IntentKV first distinguishes intention tokens from the vanilla context tokens based on their attention distribution distances. Then, the block-wise cumulative attention is calculated via aggregating the intention token attention. Finally, blocks that acquire high cumulative attention are picked and stored in KV cache. We evaluate our method across diverse long-context tasks and models. Results demonstrate that IntentKV can effectively maintain the model performance while reducing the KV cache size from 128K to 2K, leading to a 6.3x increase in decoding speed and 7.8x enhancement in memory efficiency compared to the default setting.

Anthology ID:: 2026.acl-long.1250
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27153–27169
Language:
URL:: https://aclanthology.org/2026.acl-long.1250/
DOI:
Bibkey:
Cite (ACL):: Liang Zhao, Xiaocheng Feng, Weihong Zhong, Lei Huang, Kun Zhu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, and Bing Qin. 2026. Question Tells You Where the Answer Is: Intention-aware Long-Context KV Cache Compression. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27153–27169, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Question Tells You Where the Answer Is: Intention-aware Long-Context KV Cache Compression (Zhao et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1250.pdf
Checklist:: 2026.acl-long.1250.checklist.pdf

PDF Cite Search Checklist Fix data