ClusterAttn: KV Cache Compression under Intrinsic Attention Clustering

Minwei Zhang; Haifeng Sun; Jingyu Wang; Shaolong Li; Wanyi Ning; Qi Qi; Zirui Zhuang; Jianxin Liao

doi:10.18653/v1/2025.acl-long.703

ClusterAttn: KV Cache Compression under Intrinsic Attention Clustering

Minwei Zhang, Haifeng Sun, Jingyu Wang, Shaolong Li, Wanyi Ning, Qi Qi, Zirui Zhuang, Jianxin Liao

Abstract

Sparse attention can effectively alleviate the significant demands on memory when large language models (LLMs) process long contexts. Existing methods typically apply the same sparse pattern across different attention heads and inputs. However, this uniform approach fails to capture the inherent diversity of attention patterns within LLMs — the intrinsic attention clustering. To address this, we propose ClusterAttn, a training-free sparse attention method that provides an efficient prompt cache compression scheme under intrinsic attention clustering for efficient LLM inference.Our findings show that attention heads consistently focus on specific clusters of the prompt during decoding, a pattern detectable from an observation window at the prompt’s end. ClusterAttn adaptively fits these clusters utilizing a density-based attention clustering algorithm, thus compressing the KV cache of the prompt. Evaluations on different models across various benchmarks demonstrate ClusterAttn’s superior compression rates and efficiency. By utilizing only 1024 tokens, it can reduce memory usage by 10%–65%, resulting in a latency reduction of 12%–23% and a throughput increase of 2.6–4.8 times, all with nearly no accuracy loss. Additionally, ClusterAttn can handle up to 128k context on a single A100-80GB GPU, outperforming existing methods.

Anthology ID:: 2025.acl-long.703
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14451–14473
Language:
URL:: https://aclanthology.org/2025.acl-long.703/
DOI:: 10.18653/v1/2025.acl-long.703
Bibkey:
Cite (ACL):: Minwei Zhang, Haifeng Sun, Jingyu Wang, Shaolong Li, Wanyi Ning, Qi Qi, Zirui Zhuang, and Jianxin Liao. 2025. ClusterAttn: KV Cache Compression under Intrinsic Attention Clustering. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14451–14473, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: ClusterAttn: KV Cache Compression under Intrinsic Attention Clustering (Zhang et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.703.pdf

PDF Cite Search Fix data