Chuqiao Kuang
2025
More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression
Jiebin Zhang
|
Dawei Zhu
|
Yifan Song
|
Wenhao Wu
|
Chuqiao Kuang
|
Xiaoguang Li
|
Lifeng Shang
|
Qun Liu
|
Sujian Li
Findings of the Association for Computational Linguistics: EMNLP 2025
As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimensions separately. However, these works have left the trade-off between these two orthogonal dimensions largely unexplored. In this paper, we leverage the Information Bottleneck principle to formulate KV cache compression within a unified theoretical framework. We demonstrate that a carefully managed token-precision trade-off can achieve an optimal point within the Information Bottleneck compared to standalone KV pruning or KV quantization. Experiments reveal that storing more tokens in the KV cache at lower precision—a strategy we term quantized pruning—can significantly enhance the long-context performance of LLMs. An in-depth analysis of this token-precision trade-off across key aspects shows that quantized pruning achieves substantial improvements in retrieval-related tasks and consistently performs well across varying input lengths. Furthermore, quantized pruning exhibits notable stability and effectiveness across different KV pruning methods, quantization strategies, and model scales. These findings offer valuable insights into optimizing KV cache compression through balanced token-precision trade-off strategies. Our code isavailable at https://github.com/zhzihao/QPruningKV.
Search
Fix author
Co-authors
- Xiaoguang Li 1
- Sujian Li (李素建) 1
- Qun Liu 1
- Lifeng Shang 1
- Yifan Song 1
- show all...