More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression

Jiebin Zhang; Dawei Zhu; Yifan Song; Wenhao Wu; Chuqiao Kuang; Xiaoguang Li; Lifeng Shang; Qun Liu; Sujian Li (李素建)

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression

Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, Qun Liu, Sujian Li

Abstract

As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimensions separately. However, these works have left the trade-off between these two orthogonal dimensions largely unexplored. In this paper, we leverage the Information Bottleneck principle to formulate KV cache compression within a unified theoretical framework. We demonstrate that a carefully managed token-precision trade-off can achieve an optimal point within the Information Bottleneck compared to standalone KV pruning or KV quantization. Experiments reveal that storing more tokens in the KV cache at lower precision—a strategy we term quantized pruning—can significantly enhance the long-context performance of LLMs. An in-depth analysis of this token-precision trade-off across key aspects shows that quantized pruning achieves substantial improvements in retrieval-related tasks and consistently performs well across varying input lengths. Furthermore, quantized pruning exhibits notable stability and effectiveness across different KV pruning methods, quantization strategies, and model scales. These findings offer valuable insights into optimizing KV cache compression through balanced token-precision trade-off strategies. Our code isavailable at https://github.com/zhzihao/QPruningKV.

Anthology ID:: 2025.findings-emnlp.429
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8092–8105
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.429/
DOI:
Bibkey:
Cite (ACL):: Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, Qun Liu, and Sujian Li. 2025. More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8092–8105, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression (Zhang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.429.pdf
Checklist:: 2025.findings-emnlp.429.checklist.pdf

PDF Cite Search Checklist Fix data