PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao


Abstract
Large Language Models (LLMs) have shown remarkable comprehension abilities but face challenges in GPU memory usage during inference, hindering their scalability for real-time applications like chatbots. To accelerate inference, we store computed keys and values (KV cache) in the GPU memory. Existing methods study the KV cache compression to reduce memory by pruning the pre-computed KV cache. However, they neglect the inter-layer dependency between layers and huge memory consumption in pre-computation. To explore these deficiencies, we find that the number of crucial keys and values that influence future generations decreases layer by layer and we can extract them by the consistency in attention weights. Based on the findings, we propose PyramidInfer, a method that compresses the KV cache by layer-wise retaining crucial context. PyramidInfer saves significant memory by computing fewer keys and values without sacrificing performance. Experimental results show PyramidInfer improves 2.2x throughput compared to Accelerate with over 54% GPU memory reduction in KV cache.
Anthology ID:
2024.findings-acl.195
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3258–3270
Language:
URL:
https://aclanthology.org/2024.findings-acl.195
DOI:
Bibkey:
Cite (ACL):
Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. 2024. PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference. In Findings of the Association for Computational Linguistics ACL 2024, pages 3258–3270, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference (Yang et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.195.pdf