HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

Bowen Zeng; Feiyang Ren; Jun Zhang; Xiaoling Gu; Ke Chen; Lidan Shou; Huan Li

HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

Bowen Zeng, Feiyang Ren, Jun Zhang, Xiaoling Gu, Ke Chen, Lidan Shou, Huan Li

Abstract

Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key–value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency even on high-end GPUs. A common solution is to compress caches under a fixed allocated budget at different granularities: token-level uniformly discards less important tokens, layer-level varies retention across layers, and head-level redistributes budgets across heads. Yet these approaches stop at allocation and overlook the heterogeneous behaviors of attention heads that require distinct compression strategies. We propose HybridKV, a hybrid KV cache compression framework that integrates complementary strategies in three stages: heads are first classified into static or dynamic types using text-centric attention; then a top-down budget allocation scheme hierarchically assigns KV budgets; finally, static heads are compressed by text-prior pruning and dynamic heads by chunk-wise retrieval. Experiments on 11 multimodal benchmarks with Qwen2.5-VL-7B show that HybridKV reduces KV cache memory by up to 7.9× and achieves 1.52× faster decoding, with almost no performance drop or even higher relative to the full-cache MLLM.

Anthology ID:: 2026.acl-long.594
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13018–13034
Language:
URL:: https://aclanthology.org/2026.acl-long.594/
DOI:
Bibkey:
Cite (ACL):: Bowen Zeng, Feiyang Ren, Jun Zhang, Xiaoling Gu, Ke Chen, Lidan Shou, and Huan Li. 2026. HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13018–13034, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference (Zeng et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.594.pdf
Checklist:: 2026.acl-long.594.checklist.pdf

PDF Cite Search Checklist Fix data