MM-ShiftKV: Decode-Aware Prefill-Stage KV Selection for Multimodal Large Language Models

Jinsong Shu; Chenyang Wu; Zhongle Xie; Baokun Wang; Lidan Shou

MM-ShiftKV: Decode-Aware Prefill-Stage KV Selection for Multimodal Large Language Models

Jinsong Shu, Chenyang Wu, Zhongle Xie, Baokun Wang, Lidan Shou

Abstract

Key-Value (KV) caching is essential for efficient inference in multimodal large language models (MLLMs), yet its memory footprint grows linearly with context length and becomes a major bottleneck due to the large number of visual tokens. Recent prefill-stage KV selection methods estimate KV importance from prefilling statistics, implicitly assuming that prefilling-time queries are representative of those encountered during decoding. We show that this assumption breaks down in multimodal inference, where decoding-time queries exhibit substantially larger variance than prefilling-stage representations, leading to unstable KV importance estimation under tight cache budgets. As a result, small ranking errors can disproportionately discard semantically critical visual tokens and degrade grounding and reasoning performance. We propose MM-ShiftKV, a training-free, decode-aware and strictly prefill-only KV selection method. MM-ShiftKV approximates decoding-time query behavior during prefilling by constructing variance-expanded query proxies and estimates prompt KV importance based on their aggregated attention mass. Experiments on multimodal benchmarks demonstrate that MM-ShiftKV consistently outperforms existing methods under strict KV-cache budgets.

Anthology ID:: 2026.findings-acl.1447
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 28964–28982
Language:
URL:: https://aclanthology.org/2026.findings-acl.1447/
DOI:
Bibkey:
Cite (ACL):: Jinsong Shu, Chenyang Wu, Zhongle Xie, Baokun Wang, and Lidan Shou. 2026. MM-ShiftKV: Decode-Aware Prefill-Stage KV Selection for Multimodal Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 28964–28982, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: MM-ShiftKV: Decode-Aware Prefill-Stage KV Selection for Multimodal Large Language Models (Shu et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1447.pdf
Checklist:: 2026.findings-acl.1447.checklist.pdf

PDF Cite Search Checklist Fix data