Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

Jun Zhang; Yicheng Ji; Feiyang Ren; Yihang Li; Bowen Zeng; Zonghao Chen; Ke Chen; Lidan Shou; Gang Chen; Huan Li

Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

Jun Zhang, Yicheng Ji, Feiyang Ren, Yihang Li, Bowen Zeng, Zonghao Chen, Ke Chen, Lidan Shou, Gang Chen, Huan Li

Abstract

Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the ”visual memory wall” in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing long-context attention, and overcoming memory limits, this work provides a structured analysis of how isolated optimizations compose to navigate the trade-off between visual fidelity and system efficiency. The survey concludes by outlining four future frontiers supported by pilot empirical insights, including hybrid compression based on functional unit sensitivity, modality-aware decoding with relaxed verification, progressive state management for streaming continuity, and stage-disaggregated serving through hardware-algorithm co-design. The submitted software contains a snapshot of our literature repository, which is designed to be maintained as a living resource for the community.

Anthology ID:: 2026.findings-acl.1057
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21036–21066
Language:
URL:: https://aclanthology.org/2026.findings-acl.1057/
DOI:
Bibkey:
Cite (ACL):: Jun Zhang, Yicheng Ji, Feiyang Ren, Yihang Li, Bowen Zeng, Zonghao Chen, Ke Chen, Lidan Shou, Gang Chen, and Huan Li. 2026. Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects. In Findings of the Association for Computational Linguistics: ACL 2026, pages 21036–21066, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects (Zhang et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1057.pdf
Checklist:: 2026.findings-acl.1057.checklist.pdf

PDF Cite Search Checklist Fix data