Deputy: Accelerating Large Language Model Inference with Dynamic Low-Rank Substitution

Yuhua Zhou; Shichao Weng; Changhai Zhou; Yuhan Wu; Qian Qiao; Jun Gao; Fei Yang; Aimin Pan

Deputy: Accelerating Large Language Model Inference with Dynamic Low-Rank Substitution

Yuhua Zhou, Shichao Weng, Changhai Zhou, Yuhan Wu, Qian Qiao, Jun Gao, Fei Yang, Aimin Pan

Abstract

While the massive scale of modern LLMs enables remarkable performance, their static, input-agnostic computational graph incurs substantial resource wastage and high latency during inference. Existing dynamic schemes, such as early-exit and layer-drop reduce FLOPs but break batch processing or introduce KV-cache inconsistency. We propose Deputy, a dynamic low-rank substitution framework that employs a lightweight decision module at each layer to dynamically determine the execution branch for different tokens: Attention layers choose between full and low-rank computation to mitigate the KV cache issue, while FFN layers additionally support skipping to further reduce computation. We fine-tune the LLM with LoRA and then derive an additional low-rank matrix C via a least-squares fit BC ≈ W_pre, where B is the shared LoRA matrix, so that only one extra low-rank matrix is introduced, effectively reducing memory overhead. Moreover, a hybrid KV cache strategy stores KV values generated by the low-rank branch, achieving a 38% reduction in cache storage. Experiments on Llama models demonstrate that Deputy reduces computation by approximately 40% compared to the original dense model while outperforming existing baseline methods.

Anthology ID:: 2026.findings-acl.991
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19791–19810
Language:
URL:: https://aclanthology.org/2026.findings-acl.991/
DOI:
Bibkey:
Cite (ACL):: Yuhua Zhou, Shichao Weng, Changhai Zhou, Yuhan Wu, Qian Qiao, Jun Gao, Fei Yang, and Aimin Pan. 2026. Deputy: Accelerating Large Language Model Inference with Dynamic Low-Rank Substitution. In Findings of the Association for Computational Linguistics: ACL 2026, pages 19791–19810, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Deputy: Accelerating Large Language Model Inference with Dynamic Low-Rank Substitution (Zhou et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.991.pdf
Checklist:: 2026.findings-acl.991.checklist.pdf

PDF Cite Search Checklist Fix data