FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization

Fangxin Liu, Zongwu Wang, Jinhong Xia, Junping Zhao, Shouren Zhao, Jinjin Li, Jian Liu, Li Jiang, Haibing Guan


Abstract
The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce memory overhead, existing methods predominantly rely on static quantization strategies, which struggle to adapt to dynamic workloads. To address this, we propose FlexQuant, a dynamic precision-switching framework that optimizes the trade-off between inference speed and accuracy. Leveraging model perplexity entropy and Kullback-Leibler divergence, FlexQuant enables fine-grained, layer-wise mixed-precision quantization and dynamically adjusts bit-widths during each token generation. FlexQuant provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained precision management. Evaluations demonstrate that FlexQuant achieves a 1.3× end-to-end speedup across diverse language tasks with negligible accuracy loss introduced. This framework offers a flexible and adaptive solution for efficient LLM deployment.
Anthology ID:
2025.findings-emnlp.221
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4152–4161
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.221/
DOI:
Bibkey:
Cite (ACL):
Fangxin Liu, Zongwu Wang, Jinhong Xia, Junping Zhao, Shouren Zhao, Jinjin Li, Jian Liu, Li Jiang, and Haibing Guan. 2025. FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 4152–4161, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization (Liu et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.221.pdf
Checklist:
 2025.findings-emnlp.221.checklist.pdf