AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations

Qian Tao, Wenyuan Yu, Jingren Zhou


Abstract
Large language models have shown exceptional capabilities in a wide range of tasks, such as text generation and video generation, among others. However, due to their massive parameter count, these models often require substantial storage space, imposing significant constraints on the machines deploying LLMs. To overcome this limitation, one research direction proposes to compress the models using integer replacements for floating-point numbers, in a process known as Quantization. Some recent studies suggest quantizing the key and value cache (KV Cache) of LLMs, and designing quantization techniques that treat the key and value matrices equivalently. This work delves deeper into the asymmetric structural roles of KV Cache, a phenomenon where the transformer’s output loss is more sensitive to the quantization of key matrices. We conduct a systematic examination of the attention output error resulting from key and value quantization. The phenomenon inspires us to propose an asymmetric quantization strategy. Our approach allows for 1-bit quantization of the KV cache by implementing distinct configurations for key and value matrices. We carry out experiments across a variety of datasets, demonstrating that our proposed model allows for the quantization of up to 75% decoder layers with 1 bit, while simultaneously maintaining performance levels comparable to those of the models with floating parameters.
Anthology ID:
2025.coling-main.158
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2316–2328
Language:
URL:
https://aclanthology.org/2025.coling-main.158/
DOI:
Bibkey:
Cite (ACL):
Qian Tao, Wenyuan Yu, and Jingren Zhou. 2025. AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2316–2328, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations (Tao et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.158.pdf