ACBQ: Adaptive Cross-Block Quantization of Large Language Models

Hailing Wang; Jianglin Lu; Yitian Zhang; Huimin Zeng; Yun Fu

ACBQ: Adaptive Cross-Block Quantization of Large Language Models

Hailing Wang, Jianglin Lu, Yitian Zhang, Huimin Zeng, Yun Fu

Abstract

Post-training quantization (PTQ) has emerged as a promising approach for reducing the memory footprint and computational cost of large language models (LLMs), enabling efficient deployment without full model retraining. However, existing PTQ methods struggle to simultaneously support weight–activation joint quantization and extreme low-bit weight quantization. This limitation primarily arises from the depth of LLMs and their strong cross-layer dependencies, which cause quantization errors to propagate and accumulate across layers, ultimately leading to significant performance degradation. In this paper, we present ACBQ, a simple yet effective framework that simultaneously addresses weight–activation joint quantization and extreme weight quantization. We first propose a granular quantization strategy that treats self-attention and FFN as separate quantization units with module-specific optimization objectives. To mitigate the propagation and accumulation of quantization errors across layers, we introduce an adaptive cross-block quantization strategy that explicitly accounts for cross-layer dependencies by encouraging consistency across blocks. Extensive experiments across diverse LLMs, including OPT and the LLaMA family, demonstrate that ACBQ achieves superior performance under both W4A4 and highly aggressive W2 settings, while incurring negligible additional computational overhead.

Anthology ID:: 2026.acl-long.1971
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 42578–42592
Language:
URL:: https://aclanthology.org/2026.acl-long.1971/
DOI:
Bibkey:
Cite (ACL):: Hailing Wang, Jianglin Lu, Yitian Zhang, Huimin Zeng, and Yun Fu. 2026. ACBQ: Adaptive Cross-Block Quantization of Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 42578–42592, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: ACBQ: Adaptive Cross-Block Quantization of Large Language Models (Wang et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1971.pdf
Checklist:: 2026.acl-long.1971.checklist.pdf

PDF Cite Search Checklist Fix data