QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models

Saleh Ashkboos; Ilia Markov; Elias Frantar; Tingxuan Zhong; Xincheng Wang; Jie Ren; Torsten Hoefler; Dan Alistarh

doi:10.18653/v1/2024.emnlp-main.197

QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models

Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

Abstract

Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. However, the vast majority of existing work focuses on weight-only quantization, which can reduce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address costs in compute-bound scenarios, such as batched inference or prompt processing.In this paper, we address the general quantization problem, where both weights and activations should be quantized, which leads to computational improvements in general. We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits, while at the same time maintaining good accuracy. We achieve this via a hybrid quantization strategy called QUIK that compresses most of the weights and activations to 4-bit, while keeping a small fraction of “outlier” weights and activations in higher-precision. QUIK is that it is designed with computational efficiency in mind: we provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x relative to FP16 execution. We provide detailed studies for models from the OPT, LLaMA-2 and Falcon families, as well as a first instance of accurate inference using quantization plus 2:4 sparsity.Anonymized code is available.

Anthology ID:: 2024.emnlp-main.197
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3355–3371
Language:
URL:: https://aclanthology.org/2024.emnlp-main.197/
DOI:: 10.18653/v1/2024.emnlp-main.197
Bibkey:
Cite (ACL):: Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, and Dan Alistarh. 2024. QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3355–3371, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models (Ashkboos et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.197.pdf

PDF Cite Search Fix data