EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs

Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective method for reducing this overhead. The problem is that in most previous works, the quantized model was calibrated using few samples from the training data, which might affect the generalization of the quantized LLMs to unknown cases and tasks. Hence in this work, we explore an important question: Can we design a data-independent quantization method for LLMs to guarantee its generalization performance? In this work, we propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for LLMs. Our observation indicates that two factors: outliers in the weight and quantization ranges, are essential for reducing the quantization error. Therefore, in EasyQuant, we leave the outliers (less than 1%) unchanged and optimize the quantization range to reduce the reconstruction error. With these methods, we surprisingly find that EasyQuant achieves comparable performance to the original model. Since EasyQuant does not depend on any training data, the generalization performance of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented in parallel so that the quantized model could be attained in a few minutes even for LLMs over 100B. To our best knowledge, we are the first work that achieves almost lossless quantization performance for LLMs under a data-independent setting and our algorithm runs over 10 times faster than the data-dependent methods.


Introduction
Recent work has already proved the superior performance of Transformer (Vaswani et al., 2017) based LLMs (Workshop, 2023;Zhang et al., 2022;Touvron et al., 2023;Brown et al., 2020;Rae et al., 2021;Smith et al., 2022;Chowdhery et al., 2022;Zeng et al., 2022) on various tasks over traditional methods, and has attracted massive interest in how to improve and utilize those LLMs.However, the model size also grows dramatically along with improved performance.Hence the memory footprint and computational cost become the bottleneck for deploying those models.One promising solution to alleviate this overhead is model quantization (Frantar et al., 2023a;Xiao et al., 2023), where we quantize weight only or weight and activation both i order to reduce memory consumption and computational cost.
Although model quantization is a well-studied area for normal-sized models, such as BERT (Devlin et al., 2018) and GPT-2 (Radford et al., 2019), it is still a quite challenging task for LLMs.One major reason is that previous lossless model quantization algorithms require retraining for the quantized model, which is too expensive for models over billions of parameters.Beyond this, previous models are usually designed for specific domain tasks, which means the training data are sampled from limited task domains.However, recent LLMs are usually trained on various domains of data corpus, and they have shown to be quite effective for multi-domain zero-shot tasks.In this case, if we only retrain the quantized LLMs using partial domain corpus, the generalization ability of LLMs might get worse.Therefore both efficiency and generalization guarantees are very important for designing LLMs quantization algorithms.To date, for low-bits weight-only quantization, several posttraining algorithms have been proposed (Frantar et al., 2023a;Yao et al., 2022).However, those methods also require a small calibration set sampled from training data, which still takes at least several hours.Moreover, the use of those calibration data also brings the risk of making the model overfit to the calibration set.We first find all the outliers in weight and keep them in full precision (fp32/fp16/bf16).Afterward, we optimize the quantization range (denoted as q range ) in order to approximate the normal values more precisely.In the end, the normal values are quantized into lower bits (denoted as Q[•]) with optimized quantization ranges and we set the outliers unchanged in weight.
Our Contribution: In this work, we propose a novel data-free model quantization algorithm, namely EasyQuant, that potentially improves the performance of low-bits quantized LLMs.The generalization ability of LLMs is inherently guaranteed since EasyQuant does not need any input data.By running EasyQuant for only a few minutes, we can quantize public-available OPT-176B, BLOOM-176B, and LLAMA-65B into lower bits without significant loss on various benchmarks.To our best knowledge, this is the first data-free LLM quantization algorithm for LLM quantization without notable system overhead.
Moreover, our work reveals the essential factors that cause the performance degradation of the quantized LLMs.We show that the outliers in weights are more critical to the model's performance compared to the normal elements.Beyond this, we propose to use a gradient-based method for optimizing the quantization range.These two strategies can also be used in other scenarios, such as weight-activation quantization and quantizationaware training (QAT).
Last but not least, we develop efficient CUDA kernels for outlier isolation in dequantization, and proved that hold 1% outliers in weights unquantized brings negligible (less than 0.1%) overhead w.r.t to overall latency.We also propose to implement EasyQuant in parallel for quantizing each weight in the model, which means a 175B-sized model can be quantized into 4-bits within 10 minutes.

Background and Motivation
The most widely used quantization method, namely rounding to nearest-number (RTN), quantizes a tensor x into k-bits representation according to Here s is the quantization scale, l min and l max are the lower and upper bound for clipping, and ⌊•⌉ is the rounding operator.Usually we set l min = −2 k−1 + 1 and l max = 2 k−1 and set s to be the maximum absolute value in x.
There are two major directions for finding the best configuration in weight-only LLM quantization.
The first is to minimize the reconstruction error of the weight parameter (denoted as W ), which is defined as Notice that in this case we only need to have access to the weight itself, therefore it is data-free.Beyond this, recent studies (Frantar et al., 2023a;Yao et al., 2022) propose to use the output error, defined as where D is a calibration set sampled from the original training data, for optimization.This regulation tries to mimic the outputs from the original model directly hence achieving a more promising result than reconstruction-based methods.
Data-dependent calibration might weaken the generalization ability of LLMs However, the performance gain from using calibration data might jeopardize the generalization of the quantized model, because it brings the risk of making the model overfit to the calibration set.For example, Straightforwardly shrinking the quantization ranges will clip most of the outliers to be very small, hence the perplexity increases severely since those outliers are critical for preserving the model's performance.However, when keeping those outliers unquantized, the quantized model achieves a better performance as the reconstruction error decreases continuously.This result clearly suggests that the outliers are more important than the normal values in weight, and optimizing the quantization ranges using gradient defined in (2) can significantly increase the accuracy of quantized models.More details about the experiment can be found in Section 5.
both ZeroQuant and GPTQ involve changing the original weight by training or OBS in order to minimize the output error, therefore the distribution of the weight's parameters might deviate from the original.Since the calibration data is usually sampled from a few specific domains, the performance of the calibrated model on other tasks may not be guaranteed.
Data-free quantization is challenging, but very important Although it's more challenging to use the reconstruction error as a regulation because it can only optimize the quantized model indirectly, still it is a very important direction for researching because the generalization ability of the model is inherently guaranteed when using data-free quantization since it uses no training data.Therefore in this paper, we aim to answer the following question: How can we efficiently recover the performance of the quantized model without using any input data?
In this work we propose EasyQuant, a data-free fast algorithm that could significantly improve the performance of quantized LLMs in a data-free setting, and more importantly, even outperforms the results from data-dependent quantization algorithms.Our experiments reveal that the performance gap of the lower bits (e.g.4-bits) quantized LLMs origins from two factors: 1. Setting the quantization range as the maximum absolute value of the weight induces a large reconstruction error for low-bits quantization.
2. The outliers in the weight matrix, which ac-count for less than 0.1% of the parameters, impose a very important influence on the model's performance.
In EasyQuant, we use quantization range minimization and outlier isolation to address these two challenges, and our results prove that EasyQuant achieves a significant improvement over RTN.

Insight behind EasyQuant
As mentioned above, the weight's outliers and quantization ranges are essential to the quantized model's performance.Below we present the supporting experiments in detail.

The quantization range can be efficiently optimized using gradient
Although the quantization operation itself is nondifferentiable, the gradient of the reconstruction error (∥Q[x] − x∥ 2 ) w.r.t. the quantization range s is differentiable in most cases.We proved that the gradient of the quantization range s admits (see Section 4 for more details) (2) With this gradient, the reconstruction error can be quickly minimized within hundreds of steps (see Figure 2 for more details).This result indicates that by shrinking the quantization range, most of the parameters in weight can be approximated more precisely.However, as shown in Figure 2, the performance of the quantized weight gets even worse as the reconstruction error decreases.This is a very counter-intuitive result.Through in-depth analysis, we realized that when decreasing the quantization range, more salient parameters outside the quantization range would be clipped out.Although most of the weights get approximated more precisely as indicated by the decreased reconstruction error, the salient parameters are poorly represented.As the model performance drops severely in this case, we realized that those outliers are way more important than the normal elements for the model's performance.

Outliers in weight are very important, but not sufficient
Before we further discuss the influence of those outliers, we first provide a (nσ) criterion for defining the outliers in weight.For any weight W , we say its (i, j)-th number W i,j is an (nσ) outlier if where mean(W ) and var(W ) are the mean and variance of W . Now the question is: Can we hold those outliers unchanged and straightforwardly compress the normal elements into lower bits?Unfortunately, our result suggests that excluding the outliers from quantization solely is not enough.As shown in Table 1, the performance gap still exists even when we hold 1% numbers in fp16.The problem is that if we keep too many numbers in fp16, the overhead of the dequantization kernel would also increase and result in a decreased overall throughput.

EasyQuant potentially improve the performance
As shown in Section 3.1 and Section 3.2, optimizing the quantization ranges directly reduces the model's performance drops severely because of the clipped outliers.These key observations inspire us to design EasyQuant, in which we isolate the outliers from quantization first and then optimizing the quantization range for the remaining elements.
As shown in the right part of Figure 2, with outliers being kept unquantized, the performance of the quantized model increases continuously under decreased reconstruction.This clearly proves we can potentially improve the performance of quantized LLMs with this strategy.

Methodology
4.1 Driving of the gradient in (2) Let's say the original scale s gets an infinitely small variation ∆s, which means Therefore we get this leads to This gives us

Algorithm description
In EasyQuant, for each weight W , we first select all (nσ) outliers (using (3)) and store its index I o (W ).
Afterward, for the normal elements, we optimize the per-channel quantization range using an optimizer (in our case we use Adam for example) with gradients defined in (2).The final quantized weight from EasyQuant can be formulated as where M ask o is a mask tensor defined as The detailed description of EasyQuant is in Algorithm 1.Here n refers to the hyper-parameter in the outlier criterion (nσ) as defined in (3) and baseline is the result from unquantized model.Notice that even with 10%(n = 1) numbers being held unquantized, there is still a large gap to the baseline.This means isolating the outliers is not enough to fully recover the accuracy of quantized models.

4:
Optimizing the quantization range s using optimizer A with gradient defined in (2).5: Quantize where M ask o (W ) is defined in (5).

Experiment
Baselines: We compare EasyQuant with several baselines in the INT4 quantization setting below: • RTN: The model's weights are naively quantized according to (1).
• ZeroQuant: The algorithm proposed in Yao et al. (2022).Authors treat each layer as a small neural network and use the original as the teacher model to distill the quantized one.This is equivalently minimizing x∈D ∥f (W T ; x) − f (W S ; x)∥ 2 where x are the input activations, W T is the weight of the original model and W S is the quantized model.
• GPTQ: This algorithm is proposed in Frantar et al. (2023a).Authors use the same objective function x∈D ∥f (W T ; x)−f (W S ; x)∥ 2 as in ZeroQuant.But they utilize OBS for minimizing the loss function instead of using a gradient-based optimizer.
Experiment Setup.For all models, we set the outlier threshold n ∈ [2.5, 3] in order to ensure that the outliers account less than 1% of all numbers.
For BLOOM and LLAMA, we use n = 3.When optimizing the quantization ranges, we use Adam as the optimizer and set the learning rate 1e − 3 for BLOOM and 1e − 4 for LLAMA.We choose the quantization ranges from step 100 for BLOOM and 500 for LLAMA.We use symmetric quantization since the normal values are symmetrically distributed with the outliers being excluded.For a fair comparison, we use per-channel quantization for weight in all algorithms (which means each column shares one common quantization range).
Implementation.Since each weight can be quantized in parallel, therefore we use 8 * A100 for running EasyQuant, and we finish the quantization in 1 ∼ 10 mins for all models.We store the index and value for all outliers together with the quantized normal values.Our dequantization kernel is built using CUDA.

Experiment Analysis
We focus our study on LLM by quantizing the entire BLOOM, and LLAMA model families to 4-bit.
Perplexity-base tasks.We first study perplexitybased tasks.On LLaMA models, Table 2 shows that EasyQuant outperforms GPTQ in most cases.
For LLaMA-65B, GPTQ drops 4.21 points on PTB, performing worse than the 9 × smaller full-precision 7B model, while EasyQuant still performs well on this task.On the other tasks, EasyQuant losing only 0.4-0.7 points.BLOOM shows a similar pattern (see  Practical Latency.We evaluate the overhead of EasyQuant by comparing the overhead of outlier isolation, int4 dequantization, and matrix multiplication with batch size 1, sequence length 1024, on a single A100 GPU.The matrix size is 14336 × 53746 which is the same as the first FFN layer in 176B BLOOM.For outlier isolation, we test the latency of outliers ratio (fraction of outliers within the weight) in 6 settings: (0.01%, 0.10%, 0.50%, 1%, 5%, 10%).The matrix multiplication takes 83ms and dequantization takes 5ms.Therefore from Table 3 we can see that recovering the outliers in weight brings almost no overhead to the overall latency.
Ablation study.To understand the effect of unstructured outliers, we show the perplexity result of EasyQuant without outlier isolation or quantization range optimization.As discussed in Section 3, both strategies impose a very important influence on the final model performance.
We further conduct experiments proving whether the performance gain mainly comes from the outlier isolation: Actually, outlier isolation is a very important component of EasyQuant, but still not enough to fully recover the performance loss from quantization.Keeping even 10% of weights as fp16 outliers still admits about 8% ppl increase while EasyQuant admits only 1% ppl increase.Below we present the result of 4-bit quantized BLLOM-7B when we just keep 1% outliers in fp16 without quantization range optimization on various benchmarks.

Benchmark
EasyQuant  Outlier influence.The outlier isolation is a key component in EasyQuant, but it can only impose an indirect influence on the model accuracy.The interesting phenomenon we find is that the outliers behave like a gating mechanism: without outlier isolation, the model achieves a much worse performance under a small reconstruction error; however, when keeping those outliers in fp16, the quantized LLM attains a continuously decreased ppl under smaller reconstruction error: Moreover, we have also conducted a complementary experiment testing the direct influence of the weight outlier: We prune 1% of the values ( according to its magnitude) in weights into 0 and see the ppl results (as shown in Outlier distribution.We also explore the outlier distribution along different and layers.It shows that the fraction of outliers shares different patterns in different modules and layers (as shown in Table 7 and 8).FFN.2 has a significantly higher fraction of outliers.However, it shows no pattern along the layer index.

Related Work
Model Quantization Traditional model quantization algorithms mainly focus on the cases where both parameters and activations of the model are quantized (Lin et al., 2015;Hubara et al., 2016;Tailor et al., 2021;Ni et al., 2020).However, directly quantizing the model will greatly decrease the accuracy of the models, and one important technique to improve the performance is Quantization Aware Training (QAT) (Jacob et al., 2018), where it simulates the quantization procedure in training to improve the accuracy of the quantized model further.For Transformer based models, the boundary of the compression level has been continuously advanced.For example, 8-bits quantized transformers as in FullyQT (Prato et al., 2019)  Activation and weight quantization, where both activations and weights are quantized into lower bits.In this case, the major obstacle is the outliers in activations.LLM.int8() (Dettmers et al., 2022) addresses this problem by isolating those outliers in fp16/bf16.However, such implementation leads to large latency overhead and is even slower than fp16 inference.Recent studies (Wei et al., 2023;Xiao et al., 2023) found that the outliers only exist in certain channels, and use the LayerNorm weights (Wei et al., 2023) and calibrated scales (Xiao et al., 2023) to smooth those channels.Xiao et al. (2023) has already proved that we can achieve almost lossless W8A8 quantized LLMs using a few calibration data, without manipulating the original model weights.

Conclusion and Limitations
In this paper, we propose a data-free fast weightonly quantization algorithm, namely EasyQuant, for LLMs, that potentially improves the quantized model's performance without using any training data.Our analysis reveals the intrinsic origins of the performance loss when quantizing the model weights into lower bits.We show that by isolating the outliers from quantization, the accuracy of the quantized LLM increases accordingly with decreased reconstruction error.Our experiment proved that EasyQuant significantly outperforms RTN in a data-free setting, and also behaves better than data-dependent algorithms.EasyQuant can finish the quantization for a 176B-sized model within 10 minutes and the overhead of dequantization in EasyQuant is negligible.However, we also point out some limitations of our work: The outlier recovery functionality in EasyQuant requires extra CUDA kernels for implementation.Moreover, weight-only quantization can only reduce the memory footprint without any computation cost reduction, hence the latency of our model cannot be minimized.In addition, this outlier isolation will make the weight/activation quantization more challenging because the weight includes numbers under different precision.We have also noticed that EasyQuantcannot outperform the data-dependent methods in all tasks, this motivates us to investigate more effective algorithms in future studies.

Figure 1 :
Figure1: Pipeline of EasyQuant.We first find all the outliers in weight and keep them in full precision (fp32/fp16/bf16).Afterward, we optimize the quantization range (denoted as q range ) in order to approximate the normal values more precisely.In the end, the normal values are quantized into lower bits (denoted as Q[•]) with optimized quantization ranges and we set the outliers unchanged in weight.

Figure 2 :
Figure2: Smaller reconstruction error cannot guarantee a better model performance.Straightforwardly shrinking the quantization ranges will clip most of the outliers to be very small, hence the perplexity increases severely since those outliers are critical for preserving the model's performance.However, when keeping those outliers unquantized, the quantized model achieves a better performance as the reconstruction error decreases continuously.This result clearly suggests that the outliers are more important than the normal values in weight, and optimizing the quantization ranges using gradient defined in (2) can significantly increase the accuracy of quantized models.More details about the experiment can be found in Section 5.

Table 1 :
Isolating outliers in weight from quantization can increase the model's performance.

Table 3 :
Overhead of outlier isolation on A100

Table 4 :
Using outlier isolation solely is not enough to fully recover the performance loss.EasyQuant consistently outperforms outlier isolation in all benchmarks.

Table 6 :
ppl results after pruning 1% weight with different magnitude

Table 9 :
The dynamic quantization range of different optimization steps.Here we take the quantization range of the Att.qkv module in layer 1 as an example.