LLM-FP4: 4-Bit Floating-Point Quantized Transformers

We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values, in a post-training manner. Existing post-training quantization (PTQ) solutions are primarily integer-based and struggle with bit widths below 8 bits. Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions, and it has emerged as a default choice in many hardware platforms. One characteristic of FP quantization is that its performance largely depends on the choice of exponent bits and clipping range. In this regard, we construct a strong FP-PTQ baseline by searching for the optimal quantization parameters. Furthermore, we observe a high inter-channel variance and low intra-channel variance pattern in activation distributions, which adds activation quantization difficulty. We recognize this pattern to be consistent across a spectrum of transformer models designed for diverse tasks, such as LLMs, BERT, and Vision Transformer models. To tackle this, we propose per-channel activation quantization and show that these additional scaling factors can be reparameterized as exponential biases of weights, incurring a negligible cost. Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1 on the common sense zero-shot reasoning tasks, which is only 5.8 lower than the full-precision model, significantly outperforming the previous state-of-the-art by 12.7 points. Code is available at: https://github.com/nbasyl/LLM-FP4.


Introduction
Since the introduction of transformer architecture (Vaswani et al., 2017), transformers have superseded recursive neural networks, emerging as the dominant architecture in numerous natural language processing (NLP) tasks (Kenton and * These authors contributed equally to this work Toutanova, 2019; Lewis et al., 2020).The transformative impact of the transformer has been further propelled by the emergence of models like GPT (Brown et al., 2020;OpenAI, 2023), catapulting the popularity of this architecture to new heights.Meanwhile, the versatility of transformers extends beyond NLP, encompassing diverse domains such as vision (Dosovitskiy et al.;Touvron et al., 2021), audio (Akbari et al., 2021), etc.This trend towards a unified architecture for different modalities represents a groundbreaking development within the realm of deep learning.
However, the advancements in transformer performance are accompanied by a corresponding increase in model size and computational costs (Kaplan et al., 2020).This poses significant challenges when attempting to leverage the full potential of transformer models in use cases where memory or computational resources are limited.Despite the extensive research and widespread adoption of transformers, the field of transformer compression remains relatively underexplored.To address this gap, our study focuses on the compression of transformers, especially through floating-point post-training quantization techniques.
Post-training quantization (PTQ) offers the advantages of simple to use with minimal fine-tuning requirements (Nagel et al., 2020;Cai et al., 2020).Existing PTQ solutions for transformers primarily focus on integer (INT) quantization (Liu et al., 2021;Yuan et al., 2022), which can be effective in certain scenarios but often break down when bit widths are below 8 bit.On the other hand, floatingpoint (FP) quantization has gained significant traction as a more flexible alternative, capable of better accommodating various activation and weight distributions.In fact, FP8 has emerged as the default choice in various hardware platforms, including the NVIDIA H100.
Different from integer (INT) quantization, a particular challenge in floating-point (FP) quantiza-tion is how to select appropriate exponent bits and scale parameters.Improper parameter choices can lead to subpar or divergent quantization results.To tackle this challenge, we introduce a robust recipe for FP quantization, which leverage layer-wise reconstruction to jointly search for optimal exponent bits and maximum values.Compared to previous approaches that utilize gradient updates for exponent bits (Kuzmin et al., 2022), our search-based method proves to be more stable and consistently delivers desirable quantization results, which establishes a strong baseline for FP-PTQ.
Furthermore, our investigation uncovers an intriguing pattern of activation distributions in transformers, characterized by high inter-channel variance and low intra-channel variance.Similar patterns are also observed in previous works (Xiao et al., 2022;Dettmers et al., 2022), while we argue that this pattern is inherent to transformer architectures and not limited to specific tasks, as we have observed consistent patterns not only in large language models but also in BERT model and even vision transformers.Motivated by these findings, we introduce a novel pre-shifted exponent bias for FP quantization of transformers.Concretely, we leverage the per-channel activation variance computed from calibration data and reparameterize these scales as the exponential bias of the corresponding FP quantized weight vectors.This approach effectively addresses the challenge posed by high inter-channel variance while incurring negligible computational cost.
In summary, we study floating-point posttraining quantization (PTQ) for transformer architectures, and the contribution of this paper includes: • We propose a search-based framework for determining the optimal exponent bias and maximal quantization value.This method outperforms existing techniques in terms of stability and performance, establishing a strong baseline for floatingpoint post-training quantization.
• We propose a novel technique, pre-shifted exponent bias, which effectively addresses the challenge of high inter-channel variance in the transformer with negligible computational overhead.
• Experimental results demonstrate that the proposed method yields the first usable FP4 weight and activation quantized LLaMA-13B model with mere 5.8-point degradation in zero-shot reasoning tasks against the full-precision model, reducing the gap by ∼70% compared to the previous SoTA.
• We further extend our method to BERT and vision transformers.It surpasses the previous best 4bit quantized BERT by 7.8 points on GLUE dataset and achieves 31.4 points higher accuracy compared to the previous SoTA ViT quantization method for 4-bit DeiT-S on ImageNet dataset.

Post-Training Quantization
Model quantization can be mainly categorized into quantization-aware training (QAT) and posttraining quantization (PTQ), depending on whether it involves additional training for weight finetuning or not.Most PTQ studies are primarily focused on convolutional neural networks (CNNs) (Nagel et al., 2020;Li et al., 2021;Wu et al., 2020;Cai et al., 2020;Nagel et al., 2019).However, with the growing popularity of transformer-based models, only a limited number of works (Bondarenko et al., 2021;Yuan et al., 2022;Ding et al., 2022) have been conducted to realize PTQ on transformers.Moreover, the existing works primarily focus on visual transformer models and exhibit inferior performance when the bit width is below 8. Therefore, in this work, we delve into the challenges of the low-bit PTQ for language transformers.

Floating-Point Quantization
Floating-point (FP) quantization has emerged as a promising alternative to integer quantization due to its ability to handle long-tail distributions, and offers increased flexibility (Kuzmin et al., 2022).Additionally, modern GPUs such as H100 (Micikevicius et al., 2022) now support FP quantization.Nonetheless, minimal research has been conducted on FP quantization.Only (Kuzmin et al., 2022) proposes a general FP8 quantization scheme primarily for vision tasks, and (Zhang et al., 2023) adopts a mixture of FP and INT formats quantization for LLMs.In this work, we propose FPQ baseline as a general guideline for low-bit floating-point PTQ to compress language transformer models.

Formulation of Floating-Point Variables
A standard floating-point number is represented as: where s ∈ {0, 1} is the sign bit.d i ∈ {0, 1} is i th mantissa bit, m denoted number of mantissa bits.
: An illustration of floating-point (FP) quantization process using FP5 (E2M2) positive axis.The real-valued clipped X ′′ R in Eq. 5 is rescaled by the real-valued scaling factor α.Then, the quantization step-size v is determined by the range [2 p , 2 p + 1) in which X ′′ R α falls (Eq.9).Here, p ∈ {0, 1, ..., 2 e−1 } is the exponent bit value.Lastly, X can be quantized to low-bit floating point values simply by p is an integer in [0, 2 e − 1], and e denotes number of exponent bits.b is an integer exponent bias.A floating point with j number exponent bits and k mantissa bits is denoted as FP format EjMk.

Floating-Point Quantization Process
In integer quantization, the real-valued variable X R is quantized to an integer X INT with the following formula: where ⌊•⌉ is the rounding function.X R is the real-valued variable, α represents the full-precision scaling factor, and Q min , Q max are the min/max value of the quantization range.Similarly, a realvalued variable X R can be converted to floatingpoint X FP in two steps.
(1) Scale and clip.In FP quantization, we also scale and clip the real-valued variable before quantization as: where the min/max value range of signed floatingpoint quantization can be calculated from Eq.1: Here the integer exponent bias b is another adjustable hyperparameter controlling Q max and Q min , which has similar functionality as α.Therefore, for simplicity, we reformulate Eq. 3 as: where Note that we combine the tensor-wise real-valued scaling factor α with integer exponent bias b to form a new scaling factor α = 2 Here b denotes a relaxed tensor-wise real-valued exponent, and we can derive b from the desired clipping value Qmax from Eq. 6 as: (2) Compare and quantize.Different from integer quantization, which simply utilizes the rounding function to convert the real-valued variables to quantized ones, in floating-point quantization, there is an additional step of comparing X ′′ R with quantization levels and then quantize: where X ′′ R is clipped real-valued variable (Eq.5), α is the tensor-wise floating-point scaling factor, and v is an integer power of 2.
Here we select the quantization level v according to the magnitude of Then the floating-point quantized variables can be derived with Eq.8.The illustration of the quantization process is in Fig. 1, detailed explanation can also be found in (Micikevicius et al., 2022).

Floating-Point Matrix Multiplication
With the floating-point quantized variables, the matrix multiplication is formulated as: Here in per-tensor activation quantization and perchannel weight quantization, X i,: FP denotes i th row in the activation matrix and W :,k FP denotes k th column in the weight matrix, such that each element O i,k out in the output matrix is computed by the product of two real-valued scalars αX and αk W times the corresponding quantized activation and weight vectors.We depict all the possible quantization granularity options that support such efficient matrix multiplication in Appendix D.

Method
In this section, we begin by introducing our joint format and max value search, which establishes our strong baseline and already achieves state-ofthe-art results at 8-bit and 6-bit quantization.Then we present an efficient pre-shifted exponent bias to tackle the catastrophic high inter-channel activation variance in transformer models and push the quantization limit to 4-bit.

Joint Format and Max Value Search
The objective of post-training quantization is to minimize the perturbation (δX = X FP − X R ) introduced by quantization to the pre-trained realvalued network: In this study, we adopt the setting presented in (Choukroun et al., 2019;Wu et al., 2020), which assumes a positive correlation between the change in the intermediate output of the quantized model and Eq.11.Therefore, minimizing the distance between the intermediate output of the quantized layer ( Ô) and the output of the original layer (O) leads to minimize Eq. 11.Hence, the objective loss metric is formulated as: which is used to search for the optimal FP quantization function in the following proposed framework.
The challenges in FP quantization arise from its sensitivity to the quantization format and clipping range.Undesirable format selection will result in a catastrophic error rate.In addition, we observe that the optimal clipping range varies depending on the format used.Previous work (Kuzmin et al., 2022) on floating-point (FP) quantization-aware training (QAT) proposed to learn both the FP format and maximum value with gradients.However, we find this method suffers from over-fitting in PTQ, with accuracy being even worse than naïve MinMax method, details can be found in Appendix E. Instead, we propose a search-based algorithm that jointly determines the optimal format and its associated clipping range to address this challenge.
The searching process is conducted layer by layer with the metric of minimizing Eq. 12.The output of matrix multiplication corresponding to each sub-module is denoted as O = XY, where Y can be either a weight tensor W or another activation tensor.
The search space of q-bit FP format includes all formats except for the format with an exponent bit equal to 0, as the quantization of the format with an exponent bit equal to 1 already degenerates to INT quantization.We search for the real-valued exponent bias b, which equals to the logarithm of the scaling factor.We initialize bX and bY from Eq. 7 with Q max equals the maximum value of |X R | and |Y R |, respectively.We then define the search space of bX and bY by linearly dividing where γ 1 and γ 2 are empirically set to 0.01 and 1.2, and k = 100.
The search process is outlined in Alg.1.We search the quantization scheme in all the matrix multiplication layers in parallel following (Yuan et al., 2022;Bai et al., 2022).The algorithm can be divided into two parts.(1) Do forward propagation to store the intermediate raw output of each layer l.
(2) Iteratively update the optimal format and biases for each layer for three rounds by minimizing the reconstruction metric (Eq.12).We name this search-based framework as Floating Point Quantization Baseline (FPQ baseline), and it can already achieve state-of-the-art results on both 8-bit and 6bit settings.

Pre-Shifted Exponent Bias
In transformer architectures, we observed an intriguing phenomenon of high inter-channel variance.As shown in Fig. 2, the magnitudes of values within the same channel are close to each other but exhibit significant differences across different channels.This phenomenon is not only observed in language models (i.e., LLaMA and BERT) but also significant in vision transformer models.Since outlier channels are often orders of magnitude bigger than the rest, they will dominate the quantization precision of the quantized tensor, resulting in less representation capacity for those channels with smaller magnitudes (Xiao et al., 2022).This makes tensor-wise or token-wise scaling factor insufficient for accurate activations quantization.
Algorithm 1 FPQ baseline 1: Input: Calibration dataset, Full-precision Model M , Quantization format search space RX (e.g., RX = {E3M 0, E2M 1, E1M 2} for FP4), number of round n = 3, 2: Output: FP q Quantized model 3: for l in 1 st to L th layer in M do 4: Forward & collect raw output O l = X l Y l of layer l; 5: end for 6: for l in 1 st to L th layer in M do 7: Initialize the FP format search space w.r.t X l and Y l as R Generate search space of bX in t formats to be for 0 to n do 11: Search for bi X w.r.t each r i X that minimizes Eq.12 12: Search for r i X ∈ R X that minimizes Eq.12 13: Search for bi Y w.r.t each r i Y that minimizes Eq.12 14: Search for r i Y ∈ R Y that minimizes Eq.12 15: end for 16: end for However, applying per-channel scaling factors for activations poses challenges to efficient matrix multiplication, because the scaling factor is not a shared constant along the multiplication direction and cannot be extracted as Eq. 10.To address this challenge, we introduce pre-shifted exponent bias, which allows us to calculate per-channel scaling factors from activations.These scaling factors are then re-parameterized as the exponent biases of the corresponding weights.This method effectively handles high inter-channel variance while maintaining nearly identical efficiency to per-tensor quantization.
Recalling in Eq. 7, we extracted the tensor-wise integer exponent bias b and times it with realvalued scaling factor α and becomes a new scaling factor α = 2 − b = 2 −b • α.Then, the floating-point quantization formula in Eq. 13 becomes: We note that after the bias is absorbed in the scaling factor, the original bias term (b ori ) in the FP formula is always zero.In dealing with the interchannel variance, we devise an innovative usage of this integer exponent bias: we set it to be a perchannel variant (b ori ∈ Z c ).
Then the calculation of the channel-wise integer bias vector (b ori ) is very straightforward.We first calculate the initial per-channel real-valued scaling factor (2 − bj ) from the per-channel maximum where ρ ∈ R 1 , b ori ∈ Z c .Then the formula for one of the entries in the j th channel of X can be rewrote as follows: ) Note that the bias b ori is constrained to integers within [0, 2 e − 1], compatible with the standard floating-point number calculation.Nevertheless, adding different biases for each channel during inference may still cause some extra hardware operations.Thus, we re-parameterized the perchannel activation bias into a weight tensor and pre-computed the weights using the calibration set.This way, the exponent biases shifting only happens in the calibration stage.Then, an element in j th channel of activation tensors X becomes: Pre-shifted exponent bias # $ !%&' (Eq.18) Tensor-wise scaling factor & (Eq.17) and the corresponding weight element in j th row of the weight tensor W becomes: As result, efficient matrix multiplication in Eq.10 is reformulated as: where ⊙ is the element-wise multiplication, β = 2 −b ori and (β ⊙ W:,k FP ) can be pre-calculated and stored in low-bit FP format.We depict the overall pre-shifted exponent bias method in Fig. 3.This method applies to quantizing all the fullyconnected layers.During the search process, we initialize ρX as the min j ( bj ).Then, we fixed bX to be the bias calculated from the Eq. 14 and search for the optimal ρX from [γ 1 ρ init Combining pre-shifted exponent bias method with the joint format and max-value search framework(FPQ baseline), we name our method as (FPQ), short for Floating Point Quantization.

Experiments
To validate the effectiveness of the proposed method, we conduct experiments on LLaMA (Touvron et al., 2023) and BERT (Devlin et al., 2019) models in 5.2.1 and Sections 5.2.2.Further, in Section 5.2.3 we show that our method also generalizes well to vision transformer architectures.We present ablation studies on the calibration size and search range in Section 5.3, and analyze the hardware costs of implementing FP operators in Section 5.4.

Experiments Details
We adopt per-tensor quantization for activation and per-channel quantization for weight.We employ layer reconstruction following the settings of (Yuan et al., 2022;Nagel et al., 2020), and parallel quantization based on the approach outlined in (Bai et al., 2022;Yuan et al., 2022).A more detailed discussion regarding our implementation decisions can be found in Appendix F. For LLaMA models, we quantize all the weight and activation tensors in fully-connected layers for a fair comparison with previous work (Xiao et al., 2022;Liu et al., 2023).For BERT and ViT models, both fully-connected layers and activation-activation multiplication tensors in the self-attention module are quantized.Note that for FPQ on BERT (Devlin et al., 2019) and ViTs models, the reconstruction metric Eq. 12 is substituted with a Hessian approximation loss metric.This substitution is further detailed in Appendix A.

LLaMA-7B
In general, all methods, except for the naïve Min-Max INT Quantization, produce comparable outcomes in the 8-bit setting on both LLaMA-7B and LLaMA-13B.Additionally, we observe that the naïve MinMax FP Quantization achieves nearly lossless results and even surpasses the state-ofthe-art integer post-training quantization method, SmoothQuant (Xiao et al., 2022), which indicates that floating-point quantization naturally has a strong capability in handling the distributions in transformers.However, both MinMax FP Quant and FPQ baseline fail when pushing the quan-1 https://github.com/EleutherAI/lm-evaluation-harnesstization precision to ultra-low 4/4/4 bit setting, with 28.9% and 23.8% accuracy degradation on LLaMA-7B, respectively.In this extreme case, the previous state-of-the-art PTQ and QAT methods, SmoothQuant (Xiao et al., 2022) and LLM-QAT (Liu et al., 2023) also suffer severe accuracy downgrade.In comparison, FPQ demonstrates a strong capability of handling extra-low bit settings and achieves only 8.2/5.8%accuracy drop on LLaMA-7B/13B with 4/4/4 bit-width, outperforming SmoothQuant (Xiao et al., 2022) by a large margin, yet with less bit-width and smaller calibration size.Moreover, FPQ even achieves 5.3% accuracy improvements compared to LLM-QAT (Liu et al., 2023) in the 4/4/4 setting and 1.5% over GPTQ (Frantar et al., 2023) in the 4/4/16 configuration on LLaMA-7B.
For practitioners, a crucial consideration is determining the appropriate quantization methods for various bit-widths.Therefore, based on our findings, we offer two recommendations that balance the trade-off between accuracy and search/optimization efficiency.First of all, since the difference between MinMax FP Quant and the rest of the methods is marginal for the 8/8/8 setting, we recommend simply using the MinMax FP Quant method for the 8/8/8 setting as the MinMax method does not involve search process.However, for more demanding scenarios, especially with activation quantization to 4 bits, we recommend employing FPQ for minimizing accuracy degradation with negligible inference overhead.

BERT Model
We evaluate the proposed quantization techniques for BERT model on GLUE tasks (Wang et al., 2019).Full-precision BERT-base models finetuned on GLUE datasets are obtained from Huggingface public repository 2 .We randomly sample 128 data from the training set as the calibration set.

Generalizability on Vision Transformer
Based on our findings that vision transformers also exhibit a consistent activation distribution pattern as language transformers, characterized by high inter-channel variance and low intra-channel variance, as detailed in Fig. 2, we extended our proposed methods to ViT and compared FPQ with floating-point PTQ baselines and state-of-the-art PTQ method for ViT on the ImageNet classification task.Table 3 shows that findings on ViT are consistent with that on language models: previous state-of-the-art integer-based methods struggled to maintain reasonable accuracy when quantizing the transformer to lower bits.In comparison, the proposed FPQ outperformed both PTQ4ViT and APQ-ViT on 6 bits, and also achieved 40.9% and 31.5% absolute accuracy improvement over PTQ4ViT and APQ-ViT on DeiT-S in the 4-bit configuration.

Ablation Study
In this section, we first compare the influence of different calibration sizes on FPQ.We vary the calibration size in {32, 64, 128, 256} and test on MNLI, QQP, and CoLA.Table 4 shows that the evaluation on MNLI and QQP is more robust to different settings, and the variance is more significant on CoLA.We observe that FPQ performs well with a calibration set size of 128 data points.However, we also find that it remains robust and maintains competitive accuracy even with limited access to calibration data, such as when using as few as 32 data points.

Hardware Cost
We further examine the hardware utilization of lowbit INT, FP, and mixed-format FP multiplication operators, including adder, multiplier, and multiplyaccumulate (MAC) units, in terms of hardware area.Mixed-format FP refers to the multiplication of floating-point numbers with different formats, e.g., E2M1 multiplies with E1M2.We implemented the MAC operator by Verilog HDL and utilized Cadence Genus to obtain the synthesized area under TSMC 40nm technology and 0.5GHz clock frequency.

Conclusion
This paper presents the first successful demonstration of 4-bit floating-point post-training quantization for weights, activations, and embeddings in natural language transformer architectures, including both large language models and BERT model.We also extend our method to vision transformers and observe its robust generalization ability.
Our approach involves a practical search-based technique which establishes a strong baseline and achieves state-of-the-art results for 6-bit and 8-bit quantization.Furthermore, we address the challenge of high inter-channel variance in transformers by proposing pre-shifted exponent bias, which proves highly effective in achieving accurate 4-bit quantization.

Limitations
Our experiments were conducted on publicly available datasets with finite sentence lengths, and the generalizability of our method to extremely long sequences or streaming data has not been verified and may require further investigation.In addition, it remains to be seen how our proposed method can generalize to other domains beyond language and vision, such as audio.It would also be interesting to see the applicability of our method to generative tasks and other applications.

A Hessian-Based Loss Metric
The objective of post-training quantization is to minimize the perturbation (δX = X FP − X R ) introduced by quantization to the pre-trained realvalued network: min Following the Taylor series expansion, we have Here, ḡ(X) is the gradients and H(X) is the Hessian matrix.Since the pre-trained model is wellconverged, we can assume that ḡ(X) has near zero value in every element, and thus term δX T ḡ(X) can be neglected.The Hessian matrix H(X) is computed as: where J O (X) denotes the Jacobian matrix of the layer output O w.r.t X, and H(O) is the Hessian matrix w.r.t O.We then substitute the above equation back to equation 21 : Here Ô is the intermediate output of the quantized layer and O is the original layer output.Note that under the assumption that δX is relatively small (Li et al., 2021), we can approximate ( Ô − O) as J O (X)δX using first-order Taylor expansion.Nevertheless, the calculation of H(O) is still burdensome, therefore, we use the diagonal entries of the Fisher Information Matrix of O to substitute H(O) following (Li et al., 2021;Yuan et al., 2022), and the new Hessian-based metric becomes: ) Here, each entry of O is assumed to be independent and n denoted the total number of elements in O.In this study, this hessian-based metric is used as the reconstruction metric to search for the optimal FP quantization function for both the weight and activation when performing layer-wise reconstruction in BERT and Vision Transformer models.

B Quantization Error of Different
Floating-Point Formats

D Efficient Matrix Multiplication
Figure 7 displays a comprehensive list of all the granularity options that allow for efficient matrix multiplication.While per-token quantization theoretically provides greater precision in terms of quantization granularity, the accuracy gains achieved through this method are minimal and do not justify the additional computational overhead required.As a result, we have opted to use pertensor quantization when quantizing activations.

E Learning Format and Maximum Value
We compare the previous gradient-based method (Kuzmin et al., 2022) with the proposed search-based method for finding the optimal format and maximum value.On DeiT-S, the learnable method only achieves 74.38% accuracy for an 8-bit quantized model on ImageNet, in contrast, FPQ can attain an almost loss-less result of 79.88%.We analyze the gradients for the number of exponent bits e derived in (Kuzmin et al., 2022) and observe that each time the exponent bits change, the gradients experience exponential variations, leading to high instability.Based on this observation, we assert that employing a search-based method to determine the optimal formats is crucial in post-training quantization (PTQ).

F Reconstruction Choices
The previous works on integer post-training quantization involves breaking down the target model into sub-modules and reconstructing them separately (Nagel et al., 2020;Li et al., 2021;Bai et al., 2022;Yuan et al., 2022).This addresses the problem of over-fitting, given that only a limited amount of unlabeled calibration data is available.In this study we find the layer-wise reconstruction and parallel quantization works best for floating-point PTQ: Layer Reconstruction: Recent research (Li et al., 2021;Bai et al., 2022) suggests increasing the reconstruction granularity from layer reconstruction (Nagel et al., 2020) to block reconstruction (Li et al., 2021) or even larger granularity (Lee et al., 2023).This is achieved by jointly optimizing all the linear layers or matrix multiplication components within each module to prevent the propagation of reconstruction errors among the layers.Despite this, we have observed that increasing the recon-struction granularity does not improve the of FPQ baseline or sometimes even lead to worse results.Therefore, we choose layer reconstruction.
Parallel Quantization: Sequential quantization is the most commonly used approach (Wu et al., 2020;Nagel et al., 2020;Li et al., 2021) where modules are quantized consecutively based on their sequential order, and the input for the current calibrating module is generated using all the previously quantized modules.However, some recent works (Yuan et al., 2022;Bai et al., 2022) proposed a new parallel quantization framework.This framework uses the raw output of the full-precision modules as input and makes the calibration of each module independent from one another.In this work, we use parallel quantization, as it yields better results than its sequential counterparts.

Figure 2 :
Figure 2: Magnitude of the output activations of the feed-forward network blocks in LLaMA-7B, BERT, and DeiT.

Figure 3 :
Figure 3: Overview of pre-shifted exponent bias method: (a) Search phase: The real-valued channel-wise scaling exponent bias for activations ( bj ) is partitioned into a real-valued tensor-wise exponent bias (ρ), and the integer-based channel-wise exponent bias ( bori j ).(b) Reparameterization and weight pre-computation:Once the optimal values are determined on the calibration set, bori j are re-parameterized into the weight tensor.The weights are pre-computed to apply the bias, therefore this is a one-time cost.(c) Inference phase: The method leverages efficient matrix multiplication between low-bit floating-point matrices.

Figure 4 :
Figure 4: Quantization error of different formats for BERT layers.

Figure 5 :Figure 6 :Figure 7 :
Figure 5: Magnitude of the output activations of different modules in BERT (left column), and DeiT-S (right column).

Table 1 :
Zero-shot performance on common sense reasoning tasks with LLaMA (Touvron et al., 2023) models.We denote E/W/A as the bit-width of word embeddings, model weight and activations, respectively.

Table 2 :
(Bai et al., 2022)QNLI SST-2 CoLA STS-B MRPC RTE Avg.Results on the GLUE development set with BERT(Bai et al., 2022)model.We denote E/W/A as the bit-width of word embeddings, model weight and activations, respectively.

Table 3 :
Comparison on the ImageNet dataset with vision transformer structures.

Table 4 :
Ablation studies of different calibration sizes.

Table 5 :
Table 6 illustrates the hardware cost of the INT and FP operators, with the multiplier being the pri-E/W/A γ 1 , γ 2 MNLI-M QQP CoLA Ablation studies of different search range.

Table 6 :
Area differences of INT, FP and mixed Format FP operators across different bit-widths.marycost for INT and the adder for FP.Notably, the disparity between FP4 and INT4 adders is small, while INT has twice the hardware cost for the multiplier.Moreover, the mixed-format FP4 operator has comparable hardware area as the standard FP4 operator.These findings indicate that the proposed FPQ approach imposes negligible overhead in terms of hardware implementation when compared to the standard FP operators and the hardware cost for FP is comparable with INT.