Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address this, we adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers. Block quantisations efficiently reduce the numerical scaling offsets solely from an arithmetic perspective, without additional treatments in the computational path. Our nearly-lossless quantised 6-bit LLMs achieve a $19\times$ higher arithmetic density and $5\times$ memory density than the float32 baseline, surpassing the prior art 8-bit quantisation by $2.5\times$ in arithmetic density and $1.2\times$ in memory density, without requiring any data calibration or re-training. We also share our insights into sub-8-bit LLM quantisation, including the mismatch between activation and weight distributions, optimal fine-tuning strategies, and a lower quantisation granularity inherent in the statistical properties of LLMs. The latter two tricks enable nearly-lossless 4-bit LLMs on downstream tasks. Our code is open-sourced.


Introduction
Pre-trained Large Language Models (LLMs) (Brown et al., 2020;Black et al., 2021;Zhang et al., 2022) have demonstrated impressive performance on a range of Natural Language Processing (NLP) tasks.However, their underlying computational and memory costs are a critical bottleneck to their usability.For instance, the larger variants in the GPT family scale up to hundreds of billions of parameters, requiring at least 300GB of memory to store these parameters in a float16 format (Brown et al., 2020).Quantisation serves as a natural solution for reducing the cost of running inference on these LLMs (Yao et al., 2022;Xiao et al., 2022;Dettmers et al., 2022), as a low-precision format enables cost savings across all relevant efficiency metrics: reduced on-chip memory, increased arithmetic intensity for matrix multiplies, and decreased DRAM bandwidth requirement.On the other hand, the growing popularity of running services such as ChatGPT (OpenAI, 2022) provides an impetus for exploring the use of custom silicon to support LLM inference.This raises the question: What would a low-precision number system look like in these near-future LLM hardware accelerators (ASICs)?
LLM quantisation is challenging because of the activations with large absolute magnitudes, also known as activation outliers (Bondarenko et al., 2021;Xiao et al., 2022).Previous approaches have proposed various techniques to address such outliers.However, these either require additional treatments in the integer quantisation domain (LLM.int8()and SmoothQuant) or yield unsatisfactory performance (ZeroQuant); and prior work has primarily focused on arithmetics that can be ported to GPUs.We observe that the presence of outliers necessitates different scaling factors at a finer granularity than per-tensor or per-token level (Yao et al., 2022;Xiao et al., 2022).This insight naturally leads us to revisit arithmetic systems with small exponents, such as MiniFloat (Sun et al., 2019), Block Minifloat (Fox et al., 2021), Block Logarithm (Miyashita et al., 2016), and Block Floating Point (Kalliojarvi and Astola, 1996), as they can effectively represent outliers in Transformer models.To the best of our knowledge, our work is the first to systemically investigate shortexponent arithmetics for LLM quantisation.
Figure 1 illustrates the variance of the tensors joining the GEMMs in an OPT-6.7B(Zhang et al., 2022).After feeding 128 samples from Wikitext2 Âi ← sof tmax(Ai, axis ← −1) Figure 1: The algorithm on the left is the forward pass computation of a single Transformer layer (Vaswani et al., 2017) in mainstream LLMs, wherein values in blue (e.g.X n ) represent tensors with predetermined min-max values, such as the outputs of a normalisation layer or softmax.Values in red have unbounded min-max, and are plotted on the upper right for different layers of OPT-6.7B(Zhang et al., 2022).We show that for almost all activation tensors, their variances increase at deeper layers, resulting in scaling offsets in their quantization, while weight tensors on the lower right have smaller variances.This statistical trend enlightens our LLM quantisation study.
to the model, we make three interesting observations.1) The variance of most activations in Figure 1 increases with the depth of the layer; 2) Certain tensors (eg.K) consistently have a greater variance compared to others; 3) All the weight variance is smaller than activations.
The presence of varying numerical ranges across layers and tensors poses a challenge to the efficacy of a single quantisation configuration for the entire network.From an arithmetic perspective, we refer to this phenomenon as numerical scaling offsets, as it requires different numerical ranges and granularities for quantisation.To ensure optimal performance, these layers should be subjected to fine-grained non-linear quantisation strategies.In this work, we also explore suitable places to perform TAQ and quantisation search within the entire NLP pipeline.We make the following contributions: • We address the LLM quantisation problem with activation outliers and examine it as a scaling offsets problem from an arithmetic design perspective.We demonstrate the efficacy of a family of arithmetic systems with short exponents shared across a block of numbers.
• We propose a novel quantisation framework based on block arithmetic, and demonstrate its effectiveness in performing W6A6 inference for various tasks.Our nearly-lossless W6A6 outperforms prior work in terms of arithmetic density and memory density, without requiring data calibration or fine-tuning.
• We present two methods to achieve 4-bit quantisation on downstream tasks: one is finetuning-based, and the other is mixed-precision search.The latter further demonstrates the potential advantage of shifting LLM inference to cost-effective ASICs.

Related Work
While quantisation of earlier Machine learning (ML) models has been extensively studied, effective quantisation of LLMs still remains an open problem.In this section, we review the previous works on block-based quantisation and compare to the existing LLM quantisation techniques.

Block-based Quantisation
Block-based quantisation is a technique that quantises a block of values into a compact format, where the elements within each block share common digits.This technique offers a significant memory footprint reduction while maintaining a minor roundoff error.A number of previous works rely on this method to quantise Convolutional Neural Networks (CNNs).Lin et al. utilised a linear combination of multiple binary bases, equivalent to each binary matrix having a scaling factor (Lin et al., 2017).Subsequently, Zhang et al. introduced LQ-Nets that rely on a form of block quantisation with a shared scaling factor at the vector level (Zhang et al., 2018).Further investigations explored grouping numbers at various granularities, including layer-wise (Wu et al., 2018b), channel-wise (Krishnamoorthi, 2018), and vectorwise quantisation (Dai et al., 2021).
It is worth noting that sharing a scaling factor is similar to, but not necessarily the same as, sharing the exponent (Darvish Rouhani et al., 2020).This distinction arises because scaling factors can be arbitrary float32 values, whereas exponent values must be integers represented by the assigned number of bits.Our work focuses on sharing the exponent or exponent bias.When the block size of the shared exponent is 1, we fall back to the minifloat representation such as FP8 (Sun et al., 2019).These approaches showed promising results primarily for vision models or relatively small Transformer-based models, while we shift the focus to quantising LLMs with a significantly larger parameter count.

LLM Quantisation
Efficient quantisation techniques for language models have been explored in previous works.Zafrir et al. proposed an approach for quantising BERT (Shen et al., 2019) into 8-bit integers (Zafrir et al., 2019), while Shen et al. (Shen et al., 2019) proposed Hessian-based ultra-low precision quantisation for the same model.Zhang et al. (Zhang et al., 2020) quantised BERT to ternary values leveraging layer-wise knowledge distillation, and Bai et al. (Bai et al., 2021) further pushed the quantisation of BERT weights to binary values.
The recent surge of interest in quantising LLMs has presented a unique challenge distinct from the prior art summarised above.This challenge stems from the increased model sizes of LLMs.Yao et al. proposed ZeroQuant, which quantises both weights and activations of large transformers into small integers with shared scaling factors (Yao et al., 2022).However, as mentioned by Xiao et al. (2022), ZeroQuant suffers from a severe accuracy loss.Dettmers et al. introduced LLM.int8(), a method that computes outlier GEMMs in float16 and the rest in 8-bit integer (Dettmers et al., 2022)

Method
In this section, we outline our quantisation strategy for LLMs.We first define block-based quantisation and then describe the metrics we use for evaluating quantisation methods.Finally, we detail a precision search that lowers the quantisation granularity down to the tensor level, effectively accommodating the statistical distribution inherent in LLMs.

Block-based Arithmetic
Figure 2 illustrates the data representation we explore to address LLM quantisation as well as the standard float32/float16.We outline the specifications for traditional floating-point numbers and extend them to block-based quantisation.Detailed definitions can be found in Appendix B.
Standard floating-point A standard IEEE floating-point number is defined as a 4-tuple, (s, e, m, b) (Kahan, 1996).s ∈ {0, 1} is the sign bit, e ∈ N is the exponent field; b ∈ N is the exponent bias; and m ∈ N is the mantissa.Let the bit widths of the exponent and the mantissa be E and M , respectively.The IEEE standard float32 (FP32) number has E = 8 and M = 23, where the other bit is used as the sign bit.Note that the exponent bias depends on E: b = 2 E−1 − 1, separating the exponent field symmetrically.Similarly, float16 (FP16) has E = 5 and M = 10.
MiniFloat and Denormalised MiniFloat Mini-Float is an efficient floating-point representation that requires fewer bits than traditional floatingpoint numbers.Traditionally, an 8-bit MiniFloat inherits the definition of FP32 by assigning E = 4 and M = 3.We saturate MiniFloat when e = 2 E − 1 and thus no ± inf is included.
In this paper, we also introduce a Denormalised MiniFloat (DMF) with zero as the implicit lead-ing bit in the mantissa.Similar to MiniFloat, we saturate the infinity to a maximum finite value.DMF provides a higher precision than MiniFloat for small values at the expense of narrowing down the value range.We investigate this trade-off in the context of quantising LLMs.
Block MiniFloat, Block Floating-Point and Block Logarithm As shown in Figure 2, Block quantisation packs values in a block in which a common scaling factor is shared across N values where N is the block size, reducing the computation in vector inner products.This work mainly explores three block quantisation arithmetics on LLMs: BM, BFP and BL.
Block Minifloat (BM) shares a B-bit exponent bias (Fox et al., 2021).This representation achieves high precision and high range at the same time, at the cost of a larger quantisation error at medium value than standard floating point.This is potentially amenable to values in a multimodal distribution, where values close to a peak can be efficiently represented in a block.Block Floating-Point (BFP) shares an E-bit exponent.This shared exponent bounds the range in the block and is amenable to values with small block variances.Block Logarithm (BL) sets the mantissa in BM to 1 and shares a B-bit exponent bias, resulting in values that are powers-of-twos.This contrasts with BFP and is amenable to values with large dynamic ranges.
All these quantisation methods are non-linear and thus can be useful tools to address the scaling offsets phenomenon depicted in Figure 1.Moreover, the hyper-parameter block size allows for flexible quantisation granularity, ranging from layerwise, tensor-wise, and channel-wise, to slice-wise (a slice along the token/channel vector).

Arithmetic and Memory Densities
Reducing model size is not the only advantage of quantisation; it also simplifies the computation, thereby accelerating inference.We evaluate quantisation arithmetics using adopted memory and arithmetic densities (Darvish Rouhani et al., 2020).We define memory density as the reciprocal of the size of the activation and weight data in a model, and the arithmetic density as the reciprocal of the area/the number of Look-Up-Tables (LUTs) to synthesise a multiply-accumulate (MAC) unit, which serves as the basic cell for matrix multiplication in custom inference circuits.An efficient quantisation method should make a good trade-off among task accuracy, memory density, and arithmetic density.We implemented MAC units with different abovementioned arithmetics in FPGAs to obtain the number of LUTs.A detailed description of this procedure can be found in Appendix C.

Quantisation Search
Previous works (Dong et al., 2019;Habi et al., 2020) observed that the layers in CNNs exhibit varying tolerance, or "sensitivity", to quantisation -we also notice this phenomenon in LLMs.The crucial aspect is identifying the layers that are sensitive and determining tailored quantisation configurations.To achieve this, we apply Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011) to conduct a fine-grained search for quantisation precision multiple times and analyse the statistics inherent in the quantised models that recover more accuracy.Our search space is constructed on a pertensor basis, allowing each input tensor or weight tensor in 1 ⃝-8 ⃝ (See Algorithm 1) to have its own precision.The search space increase exponentially as the layer count increases.We leverage accuracy and memory density to design the objective function: Here O f , acc, mem represent the objective function, accuracy, and memory density of the searched quantised models, respectively.The constant α is used to balance acc and mem.To determine the α for a specific search, we initially set α to 1.0 and perform the

Evaluation
We conducted a comprehensive set of experiments to identify the key factors influencing the performance of sub-8-bit LLMs.We begin with a language modelling task to eliminate less promising quantisation methods (Section 4.2), and then run the promising ones on downstream tasks.For the tasks that proved challenging even for FP32 models, we resort to fine-tuning.Additionally, we conducted a mixed-precision search on two tasks where the quantised 4-bit model struggle.The results of this search provide insights into how to further refine quantisation at the tensor level.

Experiment setup
Baselines We compare our approach with four baselines: 8-bit plain fixed-point quantisation, LLM.int8() (Dettmers et al., 2022), GPTQ (Frantar et al., 2022), and SmoothQuant (Xiao et al., 2022).We amend SmoothQuant's source code to ensure its consistency with their paper (See Appendix A) and add this amended version (referred to as "SmoothQuant-c") to the result table.and activation matrix (a slice along matrix row in Algorithm 1) unless otherwise specified.

Zero-shot PTQ on Wikitext2 and downstream tasks
In this section we present our results in a setup we call zero-shot Post-Training-Quantisation (PTQ), which was also adopted by prior work on LLM quantisation (Dettmers et al., 2022;Frantar et al., 2022;Xiao et al., 2022).In this approach, we take a pre-trained OPT model from Huggingface, quantise it, and apply it on Wikitext2 to calculate perplexity, and the eight downstream tasks shortlisted in Section 4.1 to calculate accuracy.The zero-shot PTQ setup is particularly advantageous in scenarios where LLMs lack prior knowledge, as it eliminates the need for downstream task finetuning and Training-After-Quantisation (TAQ).

Perplexity on Wikitext2
Table 3 compares our results with the baselines in terms of perplexity, memory density, and arithmetic density.Similar to prior work (Dettmers et al., 2022;Xiao et al., 2022), plain fixed-point quantisation performs poorly.In contrast, non-linear arithmetic, such as MiniFloat, yields a significantly better perplexity at a similar memory density.MiniFloat yields slightly better results than DMF, indicating the 2× higher range is more important than precision in this context.
Block-based quantisation exhibits inconsistent performance on Wikitext2.A noteworthy result is that our 6-bit BFP achieves higher memory density, higher arithmetic density, and lower perplexity than the prior art GPTQ and SmoothQuant-c without requiring data calibration.BM and BL perform poorly compared to BFP.BM was originally proposed in the context of Quantisation-Aware-Training (QAT), whereas our evaluation is based on PTQ.Without retraining, the 3-bit mantissa of BM and the 1-bit mantissa of BL may be the reason for the poor perplexity., this means we directly quantise the pre-trained model and benchmark on these downstream tasks using zero-shot prompting.We highlight 6-bit BFP which also achieves an accuracy close to FP32 on these tasks.Figure 3: The bit width distribution of Q in Line 6, Algorithm 1 from 2688 searches.We identify the layers less tolerant to aggressive quantisation in OPT-2.7B.For example, layers 18, 25 and 30 often need more bits than other layers.Keeping these layers in relatively high precision recovers the accuracy from 36.2% to 61.3% without decreasing the memory density, equivalent to a 4.3-bit OPT-2.7B on average.

Accuracy on downstream tasks
We exclude fixed-point, DMF, BM, and BL from downstream task evaluation due to their poor language modelling performance.Table 4 represents the mean accuracy on ARC (easy), COPA, LAMBADA, PIQA, and SST2.The results of QNLI, MRPC, and COLA are not included in this table as even FP32 LLMs exhibited poor accuracy close to random guess.A plot depicting how these methods match FP32 accuracy as the model scales up and a complete result table are in Appendix D.
Overall, we make the following observations: • Fixed-point representation performs inadequately due to unability of linear quantisation to address the scaling offset issue caused by varying variances.
• LLMs have different tolerance to block-based quantisations.BM and BL exhibit subpar performance compared to BFP, indicating that non-linear quantisation still needs sufficient mantissa length to capture the learned weight distribution, or retraining may be required.
• BFP strikes a good balance in the tradeoff between range and resolution.Our nearly-lossless 6-bit LLMs, without data calibration/re-training, outperform prior art methods in terms of perplexity (accuracy), memory density, and arithmetic density.
We also observe that sub-6-bit BFP has a severe accuracy drop.To address this problem, we further investigate two approaches for improving the accuracy of 4-bit LLMs.

4-bit LLMs via fine-tuning
Previous study (Brown et al., 2020;Zhang et al., 2022) reported FP32 LLMs' low accuracy on several downstream tasks in the context of zero-shot prompting.In our experiments, OPTs also exhibit poor accuracy on QNLI, MRPC, and COLA.Finetuning language models on downstream tasks has proven to be helpful for improving accuracy (Devlin et al., 2019).We explore the fine-tuning and quantisation of LLMs on downstream tasks.
There are two stages where quantisation can be applied.LLMs are typically pre-trained in FP32.The first option is to continue fine-tuning the FP32 model on downstream tasks and subsequently quantise this fine-tuned FP32 model.We refer to this setup as PTQ on fine-tuned FP32.The second option is to quantise the pre-trained FP32 model and retrain this quantised model on downstream tasks, which we refer to as TAQ on downstream tasks.
We compare these two cases on four downstream tasks (SST2, QNLI, MRPC, and COLA) that zeroshot prompting struggles to handle.The result table is in Appendix E. We observe that: • Both options effectively improve accuracy, enabling nearly lossless downstream accuracy even if 4-bit BFP is applied.
• TAQ on downstream tasks reaches a slightly better accuracy (a gain of 0.2% on average) than PTQ on fine-tuned FP32 given the same bit-width.However, the former is harder to optimize through backpropagation because of the forward quantisation error and the Straight-Through Estimator (STE) (Bengio et al., 2013) used in backpropagation.

4-bit LLMs via mixed precision
Currently, our block-based quantisation uses a uniform configuration, where the block size and bitwidth remain constant across the entire model.What if we push the barrier further?Existing works on CNN compression have explored mixedprecision quantisation (Wu et al., 2018a;Wang et al., 2019), thereby increasing memory density.This subsection lowers the block size granularity and the bit-width granularity to the tensor level to demonstrate uncharted possibilities of aggressive LLM quantisation.
Variation-aware block size By comparing the activation variance and weight variance in Figure 1, we observe that the weight variance remains stable and much smaller, suggesting that we can increase the weight block size while decreasing the activation block size.This approach enhances accuracy while maintaining memory density.

Mixed-precision
We repeat the quantisation search described in Section 3.3 on downstream tasks and filter out less promising quantisation configurations using an accuracy threshold and a memory density threshold.Each time we start TPE search with a different random seed, so the distribution of filtered quantisation configurations exposed the sensitivity of the searched tensors in LLMs.An example of a mixed-precision search result is presented in Figure 3.We find certain layers were consistently assigned with higher precision, while others tended to have lower bit widths.By preserving high precision for these sensitive layers, we recovered the 4-bit LLM accuracy from 36.2% to 61.3% on LAMBADA without compromising memory density.We include more tasks and model sizes in Appendix F. In conclusion, variance-aware block size and mixed precision allow aggressive quantisation beyond 6-bit without fine-tuning.

Conclusion
This study focuses on addressing the scaling offset issue in LLMs and provides valuable insights into the quantisation of LLMs.Through extensive experimentation, we identify key factors that significantly impact LLM quantisation.When aiming for quantisation above or equal to 6-bit, BFP surpasses previous methods in terms of accuracy, memory density, and arithmetic density, without requiring for data calibration or training.Moreover, we demonstrate that fine-tuning or mixed precision techniques enable 4-bit LLMs on downstream tasks.Fine-tuning is suitable for GPUs, and mixed precision has the potential to shift the inference platform from GPUs to cost-effective ASICs.Our findings contribute to advancing the field of LLM quantisation and provide practical guidance for achieving good quantisation performance.

Limitations
Different from many prior arts in LLM quantisation that focus on integers, our work puts particular emphasis on minifloat variants.However, the potential gains of our work have not manifested in GPU systems due to a lack of CUDA kernel implementation.The implementation of some proposed quantisation methods in this paper requires specialised kernels and hardware, however, a major focus of our work is to explore potential designs for next-generation hardware to run LLM inference.Another limitation is that our search algorithm does not include arithmetic density due to a lack of hardware models.
We leave this as a future work.
A Experiment details

A.1 Setup and Implementation
Hardware resources We run the experiments using four NVIDIA RTX3090s, three A100s, and eight V100s with 64GB, 192GB, and 128GB RAM respectively.The evaluation of PTQ perplexity on Wikitext2 takes around 64 GPU hours in total; the zero-shot prompting evaluation on downstream tasks takes around 160 GPU hours in total; the fine-tuning of FP32 models on SST2, QNLI, MRPC and COLA takes around 30 GPU hours in total; the fine-tuning of quantised BFP models takes around 70 GPU hours in total; the evaluation of fine-tuned models takes around 6 GPU hours in total; the mixed-precision search takes around 120 GPU hours in total.
Implementation We download the model codes and pre-trained weights from HuggingFace Transformers1 and implement the quantisation arithmetics using PyTorch2 .We use Vivado to report arithmetic density and Optuna3 to perform the mixed-precision search.
Evaluation We follow the code base of GPTQ (Frantar et al., 2022) 4 to estimate LLM's perplexity on Wikitext2.We chop Wikitext2's test set into sequences of 2000 tokens, feed the sequences to LLMs, and normalise the cross entropy loss by the sequence length and batch size.To evaluate LLM accuracy on downstream tasks, we follow OPT (Zhang et al., 2022) and SmoothQuant (Xiao et al., 2022) to use lm-eval-harness in the zeroshot prompting setup.

A.2 Comparison with SmoothQuant
The SmoothQuant paper (Xiao et al., 2022) declares all the eight GEMMs ( 1 ⃝-8 ⃝ in 1) are quantised.However, their codes5 do not support quantising 5 ⃝ and 6 ⃝, which takes up 19.6% floating-point operations (FLOPs) in OPT-6.7B'sself-attention.We amend their code and refer to the amended version as "SmoothQuant-c", which should be the same as SmoothQuant-O2 in the paper.We observe that SmoothQuant-c has much higher perplexity and slightly lower accuracy on downstream tasks than SmoothQuant.Besides, the SmoothQuant repository does not include the scaling factor files of OPT-125m and OPT-350m, so the perplexity/accuracy for these two models is missing in our result table.

A.3 Quantisaion search
The specific search configuration depends on model size, task, and FP32 performance.We use the accuracy threshold and memory density threshold to sort out promising mixed-precision configs.Given a model and a task, the accuracy threshold is 2% below the FP32 values.The memory density is set to 7.1% in most search configs.
Note that to estimate the memory density of quantisation config candidates, we need the model architecture information including input sizes and weight sizes for all the GEMMs in Algorithm 1 across all layers.We implement a FLOP profiler to collect this information and feed it as input to the search algorithm.The numeric values of these parameters can be found in the bash scripts of our source code.

B Definition of quantisation arithmetics
FP32, FP16 and MiniFloat A traditional floating-point representation follows IEEE floatingpoint standard (Kahan, 1996), which can define a floating-point number as a 4-tuple, (s, e, m, b), where • s ∈ {0, 1} is the sign bit of the number; • e ∈ N is the exponent field; • b ∈ N is the exponent bias; and • m ∈ N is the mantissa.
Given the bit widths of the exponent and the mantissa be E and M , the value x of a floating-point number can be obtained via: where e is the unsigned integer value represented by the exponent bits, and m is the unsigned integer value represented by mantissa bits.b = 2 E−1 − 1. FP32, FP16, and MiniFloat have E = 8, M = 23, E = 5, M = 10 and E = 4, M = 3, respectively.Note that the "1" in the fraction term of Line 2, Equation (1) comes from the implicit leading bit in the mantissa.We additionally saturate MiniFloat when e = 2 E − 1, thus the value of a MiniFloat is DMF The definition of DMF is the same as Mini-Float except that there is no implicit leading bit in the mantissa: BM, BL, and BFP BM (Fox et al., 2021) shares the exponent bias and was proposed in the context of Quantisaion-Aware-Training (QAT).When an FP32 value is cast to BM, the exponent bias is determined by the maximum value in the block.BFP (Darvish Rouhani et al., 2020) shares the exponent and was proposed in the context of PTQ.Similar to BM, the shared exponent bias is also determined by the maximum FP32 values when casted from FP32.
Logarithm quantisation was proposed by Miyashita et al. (2016) to perform QAT on CNNs.Block Logarithm (BL) was used as a baseline to compare with BM in (Fox et al., 2021).BL shares the exponent bias and does not have mantissa bits (mantissa is always 1).
Basically, block-based quantisation facilitates the vector's inner product by simplifying the accumulation after multiplication.For example, the inner product between two BFP vectors x and y is, where B is the block size.Since exponents are shared across vectors, the element products can be accumulated without shifting.The block sizes of the two vectors are not necessarily the same.

C Estimate arithmetic density via logic synthesis
We implemented the hardware designs of the corresponding modules and measured their arithmetic density using hardware synthesis tools

D PTQ on downstream tasks
We quantised the pre-trained model and apply it to the downstream tasks in the zero-shot prompting setup.Figure 4 depicts how the performance of quantised models scale with model sizes.Our 6bit BFP align with FP32 at various model sizes.Table 6 presents the detailed accuracy of each task.Note that QNLI, MRPC, and COLA results are not included in this table because even FP32 LLMs yield an accuracy close to random prediction.

E PTQ on fine-tuned FP32 vs TAQ on downstream tasks
Table 7 compares the two options on four downstream tasks (QNLI, SST2, COLA, MRPC), that FP32 LLM cannot handle.We observe that both align 4-bit BFP LLMs' performance with FP32 on downstream tasks.

F Searched mixed-precision LLMs
Mixed-precision quantisation is also helpful for recovering downstream task accuracy.Figure 8a and Figure 8b depict the performance of 4-bit LLMs on LAMBADA and ARC (easy) as the model scales up.The searched mixed-precision configuration effectively recovers the accuracy.
Figure 5, 6, and 7 is the activation distribution after searching on LAMBADA 2688 times.Keeping these layers in high precision effectively recovers the accuracy from 36.2% to 61.3% without decreasing the memory density, equivalent to a 4.3-bit OPT-2.7B on average.

G Tensor variance in LLMs
We additionally analyse the trend of increasing activation variance when the model size increases, In Figure 1, for OPT-6.7B,we plotted the variances of all tensors that have unbounded input ranges and that are taken as input operands to matrix multiplications in the Transformer layer.Figure 9 further illustrates the results for OPT-350M and OPT-2.7B.We observe that: • If we consider V, B c and B 1 as the main information path6 , these components have much smaller variances than K and Q.
• Bigger models tend to have small variances at shallow layers and larger variances at deep layers.
These observations explain why linear quantisation, such as integer quantisation, is effective for smaller models but struggles with larger ones.This increasing activation variance trend can be considered into variance-aware block size.Since a higher variance implies a higher possibility of extreme outliers, we can apply larger block sizes to those tensors with smaller variance and smaller block sizes to those with higher variance.Limited by time, we leave this exploration as well as the combination of fine-tuning, variance-aware block size, and mixed precision in future work.Figure 9: We demonstrate a similar analysis to Figure 1, where on the left we have OPT-350M variance vs layer ID and OPT-2.7Bvariance vs layer ID on the right.The trend of increasing activation variance is more obvious on larger models.

Figure 5 :
Figure 5: The searched bit-width distribution of OPT-2.7B.Notably, some layers are frequently assigned relatively high precision, indicating these layers are less tolerant to quantisation.

Figure 6 :
Figure 6: The searched bit-width distribution of OPT-2.7B.Notably, some layers are frequently assigned relatively high precision, indicating these layers are less tolerant to quantisation.

Figure 7 :Figure 8 :
Figure 7: The searched bit-width distribution of OPT-2.7B.Notably, some layers are frequently assigned relatively high precision, indicating these layers are less tolerant to quantisation.

Table 1 :
A comparison of different LLM quantisation methods.(QW,QAct)showswhether quantisations are applied to weights or activations, WxAy means x-bit quantisation for weights and y-bit quantisation for activation.PTQ and TAQ represents Post Training Quantisation and Training After Quantisation respectively.DC means data calibration.There are eight general matrix multiplications (GEMMs) per transformer layer ( 1⃝-8 ⃝ in Algorithm 1).Only ZeroQuant and ours quantise all of them.Other approaches leave4⃝ and 5 ⃝ in float32/float16 format, which take up 20.6% floating-point operations in OPT-6.7B'sself-attention.* means outliers in LLM.INT8() is computed in float16; this improves arithmetic density but memory density is kept almost identical to canonical float16.

Table 2 :
The quantisation configuration used in the following sections, where E, M , and B are the bitwidth of exponent (shared exponent), mantissa, and bias (shared bias) respectively.searchwhile recording the values of (acc, mem) until convergence.The final value of α is determined as accc memc , where (acc c , mem c ) represents the converged values.Detailed search parameters are in Appendix A.
Table 2 clarifies the quantisation configuration used in the following sections, where E, M , and B are the bit-width of exponent (shared exponent), mantissa, and bias (shared bias) respectively.All these representations include a 1-bit sign bit.The block size of blockbased methods is set to [1, 16] for both the weight

Table 3 :
Perplexity (↓) values with zero-shot Post-Training-Quantisation (PTQ) on WikiText2, this means we directly quantise the pre-trained model and apply on WikiText2.Mem and Airth represent Memory and Arithmetic density accordingly.DMF, BM, BFP and BL represent Denormalised MiniFloat, Block Minifloat, Block Floating Point and Block Logarithm respectively.SmoothQuant-c is our improved implementation where the two activation matrix multiplications are now also quantised.
† means the inliner matrix multiplications are calculated in 8-bit fixed-point, and outliers are calculated in FP16.* means the weights of GPTQ are kept in FP32.‡ means SmoothQuant repository does not include the weight scaling matrices for 125M and 350M.We highlight the best block-based quantisation arithmetic, 6-bit BFP, considering perplexity, memory density, and arithmetic density together.

Table 5 :
The arithmetic density of various quantisation configurations explored in this paper.To calculate the area factor, we convert the Digital Signal Processing units (DSPs) to equivalent LUTs to get the area factor, and then divide the quantisation arithmetic's area factor density by FP32.

Table 7 :
The comparison between PTQ fine-tuned FP32 and TAQ on SST2, QNLI, COLA, and MRPC.Both cases align 4-bit BFP LLMs with FP32 after fine-tuning.The latter may achieve slightly better accuracy.† means COLA is evaluated using the Matthews Correlation Coefficient (MCC), while the other tasks are evaluated using accuracy.