Zero-Shot Dynamic Quantization for Transformer Inference

We introduce a novel run-time method for significantly reducing the accuracy loss associated with quantizing BERT-like models to 8-bit integers. Existing methods for quantizing models either modify the training procedure,or they require an additional calibration step to adjust parameters that also requires a selected held-out dataset. Our method permits taking advantage of quantization without the need for these adjustments. We present results on several NLP tasks demonstrating the usefulness of this technique.


Introduction
Transformer-based Neural Networks (NN) such as BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019) and XLM-R (Conneau et al., 2019), pretrained on large amounts of data, have led to stateof-the-art (SOTA) results on many NLP tasks such as machine translation (Zhu et al., 2019), text classification (Wang et al., 2018) and question answering (Kwiatkowski et al., 2019;Clark et al., 2020).However, run-time inference of such large models is very costly due to their large computational requirements.In addition, deploying these models on smaller footprint mobile devices (Ravi and Kozareva, 2021) or cost-effective (Sanh et al., 2019;Jiao et al., 2020) CPU based machines require aggressive optimization techniques for both speed and network size.One popular speed optimization technique is NN quantization (Gholami et al., 2021;Kim et al., 2021;Zafrir et al., 2019), where network weights and activations are transformed from 32-bit floating-point representations to integers (typically 8-bit).Running inference using integer operations has two key advantages.First, the model size footprint is considerably reduced e.g.8-bit quantization shrinks models by a factor of four.Second, inference throughput is significantly increased by using more efficient integer-based "single instruction multiple data" (SIMD) (Hennessy and Patterson, 2012) instructions while improving memory bandwidth utilization, which is typically a bottleneck limiting computational throughput for NNs (Quinn and Ballesteros, 2018).
Fundamentally, quantization leads to a quantitative loss of information due to the lowered numerical precision.As a result, applying integer quantization directly to NN models leads to considerable drop in accuracy (Zafrir et al., 2019).However, by carefully adjusting the quantization parameters such as the clipping thresholds, the accuracy loss can be significantly reduced, if not eliminated.
The majority of quantization research (Gholami et al., 2021) involve a mix of quantization-aware training (QAT) and post-training calibration techniques with varying complexities to resolve the quantization performance gap.Several works (Kim et al., 2021;Choi et al., 2018;Zhou et al., 2017;Choi et al., 2018;Krishnamoorthi, 2018;Louizos et al., 2019;McKinstry et al., 2019) detail techniques for QAT as well as approaches where the quantization parameters are optimized using statistics gathered during training.While these approaches typically close the gap in the quantized model accuracy, they requires access to the training pipeline as well as the training data.In addition, these methods are not applicable to black-box models where both training procedures and data are not available.Also, these methods may be affected by training instabilities, increasing the complexity of the training regimes as described in (Krishnamoorthi, 2018).Post-training approaches such as (Migacz, 2017;Bhandare et al., 2019) require calibration techniques on selected datasets.For example, in (Migacz, 2017) KL-divergence (Kullback and Leibler, 1951) between the unquantized and quantized activations on each layer was used to tune the quantization clipping thresholds.Special care needs to be taken when selecting a calibration dataset; as it needs to be diverse enough but yet task specific.In certain cases this leads to low accuracy, or even unpredictable behaviour, if the run-time input deviates from the calibration dataset.
Two methods that share our high-level goals of eliminating the need for training datasets are introduced in (Nagel et al., 2019;Cai et al., 2020).These methods are implemented with CNN-based (Gehring et al., 2017) networks, and are used for image classification and object detection tasks.(Nagel et al., 2019) reduces the quantization error by rescaling the weights of consecutive CNN layers while taking advantage of the equivariance property of the piece-wise linear ReLU function.(Cai et al., 2020), on the other hand, tunes the quantization parameters using synthetic data generated utilizing mean and variance statistics obtained from the batch normalization layers of the model itself.While both methods are applicable for mainly CNN-based networks, our algorithm is considerably simpler to implement and targets transformers (Vaswani et al., 2017); particularly SOTA NLP networks with BERT-like (Devlin et al., 2018;Liu et al., 2019) pre-trained representations.
In this work, we present a method that utilizes the Interquartile Range (IQR) (Tukey et al., 1977;Rousseeuw and Croux, 1993), which is a measure of statistical dispersion, to clip the activations dynamically during inference time.Our method ensures that at least 75% of the token-wise extreme activations are not modified, while leaving the remaining 25% to be statistically modified as outliers, leading to a robust behaviour while considerably improving quantization accuracy.Our method works for any transformer-based "trained" model and does not require any form of training or calibration.Overall, our contributions can be summarized as follows: • We propose a novel "ready-to-use" inferencetime dynamic quantization method that does not require sophisticated re-training/finetuning and additional calibration strategies.
• Empirically our proposed model demonstrates both effectiveness and robustness on several different NLP benchmark tasks.
• Further, contrary to prior work, experiments suggest that our proposed method works both for monolingual and multilingual transformer architectures out-of-the-box.

Backgound
Existing approaches to speeding up inference for Transformers mostly focus on General Matrix Multiply (GEMM) operations.Fast GEMM implementations routinely use GPU and CPU specific SIMD instructions, to execute many multiplications and additions in parallel.They also optimize memory access patterns to make the best use of available memory bandwidth.Integer quantization speeds up the GEMM operations by increasing the amount of data transferred with each memory transaction.
They also take advantage of denser SIMD instructions.For example, 8-bit quantization packs four times the data per memory transaction compared to 32-bit floating point values.Many CPUs also support 8-bit SIMD multiplication operations, providing faster as well as cost-effective computation.

Uniform Quantization
Dynamic quantization for inference quantizes activations at run time.The model weights are typically quantized once ahead of execution.Let M ∈ R m×n be a matrix of either an activation or parameter weights.The quantization scale (QS) is obtained as: QS = max ∀i∈{1,...,m} ∀j∈{1,...,n} The matrix M is then quantized to M ∈ Z m×n as follows: where b is the number of integerization bits, typically 8, and the function int is the element-wise integer conversion operator; e.g. a floor function.
The reason for the subtraction by 1 in ( 2) is to ensure that the quantization range is equally spread around zero.In the case of 8 bits, the range becomes ±127.This formulation also results in a symmetric form of uniform quantization, where the quantization is evenly split around zero.This can be modified by adding a zero-shift resulting in an asymmetric quantization (Krishnamoorthi, 2018), which may particularly be useful for certain activation functions such as ReLU (Nair and Hinton, 2010) and GELU (Hendrycks and Gimpel, 2016).While non-uniform quantization (Gholami et al., 2021) has been explored to better capture weight and activation distribution with variable step sizes, uniform quantization leads to more efficient implementation on current hardware such as GPUs and CPUs with acceptable accuracy.Once matrices are quantized, GEMM operations can be performed using integer arithmetic allowing the use of fast SIMD instruction sets.Quantization lowers numerical precision which leads to loss of information.Examining (1) shows how the QS can increase precision errors if it takes extreme values that largely deviate from the majority activations.Therefore, the activation tensor must be clipped to reduce the quantization error; however, excessive clipping can lead to distortions in the activation which also leads to drops in accuracy.
In the following section, we will outline a method that chooses better QS values for each activation tensor dynamically during inference, without any modification to the training pipeline or any requirement for calibration procedures.

Interquartile Range Clipping
If we consider the extreme values in the activations as outliers in a distribution, there is a substantial amount of research for identifying outliers (Ben-Gal, 2005;Hodge and Austin, 2004).Our solution makes use of a low complexity univariate statisticalbased method for outlier detection referred to as the Interquartile Range (IQR) method originally proposed by Tukey (Tukey et al., 1977).
IQR is also considered a robust statistical measure (Rousseeuw et al., 2011) of the data spread, with the notion of robustness being defined using the concept of a breakdown point (Rousseeuw and Croux, 1993;Rousseeuw et al., 2011).The breakdown point is the minimum number of data that can be arbitrarily replaced while keeping the statistical measure bounded.The sample mean and variance have a 0 breakdown point, meaning that these measures are changed by even a single outlier; on the other hand, the IQR has a 25% breakdown point, making it a stable measure even if up to 25% of the data are outliers.
We introduce an algorithm that effectively uses IQR to clip outliers from an activation tensor which consequently improves the selection of the quantization scale as in (1).It is worth noting that a direct implementation of the IQR method is too slow as it uses a sorting operation in order to identify the quartiles on the data.The complexity of a naive implementation would be O(N log N ) where N is the number of elements of the activation tensor.In the case of BERT-like models, N = L × H, where L is the sequence length and H is the hidden dimension; e.g. for BERT-Large, N = 512 × 1024.
To lower this complexity, we obtain the IQR clipping threshold from a reduced set formed by taking the maximums, in absolute sense, along the H dimension.We will refer to this algorithm as the Token-Maximums IQR (TM-IQR) clipping.The resulting complexity of the IQR clipping becomes O(N +L log L).Our experiments show that adding this form of IQR clipping slows inference by less than 2%, which is negligible considering the resulting accuracy gains.
Algorithm 2 Activation clipping using TM-IQR Return: A Algorithm 2 outlines the basic procedure of our TM-IQR clipping.In Line 1 we compose the set of token-maximum activations in the absolute sense.Essentially, we are reducing the set of activations to a smaller representative set that contains the top outliers of the larger set.Lines 2 to 5 compute the IQR threshold t which is then used to clip the entire activation tensor in lines 6 and 7.The value 1.5 in line 5 is commonly referred to as the IQR scale.It was historically proposed by Tukey (Tukey et al., 1977) as a level to detect outliers.It is possible to attempt to fine-tune this value, however we chose to use the historical value without tuning in line with the objective of our paper.
It is important to note that the TM-IQR algorithm assigns a dynamic clip value for each activation tensor as opposed to using a fixed value for all run-time inference.Unlike fixed clipping tuned by training datasets, we expect TM-IQR clipping to be applied in a zero-shot approach across multiple tasks while maintaining reasonable empirical accuracy.This is due to the fact that our clipping strategy guarantees that at least 75% of the row-wise extreme activations are not impacted by it, while a fixed clipping method does not offer such guarantees for all types of input, as is the case when the input is not very aligned with training data.This has the important effect of limiting the distortion error, which occurs when quantizing activations with excessive clipping.

Experimental Setup
Our run-time inference engine, implemented in C++, supports both FP32 and optimized 8-bit integer quantized inference (I8).We quantize model weights at load-time and dynamically quantize activations at run-time.The TM-IQR technique is a straightforward modification with a negligible impact on inference speed, as shown in Table 1.

TM-IQR
TM-IQR can be applied to the activations before each quantized GEMM operation.However, we found that the second feed-forward GEMM, henceforth referred to as FF2, contributes the majority of the quantization error.The input dimension of FF2 is very wide, 4×H, providing more of a chance for saturation and integer numerical instability to accumulate.In addition, the input to FF2 constitutes the activations of either ReLU or a GELU nonlinearities.The range of such activation functions is unbounded on the positive side, which further increases the chance of saturation.Therefore, we found it most effective to apply the TM-IQR to the input activations of the FF2 GEMM operation.

Tasks
We test our proposed method on GLUE (Wang et al., 2018) and 2 popular question answering (QA) tasks: Natural Questions (NQ) (Kwiatkowski et al., 2019) and TyDI1 (Clark et al., 2020).We train all our tasks using the publicly available (Wolf et al., 2019).For GLUE tasks, we run 5 seeds with hyper-parameters using HuggingFace's defaults for BERT while tuning the learning rate for RoBERTa (refer to A for more details).For QA tasks, we follow (Alberti et al., 2019;Clark et al., 2020).Our underlying pre-trained language model for GLUE is both BERT (cased) (Devlin et al., 2018) and RoBERTa (Liu et al., 2019), while for QA, we used XLM-R (Conneau et al., 2019).Note our method does not need any fine-tuning once this step is done and models are obtained.

Results
Since our method does not modify the training pipeline or tune the quantization parameters on training sets, we compare our results directly to the FP32 numbers.We are not expecting our method to outperform FP32 but rather to reduce the negative effect of quantization while keeping its speed as well as simplifying the model deployment process.

Question Answering
On TyDI and NQ (Table 2), TM-IQR clearly recovers most of the performance lost to dynamic quantization and is superior to I8 by 1 point on average.Similar to GLUE, TM-IQR still performs well with the I8 drop being the highest.

GLUE
Table 3 shows that TM-IQR is robust with an overall average score drop, compared to FP32, by only 0.2% for BERT-base, 0.5% for BERT-large, 1.2% for RoBERTa-base and 0.4% for RoBERTa-large.For all 4 pretrained models, TM-IQR wins on average.Even when TM-IQR does not outperform I8, the loss is relatively small.Interestingly, TM-IQR does well for cases where I8 drop is large, e.g.CoLA and RTE for all models and STS-B for RoBERTa-base.

Conclusion
We show that BERT-like models can be quantized to 8-bit integers with good accuracy without the need to modify training procedures or add extra data sets for parameter calibration.We present a robust statistically-based algorithm that dynamically adjusts the quantization clipping to maintain reasonable accuracy.Our empirical results demonstrate the effectiveness of our method on a number  of NLP monolingual and multilingual tasks, trained on both base and large size BERT-like models.

A Evaluation on GLUE Task
For GLUE experiments we use the publicly available open-source library PyTorch-Transformers (Wolf et al., 2019).We report the standard metric on each task, specifically: Accuracy is used for MNLI, MNLI-MM (mismatch) (Williams et al., 2018), SST-2 (Socher et al., 2013), QNLI (Rajpurkar et al., 2016), and RTE (Dagan et al., 2005).Mathews correlation coefficient is used for CoLA (Warstadt et al., 2019).F1 is used for MRPC (Dolan and Brockett, 2005) and QQP (Iyer et al., 2017).Finally, Pearson correlation coefficient is used for STS-B (Cer et al., 2017), For BERT models, We use the default hyper-parameters provided by the HuggingFace's library, specifically the learning rate is 2. × 10 −5 , the batch-size is 32 and the fine-tuning epochs is 3, except for MRPC where the the fine-tuning epochs is 5.For RoBERTa models, we tuned the learning rate in [5e−7, 2e−6] for best devset results on FP32 evaluations, in addition we increase the epochs to 6 for the two large datasets, MNLI and QQP, and to 12 for the rest of the tasks.
Similarly to (Kim et al., 2021) we exclude WNLI (Levesque et al., 2012) since it showed unstable results even on FP32 due to its small dataset.

Table 1 :
IQR throughput cost in WPS (words per sec) averaged over 4 runs.Each input is 512 tokens.48 core Xeon 8260 and V100 speed included for reference.