Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges – namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending to the special separator token. To combat these challenges, we present three solutions based on post-training quantization and quantization-aware training, each with a different set of compromises for accuracy, model size, and ease of use. In particular, we introduce a novel quantization scheme – per-embedding-group quantization. We demonstrate the effectiveness of our methods on the GLUE benchmark using BERT, establishing state-of-the-art results for post-training quantization. Finally, we show that transformer weights and embeddings can be quantized to ultra-low bit-widths, leading to significant memory savings with a minimum accuracy loss. Our source code is available at https://github.com/qualcomm-ai-research/transformer-quantization.

Despite cutting edge results in many applications, pre-trained transformer-based models are extremely large, sometimes exceeding billions of parameters. Hence, efficient deployment of these models on resource-constrained embedded systems, and even sometimes in data centers, has become an important problem due to high latency and prohibitively large memory footprint and energy consumption.
One effective method to tackle this problem is neural network quantization. Quantization reduces memory consumption by using low-bit precision for weight and activation tensors. Is also reduces inference time, and improves energy efficiency by employing low-bit fixed-point arithmetic instead of floating-point arithmetic (Horowitz, 2014).
Quantization, however, is not free. It introduces additional noise in the network that can lead to a drop in the model's performance. While prior work has demonstrated the feasibility of integeronly inference for computer vision models (Lin et al., 2016;Jacob et al., 2018;Krishnamoorthi, 2018;Zhang et al., 2018;Choukroun et al., 2019;Dong et al., 2019;Esser et al., 2019;Nagel et al., 2019Nagel et al., , 2020, there is relatively little work done on quantizing NLP models (Wang et al., 2018b;, and specifically on transformer models.
Understanding the challenges of transformer quantization and designing a robust and easy-touse quantization pipeline for them constitute the primary goal of this paper. The contributions of our work include: • We show that standard 8-bit post-training quantization techniques lead to a significant performance degradation for transformer encoder models.
• We conduct a systematic study to identify the underlying reason that precludes efficient transformer quantization. We find that the main bottleneck is a considerable mismatch between the different dynamic ranges of activation tensors in the residual connections. Further analysis shows that these activation tensors contain structured outliers that facilitate specific attention patterns in deeper encoder layers, such as attending to the special [SEP] token. We highlight that this issue is inherent to many architectures and pre-training objectives.
• Based on these findings, we propose a set of solutions with different trade-offs to overcome the dynamic range problem, including techniques based on post-training, mixed precision, and quantization-aware training. In particular, we introduce a new per-embeddinggroup quantization scheme, which solves the activation quantization issue without a significant compute overhead or increase in complexity.
• Finally, we show that weights and embeddings in BERT-like models can be quantized to ultralow (2-4) bits, reducing the memory footprint by more than 8× with a minimal accuracy loss.
We evaluate our proposed solutions on eight different NLP tasks from the well-known GLUE benchmark. Our techniques set a new state-ofthe-art of post-training quantization and per-tensor quantization-aware training for the BERT model. To the best of our knowledge, this is the first work for the BERT-like transformer quantization with a strong focus on post-training quantization. The presented method is not exclusive to BERT and is easily applicable to other pre-trained transformer models.
2 Background and related work Efficient Transformers Making transformer models more efficient in terms of memory and computation time is an active area of research. A good survey paper is Tay et al. (2020). Most prior work focuses on architectural changes that speed up self-attention, which is the most expensive operation crucial for efficient processing of long sequences of tokens or pixels. Notable examples include ones that apply fixed (Child et al., 2019;Beltagy et al., 2020) or learned (Kitaev et al., 2020) sparsity patterns to the otherwise dense attention matrix, while others introduce efficient approximations based on low-rank (Wang et al., 2020b) or kernel methods (Katharopoulos et al., 2020;Choromanski et al., 2020). Some of the complementary efforts in this area are compact and fast architectures by design (Sun et al., 2020;Iandola et al., 2020), weight sharing (Dehghani et al., 2018;Lan et al., 2019), parameter reuse across multiple downstream tasks (Houlsby et al., 2019;Stickland and Murray, 2019), knowledge distillation (Sanh et al., 2019;Jiao et al., 2020), neural architecture search (Guo et al., 2019;Wang et al., 2020a), pruning (Sanh et al., 2020Prasanna et al., 2020), and better pre-training (Liu et al., 2019;Clark et al., 2020).
Quantization One of the most powerful ways to decrease the computational time and memory consumption of neural networks is quantization, which uses low-bit representations for weight and/or activation tensors. When moving from 32 to 8 bits, the memory overhead of storing tensors decreases by a factor of 4, while the computational cost for matrix multiplication reduces quadratically by a factor of 16. Low-bit fixed-point representations, such as INT8, further reduce the energy consumption since the fixed-point operations are more efficient than their floating-point counterparts (Horowitz, 2014). However, exact latency improvements and energy savings are highly dependent on the target hardware. Therefore, we focus in this work on achieving high memory and compute reduction while maintaining acceptable model accuracy and do not measure actual on-device performance gains. We will cover relevant basics of quantization here, for a more comprehensive overview of neural network quantization please refer to Nagel et al. (2021).
A commonly used scheme for quantization is uniform affine or asymmetric quantization (Zhou et al., 2016;Hubara et al., 2017;Krishnamoorthi, 2018) because it allows for efficient implementation of fixed-point arithmetic. It is defined by bitwidth b ∈ N, scale factor s ∈ R + , and zero-point z ∈ Z. We simulate the quantization process in floating-point according to Jacob et al. (2018).
Quantizing a real-valued tensor x is performed by first mapping it to an unsigned integer grid: It is possible to approximately recover the realvalued input x through an operation that is often referred to as de-quantization: In the case of symmetric quantization, we restrict the quantization grid to be symmetric around z.
It is common to have a single set of quantization parameters per tensor, known as per-tensor quantization. One could also increase the quantization granularity by defining separate quantizers for individual segments of a tensor. This will improve the accuracy of a network, but at the cost of an additional compute and memory overhead.
An important class of quantization methods is post-training quantization (PTQ) algorithms, which take a pre-trained FP32 network and convert it directly into a fixed-point network without the need for the original training pipeline (Krishnamoorthi, 2018). A vital step in the PTQ process is finding good quantization ranges for each quantizer. One way of doing this is static range estimation, which determines quantization parameters for the network by passing a few batches of calibration data through the model before inference. It yields more efficient inference since all the quantization parameters are known in advance and fixed. Several of the most common range estimators include: current min-max or simply min-max, uses the full dynamic range of the tensor (Zhou et al., 2016;Wu et al., 2018b;Zhu et al., 2020); running min-max uses exponential moving average of the min and max over multiple batches (Krishnamoorthi, 2018); MSE finds quantization parameters that minimize mean squared error between quantized and floating-point tensors (Choukroun et al., 2019;Banner et al., 2018).
An alternative to PTQ is to train a neural network with the simulated quantization operations in the network, known as quantization-aware training (QAT, Jacob et al. 2018;Gupta et al. 2015;Krishnamoorthi 2018). It allows the model to better adapt to the introduced quantization noise compared to PTQ, at the cost of longer training times, the need for labeled data and doing a hyper-parameter search. Gradients through the nondifferentiable quantization step are usually approximated using the straight-through estimator (Bengio et al., 2013). Ranges for both weights and activations can be set using PTQ range estimators or learned jointly with the weights during training, as in Esser et al. Finally, it is possible to assign different bitwidths for different layers or parts of the network, a technique known as mixed precision (Lin et al., 2016;Wu et al., 2018a;Zhou et al., 2018;Dong et al., 2019;Wang et al., 2019;van Baalen et al., 2020). Note that all the mentioned approaches for BERT-like transformer quantization employ some form of QAT and either do not discuss PTQ alternatives or only use them as weak baselines.

Problem investigation
First, we investigate what happens when we apply standard 8-bit post-training quantization to the BERT model and evaluate it on eight downstream tasks from the GLUE benchmark . To quantize fine-tuned models, we use uniform affine quantization with static range  Table 1: Post-training quantization results on development sets of the GLUE benchmark (except WNLI). The metrics for these tasks can be found in the GLUE paper ; in all cases, higher is better. FP32 baseline is trained by the authors from the pre-trained checkpoint, see Appendix B.1 for details. We report a median over 5 runs with different random seeds.
estimation, as described in Section 2. We quantize all layer's weights and activations (both input and output). We follow a typical setup with symmetric weight and asymmetric activation quantization (Bhalgat et al., 2020). We try several choices for range estimation for both weights and activations and report the best configuration per task, based on its metric (see Appendix B.2 for details).
In Table 1, we present the results for joint (W8A8), activation-only (W32A8), and weight-only quantization (W8A32). We note that there is a significant performance degradation for joint 8-bit quantization. We can also see that weight quantization incurs almost no error on its own and that most degradation is due to activation quantization. Finally, some tasks seem to be more robust to quantization than others.
To find which part of the network is the most problematic, we perform an ablation study in which we do not quantize specific activations. The results are summarized in Table 2. By far, the smallest performance drop is when we do not quantize the residual sum after the feed-forward network (FFN, see Figure 1). Furthermore, the issue seems to be the most pronounced for deeper encoder layers (10 and 11).
To understand why quantizing the residual FFN sum is so detrimental, we look at activation tensors in the problematic 11th layer. First, from Figure 2a, we note that FFN's input and output have radically different dynamic ranges (note the scale for y-axes) due to strong outliers in the output tensor. Applying per-tensor quantization for the FFN's residual sum is likely to cause a notable error because of the following trade-off between the range and the precision. On the one hand, using the full dynamic range for the FFN's output will lead to a very coarse quantization of its input. On the other hand, using higher precision for the input will cause informa-tion loss in the output due to aggressive clipping of its range. We also notice a correlation of outliers with special [SEP] tokens. In addition to that, from Figure 2b, we observe that only a few embedding dimensions are consistently responsible for these outliers across many data points.
In Appendix D, we show that this is the case for all layers of BERT-base and all GLUE tasks. Furthermore, we show that a similar issue is also present in multiple architectures and training objectives, including pre-trained BERT-large, RoBERTa, and DistilRoBERTa (Sanh et al., 2019), and Mo-bileBERT (Sun et al., 2020).
Further analysis suggests that structured outliers in the FFN's residual connections lead to structured outliers in query-key multiplications in specific attention heads in the next attention layer, causing most of the tokens to attend to the special [SEP] token. See Appendix A for more details on this.

Methodology
In this section, we introduce our proposed techniques for BERT-like model quantization. Motivated by our findings from Section 3, we consider three ways of efficient BERT quantization -posttraining mixed precision, a new per-embeddinggroup activation quantization, and quantizationaware training. Each of these three methods comes with its own set of trade-offs, which is why we present all three. The reader can pick an appropriate solution for their practice. As before, we employ uniform affine quantization and static activation ranges, which are either estimated in PTQ or learned during QAT, as described in Section 2.
Mixed precision PTQ As seen in Section 3, not all parts of BERT are equally sensitive to the quantization noise. Thus, selecting a higher bit-width for sensitive tensors can lead to better accuracy while efficiently keeping all the other tensors in 8-bit or  Figure 1: Left: Leave-one-out analysis for activation quantizers on problematic GLUE tasks. We set all weights to FP32 and use current min-max (with a batch size of 1) range estimator for activations. We report median score over 5 runs with different random seeds. Right: A schematic illustration of the attention layer in BERT. Hidden activation tensor is denoted by x. ⊕ is an element-wise addition. A problematic residual connection sum after feed-forward network is highlighted in red. lower. First, we consider 16-bit activation quantization for problematic activation tensors, such as the residual sum tensor after the feed-forward network. It will provide a model with sufficient precision to represent both FFN's input and output, as well as their sum. Additionally, given the observation from Table 1, that the BERT model seems to be quite resilient to 8-bit weight quantization, we also consider the effect of low-bit (2-4) weight and token embedding quantization, which reduces the model size by more than 8× with a minimal loss in accuracy.
Per-embedding-group PTQ As discussed in Section 2, another way of improving the performance of the quantized model is to increase the quantization granularity. Based on our observation from Figure 2b, that the most problematic outliers in activation tensors are in few designated embedding dimensions, we consider having distinct quantization parameters for individual embedding dimensions or groups of embedding dimensions, as shown in Figure 3.
We start by describing per-embedding activation quantization. In BERT-like models, an intermediate hidden activation tensor x has a shape (B, T, d), where B is the batch size, T is the sequence length, and d is the number of embedding dimensions (d = 768 for BERT-base, Devlin et al. 2019). Inspired by per-channel weight quantization (Krishnamoorthi, 2018), we can have distinct scaling factors and zero-points per embedding dimension instead of having two scalars for the whole tensor. In this case, we can collectively denote the quantization parameters by vectors s, z ∈ R d . The rest of the quantization machinery works as before, including range estimation, with the only difference that equations (1) and (2) are now with broadcasting along the last dimension. The proposed scheme should alleviate the activation quantization issue since the outlier embedding dimensions will no longer dominate the ranges of other embedding dimensions. Note, however, that full per-embedding activation quantization will lead to a more expensive computational graph. To illustrate why, consider a matrix-vector multiplication Wx, which in case of per-tensor quantization (and assuming z = 0) for both weights and activations can be computed as follows: A crucial detail here is that we can factor a common factor s w s x out of the summation. The sum is then efficiently calculated using integer-only arithmetic.
In case of per-embedding activation quantization ( x = s x x (Z) ), the matrix-vector multiplication becomes instead: Here it is no longer possible to take the scaling factor out of the summation and perform a single re-scaling of the result. Instead, one has to perform repeated intermediate re-scalings on the accumulator.
To alleviate the overhead of constant re-scaling, we introduce per-embedding-group (PEG) quantization, where we split the activation tensor into K evenly sized groups along the embedding dimension and share quantization parameters among elements in the same group: where [· · · ] denotes concatenation. Thus the required re-scaling operations are significantly reduced from d to K.
The proposed scheme might not be natively supported on all target device, but there is an efficient way to implement it using only per-tensor quantization. Before quantizing the output of the first Lay-erNorm (Figure 1), we split the output tensor based on embedding groups into K individual tensors. We also accordingly split columns of the first Linear layer and rows of the second layer and decompose them into K smaller Linear layers each. The outputs of the first set of layers are elementwisesummed, and the outputs of the second set of layers are concatenated before the residual sum. With this functional equivalent rewriting, all operations can be performed using standard per-tensor quantization.
To ensure all outliers end up in the same group, we employ a deterministic range-based permutation of the embedding dimensions. Similar to range estimation for the activation quantization, we pass some calibration data through the unquantized network and record the dynamic range r j := max(x :,:,j ) − min(x :,:,j ) for each embedding dimension j. Next, we define K evenly sized groups based on indices in arg sort(r). During the range estimation phase, we determine a separate quantization range for each group. The sorting and grouping need to happen only once before the range estimation phase and deployment to the target. PEG quantization with permutation can still be simulated on hardware that only supports pertensor operations. First, we can share the same permutation for FFN's input, output and sum since we expect the outliers in the output dominate the ones from the input. Second, we use the permutationequivariant properties of LayerNorm and linear layers (weights are permuted accordingly before inference). We first permute the output of the first LayerNorm, proceed as described above, and then apply inverse permutation before the next Layer-Norm.
Note that the PEG quantization has a negligible memory overhead, introducing only d + 2 · 3 · K extra parameters per attention layer (permutation indices and scale & zero points per group for FFN's input, output, and sum), which is less than 0.04% of the total size of BERT-base model.
Quantization-aware training Finally, we consider a variant of QAT with learnable ranges for both weights and activations by adapting the procedure from Esser et al. (2019); Jain et al. (2019) for BERT-like transformer models. Simulating the quantization process during fine-tuning allows the model to adapt to quantization noise and often significantly increases performance compared to posttraining quantization.

MP-PTQ PEG-PTQ QAT
Post-training Per-tensor Same bit-width Comparison of methods We summarize different trade-offs for the proposed techniques in Table 3. As discussed in Section 2, usually PTQ methods are preferred over QAT algorithms since they are faster and require either no data or only a small calibration dataset. Additionally, they typically require almost no hyperparameter tuning, enabling easy and computationally efficient quantization. Allocating a higher bit-width to certain parts of the network will reduce the efficiency gain from quantization, because higher bit-width layers are more computationally expensive. It is also not supported by all target hardware. Per-embedding-group quantization has a smaller granularity compared to pertensor quantization. It leads to a minor amount of extra compute (and potential latency) due to the additional summation and re-quantization that occurs and might not be supported natively on every fixed-point platform. Meanwhile, we have shown a way to simulate this scheme on a hardware that only support per-tensor quantization operations.

Experiments
In this section, we evaluate the proposed quantization techniques for the BERT model on GLUE downstream tasks.
Experimental setup In all experiments, we use uniform affine quantization -symmetric weights, asymmetric activations -with the static activation range setting, as discussed in Section 2. We quantize all layer's weights and activations. For 8-bit weight quantization, we use the best range settings found in the experiment from Section 3, which can be found in Appendix B.2. However, for low (<8) bit weight and token embedding quantization, we always use the MSE range estimator, as recommended by Choukroun et al. (2019); Banner et al. (2018). We set activation ranges based on min and max from a single input sequence. For PTQ experiments, we report the median score over five runs with different random seeds.
For QAT experiments, we initialize all quantization parameters from the PTQ setup described above. Similarly to full-precision fine-tuning, we use Adam (Kingma and Ba, 2014) and a maximum sequence length of 128, with padding using a special [PAD] token for shorter sequences. We use a typical learning rate schedule from the transformer literature (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2019) -a linear warmup for the first 10% of training steps followed by a linear decay to zero. We perform a hyper-parameter search over the maximum learning rate, batch size, number of epochs, and the self-attention dropout rate for every task and report the best median score over three runs with different random seeds. For reproducibility, we included more details on the search space and selected hyper-parameters in Appendix B.3.
Mixed precision PTQ First, we present the results for mixed precision post-training quantization (MP-PTQ) in Table 4, where we start from 8-bit activations and progressively keep more and more operations in 16-bit precision. We see that for classification tasks (MNLI, QNLI, RTE), it is sufficient to keep a few of the most problematic parts in 16-bit  Table 5: Per-embedding-group activation quantization PTQ results for BERT-base on development sets of the problematic GLUE tasks. * Per-embedding-group quantization is applied only to FFN's input, output, and residual sum (all the rest -per-tensor). " + P" -Uses range-based permutation.
to get good performance. For the STS-B regression task, it is also necessary to keep the output in higher precision to close the gap with FP32 model performace.
In conclusion, by only keeping 22% of the activations in 16-bit 1 , we can achieve performance close to FP32, while all other activations and all weights are in 8-bit for efficient inference.
Per-embedding-group PTQ Next, we investigate the effectiveness of the proposed perembedding-group post-training activation quantization, depending on the number of groups K. The results are summarized in Table 5. Per-embedding activation quantization significantly improves performance, even when only applied to problematic parts of the network. Surprisingly, we can also recover most of the performance degradation with only K = 3 groups (size 256 each), especially if 1 36 out of 161 activation quantizers for BERT-base we apply range-based permutation to ensure all the outliers end up in the same group. A small number of groups is essential since it limits the number of re-scalings required, enabling efficient execution on resource constraint devices.
Comparison of proposed methods We summarize the results for all of our proposed techniques and compare them to several related methods from the literature in Table 6. We use the same setup as described above. Unless otherwise stated, all results use 8-bit per-tensor quantization for both weights and activations. For mixed precision (MP-PTQ), we use the best setup from the ablation study before. For per-embedding-group quantization (PEG-PTQ), we use K = 6 groups with range-based permutation for all tasks and only apply it to FFN's input, output, and the sum.
To summarize, all the proposed techniques solved the dynamic range problem, enabling efficient transformer quantization with minimum accuracy loss. Our PTQ results strongly outperform results from the literature, while our assumptions in mixed precision are milder than ones of Q8BERT, which keeps all non-linearities in FP32. Our pertensor QAT results are also on par or outperform results from the literature, which uses finer quantization granularity and keeps certain parts of the network in FP32.
Low-bit weight and token embeddings Given the robustness of the BERT model to 8-bit weight quantization, we investigate the effect of low-bit weight and token embedding quantization and summarize the results in Table 7.
We see that even in the post-training regime, it is possible to achieve low-bit weight quantization with acceptable performance degradation, especially when combined with AdaRound (Nagel et al., 2020), a technique for learning optimal rounding. QAT recovers most of the performance, even with quantized activations. Furthermore, we can push token embeddings to 2-bits with less than a 0.8% drop in terms of the GLUE score. This reduces the model size by 8.85× compared to the original FP32 checkpoint and can significantly increase inference speed and reduce the energy consumption on resource constraint devices. More detailed results, including per-task scores and comparison to results from the literature, can be found in Appendix C.  The metrics for these tasks can be found in the GLUE paper ; in all cases, higher is better. We compare against Q8BERT (Zafrir et al., 2019) and Q- BERT (Shen et al., 2020). Note that these papers start from FP32 baselines with slightly different scores. * Uses FP32 Softmax, GELU and LayerNorm. † Uses dynamic activation quantization. ‡ Reports F1 score for MRPC, QQP and Pearson Correlation for STS-B, instead of the combined metrics. § A macro-average without a score for the MNLI task. ψ Uses group-wise per-channel weight quantization with 128 groups and keeps the last fully-connected layer in FP32.

Conclusions
In this paper, we explored quantization for BERTlike transformers. We showed that these models have unique quantization challenges -namely, high dynamic activations ranges that are difficult to represent with a low bit fixed-point format. These activations contain structured outliers in the residual connections that encourage specific model behavior, such as attending to the special [SEP] token. Motivated by our findings, we proposed three solutions, one based on mixed precision quantization, a novel per-embedding-group quantization, and quantization-aware training. Each of these methods has its own set of trade-offs in terms of accuracy, ease of use, and model size. Our techniques overcome the dynamic range issues and set a new state-of-the-art for PTQ and per-tensor QAT on GLUE downstream tasks. Finally, we achieved 4-bit weight and 2-bit token embedding quantization with less than 0.8% drop in terms of GLUE score, leading to significant memory and compute savings.
To better understand why transformer models learn these peculiar outliers, we look at what happens with those outliers when they proceed to the next attention layer. We visualize the attention mechanism for one of the attention heads in the problematic 11th layer of BERT-base in Figure 4. We can see that most of the tokens in this attention head attend to special [SEP] tokens. Furthermore, from Figure 4b we see a similar consistent vertical pattern (indicated by black arrows) as we saw from the per-embedding graphs for FFN's input and output activation tensors (see Figure 2b in paper). It means the attention mechanism generates such queries and key vectors that the decision of attending to special separator tokens is determined by only a few designated neurons. It suggests that structured outliers in residual connections lead to structured outliers in query-key multiplications, causing most tokens to attend to the separator token. Clark et al. (2019) has shown that in BERTlike transformer models, attending to the special [SEP] token is essentially a "no-op" for attention heads that cannot extract patterns they were trained to look for from the specific passage of text. Clark et al. (2019) also showed that such behavior is quite common: often, more than a half of the head's attention is on special tokens, specifically in deeper layers.
We hypothesize that such an attention pattern seems to be a useful one to obtain a good predictive performance, while the structured outliers merely help to facilitate this behavior. These outliers causing such a high dynamic range for activations likely emerged as a result of specific architectural choices (e.g., large fully-connected layers) and long pretraining with no explicit activation regularization applied.
We follow a standard fine-tuning practices from (Devlin et al., 2019) and https://github.com/ huggingface/transformers. Each data point is tokenized and truncated to the maximum sequence length of 128. Shorter sequences are padded to the same length of 128 using a spe-cial [PAD] token. We fine-tune for 3 epochs using Adam for all tasks. Learning rate is initially set to its maximum value and is linearly decayed to zero by the end of fine-tuning. We tune the batch size and maximum value of learning rate individually per task from the following search space: • batch size: {32, 64} for bigger tasks (QQP, MNLI, QNLI) and {8, 16, 32, 64} for the rest, • learning rate: {2,3,4,5}e-5.
We repeat every experiment 5 times with different random seeds and select the configuration with the best median score on the development set for the respective task. These configurations are shown  in Table 8. Quantization is always applied to the median checkpoint for the respective task. We exclude the problematic WNLI task (Levesque et al., 2012), as it has relatively small dataset and shows an unstable behaviour (Dodge et al., 2020), in particular due to several issues with the way the dataset was constructed 2 .

B.2 Range setting for 8-bit post-training quantization
We select the best range estimators from the following search space: • weights: {min-max, MSE}; • activations: {current min-max, running minmax, MSE}.
For activations, we also select the best batch size and number of batches from {1,4,16} (except current min-max, for which only a single batch is used). For running min-max, we use the momentum coefficient of 0.9. We repeat every experiment 5 times with different random seeds and select the configuration with the best median score on the development set for the respective task. Best  configurations for joint weight and activation 8-bit post-training quantization are listed in Table 9.

B.3 W8A8 QAT hyper-parameters
Hyper-parameters for W8A8 quantization-aware training are listed in Table 10.

B.4 W4A8 QAT hyper-parameters
Hyper-parameters for W4A8 quantization-aware training are listed in Table 11.   of MNLI ( Figure 5), STS-B ( Figure 6) and MRPC ( Figure 7). We see that only a few designated embedding dimensions generate outliers across many data points. It suggests that such behavior is already pre-determined by the weights and embeddings of the pre-trained BERT model.

D.2 Activation tensors for different architectures
We shot that dynamic range issue is present in multiple architectures and training objectives: • BERT-base in Figure 8, • BERT-large in Figure 9, • RoBERTa-base in Figure 10, • DistilRoBERTa-base in Figure 11, • MobileBERT-base in Figure 12.
In all cases, we used pre-trained checkpoints from HuggingFace library (Wolf et al., 2020 Table 12: Low-bit weight and token embedding quantization results for BERT-base on development sets of the GLUE benchmark. We compare against Q- BERT (Shen et al., 2020). Note that this work starts from FP32 baselines with slightly different scores. * Keeps the last fully-connected layer in full precision. † Uses group-wise per-channel weight quantization with 128 groups (of size 6 each).
Figure 5: Visualization of activation tensor outliers in BERT-base FFN's input and output across embedding dimension for the first ten data sequences in the MNLI development set. Dark grey color indicates values that exceed six standard deviations from the mean of the activation tensor.      x-axis: index of data sequence. y-axis: the range (note the scales are different for the input and the output).
(a) input (b) output Figure 12: Activation distributions of FFN's input (a) and output (b) in second to the last layer for MobileBERTbase, evaluated on first ten data sequences from development sets of GLUE downstream tasks (full-precision). In each sub-plot, left-to-right, top-to-bottom: CoLA, MNLI, MRPC → QNLI, QQP, RTE → SST-2, STS-B, WNLI.
x-axis: index of data sequence. y-axis: the range (note the scales are different for the input and the output).