PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models

While transformer-based pre-trained language models (PLMs) have dominated a number of NLP applications, these models are heavy to deploy and expensive to use. Therefore, effectively compressing large-scale PLMs becomes an increasingly important problem. Quantization, which represents high-precision tensors with low-bit fix-point format, is a viable solution. However, most existing quantization methods are task-specific, requiring customized training and quantization with a large number of trainable parameters on each individual task. Inspired by the observation that the over-parameterization nature of PLMs makes it possible to freeze most of the parameters during the fine-tuning stage, in this work, we propose a novel ``quantize before fine-tuning'' framework, PreQuant, that differs from both quantization-aware training and post-training quantization. PreQuant is compatible with various quantization strategies, with outlier-aware parameter-efficient fine-tuning incorporated to correct the induced quantization error. We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5. We also provide an empirical investigation into the workflow of PreQuant, which sheds light on its efficacy.


Introduction
Pre-trained language models (PLMs) have shown superior performance in various NLP applications.
Despite their impressive success, these transformerbased models typically contain hundreds of millions of parameters.Massive model scale is becoming an increasing burden, preventing researchers from making full use of large-scale PLMs.According to a recent study, only 0.5% to 4% of research papers published at the recent five NLP conferences tend to adopt large PLMs (PLMs with over a billion parameters) (Ding et al., 2022).This suggests that the inefficiency of deploying large PLMs is hampering the development of NLP research.Therefore, compressing PLMs becomes an urgent and important problem.
Various model compression methods have been proposed, such as knowledge distillation (Jiao et al., 2020;Sanh et al., 2019;Wang et al., 2021;Passban et al., 2021), weight sharing (Lan et al., 2019), network pruning (Liang et al., 2021;Gordon et al., 2020;Li et al., 2021), and quantization (Tao et al., 2022;Zhang et al., 2020;Bai et al., 2021;Kim et al., 2021).Among these compression methods, quantization is a promising solution.The core idea of quantization is to use low bit precision to store weight and activation tensors, and use fixed-point arithmetic to speed up inference.There are some prior works on quantizing PLMs covering different strategies and granularities.However, these quantization methods generally neglect the characteristics of PLMs -the distinction between fine-tuning a model and training a model from scratch -but treat quantizing PLMs no different from quantizing regular neural networks.In other words, these methods are task-specific, which design customized quantization for PLMs.There are two main limitations: first, these task-specific methods need to conduct both quantization and fine-tuning for each downstream task, with the quantization being applied either during or after the fine-tuning stage, which is inefficient; Second, the number of trainable parameters are still very large during fine-tuning, which is computational expensive.
In this work, we consider the quantization pipeline specially for the pre-training scenario.Our motivation starts from the distinction between "training from scratch" and "pre-training then finetuning".Unlike the weights from random initialization, the weights of the pre-trained model already contain rich information during pre-training.To utilize such information in a more efficient manner, we propose to directly quantize the pre-trained model in a task-agnostic way to obtain a "prequantized" model before fine-tuning.We then introduce a parameter-efficient fine-tuning and show that fine-tuning could be finished with minimal weight updates.In particular, we freeze most of the quantized weights in the "pre-quantized" model, and only fine-tune a very small subset of its model parameters in the fine-tuning process.Through an extensive set of explorations and experiments, we demonstrate the feasibility and advantages of the "quantizing the PLM first, then fine-tuning" pipeline, which we name as PreQuant.The main contributions are summarized as follows: • We propose a novel quantization framework, PreQuant, tailored for PLMs.We conduct a systematic study to overcome the difficulties of PLM quantization and validate the performance through thorough experiments.
• PreQuant performs task-agnostic quantization, which dramatically reduces the storage requirements for large PLMs and enables efficient deployment of PLMs on different downstream tasks.Moreover, PreQuant only finetunes 0.5% of the model parameters, which is more suitable in resource-limited scenarios.
• PreQuant is highly flexible, which is compatible with a wide range of quantization strategies and fine-tuning techniques.Within this framework, we evaluate the pros and cons of various quantization strategies and discuss the impact of different quantization settings.
2 Related Work

Quantization
Quantization, which represents the weights and activations of neural networks with low-bit precision, has been widely studied in computer vision and natural language processing (NLP) communities (Gholami et al., 2021).Recently, some researchers attempt to compress PLMs to reduce the deployment cost with quantization methods (Zadeh et al., 2020;Wu et al., 2022;Kim et al., 2021;Bondarenko et al., 2021).Quantization-aware training (QAT) (Gupta et al., 2015) is a representative approach to quantize a PLM while retaining most of its performance on downstream tasks.Given a downstream task, QAT performs the quantization during the task-specific training(i.e., fine-tuning) process.For example, Q8BERT (Zafrir et al., 2019) and Q-BERT (Shen et al., 2020) are typical QAT methods to compress BERT-based models.Unlike QAT, Post-training quantization (PTQ) disentangles the fine-tuning and quantization.The quantization procedure is conducted after the taskspecific fine-tuning.In comparison to QAT, PTQ holds the advantages of flexibility and good compatibility.Yao et al. (2022) combines PTQ with knowledge distillation to achieve efficient compression for large PLMs.In addition to NLP scenarios, PTQ is also utilized to compress vision transformers (Liu et al., 2021).Some very recent researches employ quantization and parameterefficient fine-tuning jointly.Qadapter (Park et al., 2022)

Preliminary
A number of works have been employing various quantization techniques on the field of pre-trained language models.Existing quantization methods can be categorized into two prominent branches: quantization-aware training and post-training quantization.

Basic Notations
We consider uniform quantization for both weights and activations.Specifically, for a given tensor x in full precision, we adopt the rounding-to-nearest operation to round x to the nearest unsigned integer grid values x Z , which can be described as: where b ∈ N is bit-width, α ∈ R is the scaling factor, and z ∈ N is zero-point.After obtaining the quantized tensor x Z , one can approximate the full-precision version of the tensor x: Quantization-aware Training (QAT) QAT methods (Fig. 1(b)) learn the scaling factors (quantization) along with the weights during the finetuning stage.Since the rounding operation in Eq. 1 is not derivable, gradients through the nondifferentiable operations are usually approximated with the Straight-through Estimator (STE).As the quantization process of QAT is supervised by the overall training objective, the performance is generally quite promising.
Post-training Quantization (PTQ) PTQ methods (Fig. 1(a)) conduct qunatization after the finetuning.Unlike QAT that relies on the full training data, PTQ requires very little sometimes even zero calibration data to estimate scaling factors.Therefore, the overhead of PTQ is relatively small.However, its ease of use often comes with significant performance penalties.Generally, existing quantization solutions (both QAT and PTQ) for PLMs are task-specific, meaning to quantize either during or after the model fine-tuning stage.However, in PLMs, "pre-training then fine-tuning" replaces conventional "training from scratch", thus pre-trained weights already contain rich information.We wonder if it is possible to perform task-agnostic quantization.As shown in Fig. 1(c), PreQuant first conducts task-agnostic quantization on the pre-trained model, followed by parameter-efficient fine-tuning.

Overview
In contrast to PTQ and QAT, we propose to quantize PLMs prior to fine-tuning.Specifically, our framework consists of two stages, as shown in Fig. 2. The first stage directly quantizes the pretrained weights of PLMs without further adaptation.Hence, the quantization is task-agnostic.The second stage fine-tunes the "pre-quantized" PLM for downstream tasks.We can not simply apply the vanilla fine-tuning setting to a "pre-quantized" PLM.When the vanilla fine-tuning setting is used, it converts low-precision weights back into highprecision representations as weight updates are necessarily in high-precision (low-precision training is practically impossible).This defeats our purpose of quantizing these values.To address the issue, we propose a parameter-efficient tuning method that freezes most of the quantized weights and only fine-tune a small subset of model parameters.The details would be presented in following sections.

Task-agnostic Quantization
The goal of the uniform quantization in Eq. 1 is to estimate the optimal scaling factor α for each parameter matrix.This can be formulated as an optimization problem that minimizes certain loss functions, such as mean squared error (Choukroun et al., 2019).A more convenient solution is to directly estimate α with statistic information, such as directly utilizing the range of the tensor as the scaling factor (Bondarenko et al., 2021).
In our investigation into the weights of PLMs, we have observed outlier phenomenon: in each parameter matrix of PLMs, a tiny fraction of weights (i.e., outliers) holds abnormally greater values than the other weights.Empirically, most of weights strictly follow Gaussian distribution while "outliers" falls into the tail of the distribution, which can be detected with: where µ and σ 2 are the mean and variance of the parameter matrix W . Outlier values affect the effectiveness of quantization, causing great quantization error (Kovaleva et al., 2021).This addresses this issue, we adopt an intuitive quantization method.We set the quantization scaling factor α to 6σ, which is big enough to clip all the outlier weights according to Eq. 1.
It is worth noting that PreQuant is compatible with the other methods.In addition to the aforementioned outlier-aware scaling factor, we implement three other methods for comparison.
• Min-max is a basic method that estimates the scaling factor with the minimum and maximum of the tensor (Vanhoucke et al., 2011).
• Row-wise quantization adopts a finer granularity that assigns different scaling factors to each dimension of the matrix (Shen et al., 2020;Bondarenko et al., 2021).
We conduct a thorough comparison on previous scaling factor estimation methods and discuss the advantages and disadvantages of each in the experiment section.In comparison to previous quantization methods, our quantization method is data-free and task agnostic, as the quantizations are executed directly prior to the fine-tuning.

Outlier-aware Parameter-efficient Fine-tuning
After obtaining a "pre-quantized" PLM, the second stage is to fine-tune the model for specific downstream tasks.In this stage, we encounter a dilemma: on one side, fine-tuning requires updating model weights with high-precision representations, while on the other side, casting the lowprecision weights back to high-precision weights will nullify the effect of quantization.To address the issue, we propose an outlier-aware parameterefficient fine-tuning (outlier-aware tuning) strategy that keeps most of the model parameters frozen in low-precision.Parameter-efficient fine-tuning aims to adapt PLMs by tuning only a few number of parameters (Houlsby et al., 2019;Gong et al., 2022).(Ben Zaken et al., 2022) and Gong et al. (2022) have shown that tuning a small subset of parameters of PLMs can be comparable with full-parameter fine-tuning in terms of performance.This approach suits our scenario as it does not modify the model structure.
However, parameter-efficient fine-tuning is more challenging in our case since the quantization step produces pre-quantized PLM wherein the weights are rounded to low-precision.The induced quantization error correlates to the disturbance of the weights.If weights do not change much after quantization, the error will be minimal, and significant if they do.Intuitively, our goal is to identify which parts of the weights cause the most quantization error.By only tuning these specific weights, we can recover much of PLM's damaged representation ability.
In our investigation, we find that the majority of parameters exhibit relatively small disturbance, hence freezing them could preserve most of the PLM's ability.Some particular weights contribute to the most of the induced error and these weights are concentrated in specific dimensions.Moreover, these susceptible-to-quantization weights are exactly outlier weights that we mentioned in the above section.This is because the abnormally large values of outliers are generally clipped according to Eq. 1.We identify the dimensions containing most of outlier weights, then setting them as trainable parameters while freezing the rest.Specifically, in each parameter matrix, we select r outlier dimensions as trainable parameters.r is extremely small, we can guarantee that more than 99% parameters still remain in low-precision.By tuning the subnetwork consisting of outlier dimensions, we expect to recover the damaged representation ability and adapt to specific downstream task at minimal trainable parameters.

Experimental Setup
Settings We evaluate PreQuant on several popular PLMs including BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019) and T5 (Raffel et al., 2020).For RoBERTa, we test on both RoBERTa base and RoBERTa large .For T5, we employ PreQuant to the encoder of T5 3b , denoted as T5 Encoder.We use a fixed set of hyper-parameters for all the GLUE tasks.For each layer, we set the bit-width option b for weights as 4. Besides, we apply 8-bit min-max uniform quantization to activations and embeddings.Experimental results of more bit-width options are listed in Appendix A.4.

STS-B
), sentiment analysis (SST-2) and linguistic acceptability (CoLA) (Wang et al., 2018).The evaluation metrics are Matthews correlation for CoLA, Spearman correlation for STS-B, and Accuracy for the other tasks.We supplement fine-tuning details in Appendix A.1.
Baselines Classical quantization methods including PTQ and QAT are adopted as baselines.For PTQ, we adopt the implementation by Bondarenko et al. (2021), which introduces the group-wise granularity to reduce the quantization error.For QAT, we also implement a group-wise granularity variant.
Results of the vanilla QAT that utilizes straightthrough estimator (STE) (Bengio et al., 2013) to spread gradients are listed in Apppendix A.4.We include Qadapter (Park et al., 2022) and AlphaTuning (Kwon et al., 2022) that jointly employ the quantization and the parameter-efficient fine-tuning for further comparison.

Main Results
Comparison with Quantization Methods.The main comparison results are reported in Table 1.
Due to the precision reduction, all quantization methods inevitably lead to performance degradation in comparison to the full-precision fine-tuned model (FT).There is a considerable performance gap between 4-bit PTQ and 32-bit FT, although they are both tuning with a modest amount of calibration data.QAT outperforms PTQ on all tasks, demonstrating the benefit of a hybrid approach of quantization and task-specific fine-tuning.Pre-Quant is comparable in performance to QAT, but with much fewer trainable parameters.In order to evaluate the scalability and robustness of PreQuant, we report the results for different scale PLMs, ranging from 110M parameters to 1.5B parameters.As the model size increases, PreQuant performs consistently and stably.Take T5 1.5b as an example, PreQuant could achieve 99.21% performance of FT with only tuning 0.10% trainable parameters.
Comparison with Parameter-efficient PLM Quantization Methods.Comparisons with Qadapter and AlphaTuning are reported in Table 2.
For Qadapter, we adopt uniform asymmetric 8-bit channel-wise quantization for both activation functions and weights as described in the original

Comparison of Quantization Strategies
In this section, we replace the outlier-aware quantization with alternative quantization strategies to see how different strategies affect the performance.We evaluate three different strategies (i.e., min-max, MSE, and Row-wise quantization in Section 4.2) on 4-bit quantization for RoBERTa large .The differences of these strategies are listed in Table 3.As the disturbance of weights after quantization indicates the induced quantization error, we compute the L2 distance between quantized weights and full-precision weights as the measurement of the quantization error.As the bottom block of correlated to the performance on downstream tasks.The less the error, the better the performance.The min-max quantization strategy performs worst due to the negative influence of outlier weights.Meanwhile, outlier-aware, MSE, and row-wise strategies achieve comparable performance on four tasks as well as similar quantization error.The MSE quantization strategy achieve slightly better performance since it directly optimizes the L2 distance, which is more complicated than statistical strategies.rowwise quantization perform slightly better than layerwise strategies at the cost of a more expensive computational graph.Above all, the outlier-aware strategy reaches the best trade-off between performance and complexity.

Analysis of Outlier-aware Fine-tuning
In this section, we discuss the effect of parameterefficient fine-tuning on PreQuant.
Does outlier-aware tuning really work?Pre-Quant appoints the trainable subnetwork by detecting outlier dimensions, shorted as Outlier.It is important to show that the outlier dimension really matters for fine-tuning performance.To this end, we introduce two variants: 1) Random: We randomly choose the same amount of trainable parameters as our method; 2) Ticket: This is a task-agnostic subnetwork for parameter-efficient fine-tuning proposed in Gong et al. (2022) achieve comparable performance, and both are very close to the upper-bound performance by the FT.This suggests that our outlier-aware fine-tuning is a promising strategy to efficiently adapt PLMs to downstream tasks while reducing quantization errors.Noting that Outlier and Ticket have similar performance, we further calculate the subnetwork overlap ratio of the two methods using the Jaccard similarity coefficient.As we expected, Outlier and Ticket have non-negligible overlap (Jaccard similarity coefficient is 0.57.).
What is the optimal size of the trainable subnetwork?As stated in Section 4.3, we use hyperparameter r to control the size of the trainable high-precision parameters.We then focus on the effect of r on model performance.We conduct empirical experiments with various values of r in {1, 3, 5, 10, 20, 512, 1024}.Smaller value of r indicates fewer trainable parameters, which inevitably leads to performance degradation.We expect that more trainable parameters will lead to higher performance.The results are reported in Table 4.We find that a relatively small r, e.g., 3 or 5, is good enough to adapt PreQuant to downstream tasks.Note that r = 512 sets half of the model parameters trainable, and r = 1024 denotes that the whole model is trainable.From  et al., 2021), reparameterizes linear layers.LoRA updates all parameters in the weight matrix by adding a low-rank matrix.In our scenario, the original weight matrix is in low-precision while the update matrix is in high-precision.The addition of a high-precision matrix to a low-precision matrix results in a high-precision matrix, thus nullifying the quantization effect.

Extending to Layer-wise Mixed-precision Quantization
Previous work has shown that allocating different bit-widths to different layers leads to a better accuracy-efficiency trade-off, since not all layers are equally sensitive to quantization (Tang et al., 2022).PreQuant can be conveniently extended to a layer-wise mix-precision variant by assigning customized bit-widths to each transformer layer.
We implement a pilot mix-precision quantization paradigm that assigns 2-bits to bottom layers and 4-bits to top layers or vise versa.As can be seen in Table 5, all mixed-precision methods exhibit performance degradation due to the hybrid quantization setting.An overall conclusion is that top layers are less sensitive to quantization than bottom layers.Allocating 2-bits to the top third of layers resulted in an average loss of less than 3 points, which is very impressive.Meanwhile, assigning 2-bits to the bottom one-third of the layers suffers from more than 10 points of performance loss.These insightful findings could be beneficial to the development of better mixed-precision quantization techniques.

Conclusions
As the scale of pre-trained language models increases, model compression becomes a prerequisite prior to model deployment in resource-limited scenarios.Quantization is an effective and promising technique to compress large PLMs.Existing quantization methods including PTQ and QAT perform quantizations either during or after task-specific fine-tuning process.Since these approaches are highly task-specific, it's hard to transfer them to different tasks with low cost.In this paper, we propose a "quantizing the PLM first, then finetuning" framework, PreQuant, which includes a task-agnostic quantization stage and an outlieraware parameter-efficient fine-tuning stage.We compress widely used PLMs with PreQuant, including BERT, RoBERTa and T5 variants.The experimental results on the GLUE benchmark are reported to demonstrate the effectiveness of Pre-Quant.We also reveal that PreQuant is more flexible and efficient than its competitive counterparts.
An elaborate empirical study is conducted on the workflow of PreQuant, we hope the findings could shed some light on the quantization research of PLMs.

Limitations
Although the proposed PreQuant achieves promising results especially in reducing the storage and computational resources, we discuss some limitations of our work in this section.In our experiments, we observe that the performance of Pre-Quant is highly correlated with the data size.When fine-tuning with very limited data, PreQuant may not meet expectation to preserve the performance of PLMs.Moreover, our model performance also depends on the number of parameters (i.e.outliers) restored in the fine-tuning stage.This hyperparameter controls the trade-off between model performance and parameter efficiency.The optimal choice of the hyper-parameter for different tasks requires further investigation.Additional discussion and experimental results are provided in Appendix A.2.

A Appendix
Figure 4: Visualization of quantization error on BERT base key matrix from several layers.We compare the weights before and after the quantization, and plot the positions with large difference.
A During investigation, we find quantization is more challenging in small datasets.We further explore the effect of data size on quantization and finetuning.To this end, we randomly sample MNLI training set to {2k, 4k, 6k, 8k} examples and finetune T5 Encoder on them.As seen in Table 6, smaller data size leads to larger performance gap between the full-precision model and the quantized one.
A.3 Visualization of Quantization Error As shown in Fig. 5, when the number of bits for weight is 8, the performance of all quantization methods is close.However, when the bit-width decreases to 4, performance disparities between various approaches start to become apparent.PTQ fails to predict reasonable answers on 4-bit quantization, indicating that the quantization error is too strong to be minimized with a modest amount of calibration data.QAT and PreQuant still remain an acceptable performance for 4-bit quantization.

Figure 1 :
Figure 1: An illustrative comparison of different quantization methods for PLMs.PTQ directly performs model qunatization after the fine-tuning stage, while QAT jointly optimizes qunatization and fine-tuning.In contrast, our PreQuant conducts task-agnostic quantization first, and then performs parameter efficient fine-tuning.

Figure 2 :
Figure 2: The illustration of our two-stage quantization framework.Dark green and light green blocks represent for weight values in high-precision and low-precision respectively.Blue blocks represent for fine-tuned weights.In the first stage, all weights are "pre-quantized" to low-precision indiscriminately.In the second stage, a very small portion of weights are updated while the others are frozen during fine-tuning.

Figure 3 :
Figure 3: An comparison of different quantization strategies for PLMs.The pre-trained model is RoBERTa large and the bit-width is 4.

Fig. 4
Fig. 4 is an example of quantization error induced by uniform quantization.Several outlier dimensions tend to have larger error after quantization due to their large value.A.4 Scaling to Other Bit-widths

Table 1 :
Results on the development set of the GLUE benchmark.We also report the quantity of trainable parameters (without embeddings) of each method.FT represents for full-precision full-parameter fine-tuning, which achieves best performance as expected.All the quantization methods are implemented with 4-bits precision representations.

Table 3 :
Comparison of different quantization strategies on 4-bits PreQuant of RoBERTa large .

Table 4 :
Validation results on QNLI and MRPC after applying 4-bits PreQuant to RoBERTa large and T5 Encoder.The FT line is the result of full-precision full-parameter fine-tuning.
. The experimental results on four datasets are shown in Fig.3.Random selection of trainable parameters leads to a significant drop in performance, suggesting that outlier information does help in finding suitable trainable subnetworks.Outlier and Ticket Table4, we can see that setting r as 1024 cannot fully recovers the performance which is reasonable because the induced quantization error between high-precision and lowprecision representations could not be completely eliminated.Setting r to a larger value than 10 brings limited performance improvements but requiring more high-precision computational cost.

Table 6 :
.2 Results in Low-resource Scenarios Results in low-resource scenario.We randomly sample subsets from MNLI as training set and test on the standard validation set.

Table 7 :
Full results on the GLUE benchmark of the vanilla QAT and parameter-efficient quantization methods on RoBERTa large .