Quadapter: Adapter for GPT-2 Quantization

Transformer language models such as GPT-2 are difficult to quantize because of outliers in activations leading to a large quantization error. To adapt to the error, one must use quantization-aware training, which entails a fine-tuning process based on the dataset and the training pipeline identical to those for the original model. Pretrained language models, however, often do not grant access to their datasets and training pipelines, forcing us to rely on arbitrary ones for fine-tuning. In that case, it is observed that quantization-aware training overfits the model to the fine-tuning data. For quantization without overfitting, we introduce a quantization adapter (Quadapter), a small set of parameters that are learned to make activations quantization-friendly by scaling them channel-wise. It keeps the model parameters unchanged. By applying our method to the challenging task of quantizing GPT-2, we demonstrate that it effectively prevents the overfitting and improves the quantization performance.


Introduction
Quantizing a transformer model is not a simple matter due to numerous channel-dependent outliers in activations (Bondarenko et al., 2021).They lead to a large quantization error (Zhao et al., 2019), and we observe that the problem is worse in the decoder-only transformers like GPT-2.One solution to the difficulty is quantization-aware training (QAT), an approach that fine-tunes the model parameters in response to the numerical error arising from quantization.Post-training quantization (PTQ) -a counterpart of QAT that performs quantization without modifying model parameters -is not powerful enough to cope with the outliers.
While QAT is effective, it requires the dataset and the training pipeline, and the problem is that they are often inaccessible when dealing with the original pretrained model without any downstream task.One then cannot but use arbitrary fine-tuning data for QAT.
However, the fine-tuning returns worse accuracies for distributions unseen during training (outof-distribution with regard to fine-tuning; F-OOD) despite improving for the training distribution (indistribution with regard to fine-tuning; F-ID) (Kumar et al., 2022).This is consistent with our observation that QAT overfits the model to the finetuning data as in Figure 1.The resulting quantized model therefore has its generality impaired.This violates the premise of a general-purpose language model, which must operate well across various texts of the target language.
Our hypothesis is that QAT incurs the overfitting because it changes all the parameters of the model.This difficulty is much like the research topic of continual learning, where it is important that a model should not forget its past capability when transferring to a new task (Zhang et al., 2021).Adapter is a strategy to adapt to a new distribution by training only a small number of parameters.It is a popular method to lessen the catastrophic forgetting.We borrow this concept to propose Quadapter, a lightweight module to adapt to the quantization error on behalf of the intact original model.
The contribution of this work is that we suc-Figure 2: Quadapter performs a linear scaling and its inversion before and after Q, the quantizer for the target activation (left).In the transformer block of GPT-2, Quadapters can be installed in two different locations (right).
cessfully quantize GPT-2, overcoming the large inter-channel variance and the QAT overfitting issue with Quadapter.To the best of our knowledge, this is the first work to quantize both weights and activations of GPT-2 without the complete training pipeline.

Related Works
Adapters Extensive researches have been conducted on how to steer a large pretrained model with few adapter parameters.The concept of adapter has proven its usefulness in language models for transfer learning (Houlsby et al., 2019), multi-task learning (Stickland and Murray, 2019), and domain adaptation (Zhang et al., 2021).Several works apply adapters to the visual domain as well (Li and Hoiem, 2016;Perez et al., 2018).
Transformer Quantization In comparison to GPT-2, BERT is easier to quantize.It can be quantized with PTQ under a limited performance drop (Shen et al., 2020).QAT on BERT for a given downstream task recovers full-precision (FP) performance even with ultra-low precision (Zafrir et al., 2019;Bondarenko et al., 2021), or with integeronly operations for non-linear layers (Kim et al., 2021).On the other hand, quantization studies on autoregressive transformers are relatively limited in their scope, using weight-only quantization (Chung et al., 2020) or requiring full-fledged training (Prato et al., 2020;Tao et al., 2022).Please note that these works focus on quantizing GPT-2 that is finetuned on a downstream task whereas ours quantizes the original pretrained GPT-2.
Quantization techniques Directly relevant to our work are cross-layer-equalization (CLE) (Nagel et al., 2019) and adaptive rounding (AdaRound) (Nagel et al., 2020).Similarly to CLE, Qudapter rescales associated model weights to lessen the quantization burden.AdaRound and our proposed method are alike in training foldable helper parameters to minimize the block-wise quantization error.

Methods
Quadapter is simply a set of learnable parameters.
On the other hand, the Quadapter block represents the actual working mechanism of Quadapter, involving two consecutive layers of linear relations, their quantizers, and their associated Quadapter instance.The effectiveness of Quadapter comes from the interaction amongst the involved components, and from the two-phase training procedure.

Quadapter Design
Quadatper linearly scales the input channels and reverts after quantization.This ensures the identity relation if not for quantizers, making it possible to keep the model parameters intact (Figure 2 left).
The scaling and the inverse-scaling of an activation are, in practice, folded to the weight and the bias of the preceding layer and to the weight of the following layer.For example, given a forward pass of two linear layers: the Quadapter block output ŷ is as follows: Here, A = diag(α) is a diagonal matrix with A ii = α i , where α ∈ R d is the learnable Quadapter parameter with the intermediate activation dimension d.Q θ 1 and Q θ 2 are the weight quantizers, and Q θa is the activation quantizer.Each quantizer Q θ quantizes its input values based on the quantization parameter θ = (θ min , θ max ) (Krishnamoorthi, 2018).Quadapter α is trained during training and fused at the inference time (Equation 3).As in Equation 2, the forward scaling and the inverse scaling correspond across three nested quantizers that are strongly nonlinear operations.Therefore α should be learned rather than set analytically as in (Nagel et al., 2019); a single analytical solution is not sufficient to balance the quantization burden between the two layers.

Quadapter Training
The learning of Quadapter is comprised of two phases: the block-wise calibration and the end-toend fine-tuning.Phase 1: Block-wise Calibration Each of the Quadapter instances is initialized to ⃗ 1 and trained with the calibration data, independently per Quadpter block.The local objective for each block is L2 loss: which (Nagel et al., 2020) shows to be effectively complementary to the task loss.ŷ is computed in the dynamic quantization mode (Zafrir et al., 2019), where the statistics are obtained per batch.Quadapter resulting from the calibration phase is a PTQ method that is independent of the fine-tuning process.We therefore denote such Quadapter by Quadapter BC.Phase 2: End-to-end Fine-tuning The subsequent fine-tuning starts with more accommodating quantization parameters (i.e. the min/max statistics) since they have moved to moderate values from extreme outliers during the first phase.The fine-tuning therefore converges much more quickly.
In the second phase, the statistics for quantization are computed in the fashion of static quantization (Zafrir et al., 2019), based on the same calibration data as in the first phase.Quadapter is then trained to minimize the end-to-end task loss.
During the course, the quantization parameters are jointly learned as in (Bhalgat et al., 2020)

Experiments
Models We quantize GPT-2 (Radford et al., 2019) and DistilGPT-2 (Sanh et al., 2019) based on their huggingface pretrained models1 .Our quantization configuration follows (Siddegowda et al., 2022), doing uniform asymmetric 8-bit quantization for both activations and weights.All the weights and It is applied for all the QAT experiments.We use AI Model Efficiency Toolkit2 to obtain AdaRound performance.The CLE metrics are computed with an untrained Quadapter, initialized analytically as in (Nagel et al., 2019).Datasets We employ WikiText-2 (Merity et al., 2016), the English Penn Treebank (PTB) corpus (Marcus et al., 1993), the LAMBADA dataset (Paperno et al., 2016), and the named-entity subset (CBT_NE) as well as the common-noun subset (CBT_CN) of Children's Book Test (Hill et al., 2016).We follow the datasets' default divisions as to training/validation/test splits.Experiment design To test the overfitting resiliency, GPT-2 and DistilGPT-2 are quantized with various PTQ and QAT methods on one of the five datasets.The resulting quantized model is evaluated on its F-ID and on the other four datasets (F-OOD).In addition, we expose the models to varying amounts of fine-tuning data during quantization to compare the changing behaviors of QAT and Quadapter.baseline methods on the F-OOD in both GPT-2 and DistilGPT-2.This observation evinces the general capability of Quadapter to reduce overfitting across different models.The comparison between Quadapter (ours) and Quadapter BC+QAT is the ablation of the end-to-end finetuning, and the reusult proves its importance.
Noteworthy is that Quadapter is a powerful stand-alone PTQ technique.Even without QAT fine-tuning, the F-OOD metrics are better than those of the QAT baselines.In addition, the effectiveness of the calibration phase is shown by the comparison between CLE and Quadapter BC.
Another advantage of Quadapter is that it is a viable quantization option in data-scarce situations.As shown in Figure 3, Quadapter outperforms QAT throughout different amounts of fine-tuning data, and the gap is most evident when only a small amount of data is available.
Aside from the convincing metrics reported above, we further explore if Quadapter does the intended job of transforming an activation into a more uniform distribution.Figure 4 describes the perchannel statistics before and after the Quadapter training.Values in most activation dimensions except for few have small magnitudes around 0, and such dimensions lose precision when quantized because of the large magnitudes of total min/max before applying Quadpater.The illustration verifies that the effect of Quadapter indeed aligns with our expectation, reducing the ranges of outlier-ridden channels while enlarging the ranges of the others.

Limitations
One limitation of Quadapter is that it requires two consecutive layers of linear relations.In other words, it can be a mediator only for convolution layers, linear layers, or normalization layers (when followed by a linear or convolution layer), but not if residual connections or nonlinear activation functions intervene.

Conclusions
We identify two challenges in quantizing autoregressive transformer language models: the overfitting issue of QAT and the inter-channel variance in activations.Through experiments, we demonstrate that Quadapter not only mitigates the two problems but also serves as an effective PTQ technique.

Appendix A Additional Experiments
F-ID Expansion In Table 1, we limit the F-ID to one amongst the five available datasets.Here, we perform an additional experiment by expanding the F-ID to include four of them, limiting the F-OOD to the one remaining dataset.The results are in Table 2. Comparing the metrics on WikiText2 when the QAT model is fine-tuned on PTB (Table 1) and when fine-tuned on all but WikiText2 (Table 2), we can observe the improvement of Quadapter's F-OOD performance.On the other hand, QAT still suffers from overfitting despite the expanded finetuning data.
Ablation of LSQ+ In (Bhalgat et al., 2020), LSQ+ is a composite method that includes initialization of weight quantizer, model parameter training, and quantization parameter (θ) training.However, in our work, we isolate the quantization parameter training and denote it by LSQ+.In Table 1, QAT is accompanied by LSQ+, thus training both the model parameters and the quantization parameters.
We ablate LSQ+ in Table 3.The results show that LSQ+ tends to improve the quantization performance in general, particularly in conjunction with our proposed method.
Quadapter BC effectiveness on QAT As discussed in the main text, QAT makes the model overfit to F-ID and perform poorly on F-OOD.However, when employed with Quadapter BC (i.e.Quadapter BC+QAT), the QAT training process is stabilized, and so the quantized model reaches near the upper bound of the fine-tuned FP model (Figure 5).This shows that Quadapter fosters QAT.

C Implementation Details
Quantization Implementation The quantizer function Q θ is defined as follows: where s and o are the scale and offset, and b is the target bit depth (8-bit in our case).⌊•⌉ is a rounding function, and clip(x, n, p) clamps all the values between n and p from the input x.
In the first calibration phase, θ min and θ max are obtained from the batch statistics at each inference step (i.e.dynamic quantization).In the second finetuning phase, the parameters are initially set based  on the calibration data (i.e.static quantization), and trained afterwards.When θ is trained, the gradients are passed through the rounding function via straight-through-estimation.
Quadapter Implementation The details of the actual application of Quadapter to GPT-2 is slightly different from the general form of Quadapter block in Equation 2. In the transformer block of GPT-2, W 1 denotes only the affine transformation, but not the preceding normalization of the layer norm operation.For example, we can define the layer norm operation as follows: then W 1 = γ, b 1 = β, and the input of Quadapter block is (x l − µ(x l ))/σ(x l ).The fused weight is computed as W ′ 1 = α ⊙ γ with element-wise multiplication ⊙.

D Constraint of Two Linear Layers
While stating that Quadapter is applicable only to two consecutive layers of linear nature in Section 5, the main text omits the discussion of piece-wise linear activation function for brevity.For an operation f that meets the following scaling-invariant condition: the identity relation between the scaling and the inverse-scaling of Quadapter still holds.Therefore, it is possible to install Quadapter through piecewise linear functions (e.g.ReLU, leaky ReLU, PReLU, etc.) as well.
Our future goal is to further expand the Quadapter applicability.We thus plan to investigate if Quadapter would be applicable even with an intervening nonlinear activation function (e.g.GeLU, tanh, etc.) by enhancing expressivity (i.e.setting up additional learnable scalar variables for the inverse scaling).In addition, by scaling all the tensors involved in a residual connection, we expect to apply Quadapter even in the presence of a residual connection in between the two target layers.

Figure 1 :
Figure 1: Average perplexity (PPL) of the full-precision (FP) model and the models quantized with PTQ and QAT on 5 datasets (left).We use the PTB dataset as the fine-tuning data (F-ID) for QAT.The FP model and the QAT model are evaluated on the F-ID and the other 4 datasets (F-OOD) (right).

Figure 3 :
Figure 3: GPT-2 quantization performance when finetuned on F-ID of varying sizes.Both axes are logarithmic.

Figure 4 :
Figure 4: Visualization of the per-channel (x-axis) min/max (y-axis) values of the final layer norm output activation in GPT-2.The solid/dotted lines represent per-channel/total min and max.

Figure 5 :
Figure 5: Comparison of the fine-tuned FP model (FP finetuned) with other methods.PTB is the F-ID, and the other 4 datasets are the F-OOD.

Figure 6 :
Figure 6: Block-wise calibration learning curves of the two selected Quadapter blocks in GPT-2.

Table 1 :
Performance evaluation of the quantized GPT-2 and DistilGPT-2 on various datasets.The metric is PPL (lower is better).In the case of Quadapter BC+QAT, QAT initiates after the block-wise calibration of Quadapter.For Quadapter (ours), both the training phases are completed.Underline indicates the results on F-ID while the model parameters stay fixed.Algorithm 1 details the full flow of the Quadatper training.

Table 2 :
PPL measurements of GPT-2 for expanded F-ID.Wikitext2 is the F-OOD, and the other 4 datasets are the F-ID.The average PPL is reported in the columns, F-ID and F-OOD.

Table 3 :
PPL measurements of GPT-2 trained without LSQ+.The differences from the counterparts with LSQ+ in Table1are noted in the parenthesis (a positive value indicates LSQ+'s performance gain).