Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models

Quantization is a promising approach for reducing memory overhead and accelerating inference, especially in large pre-trained language model (PLM) scenarios. While having no access to original training data due to security and privacy concerns has emerged the demand for zero-shot quantization. Most of the cutting-edge zero-shot quantization methods primarily 1) apply to computer vision tasks, and 2) neglect of overfitting problem in the generative adversarial learning process, leading to sub-optimal performance. Motivated by this, we propose a novel zero-shot sharpness-aware quantization (ZSAQ) framework for the zero-shot quantization of various PLMs. The key algorithm in solving ZSAQ is the SAM-SGA optimization, which aims to improve the quantization accuracy and model generalization via optimizing a minimax problem. We theoretically prove the convergence rate for the minimax optimization problem and this result can be applied to other nonconvex-PL minimax optimization frameworks. Extensive experiments on 11 tasks demonstrate that our method brings consistent and significant performance gains on both discriminative and generative PLMs, i.e., up to +6.98 average score. Furthermore, we empirically validate that our method can effectively improve the model generalization.

To achieve this goal, various model compression methods have been developed, including knowledge distillation (Liu et al., 2019), pruning (Liu et al., 2018) and etc.Among these methods, quantization has attracted great attention (especially in the large language model scenarios) (Yao et al., 2022;Frantar et al., 2022), owing to its impressive ability to reduce memory footprint and accelerate inference while maintaining the network structure (Bai et al., 2021a).However, quantization-aware training (QAT) usually requires retraining using the original training data to mitigate performance degradation.While in some application scenarios, access to the original training data is oftentimes not available due to the protection of privacy and consideration of data security.
In response to this problem, post-training quantization (PTQ) (Bai et al., 2021a) is proposed to address the training-data access, as it generally requires no retraining.But most of those methods would lead to an accuracy drop in lowprecision quantization since they are training-free and primarily rely on the analysis of weight distributions.More recently, zero-shot quantization (ZSQ) (Nagel et al., 2019;Zhuang et al., 2022;Zhang et al., 2021) shows promising results in the computer vision community.Specifically, ZSQ performs the quantization process with the synthetic data generated from a generator.To alleviate the side effect of inaccurate synthetic data, Choi et al. (2020) further improve the ZSQ by adaptively tuning the generator with the supervision of adversarial learning.However, it is non-trivial to adopt such an adversarial-based ZSQ in the NLP field, as ❶ backpropagation on discrete words is not reasonable, i.e., it seems "unlikely" to pass the gradient through the text to the generator; ❷ minimizing the difference between the teacher and (quantized) stu-dent models would be prone to over-fitting problem, leading to poor model generalization.
To address the above issues, we propose a novel Zero-shot Sharpness-Aware Quantization (namely ZSAQ) framework to improve the performance and generalization of the quantized model.Firstly, regarding the problem ❶, we design a simple feature adaptation module to convert the output token representations of the generator into the target quantized model's representation space, thus avoiding the gradient propagation on the discrete words.Secondly, for the problem ❷, we are inspired by the sharpness-aware minimization (SAM) (Foret et al., 2020) and propose an alternating SAM-SGA algorithm to boost the ZSAQ and improve model generalization.Specifically, on the one hand, SAM-SGA uses SAM to robustly minimize the divergence of output distributions between teacher and quantized student models.On the other hand, it uses stochastic gradient ascent (SGA) to encourage the generator model to maximize the divergence at each iteration.By optimizing such a minimax problem, we can not only train the well-performed with generative adversarial learning but also improve the generalization of quantized models.
Theoretically, we provide rigorous analysis for the SAM-SGA algorithm solving the minimax optimization (ZSAQ).We show that SAM-SGA achieves O(1/ √ T ) convergence rate.Furthermore, our theoretical results are not limited to the single case but are applicable to a wide range of minimax optimization problems in both nonconvexstrongly-concave and nonconvex-PL conditions.Empirically, we adopt our ZSAQ framework to quantize both the discriminative and generative PLMs, and evaluate 8 language understanding tasks and 3 language modeling tasks.Extensive experiments demonstrate the effectiveness and versatility of our approach.More encouragingly, our ZSAQ brings +6.98 average performance gains in the lowprecision quantization, compared to the baselines.Extensive analyses show that our framework has the potential to expand to more large language models and prove that SAM-SGA indeed brings better model generalization.
In summary, our contributions are three-fold: (1) We propose a novel method, zero-shot sharpnessaware quantization (ZSAQ), which realizes zeroshot quantization without much accuracy loss.(2) We provide a theoretical convergence guarantee for the SAM-SGA algorithm which aims to solve the minimax optimization problem (ZSAQ).(3) Extensive experiments show that our SAM-SGA can bring consistent and significant performance gains, up to +6.98 average score, on both discriminative and generative PLMs.

Related Works
Compression method for language models.Compression is efficient in reducing computation cost, memory overhead, and energy consumption as well as shortens inference time for large neural network models, which is in great need for pre-trained language models to be deployed in the mobile or resource-constrained device.Quantization works by transmitting the models with lower-bit parameters.QAT works by minimizing the rounding error on the training dataset.FullyQT (Prato et al., 2020) shows the fully quantized Transformer can avoid any accuracy loss in translation quality.I-BERT (Kim et al., 2021) eliminates floating point calculation in BERT inference and achieves similar accuracy to the full-precision baseline.BinaryBERT (Bai et al., 2021b) achieves good performance by ternary weight splitting which avoids the complex and irregular loss landscape.There are some other techniques, low-rank factorization (Tahaei et al., 2021;Edalati et al., 2021;CHen et al., 2020;Reid et al., 2021), factorizing the weight matrices which are usually low-rank into several smaller matrices by means of singular value decomposition.Parameter sharing (Rothe et al., 2020;Takase and Kiyono, 2021;Reid et al., 2021), reduces memory overhead by reusing the same parameters in multiple computations.Pruning (Mishra et al., 2021;Guo et al., 2020;Chen et al., 2020;Xia et al., 2022;He et al., 2022;Liu et al., 2023a), deletes some parameters which are "useless" in network under the same model performance.Zero-shot quantization.In contrast to QAT, PTQ relies less on the training data and does not require end-to-end training (Nagel et al., 2019;Nahshan et al., 2021;Nagel et al., 2020;Li et al., 2020;Zhao et al., 2019;Hubara et al., 2020), which can be zero-shot.Bai et al. (2021a) propose modulewise quantization error minimization and perform close to QAT.SmoothQuant (Xiao et al., 2022) transmits the activation quantization to weights to smooth the outliers in activation and makes a large language model implemented efficiently.Tao et al. (2022) proposes an adaptive quantization for different modules and performs compara-bly with the full-precision generative pre-trained language models.GPTQ (Frantar et al., 2022), a one-shot weight quantization method with approximate second-order information, can efficiently quantize generative pre-trained transformers with large amounts of parameters to 3 or 4 bits with negligible accuracy degradation.Nahshan et al. (2021) figures out that aggressive quantization can lead to the steep curvature of the loss landscape, while they propose a method combining layer-by-layer quantization and multivariate quadratic optimization that can improve accuracy over current PTQ.NuQmm (Park et al., 2022) uses a non-uniform quantization method which allows for a trade-off between accuracy and compression ratio.Another way to achieve zero-shot is by means of the generative adversarial technique.(Cai et al., 2020;Liu et al., 2021b;Choi et al., 2022) has shown promising performance on small-scale computer vision tasks, while it has not been applicable to the NLP realm.To the best of our knowledge, we are the first to adopt the adversarial learning approach for quantizing both the discriminative and generative PLMs in a zero-shot manner and provide its rigorous convergence analysis.

Methodology
In this section, we will present our zero-shot sharpness-aware quantization method.Network quantization.For a real-valued vector x which represents the parameter in the pre-trained full-precision model, we quantize it into x by the signed symmetric quantization: where s ∈ R + denotes scale factor and b ∈ Z represents the bit-width of the quantized version.So the parameters can be projected to the integer grid: of the pre-trained model.So the objective for Q is to minimize the discrepancy between P(G(z)) and Q(G(z)), denoted as While for the generator G, it tries to maximize the divergence, so that the quantized model Q can be trained to confuse even the most strict generator.We try to balance the trade-off between the generator G and the quantized model Q so that we can get a well-performed quantized model without access to the original training data.The structure of our method is demonstrated in the Figure 1.And the objective function of our method is shown below: where D represents the divergence MSE.ZSAQ: Zero-shot sharpness-aware quantization.
As shown in Liu et al. (2021a), there is a sharper loss landscape in the low-precision model than in the full-precision model.To avoid overfitting problems caused by the training process, we should consider the generalization ability of our quantized models.Inspired by Foret et al. (2020), we pursue a flatter landscape during the training process of the quantized model which can guarantee a better generalization performance.So we can adjust our objective function as: where we denote the sharpness function as: ≜ max The sharpness function searches for the sharpest point among a neighbor centered at ω with a radius of β, which will represent the worst-case data.Then our optimization process not only minimizes the divergence but also the sharpness of the divergence, i.e., getting a quantized model with both lower divergence and a flatter landscape.
Solving the inner maximization problem to obtain the optimal: Remark.Here we use ω + ϵ as the parameter in the sharpness function (4) instead of ω+ϵ, for the reason that ϵ may be so small that there will be no difference adding the perturbation.And this issue is also illustrated in Liu et al. (2021a).
Algorithm.For the objective function, we train the generator to maximize the discrepancy between the pre-trained model and the quantized model: Then we train the quantized model to minimize both the discrepancy value and the sharpness of the discrepancy, i.e., the addition of D(P, Q) and S(P, Q), on the synthetic data produced by the generator G.While the addition comes out to be: where the optimal ϵ * ( ω) is obtained from Eq. ( 5).
We obtain our final algorithm by applying stochastic gradient descent on the minimization objective (7), which turns out to be SAM, and applying stochastic gradient ascent on the maximization objective (6).Generator G is parameterized by θ, quantized model Q initially quantized from the pre-trained model P and the quantized parameter is denoted as ω, while the parameters before quantization is denoted as ω.We alternatively update the parameters ω and θ respectively.The pseudo code is shown in the Algorithm 1.

Optimization
In this part, we will analyse the convergence of our algorithm to solve this minimax optimization problem, which can provide a theoretical guarantee for our method.
Next, we restate the update rule of our Algorithm 1 in our new denotation that: , where g ω and g θ are the approximated gradients calculated by The first parameter can be either a floating point or an integer.It is well-defined for us to represent Before illustrating the theoretical results, we first introduce some necessary assumptions and definitions that are used in the analysis.
Assumption 4.1.f (ω, θ) is differential and l-Lipschitz smooth: Assumption 4.3.Parameter θ is restricted in a convex and bounded set with a diameter of D.
Assumption 4.4.The approximated gradient is unbiased and has a bounded variance, i.e., Remark.These assumptions are common in the context of minimax optimization problems and can be realized in empirical practice.
Lemma 4.1.(Zhu et al., 2020) The quantization error is bounded by the grain, where d denotes the dimension of parameter ω: Remark.The quantization error can be naturally derived from Eq. ( 1) and the grain ∆ is decided by default quantization setting, which will not be discussed in our paper.
Remark.By defining the function Φ, we can simplify our minimax optimization with two opposite variables into a minimization optimization problem with a single variable.When the gradient of Φ(ω) diminishes to zero, we consider our algorithm converges.
Theorem 4.1.Under Assumption 4.1,4.2,4.3,4.4 and restrictions 132T κ 4 lσ 2 }, we have the convergence bound for our problem: Remark.Due to the space limitation, the proofs and strongly-concave case are placed in Appendix A. Observing the above results, we evaluate the averaged gradients at the quantized parameter ω and we can draw the conclusion that our averaged output of the quantized models can converge to an ϵ-stationary point during O(1/ϵ 2 ) iterations under the neglect of quantization error d∆ 2 .

Setup
Tasks and Models.In this section, we evaluate our SAM-SGA method on both discriminative (i.e., BERT-style) and generative (i.e., GPT-style) language models.For discriminative models, we followed many prior works (Zhong et al., 2023c,d) and tested both BERT base and BERT large (Devlin et al., 2018) on GLUE benchmark (Wang et al., 2018); and for generative models, we tested the OPTfamily (Zhang et al., 2022) (mainly OPT-350m) on the GLUE benchmark and several language modeling tasks, i.e., WikiText2 (Merity et al.) and Penn Treebank (PTB) (Mikolov and Zweig, 2012) and WikiText103 (Merity et al.).For evaluation, we report the performance with Accuracy ("Acc.")metric for most tasks, except the Pearson correlation ("Pear.")for STS-B, the Matthew correlation ("Mcc.")for CoLA, the perplexity (PPL) score for language modeling tasks.We report the averaged results over 5 random seeds to avoid stochasticity.The details of all tasks are shown in Appendix A.1.Implementation Details.Following many previous studies (Tao et al., 2022;Bai et al., 2021b), we first fine-tune a full-precision model using the pretrained checkpoint from huggingface1 for each task, and use the fine-tuned model as the full-precision teacher model.Then, we apply the cutting-edge PTQ methods, i.e., ZeroQuant (Yao et al., 2022) for BERT models and GPTQ (Frantar et al., 2022) for OPT models, to quantize the fine-tuned model and use the quantized network to initialize the student model.For the generator model, we directly use the OPT-350m (Zhang et al., 2022)   and the detailed hyper-parameters can be found in Appendix A.2.
Compared Methods.For reference, we compare our SAM-SGA method with the following cuttingedge counterparts: • Full-precision: we report the results of fullprecision fine-tuned model, i.e., teacher model in our framework.
• Baseline: we adopt the powerful PTQ methods i.e., ZeroQuant (Yao et al., 2022) for BERT models and GPTQ (Frantar et al., 2022) for OPT models, to quantize the fine-tuned model, and report the results as the baseline.
• QAT-GT: after obtaining the teacher and student models, we use the original training data to guide the quantization-aware training process of student model.
• QAT-Rand: we follow Yao et al. (2022) and use the random data (using the random integer number to generate token ids) to perform the QAT process.
Notably, for each method, we carefully grid-search the best learning rates from {5e-6, 1e-5, 2e-5, 5e-5}, and report the best results that are chosen from the best single run among those learning rates.

Main Results
We report the results of BERT base , BERT large and OPT-350m in  up to +6.98 average score.Specifically, QAT-GT also achieves remarkable performance gains, owing to the supervised information from the original training data.When using the random data ("QAT-Rand"), the performance gains brought by the vanilla QAT process will be much slighter.Encouragingly, it can be found that ZSAQ performs better in the lower precision settings, indicating the potential of validation methods in extreme compression scenarios.These results prove the effectiveness and significance of our data-free method.
SAM-SGA brings the consistent performance gains among both model sizes.We further adopt our SAM-SGA to the larger BERT model.Due to the space limitation, we only report the results of parts of GLUE benchmarks in Table 2, and provide the full results in Table 7 of Appendix A.3.It can be found the similar phenomenon in the BERTbase experiments.That is, our SAM-SGA brings the consistent performance gains on both models.
SAM-SGA works well in both BERT-style and GPT-style models.In addition to the discriminative PLMs, we also verify the effectiveness of our SAM-SGA on the generative models.The results of language modeling tasks are in Table 3, and we can seen that, with the help of SAM-SGA, OPT-350m achieves much better quantization performance against the baselines.Moreover, we also evaluate the OPT-350m on the GLUE benchmark.
The contrastive results are listed in Table 8 of Appendix A.3, illustrating that our method outperforms the other methods, which further validates the effectiveness of our methods for generative style models.

Ablation Study
In this part, we 1) first evaluate the impact of important components of our SAM-SGA, and 2) then investigate the effect of different generator models.Impact of important components of SAM-SGA.
There are several important components in our SAM-SGA framework, i.e., "SGA": tuning the generator model with the stochastic gradient ascent, "SAM": training the quantized model with SAM optimizer.Here, we conduct the ablation study to verify whether these components are necessary.Specifically, taking the BERT-base model as an example, we compare our full SAM-SGA method with the following methods: i) "-w/o SGA": we remove the SGA process, i.e., keeping the generator model fixed; ii) "-w/o SAM": we replace the SAM optimizer with the vanilla AdamW optimizer; 3) "-w/o SGA&SAM": fixed generator model and base optimizer are used.
As shown in Table 4, we can find that removing any component will cause the performance degradation, indicating the necessarily of these components.More specifically, without the SGA, our SAM-SGA would perform much worse.One possible reason is that the quantized model might over-fit the synthetic data, without the constraint of adversarial training.
Effect of different generator models.Here, we examine whether our SAM-SGA can still work well using different generator models.We con-   duct the contrastive experiments by varying the generator model size from 120M to 1.3B, and illustrate the results in Figure 2. As seen, compared to the baseline, our SAM-SGA consistently brings improvements across all model sizes, basically indicating that the performance of SAM-SGA is not very sensitive to the generator model size.More specifically, the case of OPT-350m performs best, and we thereby use this setting in our experiment.

Analysis and Discussion
Here, we conduct extensive analyses to discuss: 1) whether our SAM-SGA outperforms the other zero-shot quantization counterparts; 2) whether our SAM-SGA can expand to more larger language model scenarios, and 3) whether it gains better model generalization.
Comparisons with other zero-shot quantization methods.To better assess the strengths of our method, we further compare it with more zero-shot quantization methods, i.e., PTQvanilla (vanilla post-training quantization), PTQ-PEG (Bondarenko et al., 2021) and PTQ-  Adaround (Nagel et al., 2020).Specifically, taking the BERT-base as an example, we show the contrastive results in the W8A8 and W4A8 settings in Table 5.As seen, our method SAM-SGA outperforms the other zero-shot quantization methods by a clear margin, especially in the low-bit setting.
Scaling to Billion-level large models.Some readers may concern that whether our method can be used to quantify the larger language models.Here, we attempt to expand our method to the quantization of billion-level models.Specifically, due to the limited amount of computing resources, we have tried our best to extend SAM-SGA to a larger model setting, and lastly choose the OPT-1.3bfor experiments.For reference, we compared our method with the baseline GPTQ method3 .
As shown in Table 6, compared to the baseline, our SAM-SGA brings the significant and consistent performance improvements among all settings, i.e., up to +7.84 gains.These results prove that our method also works well in the larger language model scenarios.
Does SAM-SGA Bring Better Generalization?One of our main contributions is that we introduce the SAM optimizer to improve the generalization of quantized model.Here, we verify it from two perspectives: i) measuring the cross-task zero-shot performance, and ii) visualizing the loss landscapes of quantization models.
Task Generalization.As stated in many previous studies (Zhong et al., 2023b;Ding et al., 2022), the performance of out-of-domain (OOD) data is widely used to verify the model generalization.Hence, we follow Xu et al. (2021); Zhong et al. (2022a) and evaluate the performance of quantized model on several OOD data.In practice, we first quantize BERT-base models (fine-tuned on QNLI task) with different methods, including "Baseline", "Ours" and its variant "-w/o SAM" (without using SAM).Then, we inference the quantifized models on other tasks, i.e., MRPC and RTE.The results are illustrated in Figure 3.We observe that "Ours" consistently outperforms the other counterparts.To be more specific, compared with baseline, our SAM-SGA brings a +2.0 average improvement score on all tasks, indicating that our SAM-SGA boosts the performance of PLMs on OOD data.
Visualization of Landscape.To have a close look, we visualize the loss landscapes of different quantized OPT-350m models fine-tuned on the PTB task.In practice, we follow He et al. (2021); Zhong et al. (2022a) to plot the 1D loss curve by linear interpolation between the model weights before (denoted as θ 0 ) and after (denoted as θ 1 ) tuning, i.e., "θ 1 + α • (θ 1 − θ 0 )", where α is a scalar parameter that is ranged from -1 to 1.The 1D visualization results are illustrated in Figure 4, and we find that our optimal setting "Ours" shows a flatter and optimal property.These results prove that our SAM-SGA can smooth the loss landscape and improve the generalization of PLMs effectively.

☞ A Note on More Analyses and Discussions
Notably, in addition to the above results and studies, we further conduct more in-depth and systematic analyses and discussions in Appendix, due to space limitations.Specifically, we provide 1) parameter analysis of η, and 2) visualization of the loss curve of adversarial training stages in Appendix A.3.Please refer to the Appendix for more details.

Conclusion
In this paper, we propose a novel zero-shot quantization framework ZSAQ, which effectively realizes zero-shot quantization without much accuracy drop and shows promising generalization ability.We provide theoretical analysis for the minimax optimization algorithm under both nonconvex-stronglyconcave and nonconvex-PL conditions, which guarantees a convergence rate of O(1/ √ T ) for our ZSAQ method.Our experiments show outperforming quantization results, up to +6.98 average score, and validate improved generalization ability.

Limitations
Our work has several potential limitations.First, given the limited computational budget, we only validate our SAM-SGA method on the Large and Base model sizes.It will make our work more convincing if scaling the experiments up to the much larger model sizes, e.g., LLaMA-65b (Touvron et al., 2023) and OPT-66b.On the other hand, although our method brings much performance gains compared to the other PTQ method, it would lead to more computation overheads.How to accelerate the process has not been explored in this work.

Ethics Statement
We take ethical considerations very seriously, and strictly adhere to the EMNLP Ethics Policy.This paper focuses on zero-shot quantization for current open-sourced pretrained language models, but not capturing the privacy knowledge.Both the models and evaluation datasets used in this paper are publicly available and have been widely adopted by researchers.Therefore, we believe that this research will not pose ethical issues.

A.1 Details of Tasks and Datasets
In this work, we conduct extensive experiments on GLUE benchmark and some widely-used language modeling tasks.Here, we introduce the descriptions of the used tasks and datasets in detail: CoLA Corpus of Linguistic Acceptability (Warstadt et al., 2019) is a binary singlesentence classification task to determine whether a given sentence is linguistically "acceptable".
MRPC Microsoft Research Paraphrase Corpus (Dolan and Brockett, 2005) is a task to predict whether two sentences are semantically equivalent.
STS-B Semantic Textual Similarity (Cer et al., 2017) is a task to predict how similar two sentences are on a 1-5 scale in terms of semantic meaning.
RTE Recognizing Textual Entailment (Giampiccolo et al., 2007), given a premise and a hypothesis, is a task to predict whether the premise entails the hypothesis.
MNLI The Multi-Genre Natural Language Inference Corpus (Williams et al., 2018) is a task to predict whether the premise entails the hypothesis, contradicts the hypothesis, or neither, given a premise sentence and a hypothesis sentence.
SST-2 The Stanford Sentiment Treebank (Socher et al., 2013) is a binary classification task to predict the sentiment of a given sentence.
QNLI Question Natural Language Inference is a binary classification task constructed from SQuAD (Rajpurkar et al., 2016), which aims to predict whether a context sentence contains the answer to a question sentence.
QQP The Quora Question Pairs2 dataset is a collection of question pairs.The task is to determine whether a pair of questions are semantically equivalent.
PTB The English Penn Treebank (PTB) (Mikolov and Zweig, 2012) corpus is one of the most known and used corpus for the evaluation of models for sequence labelling, and is also commonly used for character-level and word-level Language Modelling.
WikiText2 and WikiText103 Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 (Merity et al.) is over 2 times larger and WikiText-103 (Merity et al.) is over 110 times larger.The WikiText datasets are also widely-used for language modeling.As they are composed of full articles, the datasets are well suited for models that can take advantage of long term dependencies.

A.2 Details of Hyper-parameters
BERT and OPT models are obtained from Huggingface4 , and we follow the instruction (e.g., fine-tuneing hyper-parameters) from Huggingface Transformer Library5 to fine-tune these models.
For initial post-training quantization of BERTbased models, we use the ZeroQuant (Yao et al., 2022) and set 48 groups for group-wise weight quantization in W8A8 settings, 32 groups for W4A8 and 16 groups for W2A8.While for OPTbased models, we use 128 groups in all settings for GPTQ (Frantar et al., 2022) method.
For our SAM-SGA method, we use 50 iterations with batch size 32 and sequence length 128 for BERT models, and we use 100 iterations with batch size 4 and sequence length 2048 for OPT models.We grid search the learning rate in range of {1e-6, 5r-6, 1e-5, 2e-5}.All the quantized models are trained using a single NVIDIA A100 (40G) GPU.

A.3 More Results and Analyses
In this part, we provide more results and analyses to further investigate the effectiveness of our method.Specifically, we first perform the parameter analyses on η θ and η ω , and then visualize the loss curve of adversarial training stages.
Parameter Analyses of η θ and η ω .The factors η θ and η ω are two important hyper-parameters in our Algorithm.In this study, we analyze its influence by evaluating the performance of BERTbase (W4A8) with different η spanning {5e-6,1e-5,2e-5,5e-5} on MNLI and QNLI tasks.Figure 5 illustrates the grid-search results.We find that the results do not exhibit significant fluctuations while demonstrating stability within a narrow range.These prove the effectiveness of our method.
Visualization of loss curve of adversarial training stages.Since adversarial training usually suffers from instability, some readers may concern about the training stability of our method.To investigate it, we visualize the loss curve of adversarial training stages in Figure 6.We can see the stu-dent_loss decreasing and generator_loss increasing as time varies and eventually reaching a relatively stable plateau during the final stage.These results indicate that the adversarial training is relatively stable in our SAM-SGA framework.

A.4 Convergence Analysis
Before our proof, we give some notations for brief.ω t+1/2 ≜ ωt + βg ω ( ωt , θ t ) like in (Foret et al., 2020).So the update rule for ω can be rewritten as: Lemma A.1.Under Assumption 4.4, we can further get the evaluation about the approximate function: and ξ i is i.i.d.So the expectation of the approximate function g is the same as We give an estimation that Proof.We have the following decomposition: For the cross-product term, we evaluate it as follows: And for the first term, we have: Combining the above inequalities and we can get: Figure 6: Visualization of loss curve of adversarial training stages.

A.5 Strongly Concave Case
Lemma A.3.For the descending relationship of the function Φ, we have: Proof.We get the conclusion that Φ(ω) is 2κlsmooth according to Lemma 4.3 in (Lin et al., 2020): Taking expectation conditioned on (ω t , θ t ) and we get: )d∆ 2 (15) We again take expectations on both sides of the above inequality so we have: )d∆ 2 (16) For the second term, we decompose it as follows: We continue estimating the last term in above inequality (17): where the last inequality (i) is due to Young's inequality.
We then continue to decompose the second term as follows according to the update rule and take expectaion: Using the property of µ-strong concavity of function f on variable θ, we can get: 23) So the original inequality (22) above can be turned into: ) is l-smooth and µ-strongly concave, we have: 25) On the other hand, we can get the following inequality according to optimal condition: Combining above two inequalities (25)(26), we have: 24) and ( 27) together and on the occasion that η θ = 1 2l we can get: Then we return to the inequality (21), we can get: (29) where inequality (i) is due to the κ-Lipschitz of θ * (•) according to the Lemma 4.3 in (Lin et al., 2020); inequality (ii) follows from the decomposition that ωt+1 − ωt = ωt+1 − ω t+1 + ω t+1 − ωt and the update rule of parameter ω, together with Lemma 4.1,A.2;finally inequality ωt )∥ 2 (the second inequality here uses the property that ∇Φ(•) = ∇ ω f (•, θ * (•)) which we refer to Lemma 4.3 in (Lin et al., 2020)).

A.6 PL Condition
First we construct a potential function in the same way as (Yang et al., 2022): where α > 0 is a preset parameter.Then we come to evaluate the descending relationship of the potential function V t .

Figure 2 :
Figure 2: Ablation study of different generator models.We use OPT-family models with different sizes as the generators.

Figure 3 :
Figure 3: Analysis of task generalization.The model is fine-tuned on the QNLI task and transferred to four different tasks.We can see that our method consistently brings better generalization compared to the baseline.

Figure 4 :
Figure 4: 1D visualization of loss landscapes of OPT-350m quantized by different methods.PTB(Mikolov and Zweig, 2012) is used for evaluation.

Table 1 :
(Wang et al., 2018)r tuning) model 2 , i.e., the generators are not trained on any end-task data.All experiments are conducted on a single NVIDIA A100 (40G) GPU Results of BERT base on the development set of GLUE(Wang et al., 2018)benchmark."#Bits (W-A)" denotes the bit-width for weights of Transformer layers and activations.

Table 3 :
Results of OPT-350m on the dev set of four language modeling benchmarks.

Table 1
SAM-SGA outperforms the other counterparts in most settings.As shown in Table1, our SAM-SGA outperforms the baseline methods among all quantization settings by a large margin, i.e.,

Table 4 :
Ablation study of important components of SAM-SGA.Here, we use the BERT-base models quantified into W4A8 bit-width.

Table 5 :
(Wang et al., 2018)her zero-shot quantization methods on the development set of GLUE(Wang et al., 2018)benchmark.Full results are shown in Table9.

Table 6 :
Results of OPT-1.3b on two language modeling benchmarks.We can find that our SAM-SGA can still work well in the billion-level model scenarios.

Table 7 :
(Wang et al., 2018)ults of BERT large on the development set of GLUE(Wang et al., 2018)benchmark.

Table 8 :
(Wang et al., 2018) on the development set of GLUE(Wang et al., 2018)benchmark.Here, we only report the results in W4A8 settings, and note that the results in the other low-bit settings are similar.

Table 9 :
(Wang et al., 2018)ons with other zero-shot quantization methods on the development set of GLUE(Wang et al., 2018)benchmark.