G-Tuning: Improving Generalization of Pre-trained Language Models with Generative Adversarial Network

,


Introduction
Large-scale pre-trained language models (PLMs) have demonstrated substantial achievements in natural language processing (NLP) recently (Qiu et al., 2020;Han et al., 2022).Generally, fine-tuning PLMs with the task-specific training data can get significant improvements compared to training a model from scratch (Devlin et al., 2019;Lample and Conneau, 2019;Radford et al., 2018;Ouyang et al., 2022).Fine-tuning aims at transforming the representation from PLMs in the universal space to the target space, thereby enabling the model to generalize to a wider range of samples (Pan and Yang, 2010;Liu et al., 2021;Wei et al., 2021).
However, the generalization capability of PLMs is largely affected by the task-specific data when using fine-tuning to further train the model (Patel et al., 2022).As noted by Wu et al. (2022), fine-tuning is susceptible to memorizing the training data when the capacity of the PLM exceeds that of the downstream task data.Furthermore, the advantage of PLMs over random initialization is lost when the coverage of task-specific data is low, resulting in a large gap between training and test data (Zoph et al., 2020;He et al., 2019).For instance, in domain generalization, PLMs fine-tuned with in-domain training sets often fail to perform well on out-of-domain test sets, even when data from the test sets is used for pre-training (Yang et al., 2022).Therefore, a crucial challenge for unlocking the potential of PLMs is how to learn a complete and accurate mapping from the universal space to the target space with limited training data.
Previous studies have presented some promising approaches to address this problem.Fang et al. (2020) proposed a self-teaching method to use a fine-tuned PLM to get soft labels of the unlabeled data and train another PLM by these synthetic data, which brings considerable improvements in the cross-lingual transfer scenario.Li and Zhang (2021) presented regularized self-labeling to correct mislabeled data points and reweight less confident data points to regularize PLMs.Li et al. (2022) proposed an ensemble learning method for domain generalization, which can dynamically dispatch proper PLMs to predict each test sample.Wu et al. (2022) proposed a noisy tuning method to add matrix-wise perturbation to different parameter matrices to overcome the outfitting problem of finetuning, which indirectly improve the generalization ability of PLMs.Lu et al. (2022) presented stochastic weight averaging to improve generalization by encouraging convergence of the model to a flatter minimum.Furthermore, compared to fine-tuning, in-context learning which doesn't tune the parameter of PLMs, we leave this to future work (Li and Liang, 2021;Brown et al., 2020;Vu et al., 2022).
In this paper, we propose a novel fine-tuning framework (named G-Tuning) to preserve the generalization capability of PLMs when training with task-specific data.We adopt parameter efficient fine-tuning (PEFT) as our backbone, which only tunes a lightweight adapter network connected behind PLMs (Houlsby et al., 2019) 1 , and incorporate a generative adversarial network (Goodfellow et al., 2020;Arjovsky et al., 2017;Gulrajani et al., 2017) into it to learn the representation mapping.Specifically, we first train a discriminator to aim of discriminating the representation that is not mapped correctly from the target space.Then, besides predicting the ground-truth label, the model is seen as a generator requested to generate the representation hard to be discriminated.We conduct experiments on the GLUE and two more challenging scenarios, i.e., domain and language generalizations.Experimental results show that G-Tuning can improve generalization capability by effectively mapping the universal representation to the target space.

Preliminaries
Pre-trained Language Model.Given a largescale unlabeled data-set M , the widely-used training function of PLMs is where x u is an input sequence and m(•) is a perturbation function, which masks tokens in the x u by a certain rule (Devlin et al., 2019;Radford et al., 2018).The θ is the parameter set of the PLM, which generally adopts the Transformer structure (Vaswani et al., 2017).
Parameter Efficient Fine-tuning.Given a training set B from a downstream task, we can summarize the loss function of PEFT (Houlsby et al., 2019;He et al., 2022a) as: where x t is the input and y t is the corresponding label.PEFT only trains an adapter parameterized by ϕ, and the parameter θ of the PLM is fixed.

The Proposed G-Tuning
Before elaborating on the G-Tuning, we first define the composition of the PLM and the adapter 1 Compared to fine-tuning the whole model, PEFT is efficient and stable in our experiments (see Appendix A). as f (•).Given an input, the output of f (•) is a vector or the mean pooling of a matrix according to different tasks.Inspired by the Wasserstein GAN (WGAN) (Arjovsky et al., 2017;Gulrajani et al., 2017), we train a discriminator D(•) as follows: (3) Here, the second term is gradient penalty, which is used to smooth the weight of D(•) (Gulrajani et al., 2017).The coefficient λ is set as 10 and the latent representation z is computed by: We consider the representation from f (x t ) obeys the real distribution, and from f (x u ) obeys the generated distribution.The aim of D(•) is to identify the representation that did not correctly map from the universal space to the target space.We think of f (•) as the generator which can be optimized by the following loss function: (5) Finally, different from the original WGAN, we combine the loss functions from Eq. 2 and Eq. 5 in a multi-task learning paradigm (Sener and Koltun, 2018) to optimize the adapter network: where the coefficient α and β are used to control the loss function, which we set as 1 and 0.5, respectively.2Here, we utilize L G to learn the representation mapping.To avoid the deviation of the target space, we further use L T to keep it consistent.An illustration of the G-Tuning is shown in Figure 1.Table 1: Results on the GLUE."Avg." is the average score of all tasks." †" means the results come from their paper.

Implementation Detail
Data-set.We first experiment on the GLUE benchmark (Wang et al., 2018).In consideration of the labels of the test sets are not released, we report results on the validation sets.Further, we report the result of MNLI-matched for the MNLI task and do not evaluate the WNLI task due to problems with the data.Then, we conduct experiments in two more challenging scenarios: domain generalization and language generalization.In domain generalization, following Yang et al. (2022), we reorganize the date set of each task in GLUE (named GLUE ood ) and use the same metric to evaluate the model.In language generalization, we adopt XTREME benchmark (Hu et al., 2020) to evaluate our approach, in which we use English training data to tune the model and transfer it to other languages.In addition, more details of GLUE and XTREME benchmarks can refer to their papers (Wang et al., 2018;Hu et al., 2020).The evaluation metric of all tasks and the data statistic of the GLUE ood data-set are shown in Appendix C.
Setting.Depending on the type of task, we use the large setting of RoBERTa (Liu et al., 2019) and XLM-R (Conneau et al., 2020) as the foundation model.We set the batch size as 32 and the number of gradient accumulations as 2. The training epoch of all models is 10.We compose the data-set M by randomly sample the Similar to previous work (Xu et al., 2022;Zaken et al., 2022), we search learning rate from {5e-6,1e-5,5e-5,1e-4,5e-4}.The optimization frequency of the discriminator is {3,5,7,9} times than the generator.We use Adam (Kingma and Ba, 2014) as the optimizer for our method.We use a three layers transformer structure as the discriminator and the adapter, respectively.We first fine-tune the PLM for 5 epochs; then, we use the fine-tuned model to train the discriminator and employ the proposed method with another 5 epochs.For the sentence pair task, we compute the representation for each sentence individually and combine them before the output layer.For the structure prediction task, we use the average of all the outputs of all tokens as input.We report the average score of 3 runs with different seeds.Experiments are performed on 4 NVIDIA A100 GPUs.

Main Results
Results on GLUE.The results of the GLUE benchmark are presented in Table 1.We first report the results for several widely-used PLMs, i.e., XLNET (Yang et al., 2019) and RoBERTa (Liu et al., 2019), as well as a prompt-based method, HyperPrompt (He et al., 2022b).To ensure fair comparison, we implement the self-teaching (Self-Teach) (Fang et al., 2020), noisy tuning (Noisy-Tune) (Wu et al., 2022) and the standard adapterbased method (as trained by Eq. 2) with the same structure of our model.Our G-Tuning approach gets a 1.0 absolute improvement compared to finetuning RoBERTa directly.Additionally, our model consistently outperforms previous work.Subsequently, we evaluate the generalization capabilities of our approach in domain generalization and language generalization.While G-Tuning gets considerable improvements on the GLUE benchmark, it is important to consider whether it effectively transforms the entire representation space or only a neighborhood surrounding the training data.
Results on Domain Generalization.The results are presented in Table 2. Similar to Yang et al. (2022), we evaluate our approach by exploiting out-of-domain (OOD) data as test sets.Compared to the standard dev sets on the GLUE, the performance of all methods exhibits a notable decline on the OOD test sets.Moreover, the results of fine-tuning the whole model are inferior to those of  other methods, suggesting that training PTMs with in-domain data will lead to a severe decline in generalization ability.In comparison to the strongest baseline, G-Tuning achieves an average improvement of 6.4 in the domain generalization scenario.
Results on Language Generalization.The overall results on XTREME benchmark are shown in Table 3.Compared to fine-tuning XLM-R, our method achieves an average improvement of 1.4.However, the improvement is lower than that observed in the domain generalization.We posit that the characteristics of different languages have an impact on language generalization.Here, we do not utilize any bilingual parallel data, which makes it challenging to learn the alignment of these characteristics, resulting in limited improvements.

Analysis
We sampled 200 sentences each from in-domain and out-of-domain data and used the model fine-tuned and G-Tuned to generate the representations.
We then normalized the representations and reduce the dimensionality using t-SNE.The visualization is shown in Figure 2. We also included contour lines based on sample density in the figure.It is apparent that the density centers of different domains are nearly coincident in our model, whereas fine-tuning results in a significant gap between different domains.Empirically, the representation in a task-specific space should be centralized.The gap from the fine-tuning method leads to incorrect label predictions for some data, e.g., the samples in the upper right corner of the left figure.

Conclusion
In this work, we elucidate a drawback of the finetuning strategy on PLMs, which is that the representation from PLMs is not fully mapped to the target space when the training data is insufficient.The generalization ability of PLMs in the downstream tasks will be diminished in this situation.To address this issue, we present G-Tuning, a framework that aims to preserve the generalization ability of PLMs in the downstream tasks.The proposed G-Tuning utilizes a generative adversarial network to transform the representation that is not covered by training data into the target space.Extensive experiments on the GLUE and two additional scenarios show that G-Tuning accurately maps the universal representation to the target space, getting substantial improvements in generalization performance.

Limitations
In this section, we will summarize several limitations of our work.The first one is we only apply the proposed method to the natural language understanding (NLU) tasks.It is uncertain how to extend our approach to the natural language generation (NLG) tasks and whether it can bring considerable improvement.Then, the training process of GAN is sensitive to hyper-parameters, leading to us not simply using the default setting when extending to other tasks.Finally, this paper does not include a theoretical explanation and proof of how our method works, which we will further study in future work.Optimize f (•) by the Eq. 2 6: end for 7: for _ to N 2 do 8: Optimize D(•) by the Eq. 3 10: end for 11: for _ to N 2 do 12: Optimize f (•) by the Eq. 6 14: for _ to T do 15: Optimize D(•) by the Eq. 3 17: end for 18: end for 19: return f (•)

C Evaluation Metric and Data Statistic
We summarize the evaluation metric in the GLUE and XTREME benchmarks used in our experiments.The summary is shown in Table 4. Compared to the Wang et al. (2018) and Hu et al. (2020), our work has several differences.First, we do not evaluate the MNLI-mismatch set in the MNLI task.Then, we choose the Pearson Correlation to evaluate the STSB task and F1 score to evaluate the XQuAD, MLQA and TyDiQA tasks.
Then, following Yang et al. (2022), we select several data-sets as the out-of-domain test sets to evaluate the domain generalization ability.The data statistic is shown in Table 5.For the MNLI and STSB, we concat the training and test set from SICK as the test set.We randomly sample 6000 samples from HANS and IMDB as test sets for the RTE and SST2, respectively.All data-sets mentioned above can be downloaded on the Hugging Face 3 .Moreover, the CoLA ood is collected by the Yang et al. (2022).We randomly select 6000 samples from the original set as the test sets.C1.Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Section 3 The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: An illustration of the training process of the proposed G-Tuning.

Figure 2 :
Figure 2: The scatter figure of the representation of indomain (Orange Square) and out-of-domain (Blue Circle) data generated by fine-tuning (Left) and G-Tuning (Right), respectively.

Algorithm 1
The overall process of G-Tuning.Input: Unlabeled set M ; Labeled set B; PLM with adapter f (•) = h • g(•); Training step N ; Discriminator D(•); Update Frequency T Output: The fine-tuned model f (•) 1: Randomly initialize the parameter of h(•) 2: Initialize the parameter of g(•) from PLM 3: for _ to N 2 do 4:Sample {x t , y t } ∼ B 5:

Table 3 :
Results on the XTREME benchmark (Language Generalization)."Sent" is short for sentence and "Struct Pred" is structural prediction."Avg." is the average score of all tasks." †" means the results come from their paper.

Table 4 :
The evaluation metric used in our experiment in the GLUE and XTREME benchmarks.

Table 5 :
The data statistic of the GLUE ood data-set.The limitations of this work is shown in the last section.B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Not applicable.Left blank.B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Not applicable.Left blank.B6.Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Not applicable.Left blank.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 3 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section 3 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.