ROSE: Robust Selective Fine-tuning for Pre-trained Language Models

Even though the large-scale language models have achieved excellent performances, they suffer from various adversarial attacks.A large body of defense methods has been proposed. However, they are still limited due to redundant attack search spaces and the inability to defend against various types of attacks.In this work, we present a novel fine-tuning approach called RObust SEletive fine-tuning (ROSE) to address this issue.ROSE conducts selective updates when adapting pre-trained models to downstream tasks, filtering out invaluable and unrobust updates of parameters.Specifically, we propose two strategies: the first-order and second-order ROSE for selecting target robust parameters.The experimental results show that ROSE achieves significant improvements in adversarial robustness on various downstream NLP tasks, and the ensemble method even surpasses both variants above.Furthermore, ROSE can be easily incorporated into existing fine-tuning methods to improve their adversarial robustness further.The empirical analysis confirms that ROSE eliminates unrobust spurious updates during fine-tuning, leading to solutions corresponding to flatter and wider optima than the conventional method.Code is available at https://github.com/jiangllan/ROSE.


Introduction
Recently, the discipline of fine-tuning large-scale pre-trained language models has gained prominence, achieving remarkable performances across various natural language processing benchmarks.However, recent studies (Ribeiro et al., 2020;Jin et al., 2020;Nie et al., 2020;Lin et al., 2021;Jiang et al., 2022) have highlighted the lack of adversarial robustness in models fine-tuned on specific downstream tasks, i.e., adapted models are vulnerable to various types of adversarial attacks.A majority of adversarial examples fool models via character or word-level perturbations, either those tokens that aren't found in training sets or superficial cues attached to labels (Li et al., 2021b;Le et al., 2022).The vulnerability of adapted models can be attributed to their tendency to capture shallow and spurious patterns when fine-tuning on downstream tasks, instead of utilizing the general linguistic knowledge they have learned during the pre-training stage (Sagawa et al., 2020;Warstadt et al., 2020;Gouk et al., 2021;Dong et al., 2021).
To address this issue, various defense methods have been proposed, including adversarial training (Goodfellow et al., 2015;Zhu et al., 2020;Ivgi and Berant, 2021), adversarial data augmentation (Zhang et al., 2019;Zheng et al., 2020;Si et al., 2021) and so on.Adversarial training and adversarial data augmentation stand to provide the most promising performance among all the defense methods.They enhance adversarial robustness by re-training models with additional adversarial data generated either via human-crafting or by conducting projected gradient ascent on the benign examples.In essence, these methods prevent models from learning misleading features by covering more diverse training data.However, they are limited to practice as they often require prohibitively large attack search spaces, and are not generally applicable to different types of attacks.
In this work, we present an attack-agnostic and model-agnostic defense method called RObust SElective Fine-Tuning (ROSE) to address these challenges from a learning perspective.ROSE is a novel fine-tuning method that conducts robust updates selectively during the fine-tuning stage.The intuition behind our method is straightforward: only robust and informative updates of parameters should be conducted.While the improper ones, which make the fine-tuned model capture superfi- Specifically, we propose two strategies in response to the above challenges: first-order ROSE and second-order ROSE.Our first-order ROSE employs adversarial perturbations to clarify parameters that are robust against slight perturbation in the hidden space, enabling models to cope with superficial patterns among examples with similar semantics in the current step.Our second-order ROSE allows models to counter superficial patterns across examples with different semantics along the fine-tuning process by smoothing the optimization trajectory.We also propose an ensemble method to aggregate the benefits of the above two strategies.ROSE distinguishes parameters based on robust criteria at each step in the backward process, then performs robust updates while freezing the remaining parameters.
Figure 1 illustrates the loss landscapes of solutions found by different fine-tuning strategies (pretrained initial, vanilla fine-tuning, overfitting and ROSE-tuning) on specific tasks.ROSE leads to solutions corresponding to broader and flatter optima compared to traditional fine-tuning methods, which implies that it achieves better adversarial robustness as found in Goodfellow and Vinyals (2015).Moreover, our probing experiment illustrates that ROSE prefers deeper linguistic features rather than shallow lexical ones during fine-tuning.The above empirical analysis confirms the inner working of ROSE.ROSE allows for more robust solutions by masking out unreliable and spurious updates when fine-tuning models on downstream tasks.
We conduct extensive experiments to evaluate the effectiveness of ROSE.We compare ROSE to several strong defense methods.The results show that ROSE exhibits superior adversarial robustness on challenging examples, and achieves comparable or even better benign performance on several benchmarks.ROSE is generic and can be incorporated with existing methods to further enhance their adversarial robustness.

Methodology
In this section, we will introduce our method in detail.The key motivation of our method is to select parameters that carry stable information for downstream tasks during fine-tuning.Specifically, the vanilla fine-tuning updates all the parameters in the backward process, while ROSE only updates the robust and informative ones.To identify robust parameters, we propose two robustness criteria and corresponding selective fine-tuning strategies: firstorder ROSE (Section 2.1) and second-order ROSE (Section 2.2).Furthermore, we explore an ensemble robust selective fine-tuning method (Section 2.3), which aggregates the benefits of the above two strategies.The overall training algorithm of ROSE when applied to AdamW (Loshchilov and Hutter, 2019) is shown in Algorithm 1.

First-order ROSE
Our first-order ROSE aims to select parameters that are insensitive to first-order perturbation in the hidden space.We employ adversarial inputs to distinguish robust parameters.Different from the conventional virtual adversarial examples generated via PDG-based methods, ROSE adopts dropout to generate adversarial perturbation with little overhead cost.We follow the method used in Gao et al. (2021); Liang et al. (2021), which passes the same input to the model twice in the forward process with different dropout, and obtains two outputs correspondingly.Then in the backward process, ROSE only updates parameters that are insensitive Algorithm 1 ROSE for AdamW Optimizer given α = 0.001, β 1 , β 2 ∈ [0, 1), ϵ = 10 −8 , λ ∈ R 1: initialize time step t ← 0, parameter vector θ t=0 ∈ R n , first moment vector m t=0 ← 0, second moment vector v t=0 ← 0, learning rate η ∈ R 2: repeat 3: to the difference between the two outputs.Formally, we denote an initial pre-trained model as θ 0 , the two probability distributions produced with different dropout at the t-step as P t , P ′ t , and the Kullback-Leibler (KL) divergence between them is defined as follows: In the backward process, first-order ROSE filters out parameters which incline to learn superficial cues between similar examples.The potential risk r F,i t of the i-th parameter in model is computed as the ℓ F norm of the gradient with regard to L KL t : Then we sort the magnitude of the sensitivity from Given the upper threshold c F h (e.g., 60%) for robust parameters, the mask M F t can be derived as: Note that, only the classification loss will be used to update the weights of models, while gradients with regard to L KL t will be discarded after calculating masks.

Second-order ROSE
Our second-order ROSE smooths the optimization trajectory to prevent models from learning spurious patterns between different groups of data points along the fine-tuning process.More precisely speaking, our second-order ROSE selects and tunes parameters that are less aggressively updated to avoid overfitting on spurious patterns.A straightforward idea is to calculate the second derivatives of the classification loss as the secondorder risks.Unfortunately, it requires prohibitive computation and storage costs.Thus we employ a stochastic gradient-based optimizer like AdamW to approximate this solution.
Formally, we denote the softmax cross-entropy loss at the t-step as L SCE t , and the first momentum of optimizer as m t−1 .Then the second-order risk r S,i t of the i-th parameter in the model is defined as the relative magnitude between current gradients g t and the exponential moving average m t−1 , which is computed as: where α is a scaling coefficient.In AdamW, α = (1 − β 1 ), and β 1 is the momentum factor.Similar to our first-order ROSE, we sort the magnitude of the second-order risks r S t in ascending order and calculate the second-order mask M S t with Eq. 3.

Ensemble ROSE
Since our first-order and second-order ROSE emphasize different kinds of robust parameters, we then propose an ensemble method to aggregate the benefits of the above two mechanisms.
At the t-step, we first calculate both the firstorder risks r F,1 t , • • • , r F,n t with Eq. 2 and secondorder risks r S,1 t , • • • , r S,n t with Eq. 4. Then we sort both of them in ascending order.Given upper thresholds c F h and c S h , we can compute the firstorder and second-order masks as: M F t and M S t , respectively.Finally, the ensemble mask M E t at t-step is computed as: where γ ∈ (0, 1) is a scaling coefficient hyperparameter to control the weight of two masks.

Datasets
We demonstrate the effectiveness of our method using four tasks from GLUE (Wang et al., 2019) and AdvGLUE (Wang et al., 2021b) benchmarks.
The General Language Understanding Evaluation (GLUE) is a widely-used benchmark, including 9 natural language understanding tasks.The Adversarial GLUE (AdvGLUE) is a robustness benchmark based on GLUE, covering 14 prevalent adversarial textual attack methods.The AdvGLUE adopts careful systematic annotations to curate high-quality and reliable adversarial examples.We do not employ automatic adversarial attack algorithms to evaluate the adversarial robustness, due to their tendency to produce incorrect or puzzling adversarial examples (Li et al., 2021a).SST-2 (Socher et al., 2013) is a sentiment classification task with single-sentence inputs, which is collected from movie reviews.
RTE (Bentivogli et al., 2009) is a natural language inference task aggregated from a series of textual entailment challenges, originating from news articles and Wikipedia content.
QNLI (Rajpurkar et al., 2016) is also an inference task.Given the context sentence and corresponding question, this task is to determine whether it provides the answer.
QQP 1 (Chen et al., 2018) is a widely used benchmark involving detecting semantic similarity.An annotated binary label indicates whether two texts of each pair are semantically same or not.

Baselines
We adopt pre-trained checkpoints of RoBERTa BASE and RoBERTa LARGE (Liu et al., 2019) as the basis of our experiments.Besides the vanilla fine-tuning method, we select several suitable baselines for comparison including: R-Drop (Liang et al., 2021) is a generic regularization strategy, which assures consistency between two outputs obtained by different dropout.ROSE borrows the idea of dropout twice, but does not fine-tune all parameters to constrain the divergence between two outputs.
CHILD-TUNING D (Xu et al., 2021) is a finetuning technique, which only updates the most informative subset of parameters of large pre-trained models during the backward process.Although both CHILD-TUNING and ROSE mask out gradients in the backward process, the specific parameters they update are completely different.
SMART (Jiang et al., 2020) is an adversarial training approach, contains a regularization module induced by smoothness and a optimization module inspired from Bregman proximal point method.
FreeLB (Zhu et al., 2020) is an adversarial training approach built on the top of language models, which improves higher invariance in word embedding space and reduces the adversarial risk surrounding examples.

Experimental Settings
Our implementation of ROSE is based on Huggingface library2 (Wolf et al., 2020).Batch size for RTE is set to 16, and for other tasks it is set to 32.Dropout rates are all set to 0.1.We carry out grid search of learning rate For ROSE-ensemble, we simply set γ = 0.5 in Eq. 5.The maximum number of epochs is set to 10.For the replication study, we report the average accuracy over 5 random seeds on the GLUE and AdvGLUE development sets after fine-tuning the pre-trained models on the corresponding GLUE training data.
For all the baselines, we either perform grid search or adopt the parameter combination provided by the official codebase to select the best parameters.Similarly, we report the average results on two benchmarks over 5 random seeds using the same evaluation schema.

Main Results
We compare ROSE-First, ROSE-Second, and ROSE-Ensemble to all baselines on SST-2, RTE, QNLI, and QQP tasks from GLUE and AdvGLUE benchmarks.The overall results are summarized in Table 1.We observe that: (1) Our proposed ROSE substantially improves the robustness of fine-tuned pre-trained language models, while maintaining competitive benign performances to previous methods.Despite the effectiveness of ROSE-First and ROSE-Second varies on tasks and model size, both of them surpass the existing methods.ROSE-Ensemble aggregates the advantages of first-order and secondorder strategies, providing the strongest adversarial robustness.In particular, ROSE-Ensemble BASE outperforms vanilla RoBERTa BASE model by 8.04% average score.ROSE-Ensemble LARGE beats RoBERTa LARGE by 4.29% on average.
(2) ROSE consistently outperforms CHILD-TUNING D and R-Drop, which both share some similarities with our method.CHILD-TUNING D , which masks out the most inessential parameters in the backward process, shows the worst robustness on most datasets.R-Drop uses dropout to regularize the output of models.Results indicate that R-Drop improves robustness on a number of tasks, but it is not competitive with strong defense methods.We will explore the effectiveness of our robust selection strategies further in Section 3.5.
(3) Our method also surpasses the two strong baselines SMART and FreeLB, which employ the most prevalent adversarial training idea to improve the robustness of pre-trained models.For instance, ROSE-Ensemble BASE enhances performance by up to 2.75% average score over SMART BASE .ROSE-Ensemble gains 2.15% average score improvement compared to FreeLB with RoBERTa LARGE .Furthermore, SMART and FreeLB are both inefficient and heavily correlated with model structure, while our ROSE does not suffer from these issues.

Extensions of ROSE to Existing Method
ROSE is a generic method and can be easily incorporated into other well-recognized methods.In this section, we incorporate ROSE into R-Drop and examine whether it is still effective.Since the optimization objective of R-Drop is a weighted sum of softmax cross-entropy loss and KL-divergence, we decouple them from the aggregated loss and use them to perform our first-order and second-order mask calculations, respectively.Noted that, in the backward process we still use gradient calculated with regards to the aggregated loss to update, which is different from the ROSE process.
We primarily adopt the best parameter combinations from the main experiments in Section 3.4, including the learning rates and upper thresholds.
We follow the settings from R-Drop for other parameters.We conduct experiments using both RoBERTa BASE and RoBERTa LARGE .
Results are shown in Table 2. Generally, ROSE improves the adversarial robustness of R-Drop by a large margin, and maintains competitive benign performances at the same time.For example, ROSE-First BASE promotes the adversarial robustness of R-Drop BASE on QNLI task from 28.92% to 39.46%.R-Drop patched with ROSE-Second witnesses an improvement on QQP task from 44.80% to 50.26% using RoBERTa LARGE .Notably, our ROSE-Ensemble outperforms R-Drop by roughly 3 points on average for both model sizes.The above results indicate that when incorporated into existing methods, ROSE can enhance their adversarial robustness even further.

Effect of Scalar γ
Further, we investigate the impact of the scaling coefficient γ in our ROSE-Ensemble.Here we vary the γ ∈ {0.1, 0.3, 0.5, 0.7, 0.9} and conduct experiments on four tasks, where γ = 0.5 is the current setting.We adopt the setting from Section 3.4 for other parameters.The results are presented in Table 3.It is shown that the best-balanced choice is γ = 0.5, but ROSE-Ensemble can stably improve the robustness using other γ.Furthermore, ROSE achieves more substantial performance when applied to pre-trained language models of greater complexity.

Analysis
In this section, we conduct further analysis to reveal the inner working of ROSE.

Two-dimensional Loss Visualization
The loss landscape (Goodfellow and Vinyals, 2015;Li et al., 2018a) is a valid and effective indicator to characterize the property of neural networks.It has been empirically demonstrated that flatter and wider optima correlate well with better robustness.We plot and compare two-dimensional loss landscapes of the solutions found by vanilla finetuning and our ROSE.Visualizations on various tasks show that our ROSE generally leads to flatter and wider optima, thus improving the adversarial robustness of adapted models.
Let θ denote the parameters of a model finetuned on downstream tasks.Then the twodimensional loss curve of model can be plotted with the function: where L is the loss function, and α, β are scalar values.δ, η are direction vectors randomly sampled from Gaussian distribution, which denote two direction vectors in the parameter space corresponding to the two axes of the loss surface.To remove the scaling effect of neural nets, we follow the filter-wise normalization in Li et al. (2018a), which scales the δ, η to the same norm as parameters by for each axis.Since the parameter space is highdimensional, experimental results confirm the two directions δ and η are divergent and orthogonal to each other.We plot and compare the loss surfaces of models with vanilla fine-tuning and ROSE on four tasks.
The visualizations are shown in Figure 2. We can observe that ROSE has a significant influence on the smoothness of the loss landscapes across all datasets.Models fine-tuned with ROSE provide wider and less dense loss contours than vanilla fine-tuning, which shows that they are more robust against noisy perturbations.Specifically, ROSE-First finds solutions with wider bottoms, and ROSE-Second leads to solutions with less dense loss contour.This indicates that ROSE-First and ROSE-Second succeed in defense of local and global perturbations, respectively.Additionally, ROSE-Ensemble is shown to have both of these features, demonstrating that it aggregates the benefits of the two strategies discussed above.Appendix A.1 provides an additional one-dimensional visualization.

Probing Preference for Different Features
We then employ the probing task from (Warstadt et al., 2020) to test whether models fine-tuned with ROSE prefer linguistic rather than superficial features.In the probing experiment, a model is first trained on ambiguous data which equally supports both linguistic and superficial cues, and then tested on disambiguating data which supports only the linguistic cues.The preference of models for features is measured through Matthews correlation scores between predictions and labels on test sets.
The models are shown a systematic preference for linguistic features if the score is 1, and complete reliance on superficial cues if the score is −1.Therefore a higher score shows a stronger preference for linguistic features.We select two representative experiments gotten by pairing the linguistic feature Syntactic construction with two surface features Lexical content and Length.For each probing task, we report results of adapted models with different fine-tuning strategies on 5 random seeds.Results are plotted in Figure 3.We can observe that, compared to vanilla fine-tuned models, ROSEtuned models show a stronger preference for linguistic features than any superficial features.This indicates that ROSE successfully enables models to extract deeper linguistic knowledge during finetuning, instead of adopting spurious cues from the training data of downstream tasks.

Related Work
Adversarial training is the most effective and promising strategy to improve the adversarial robustness of models.Existing adversarial training methods usually employ PDG-based attacks to generate adversarial examples, and force models to maintain the output on them (Zhu et al., 2020;Liang et al., 2018;Wang et al., 2021c).Despite the substantial improvements in robustness, adversarial training often requires significant computational and memory costs and fails to preserve the original labels.Some works focus on constructing reliable adversarial datasets (Gardner et al., 2020;Eger and Benz, 2020), which require huge human annotation and only work for a single task.By contrast, our proposed ROSE is much more efficient and only employs such perturbations to select robust param-eters to tune, therefore, there is no need for reliable adversarial examples.
Besides adversarial training methods, our work also relates to a few works of regularization and optimization.In regularization, lots of methods have been proposed, including L 2 -penalty (Schwarz et al., 2018;Li et al., 2018b;Chen et al., 2020), weight decay (Kang et al., 2016;Zhang et al., 2021), Mixout regularization (Lee et al., 2020), and so on.The general approach is augmenting the vanilla optimizer with terms that indirectly or directly penalize the aggressive updates.Although these methods are exciting, the regularization is often not scalable, and hard to transfer to another model.Another line of work (Wang et al., 2021a;Dong et al., 2021) attempts to address this issue from an informative-theoretic perspective.In optimization, there has been some work proposed recently to force optimization towards wide valleys (Chaudhari et al., 2017;Jiang et al., 2020).Compared to these works, ROSE uses the simplest idea by selecting parameters with the second-order robustness in fine-tuning stage to smooth the optimization trajectory.ROSE is more efficient and can be incorporated into existing methods to improve their adversarial robustness further.
Note that our method does not fall within the realm of model compression.The target of model compression is to obtain an efficient sub-network with competitive performance, with typical approaches to abandon some parameters when models do inference.While ROSE aims to improve the adversarial robustness of pre-trained language models, which is done via conducting selective parameters updates in the backward process.

Conclusion
In this work, we propose an attack-agnostic and model-agnostic defense approach called ROSE, which selectively updates robust parameters during the fine-tuning stage.We present first-order ROSE which selects the parameters robust against slight perturbation in the hidden space, secondorder ROSE which filters out aggressive updates, and ensemble ROSE which aggregates the benefits of the above two strategies.Experimental results show that both our ROSE-First and ROSE-Second greatly improve the robust performance on various NLP benchmarks, while ROSE-Ensemble is even more effective.Besides, existing methods achieve better robustness when incorporated with our ROSE.We also demonstrate empirically that the effectiveness of ROSE can be attributed to the wider and flatter solutions it finds than the conventional fine-tuning methods.We hope ROSE could motivate more defense works for language models.

Limitations
Although ROSE achieves superior adversarial robustness on four datasets, there are still two limitations.First, there are some vital hyper-parameters in ROSE, e.g. the scaling coefficient γ, which have a great influence on the performance as shown in Section 3.6.We adopt grid search to select the best parameters, which requires considerable GPU resources.There is still a need for a more automatic method.Once we further understand the inner working mechanism of deep neural networks, such hyper-parameters could be calculated theoretically.Second, due to the limitation of computational resources, we focus on fine-tuning in this work, leaving applying ROSE to pre-training for future work.We hope ROSE could provide a new perspective for general defense work towards more robust language models.tion as described in (Goodfellow and Vinyals, 2015), which plots the value of loss function along the line connecting two different models.Let θ and θ ′ indicate the parameters of these two models respectively.Then we plot the function: where α is a scalar value.We compare the weights of models obtained by vanilla and ROSE-Ensemble fine-tuning method on four tasks.In particular, for α ∈ [−0.5, 1.5], we uniformly sample 51 points and plot the function f (α) and superimpose the classification accuracy.trained with ROSE (α = 0) have flatter and wider curves than vanilla fine-tuning (α = 1), which correlates well with robustness.
Figure 4 shows the visualization of fine-tuned models with different strategies.Compared with vanilla fine-tuning, we can observe that our ROSE provides wider and flatter curves.The result indicates that solutions obtained by ROSE tend to be more robust.

A.2 Effectiveness of Dropout
In order to investigate the effectiveness of dropout in ROSE-First, For two outputs produced by different dropout, we inspect the ratio (%) of prediction labels that are not consistent with each other on the first 200, 600, 2000 and 4000 steps.
From Table 4  cost.The ratio decreases fast at the beginning and comes to stable finally, which indicates that ROSE-First succeeds to improve the robustness of models against such perturbation over the training process.

Figure 1 :
Figure1: The loss surfaces of fine-tuned models with different fine-tuning strategies, initialized from RoBERTa BASE .ROSE leads to wider and flatter optima, enhancing the adversarial robustness of adapted models.

δFigure 2 :
Figure2: 2D loss contours of models with vanilla fine-tuning and ROSE on four tasks.The lighter color of the contour lines correlates with a larger loss, and the area with denser contour lines demonstrates steeper loss surfaces.

Figure 3 :
Figure3: Results for probing tasks with different random seeds.Each data point represents one run.Models with a stronger preference for linguistic rather than spurious features achieve higher scores.

Figure 4 :
Figure 4: 1D loss interpolation (Left axis corresponds to loss, and right to accuracy) between solutions found by vanilla fine-tuning and ROSE-Ensemble on four tasks.trainedwith ROSE (α = 0) have flatter and wider curves than vanilla fine-tuning (α = 1), which correlates well with robustness.

Table 1 :
Model performance on GLUE and AdvGLUE benchmarks.The results are accuracy averaged over 5 random seeds.Values are reported as percentage (%).Besides each task scores, we also report a macro-average.The last column is the drop from GLUE to AdvGLUE, the smaller the better.The bold represents ROSE is significantly better (1-tailed t-test, p-value < 0.05) than the baselines.

Table 2 :
Experiment results for R-Drop incorporated with ROSE.ROSE improves the robustness of R-Drop further.The bold represents ROSE is significantly better (1-tailed t-test, p-value < 0.05) than R-Drop.

Table 3 :
Results for ROSE-Ensemble with different γ.

Table 4 :
we can see that, dropout generates adversarial examples as expected with striking low Results for ROSE-Ensemble with different γ.