Bi-Drop: Enhancing Fine-tuning Generalization via Synchronous sub-net Estimation and Optimization

Pretrained language models have achieved remarkable success in natural language understanding. However, fine-tuning pretrained models on limited training data tends to overfit and thus diminish performance. This paper presents Bi-Drop, a fine-tuning strategy that selectively updates model parameters using gradients from various sub-nets dynamically generated by dropout. The sub-net estimation of Bi-Drop is performed in an in-batch manner, so it overcomes the problem of hysteresis in sub-net updating, which is possessed by previous methods that perform asynchronous sub-net estimation. Also, Bi-Drop needs only one mini-batch to estimate the sub-net so it achieves higher utility of training data. Experiments on the GLUE benchmark demonstrate that Bi-Drop consistently outperforms previous fine-tuning methods. Furthermore, empirical results also show that Bi-Drop exhibits excellent generalization ability and robustness for domain transfer, data imbalance, and low-resource scenarios.


Introduction
In recent years, Natural Language Processing (NLP) has achieved significant progress due to the emergence of large-scale Pretrained Language Models (PLMs) (Devlin et al., 2018;Liu et al., 2019;Raffel et al., 2019;Clark et al., 2020).For downstream tasks, compared with training from scratch, fine-tuning pretrained models can usually achieve efficient adaptation and result in better performance.Despite the great success, fine-tuning methods still face challenges in maintaining generalization performance on downstream tasks -they tend to run into the overfitting issue when the training data is limited (Phang et al., 2018;Devlin et al., 2018;Lee et al., 2020).
To improve the generalization ability of finetuning methods, many regularization techniques have been proposed (Chen et al., 2020  Unlike previous methods that require multiple mini-batches of data to asynchronously determine the sub-net to optimize, Bi-Drop has a synchronous sub-net selection strategy with a higher data utility.et al., 2021;Wu et al., 2021;Xu et al., 2021;Yuan et al., 2022), such as sub-net optimization strategies like Child-TuningD (Xu et al., 2021) and DPS (Zhang et al., 2022).Child-TuningD selects a static sub-net for updating based on parameter importance estimated by Fisher Information (FI).As an improved variant of Child-Tuning D , DPS dynamically decides the sub-net to be updated by estimating FI with multiple mini-batches of data.Although these FI-based methods achieve better generalization ability than vanilla fine-tuning, they still have two limitations: (1) hysteresis in sub-net updating: the sub-net preference is estimated with the model parameters in previous iterations and may be incompatible with the current update step; and (2) insufficient utility of training data: FI estimation requires cumulative gradients through multiple mini-batches, so these methods cannot fit in situations with data scarcity.
In this paper, we delve deeper into adaptive subnet optimization strategies and propose Bi-Drop, a FI-free strategy for fine-tuning pretrained language models.Unlike Fisher information estimation, which requires cumulative gradients of minibatches, Bi-Drop only relies on information in a single mini-batch to select the parameters to update.Specifically, Bi-Drop utilizes gradient information from different sub-nets dynamically generated by dropout in each mini-batch.As illustrated in Figure 1, within a single training step of Bi-Drop, a minibatch will go through the forward pass multiple times and, due to the randomness introduced by dropout, yield various distinct sub-nets.We then apply a parameter selection algorithm with perturbation and scaling factors to stabilize the gradient updates.With this synchronous parameter selection strategy, Bi-Drop can selectively update model parameters according to the information from only the current mini-batch, and thus mitigate overfitting with a high utility of training data.
Extensive experiments on the GLUE benchmark demonstrate that Bi-Drop shows remarkable superiority over the state-of-the-art fine-tuning regularization methods, with a considerable margin of 0.53 ∼ 1.50 average score.Moreover, Bi-Drop consistently outperforms vanilla fine-tuning by 0.83 ∼ 1.58 average score across various PLMs.Further analysis indicates that Bi-Drop attains superb generalization ability for domain transfer and task transfer, and is robust for data imbalance and lowresource scenarios.
To sum up, our contributions are three-fold:

Related Work
Pretrained Language Models In recent years, the field of natural language processing (NLP) has witnessed significant advancements due to the development of large-scale pretrained language models (PLMs).The introduction of BERT (  et al., 2018) sparked a continuous emergence of various pre-trained models, including RoBERTa (Liu et al., 2019), ELECTRA (Clark et al., 2020), XLNet (Yang et al., 2019), GPT-2 (Radford et al., 2019), and GPT-3 (Brown et al., 2020), which have brought remarkable improvements in model structures and scales.Until now, fine-tuning is still one of the most popular approaches to adapting large pretrained language models to downstream tasks.
Regularization Methods for Fine-tuning Largescale PLMs are prone to over-fitting (Phang et al., 2018;Devlin et al., 2018) and exhibit inadequate generalization ability when fine-tuned with limited training data (Aghajanyan et al., 2021;Mahabadi et al., 2021), resulting in degraded performance.To tackle this issue, various regularization techniques have been suggested to enhance the generalization capacity of models, including advanced dropout alternatives (Wan et al., 2013;Wu et al., 2021), applying adversarial perturbations (Aghajanyan et al., 2021;Wu et al., 2022;Yuan et al., 2022) and constrained regularization methods (DauméIII, 2007;Chen et al., 2020).In recent years, Child-tuning (Xu et al., 2021)   (2) sub-net selection: an advanced strategy is adopted to select the sub-net to be updated based on the gradients of distinct sub-nets generated by dropout; (3) Parameter updating: only the parameters of the selected sub-net are updated to mitigate overfitting.

Background
We first introduce the paradigm of sub-net optimization by giving general formulations of the backpropagation during the vanilla fine-tuning and CHILD-TUNING D .We denote the parameters of the model at t-th iteration as θ t = {θ t,i } n i=1 , where θ t,i represent the i-th element of θ t at the t-th training iteration.θ 0 denotes the parameter matrix of the pre-trained model.The vanilla fine-tuning adopts Stochastic Gradient Descent (SGD) to all the model parameters, formally: where L represents the training loss within a batch; η is the learning rate.Instead of fine-tuning the entire network, CHILD-TUNING D proposes to only optimize a subset of parameters (i.e., the sub-net).
It first adopts the Fisher Information (FI) to estimate the relative importance of the parameters for a specific downstream task, which can be formulated as: where D is the training data, F (θ 0 ) denotes the fisher information matrix of the pretrained parameters; sort(•) p represents the highest value of p percentile in F (θ 0 ) after sorting in ascending order; M CT D is a mask matrix that is the same-sized as θ 0 .During fine-tuning, CHILD-TUNING D only optimizes the selected sub-net in M CT D :

Bi-Drop
As introduced in Section 3.1, CHILD-TUNING D only optimizes an unchanged sub-net during finetuning and ignores the update of other parameters, which may degrade the model's performance on downstream tasks.In this section, we offer a detailed introduction to our proposed method, Bi-Drop, which selects the updated parameters adaptively at each fine-tuning step.Specifically, Bi-Drop splits each training step into three sub-steps: (1) multiple forward propagations, (2) sub-net selection, and (3) parameter updating.We provide a pseudo-code of Bi-Drop in Algorithm 1.

Multiple Forward Propagations
Instead of prior FI-based methods that require accumulated gradients to measure the parameter importance, Bi-Drop leverages distinct sub-nets generated by dropout to select the sub-net to be updated.Inspired by Wu et al. (2021), given the training data , at each training step, we feed x i to the model multiple times in the forward pass with different dropouts, and obtain their gradients correspondingly: where θ (j) t and g (j) t represents the parameters of the j-th forward pass and its corresponding gradients.k denotes the number of forward passes, i.e., the number of distinct sub-nets with different dropouts.

Sub-net Selection
In this subsection, we introduce our sub-net selection strategy, which estimates the relevant importance of parameters based on the gradients of distinct sub-nets generated by different dropouts.Concretely, our strategy is based on two estimation factors: the perturbation factor and the scaling factor.
Perturbation Factor We propose the perturbation factor, which estimates the importance of parameters according to their stability with different dropouts in the forward pass.We point out that various sub-nets generated by dropout can be viewed as adversarial perturbations to the vanilla model.The perturbation factor is formalized as follows: where µ t is the average gradients of parameters.F per measures the stability of parameters by both considering the mean and variance of gradients with adversarial perturbations, i.e. sub-nets with consistently larger gradients and smaller variances are more favorable by this factor.
Scaling Factor We further propose the scaling factor as a regularization term.This factor measures the ratio of the average parameter gradients to the original parameters.Parameters whose gradient scale is much smaller than the original parameters will not be updated, which is similar in spirit to gradient clipping.

Parameter Updating
Following prior work (Xu et al., 2021;Zhang et al., 2022), we derive a step-wise mask matrix M t filtered by selecting the highest value of p percentile measured by the aforementioned two estimation factors.
Then, we utilize M t to update the sub-net which consists of important parameters at each training step.We denote the formulation by simply replacing Eq.4 with our step-wise mask matrix M t :   et al., 2015), MNLI (Williams et al., 2018), MNLI-M (Williams et al., 2018) and SICK (Marelli et al., 2014).We report all results by Accuracy on the development sets consistent with GLUE.

Baselines
Besides the vanilla fine-tuning method, we mainly compare Bi-Drop with the following baselines: Mixout (Lee et al., 2020) is a fine-tuning technique that stochastically replaces the parameters with their pretrained weight based on the Bernoulli distribution.R3F (Aghajanyan et al., 2021) is a fine-tuning strategy motivated by trust-region theory, which injects noise sampled from either a normal or uniform distribution into the pre-trained representations.R-Drop (Wu et al., 2021) minimizes the bidirectional KL-divergence to force the output distributions of two sub-nets sampled by dropout to be consistent with each other.Child-Tuning D (Xu et al., 2021)  For reference, we also show other prior finetuning techniques in our main experimental results, such as Weight Decay (DauméIII, 2007), Top-K Tuning (Houlsby et al., 2019) and RecAdam (Chen et al., 2020).

Experiments Setup
We conduct our experiments based on the Hugging-Face transformers library2 (Wolf et al., 2020) and follow the default hyper-parameters and settings unless noted otherwise.We report the averaged results over 10 random seeds.Other detailed experimental setups are presented in Appendix B.

Results on GLUE
Comparison with Prior Methods We compare Bi-Drop with various prior fine-tuning methods based on BERT large and report the mean (and max) scores on GLUE benchmark in Table 2, following Lee et al. (2020) and Xu et al. (2021).The results indicate that Bi-Drop yields the best average performance across all tasks, showing its effectiveness.Moreover, the average of the maximum scores attained by Bi-Drop is superior to that of other methods, providing further evidence of the effectiveness of Bi-Drop.We also conducted the same experiment on Roberta large , and the details can be found in Appendix E.
Comparison with Vanilla Fine-tuning We show the experimental results of six widely used largescale PLMs on the GLUE Benchmark in Table 3.The results show that Bi-Drop outperforms vanilla fine-tuning consistently and significantly across all tasks performed on various PLMs.For instance, Bi-Drop achieves an improvement of up to 1.58 average score on BERT base and 1.35 average score on Roberta base .The results highlight the universal effectiveness of Bi-Drop in enhancing the fine-tuning performance of PLMs.Additionally, because Bi-Drop forward-propagate twice, we present an additional study of the baseline with doubled batch  size in Appendix D.

Out-of-Domain Generalization
We further evaluate the generalization ability of Bi-Drop on a widely used experimental setting in prior research (Aghajanyan et al., 2021;Xu et al., 2021;Zhang et al., 2022)

Task Generalization
We also evaluate the generalization ability of finetuned models following the experimental setting of Aghajanyan et al. (2021) and Xu et al. (2021), which freezes the representations of the model finetuned on one task and only trains the linear classifier on the other task.Specifically, we finetune BERT large among one task selected among MRPC, CoLA, and RTE and then transfer the model to the other two tasks.Figure 3 shows that Bi-Drop consistently outperforms vanilla fine-tuning when the fine-tuned model is transferred to other tasks.
In particular, Bi-Drop improves by 3.50 and 3.28, when models trained on MRPC and RTE respectively are evaluated on CoLA.pared with the vanilla fine-tuning approach.

Stability to Random Seeds
We further investigate the stability properties of fine-tuned models.Figure 4 shows the output distributions of models with four experimental settings and across 10 random seeds.The results demonstrate that Bi-Drop outperforms other strategies in terms of average performance, and also exhibits greater stability by achieving more consistent results across 10 random seeds with lower variance.

Robustness Analysis
Recent research has brought to light that the vanilla fine-tuning approach is prone to deception and vulnerability in many aspects.In this study, we assess the robustness of Bi-Drop by designing evaluation tasks that focus on two common scenarios, aiming to examine its ability to withstand various forms of perturbations while maintaining its robustness.
Robustness to Label Noise Due to the inherent limitations of human annotation, widely-used largescale datasets inevitably contain a certain amount of incorrect labels (Vasudevan et al., 2022).To investigate the robustness of Bi-Drop to label noise, we conduct simple simulation experiments on RTE, MRPC, and CoLA by randomly corrupting a predetermined fraction of labels with erroneous values.We evaluate the robustness of various finetuning methods trained on noisy data.The results shown in the left panel of  achieves up to 4.00, 4.23, and 5.29 average score improvements on 30%, 40%, and 50% reduction ratios respectively, outperforming other fine-tuning methods at lower reduction ratios and showcasing its robustness towards the minority class.

Performance in Low-Resource Scenarios
As illustrated in Section 1 and 2, compared with prior FI-based sub-net optimization methods that have a strong dependence on the training data, Bi-Drop proposes a step-wise sub-net selection strategy, which chooses the optimized parameters with the current mini-batch.In this section, we conduct extensive experiments to analyze how this dependency affects the performance of models.Concretely, we adopt various fine-tuning methods on BERT large with a limited amount of training data.
The results are illustrated in Table 6.As the data amount decreases from 1.0K to 0.5K, the average improvement score of Child-Tuning D over vanilla fine-tuning decreases from 1.57 to 1.15, while its improved variant DPS maintains a relatively stable improvement.But Bi-Drop improves the average improvement score from 2.77 to 3.28.The results indicate the superiority of Bi-Drop over prior methods in low-resource scenarios.

Ablation Study
To evaluate the effectiveness of our proposed finetuning strategy, we conduct an ablation study in

Limitations
We propose a novel and effective fine-tuning method, Bi-Drop, which achieves a considerable performance improvement in downstream tasks.However, similar to some previous studies (Jiang et al., 2020;Aghajanyan et al., 2021;Wu et al., 2021), Bi-Drop requires multiple forward propagations, which makes its training time efficiency not good enough compared with the vanilla fine-tuning method.

D Batch Size Doubled Training
We implement Bi-Drop by repeating the input data twice and forward-propagating twice.This is similar to doubling the batch size at each step.The difference is that half of the data is the same as the other half, and directly doubling the batch size, the data in the same mini-batch is all different.So for a fair comparison, we experimented with directly doubling the batch size.So for a fair comparison, we experimented with directly doubling the batch size.The experimental results are shown in Table 10, results show that directly doubling the batch size has basically no improvement, and Bi-Drop is significantly better than directly doubling the batch size.

E Comparison with Prior Methods on Roberta large
We compare Bi-Drop with various prior fine-tuning methods based on BERT large and report the mean

Figure 1 :
Figure 1: Comparisons between Bi-Drop and previous methods.Unlike previous methods that require multiple mini-batches of data to asynchronously determine the sub-net to optimize, Bi-Drop has a synchronous sub-net selection strategy with a higher data utility.

Figure 2 :
Figure 2: An overall illustration of Bi-Drop.Bi-Drop splits each training step into three sub-steps: (1) Multiple forward propagations: each mini-batch sample goes through the forward pass multiple times (denoted as k) with dropout;(2) sub-net selection: an advanced strategy is adopted to select the sub-net to be updated based on the gradients of distinct sub-nets generated by dropout; (3) Parameter updating: only the parameters of the selected sub-net are updated to mitigate overfitting.
selects the task-relevant parameters as the sub-net based on the Fisher information and only updates the sub-net during fine-tuning.DPS (Zhang et al., 2022) is a dynamic sub-net optimization algorithm based on Child-Tuning D .It estimates Fisher information with multiple minibatches of data and selects the sub-net adaptively during fine-tuning.

Figure 4 :
Figure 4: Results of Bi-Drop across four experimental settings.Each method includes a violin plot for 10 random runs.Compared with other methods, the shorter and thicker violin plot of Bi-Drop proves its better stability.

F1Figure 5 :
Figure 5: The effect of dropout rate on experimental results

Table 2 :
Comparison between Bi-Drop with prior fine-tuning methods.We report the mean (max) results of 10 random seeds.The best results are bold.Note that since R3F is not applicable to regression, the result on STS-B (marked with * ) remains the same as vanilla.Bi-Drop achieves the best performance compared with other methods.NLI DatasetsWe also evaluate the generalization ability of Bi-Drop on several Natural Language Inference (NLI) tasks, including SNLI (Bowman

Table 3 :
Comparison between Bi-Drop and vanilla fine-tuning applied to six widely-used large-scale PLMs.We report the mean results of 10 random seeds.Average scores on all tasks are underlined.The best results are bold.It shows that Bi-Drop yields consistent improvements across all tasks among different PLMs.

Table 4 :
Evaluation for out-of-domain generalization.The models are trained on MNLI/SNLI and tested on out-of-domain data.Average scores are computed excluding in-domain results (underlined).The best results are bold.Bi-Drop can better maintain the out-of-domain generalization ability of the model.
Evaluation for task generalization.The model is fine-tuned on a specific task among MRPC, CoLA, RTE and transferred to the other two tasks.Bi-Drop can be more generalizable.
The results further verify the conclusion that Bi-Drop helps models learn more generalizable representations, com-

Table 5
Robustness to Data Imbalance Minority class refers to the class that owns insufficient instances in the training set.In this section, we strive to explore the robustness of diverse fine-tuning approaches for the minority class by carrying out experiments on synthetic RTE, MRPC, and CoLA datasets.The experimental results are illustrated in the right panel of Table5, which shows that Bi-Drop significantly outperforms other fine-tuning methods.Bi-Drop

Table 5 :
Left: Robustness to label noise.The noise ratio is the percentage of training instances whose labels are transferred to incorrect labels.Right: Robustness to data imbalance.We reduce the number of instances labeled 1 by 70%/60%/50% in the training set and test the accuracy of instances labeled 1 (as the minority class) in the validation set.Bi-Drop can maintain more robust representations compared with other fine-tuning methods.

Table 6 :
Comparison between Bi-Drop and prior sub-net optimization strategies with varying low-resource scenarios (0.5K, 1K).We report the results of 10 random seeds and the best results are bold.Bi-Drop performs better than other methods in low-resource scenarios.

Table 7 .
The results show that both our sub-net selection strategy and gradient averaging strategy

Table 7 :
Ablation results.ESS represents our Effective Sub-net Selection strategy using both factors Perturbation and Scaling.RSS stands for Random Sub-net Selection strategy.Both our sub-net selection strategy and gradient averaging strategy are effective.
Bi-Drop uses two dropout techniques.In order to analyze the impact of the dropout rate on the experimental results, a simple analysis experiment was done here.In order to make the comparison fair, all the parameters except the dropout rate are kept the same in the experiment.For simplicity, the dropout values are the same twice.The experimental results are shown in 5.It can be seen that different datasets have different preferences for dropout values.CoLA and RTE achieve the best results when the dropout value is 0.05; while MRPC achieves the best results when the dropout value is 0.1; STSB is insensitive to the dropout value until the dropout value is less than 0.1.

Table 9 :
Hyperparameters settings for different pretrained models on variant tasks.These settings are reported in their official repository for best practice.

Table 11 :
Comparison between Bi-Drop with prior fine-tuning methods.We report the mean (max) results of 10 random seeds.The best results are bold.Note that since R3F is not applicable to regression, the result on STS-B (marked with * ) remains the same as vanilla.Bi-Drop achieves the best performance compared with other methods.