Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning

Recent pretrained language models extend from millions to billions of parameters. Thus the need to fine-tune an extremely large pretrained model with a limited training corpus arises in various downstream tasks. In this paper, we propose a straightforward yet effective fine-tuning technique, Child-Tuning, which updates a subset of parameters (called child network) of large pretrained models via strategically masking out the gradients of the non-child network during the backward process. Experiments on various downstream tasks in GLUE benchmark show that Child-Tuning consistently outperforms the vanilla fine-tuning by 1.5 8.6 average score among four different pretrained models, and surpasses the prior fine-tuning techniques by 0.6 1.3 points. Furthermore, empirical results on domain transfer and task transfer show that Child-Tuning can obtain better generalization performance by large margins.


Introduction
Pretrained Language Models (PLMs) have had a remarkable effect on the natural language processing (NLP) landscape recently (Devlin et al., 2019;Liu et al., 2019;Clark et al., 2020). Pretraining and fine-tuning have become a new paradigm of NLP, dominating a large variety of tasks.
Despite its great success, how to adapt such large-scale pretrained language models with millions to billions of parameters to various scenarios, especially when the training data is limited, is still challenging. Due to the extremely large capacity and limited labeled data, conventional transfer learning tends to aggressive fine-tuning (Jiang et al., 2020), resulting in: 1) degenerated results on the test data due to overfitting (Devlin et al., 2019;Phang et al., 2018;Lee et al., 2020), and 2) * Equal Contribution. Joint work between Alibaba and Peking University. † Corresponding authors. It forwards on the whole network while backwarding on a subset of network (i.e., child network). Right: To achieve this, a task-free or task-driven mask is performed on the gradients of the non-child network, resetting them to zero (grey diagonal grids).
Preventing the fine-tuned models to deviate too much from the pretrained weights (i.e., with less knowledge forgetting), is proved to be effective to mitigate the above challenges (Gouk et al., 2020). For instance, RecAdam (Chen et al., 2020) introduces L 2 distance penalty between the fine-tuned weights and their pretrained weights. In addition, Mixout (Lee et al., 2020) randomly replaces part of the model parameters with their pretrained weights during fine-tuning. The core idea behind them is to utilize the pretrained weights to regularize the fine-tuned model.
In this paper, we propose to mitigate the aggressive fine-tuning problem from a new perspective. Based on the observation that it is unnecessary to update all the parameters within the large-scale model during fine-tuning, we propose an effective fine-tuning technique, CHILD-TUNING, which straightforwardly updates a subset of parameters (called child network) via strategically masking out the gradients of non-child network in the backward process, as illustrated in Figure 1. Note that it is different from model pruning, since it still forwards on the whole network, thus making the full use of knowledge hidden in the pretrained weights.
In detail, we propose two variants, CHILD-TUNING F and CHILD-TUNING D , which respectively detect the child network in a task-free and a task-driven way. CHILD-TUNING F chooses out the child network in the absence of task data via a Bernoulli distribution. It introduces noise to the full gradients, playing a role of regularization, hence preventing overfitting to small datasets and leading to better generalization. Furthermore, CHILD-TUNING D utilizes the downstream task data to detect the most task-related parameters as the child network and freezes the parameters in non-child network to their pretrained weights. It decreases the hypothesis space of the model via a task-specific mask applied to the full gradients, helping to effectively adapt the large-scale pretrained model to various tasks and meanwhile greatly maintain its original generalization ability.
Our extensive experiments on the GLUE benchmark show that CHILD-TUNING can be more excellent at fine-tuning different PLMs, with up to 8.60 average score improvement on CoLA/RTE/MRPC/STS-B tasks compared to vanilla fine-tuning (Section. 3.3). Moreover, it achieves better generalization ability in transferring to out-of-domain data and other related tasks (Section. 3.4). Experimental results also demonstrate that CHILD-TUNING yields consistently greater improvements than state-of-the-art fine-tuning methods. More importantly, since CHILD-TUNING is orthogonal to these prior methods, integrating CHILD-TUNING with them can even lead to further improvements (Section. 4.1).
In summary, our contributions are three-fold: • We propose CHILD-TUNING, a straightforward yet effective fine-tuning technique that only updates the parameters in the child network. We explore to detect the child network in both task-free and task-driven ways.
• CHILD-TUNING can effectively adapt the large-scale pretrained model to various downstream scenarios, from in-domain to out-ofdomain, and cross-task transfer learning.
• Since CHILD-TUNING is orthogonal to prior fine-tuning methods, integrating CHILD-TUNING with them can further boost the finetuning performance.

Methodology
To better adapt large-scale pretrained language model to various downstream tasks, we propose a simple yet effective fine-tuning technique, CHILD-TUNING. We firstly introduce a gradient mask in the backward process to achieve the aim of updating a subset of parameters (i.e., child network), while still utilizing the knowledge of the whole large model in the forward process (Section 2.1). Then, we explore two ways to detect the child network (i.e., generate different gradient masks): CHILD-TUNING F that are in a task-free way (Section 2.2), and CHILD-TUNING D that are in a taskdriven way ( Section 2.3).

Overview of CHILD-TUNING
We start the introduction of CHILD-TUNING by giving a general formulation of the back propagation during the vanilla fine-tuning. We denote the parameters of the model at the t-th iteration as w t (w 0 refers to the pretrained weights). The vanilla fine-tuning computes the gradient of the loss L(w t ) and then applies gradient descent to all parameters, which can be formulated as: where ∂L(wt) ∂wt are the gradients corresponding to the model parameters w t , η is the learning rate.
CHILD-TUNING also backwardly computes the gradients of all trainable parameters like standard fine-tuning. However, the key difference is that CHILD-TUNING determines a child network C t at the t-th iteration, and only updates this part of parameters. To achieve this, we firstly define a 0-1 mask that is the same-sized as w as follows: where M (i) t and w (i) t denote the i-th element of the mask M t and parameters w t at the t-th training iteration, respectively.
Then, we formally define CHILD-TUNING technique by simply replacing Eq. 1 with the following equation: Algorithm 1 provides the pseudo-code of CHILD-TUNING when applied to widely used Adam (Kingma and Ba, 2015) optimizer. The main difference is the insertion of line 5-7.

Algorithm 1 CHILD-TUNING for Adam Optimizer
Require: w 0 : initial pretrained weights; L(w): stochastic objective function with parameters w; η: learning rate; β 1 , β 2 ∈ [0, 1): exponential decay rates for the moment estimates; 1: initialize timestep t ← 0, first moment vector m 0 ← 0, second moment vector v 0 ← 0 2: while not converged do 3: g t ← ∂L(wt) ∂wt // Get task-free/task-driven child network 5: C t ← GetChildNetwork() // Generate a corresponding gradient mask 6: In this section, we firstly explore the choice of the child network that does not require any downstream task data, i.e., a task-free technique called CHILD-TUNING F . Specifically, CHILD-TUNING F generates a 0-1 mask M t at the t-th iteration drawn from a Bernoulli distribution with a probability p F : The higher the p F is, the larger the child network is, and hence more parameters are updated. When p F = 1, CHILD-TUNING F degenerates into the vanilla fine-tuning method. Note that we also enlarge the reserved gradients by 1 p F to maintain the expectation of the gradients.
We theoretically justify the effectiveness of CHILD-TUNING F . We denote ∆w as the update at each iteration: Intuitively, Theorem 1 shows the variance of gradients is a strictly decreasing function of p F . Thus, CHILD-TUNING F improves the variance of the gradients, and the trade-off between exploration and exploitation can be controlled by adjusting p F . As illustrated in Theorem 2, with higher variance, the model can converge to more flat local minima (smaller ρ in Theorem 2). Inspired by studies that show flat minima tends to generalize better (Keskar et al., 2017;Sun et al., 2020;Foret et al., 2021), we can further prove CHILD-TUNING F decreases the generalization error bound.
Theorem 1. Suppose L denotes the loss function on the parameter w, the gradients obey a Gaussian distribution N ( ∂L ∂w , σ 2 g I k ), and SGD with learning rate η is used. For a randomly sampled batch B, if GradMask reserves gradients with probability p F , the mean and covariance of the update ∆w are, Specially, when w is a local minima, Theorem 2. Suppose w 0 denotes the pretrained parameter; k is the number of parameters; w denotes the local minima the algorithm converges to; ρ is the greatest eigenvalue of the Hessian matrix on w, which indicates the sharpness. If ∆w ∼ N (0 k , σ 2 I k ), when the following bound holds, the algorithm can converge to the local minima w with high probability, Suppose the prior over parameters after training is P = N (w 0 , σ 2 0 I k ), the following generalization error bound holds with high probability, where R is a term not determined by σ.
Thus, CHILD-TUNING F can be viewed as a strong regularization for the optimization process. It enables the model to skip the saddle point in the loss landscape and encourages the model to converge to a more flat local minima. Please refer to Appendix E for more details about stated theorems and proofs.

Task-Driven Variant: CHILD-TUNING D
Taking the downstream labeled data into consideration, we propose CHILD-TUNING D , which detects the most important child network for the target task. Specifically, we adopt the Fisher information estimation to find the highly relevant subset of the parameters for a specific downstream task. Fisher information serves as a good way to provide an estimation of how much information a random variable carries about a parameter of the distribution (Tu et al., 2016a,b). For a pretrained model, Fisher information can be used to measure the relative importance of the parameters in the network towards the downstream tasks.
Formally, the Fisher Information Matrix (FIM) for the model parameters w is defined as follows: where x and y denote the input and the output respectively. It can be also viewed as the covariance of the gradient of the log likelihood with respect to the parameters w. Following Kirkpatrick et al. (2016), given the task-specific training data data D, we use the diagonal elements of the empirical FIM to point-estimate the task-related importance of the parameters. Formally, we derive the Fisher information for the i-th parameter as follows: We assume that the more important the parameter towards the target task, the higher Fisher information it conveys. Hence the child network C is comprised of the parameters with the highest information. The child network ratio is p D = |C| |C|+|C| ∈ (0, 1], where C denotes the non-child network. As p D rises, the scale of the child network also increases, and when p D = 1 it degenerates into the vanilla fine-tuning strategy.
Since the overhead of obtaining the task-driven child network is heavier than that of the taskfree one, we simply derive the child network for CHILD-TUNING D at the beginning of fine-tuning, and keep it unchanged during the fine-tuning, i.e., C 0 = C 1 = · · · = C T . In this way, CHILD-TUNING D dramatically decreases the hypothesis space of the large-scale models, thus alleviating overfitting. Meanwhile, keeping the non-child network freezed to their pretrained weights can substantially maintain the generalization ability.

Experiments
3.1 Datasets GLUE benchmark Following previous studies (Lee et al., 2020;Dodge et al., 2020), we conduct experiments on various datasets from GLUE leaderboard (Wang et al., 2019), including linguistic acceptability (CoLA), natural language inference (RTE, QNLI, MNLI), paraphrase and similarity (MRPC, STS-B, QQP), and sentiment classification (SST-2). CoLA and SST-2 are single-sentence classification tasks and the others are involved with a pair of sentences. The detailed statistics and metrics are provided in Appendix A. Following most previous works (Phang et al., 2018;Lee et al., 2020;Dodge et al., 2020), we fine-tune the pretrained model on the training set and directly report results on the dev set using the last checkpoint, since the test results are only accessible by the leaderboard with a limitation of the number of submissions.
NLI datasets In this paper, we also conduct experiments to explore the generalization ability of the fine-tuned model based on several Natural Language Inference (NLI) tasks. Specifically, we additionally introduce three NLI datasets, i.e., SICK (Marelli et al., 2014), SNLI (Bowman et al., 2015) and SciTail (Khot et al., 2018). We also report results on the dev set consistent with GLUE.

Experiments Setup
We use the pretrained models and codes provided by HuggingFace 1 (Wolf et al., 2020), and follow their default hyperparameter settings unless noted otherwise. Appendix B provides detailed experimental setups (e.g., batch size, training steps, and etc.) for BERT LARGE ( Table 1: Comparison between CHILD-TUNING and vanilla fine-tuning applied to four widely used large-scale Pretrained Language Models (PLMs). Average scores on all tasks are underlined. The best results are bold. It shows that CHILD-TUNING yields consistent improvements across all tasks among different PLMs, especially for CHILD-TUNING D that detects the child network in a task-driven way.

Results on GLUE Benchmark
In this section, we show the results of four widely used large PLMs on four GLUE tasks: CoLA, RTE, MRPC, and STS-B, following Lee et al. (2020). Besides vanilla fine-tuning, we also report the results of two variants of CHILD-TUNING, including both CHILD-TUNING F (p F = 0.2, 0.3, 0.4) and CHILD-TUNING D (p D = 0.1, 0.2, 0.3).
As Table 1 illustrates, CHILD-TUNING outperforms vanilla fine-tuning by a large gain across all the tasks on different PLMs. For instance, CHILD-TUNING yields an improvement of up to 2.08 average score on XLNet, and 8.60 average score on ELECTRA. Besides, the straightforward task-free variant, CHILD-TUNING F , can still provide an improvement of 0.87 average score on BERT and 6.27 on ELECTRA. CHILD-TUNING D , which detects child network in a task-driven way, is more aware of the unique characteristics of the downstream task, and therefore achieves the best performance, with up to 1.50 and 8.60 average score improvement on BERT and ELECTRA. In summary, we can come to a conclusion that CHILD-TUNING is model-agnostic and can consistently outperform vanilla fine-tuning on different PLMs.

Probing Generalization Ability of the Fine-tuned Model
To measure the generalization properties of various fine-tuning methods, in this section, we conduct probing experiments from two aspects, that is, domain generalization and task generalization.

Domain Generalization
Besides boosting performance on the target downstream task, we also expect CHILD-TUNING can help the fine-tuned model achieve better generalization ability towards out-of-domain data. We evaluate how well the fine-tuned model generalizes to out-of-domain data based on several Natural Language Inference (NLI) tasks. In detail, we fine-tune BERT LARGE with different strategies on 5k subsampled MNLI and SNLI datasets respectively, and directly test the accuracy of the finetuned models on other NLI datasets in different domains, including MNLI, MNLI-mismatch 3 , SNLI, SICK, SciTail, and QQP 4 . As Table 2 illustrates, CHILD-TUNING outperforms vanilla fine-tuning across different out-of-domain datasets. Specifically, CHILD-TUNING F improves 1.11/0.35 average score for models trained on MNLI/SNLI, while CHILD-TUNING D improves up to 1.53/0.81 average score. In particular, CHILD-TUNING D achieves 1.90 score improvement on SICK task and 1.56 on SNLI task for models trained on MNLI.
The results suggest that CHILD-TUNING encourages the model to learn more general semantic features during fine-tuning, rather than some superficial features unique to the training data. Hence, the fine-tuned model can well generalize to different datasets, even though their domains are quite different from the dataset the model is trained on.

Task Generalization
To justify the generalization ability of the model from another perspective, we follow the probing experiments from Aghajanyan et al. (2021), which first freezes the representations from the model trained on one task and then only trains a linear classifier on top of the model for another task.
In particular, we fine-tune BERT LARGE on MRPC task, and transfer to four other GLUE tasks, i.e., CoLA, STS-B, QNLI, and QQP. As Figure 2 shows, CHILD-TUNING consistently outperforms vanilla fine-tuning on different transferred tasks. Compared with vanilla fine-tuning, CHILD-TUNING F improves 4.58 average score (58.95 → 63.53), while CHILD-TUNING D even gains up to 7.06 average score improvement (58.95 → 66.01).
In summary, fine-tuning with CHILD-TUNING gains better performance when the fine-tuned model is transferred to another task, demonstrating that CHILD-TUNING can maintain more generalizable representations produced by the model than vanilla fine-tuning.

Comparison with Prior Methods
In this section, we review and compare prior studies towards effective fine-tuning: 1) Weight Decay (Daumé III, 2007), which adds the λ w−w 0 2 penalty to the loss function, where w 0 denotes the pretrained weights; 2) Top-K Tuning, which only fine-tune the top-K layers of the model with other layers freezed. Houlsby et al. (2019) uses it as a strong baseline; 3) Mixout (Lee et al., 2020), which randomly replaces the parameters with their pretrained weights; 4) RecAdam (Chen et al., 2020), which is similar to Weight Decay while its loss weights λ keeps changing during finetuning; 5) Robust Representations through Regularized Finetuning (R3F) (Aghajanyan et al., 2021), which is rooted in trust region theory. Appendix C shows detailed hyperparameter settings.
We compare CHILD-TUNING with these methods based on BERT LARGE , and report the mean (max) score results in Table 3, following Lee et al. (2020). While all the fine-tuning methods can bring improvements across four different tasks compared with vanilla fine-tuning, CHILD-TUNING achieves the best performance. In detail, among prior finetuning methods, Mixout and R3F yield the highest improvement with 0.84 and 0.88 average score re-  Table 3: Comparison between CHILD-TUNING with other fine-tuning methods. We report the mean (max) results of 10 random seeds. Results with † are taken from Yang et al. (2019), and others are from our implementation. The task-driven variant, CHILD-TUNING D , achieves the best performance compared with other methods. Integrating CHILD-TUNING D with other fine-tuning methods like R3F can yield further improvements. Note that since R3F is not applicable to regression task, the result on STS-B (marked with * ) is the same as CHILD-TUNING D .  spectively. CHILD-TUNING F has performance on par with Mixout and R3F, while CHILD-TUNING D achieves 1.50 average score improvement in total. More importantly, CHILD-TUNING is flexible and orthogonal to most fine-tuning methods. Thus, integrating CHILD-TUNING with other methods can further boost the performance. For instance, combining CHILD-TUNING D with R3F leads to a 1.84 average score improvement in total.
In short, compared with prior fine-tuning methods, we find that 1) CHILD-TUNING is more effective in adapting PLMs to various tasks, especially for the task-driven variant CHILD-TUNING D , and 2) CHILD-TUNING has the advantage that it is flexible enough to integrate with other methods to potentially achieve further improvements.

Results in Low-resource Scenarios
Fine-tuning a large pretrained model on extremely small datasets can be very challenging since the risk of overfitting rises (Dodge et al., 2020). Thus, in this section, we explore the effect of CHILD-TUNING with only a few training examples. To this end, we downsample all datasets in GLUE to 1k training examples and fine-tune BERT LARGE on them.
As Table 4 demonstrates, compared with vanilla fine-tuning, CHILD-TUNING F improves the average score by 1.42, and the improvement is even larger for CHILD-TUNING D , which is up to 2.24. It suggests that although overfitting is quite severe when the training data is in extreme lowresource scenarios, CHILD-TUNING can still effectively improve the model performance, especially for CHILD-TUNING D since it decreases the hypothesis space of the model.

What is the Difference Between CHILD-TUNING and Model Pruning?
CHILD-TUNING D detects the most important child network in a task-driven way, and only updates this parameters within the child network during the fine-tuning with other parameters freezed. It is very likely to be confused with model pruning (Li et al., 2017;Zhu and Gupta, 2018;Lin et al., 2020), which also detects a subnetwork within the model (but then removes the other parameters). Actually, CHILD-TUNING and model pruning are different in both the objectives and methods. Regarding objectives, model pruning aims at improving the inference efficiency and maintaining  the performance at the same time, while CHILD-TUNING is proposed to address the overfitting problem and improve the generalization ability for largescale language models during fine-tuning. Regrading methods, model pruning abandons the unimportant parameters during inference, while the parameters that do not belong to the child network are still reserved for CHILD-TUNING during training and inference. In this way, the knowledge of the non-child network hidden in the pretrained weights will be fully utilized.
To better illustrate the effectiveness of CHILD-TUNING D compared to model pruning, we set all the parameters not belonging to the child network to zero, which is referred to as Prune in Table 5. It shows that, once we abandon parameters out of the child network, the score dramatically decreases by 33.89 points averaged on four tasks (CoLA/RTE/MRPC/STS-B), and the model even collapses on CoLA task. It also suggests that besides parameters in child network, those in the nonchild network are also necessary since they can provide general knowledge learned in pretraining.

Is the Task-Driven Child Network Really that Important to the Target Task?
CHILD-TUNING D detects the task-specific child network by means of choosing parameters with the highest Fisher information towards the downstream task data. In this section, we exlore whether the detected task-driven child network is really that important to the task.
To this end, we introduce two ablation studies for CHILD-TUNING D : 1) Random: We randomly choose a child network and keep it unchanged during fine-tuning; 2) Lowest Info.: We choose those parameters with lowest Fisher information as the child network, contrasted to the highest Fisher in- formation adopted in CHILD-TUNING D . As shown in Table 5, choosing the child network randomly can even outperform vanilla fine-tuning, with 0.18 average score improvement. It supports our claim that there is no need to update all parameters of the large PLMs, and decreasing the hypothesis space can reduce the risk of overfitting. However, it is still worth finding a proper child network to further boost the performance. If we choose parameters with the lowest Fisher information (Lowest Fisher), the average score is dramatically decreased by 6.65 compared with choosing with the highest Fisher information adopted in CHILD-TUNING D . Hence, we can conclude that the child network detected by CHILD-TUNING D is indeed important to the downstream task.

What is the Relationship among Child
Networks for Different Tasks?
As the task-driven child networks are correlated with the tasks, we further explore the relationship among child networks for different tasks. To this end, we visualize the overlapping rate among different task-driven child networks, where we use the Jaccard similarity coefficient, to calculate the overlapping rate between task i and j. Figure 3 shows the overlap among GLUE tasks. As we expected, similar tasks tend to have higher overlapping ratios of child network. For example, the overlapping ratio among NLI tasks is remarkably higher than others, such as RTE and QNLI, QNLI and MNLI. For different kinds of tasks, their overlapping ratio is relatively lower, such as CoLA and MRPC. It is also interesting to find that the task-driven child network for SST2 overlaps less with other tasks except CoLA, even though SST2 and CoLA is not so similar. The reason may be that both SST2 and CoLA belongs to a single sentence classification task, while others are in a different format of sentence-pair classification tasks.
Effective and generalizable fine-tuning. With a mass of parameters, fine-tuning large PLMs tend to achieve degenerated performance due to overfitting and have poor generalization ability, especially on small datasets (Devlin et al., 2019;Phang et al., 2018;Lee et al., 2020). Therefore, different finetuning techniques have been proposed. Some of them utilize the pretrained weights to regularize the deviation of the fine-tuned model (Lee et al., 2020;Daumé III, 2007;Chen et al., 2020), while others compress the output information (Mahabadi et al., 2021) or injects noise into the input (Jiang et al., 2020;Aghajanyan et al., 2021). Moreover, Zhang et al. (2021) and Mosbach et al. (2021) point out that the omission of bias correction in the Adam optimizer used in Devlin et al. (2019) is also responsible for the degenerated results.
Orthogonal to these methods, CHILD-TUNING address the problems by detecting the child network within the model in a task-free or task-driven way. It only updates parameters within the child network via a gradient mask, which is proved to be effective in adapting large PLMs to various tasks, along with better generalization ability.
Parameter-efficient Fine-tuning. There are also studies focusing on parameter-efficient fine-tuning, for example, the adapter-based methods (Houlsby et al., 2019;Pfeiffer et al., 2020;Karimi Mahabadi et al., 2021), and the Diff-Pruning method (Guo et al., 2021). However, our CHILD-TUNING is different from this line of works. Firstly, they aim at fine-tuning as few as possible parameters to maintain performance, while we target effective and generalizable fine-tuning. Secondly, Diff-Pruning sparsifies diff-vector with gradient estimators, and adapterbased methods fine-tune new added module during training, while we detect the child network inside the model without extra parameters and only need to calculate the FIM before training for CHILD-TUNING D . Finally, we consistently outperform vanilla fine-tuning by a large margin, while they achieve competitive performance with full model training.

Conclusion
To mitigate the overfitting problem and improve generalization for fine-tuning large-scale PLMs, we propose a straightforward yet effective fine-tuning technique, CHILD-TUNING, which only updates the child network during fine-tuning via strategically masking out the gradients of the non-child network. Two variants are introduced, CHILD-TUNING F and CHILD-TUNING D , which detect the child network in a task-free and task-driven way, respectively. Extensive experiments on various downstream tasks show that both of them can outperform vanilla fine-tuning and prior works by large gains among four different pretrained language models, and meanwhile largely enhance the generalization ability of the fine-tuned models. Since CHILD-TUNING is orthogonal to most prior fine-tuning techniques, integrating CHILD-TUNING with them can further boost the performance.  and ELECTRA LARGE 9 . The training epochs/steps, batch size, and warmup steps are listed in Table 7. We use AdamW (Loshchilov and Hutter, 2019) optimizer, and set β 1 = 0.9, β 2 = 0.999, = 1e-6. We clip the gradients with a maximum norm of 1, and the maximum sequence length is set as 128. For CHILD-TUNING F , we uses p F = {0.2, 0.3, 0.4} and re-scale the gradients to ensure the gradients after CHILD-TUNING F are unbiased. For CHILD-TUNING D , we use p D = {0.1, 0.2, 0.3}. We use grid search for learning rate from {1e-5, 2e-5, . . . , 1e-4}. We conduct all the experiments on a single GTX-3090 GPU.
These pretrained models are all Transformerbased. XLNet (Yang et al., 2019) is an autoregressive pretrained language model with token permutations. It generates tokens in an autoregressive way while can still capture bidirectional context information. RoBERTa (Liu et al., 2019) is a robustly optimized version of BERT. It uses a dynamic masking mechanism, larger batch size, and longer training times, and it also abandons the next sentence prediction task. ELECTRA (Clark et al., 2020) pretrains the model with a generator and a discriminator. The discriminator is trained to distinguish whether the token is generated by the generator or the original token. 9 https://huggingface.co/google/ electra-large-discriminator/tree/main

C Settings for Other Fine-tuning Methods
We compare Child-tuning with several other regularization approaches in our paper. In this section, we simply introduce these approaches and their hyperparameters settings.
Weight Decay Daumé III (2007) proposes to adds a penalty item to the loss function to regulate the L 2 distance between fine-tuned models and the pretrained models. Therefore, the loss function is as follows: We grid search the optimal λ WD from 10, 1, 10 −1 , 10 −2 , 10 −3 , 10 −4 .
Top-K Fine-tuning Top-K Fine-tuning is a common method and Houlsby et al. (2019) uses it as a strong baseline. Top-K Fine-tuning only updatess the top K layers along with the classification layer, while freezing all the other bottom layers. We grid search the optimal K from {0, 3, 6, 12} in our paper.

Mixout
Lee et al. (2020) randomly replace the parameters with its pretrained weights with a certainly probability p during fine-tuning, which aims to minimize the deviation of the finetuned model towards the pretrained weights. In our paper, we grid search the optimal p from {0.1, 0.2, . . . , 0.8}. We use the implementation in https://github.com/bloodwass/ mixout.
RecAdam Chen et al. (2020) proposes a new optimizer RecAdam for fine-tuning, which can be considered as an advanced version of Weight Decay, because the coefficient of two different loss items are changed as the training progresses. The following equations demonstrate the new loss function, where k and t 0 are controlling hyperparameters and t is the current training step.

Robust Representations through Regularized
Fine-tuning (R3F) Aghajanyan et al. (2021) propose R3F for fine-tuning based on trust region theory, which adds noise into the sequence input embedding and tries to minimize the symmetrical KL divergence between probability distributions given original input and noisy input. The loss function of R3F is as follows: where f (·) denotes the model and z denotes the noise sampled from either normal distribution or uniform distribution controlled by hyperparameter σ, and KL S (x||y) = KL(x||y) + KL(y||x). We use both normal and unform distribution, λ R3F = 1, and grid search the σ from {0.1, 0.5, 1.0, 5.0}. We use the implementation in https://github.com/pytorch/ fairseq/tree/master/examples/rxf.

D Label Mapping in Task Generalization
MNLI and SNLI datasets contain three labels, i.e., entailment, neutral, and contradiction. For SciTail, it only has two labels, entailment and neutral, and therefore we map both neutral and contradiction in source label space to neutral in target label space following Mahabadi et al. (2021). For QQP, it has two labels, duplicate and not duplicate, and Gong et al. (2018) interpret them as entailment and neutral respectively. We follow Gong et al. (2018) and use the same mapping strategy as SciTail.

E Theoretical Details
We theoretically justify the effectiveness of CHILD-TUNING F . Assume CHILD-TUNING F reserves gradients with probability p F ∈ (0, 1], and we simply use p to denote p F in the following content. Theorem 1 shows the variance of gradients is a strictly decreasing function of p. When p = 1, it degenerates into normal fine-tuning methods. Therefore, CHILD-TUNING F can improve the variance of the gradients of the model. Next, Theorem 2 shows that with higher variance, the model can converge to more flat local minima (smaller ρ in Theorem 2). Inspired by studies that show flat minima tends to generalize better (Keskar et al., 2017;Sun et al., 2020;Foret et al., 2021), we can further prove CHILD-TUNING F decreases the generalization error bound.
Theorem 1. Suppose L denotes the loss function on the parameter w, for multiple data instances in the training set x ∼ S, the gradients obey a Gaussian distribution N ( ∂L ∂w , σ 2 g I k ). For a randomly sampled batch B ∼ S, when the learning algorithm is SGD with learning rate η, the reserving probability of the CHILD-TUNING F is p, then the mean and covariance of the update ∆w are, where Σ is the covariance matrix and diag(x) is the diagonal matrix of the vector x. Specially, when w is a local minima, E[∆w] = 0 k , Σ[∆w] = σ 2 I k and σ 2 = η 2 σ 2 g p|B| is a strictly decreasing function of p.
Theorem 2. Suppose L denotes the expected error rate loss function; w 0 denotes the pretrained parameter; k is the number of parameters; w denotes the local minima the algorithm converges to; H is the Hessian matrix on w and ρ is its greatest eigenvalue; F k is the cumulative distribution function of the χ 2 (k) distribution.
If the next update of the algorithm ∆w ∼ N (0 k , σ 2 I k ) and the training loss increases more than with probability δ, we assume the algorithm will escape the local minima w. When the following bound holds, the algorithm can converge to the local minima w, with higher order infinity omitted, Suppose the prior over parameters after training is P = N (w 0 , σ 2 0 I k ), the following generalization error bound holds with probability 1-δ over the choice of training set S ∼ D, , with higher order infinity omitted.

E.2 Proof of Theorem 2
Proof. We first prove Eq. 13. Apply a Taylor expansion on training loss L, notice that ∇ w L(w) = 0 k since w is a local minima. When the algorithm can escape the local minima w, with higher order infinity omitted, we have, =v T ∇ w L(w) If the probability of escaping, P esc , we have P esc = P (L(w + ∆w) − L(w) ≥ ) (29) namely, P ( ∆w σ 2 2 ≤ 2 ρσ 2 ) ≤ 1 − P esc . Since ∆w σ ∼ N (0 k , I k ), ∆w σ 2 2 ∼ χ 2 (k), we have, when Eq. 13 holds, Therefore, P esc ≤ 1 − P ( ∆w σ 2 2 ≤ 2 ρσ 2 ) ≤ δ. The algorithm will not escape the local minima w and can converge to the local minima w.
To prove Eq. 14, we introduce Lemma 1 in paper Foret et al. (2021), which is Theorem 2 in the paper.
Lemma 1. Suppose d > 0, the prior over parameters is P = N (w P , σ 2 P I k ) and σ 2 P = d 2 + w−w P , with higher order infinity omitted.