Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

The Lottery Ticket Hypothesis suggests that an over-parametrized network consists of ”lottery tickets”, and training a certain collection of them (i.e., a subnetwork) can match the performance of the full model. In this paper, we study such a collection of tickets, which is referred to as ”winning tickets”, in extremely over-parametrized models, e.g., pre-trained language models. We observe that at certain compression ratios, the generalization performance of the winning tickets can not only match but also exceed that of the full model. In particular, we observe a phase transition phenomenon: As the compression ratio increases, generalization performance of the winning tickets first improves then deteriorates after a certain threshold. We refer to the tickets on the threshold as ”super tickets”. We further show that the phase transition is task and model dependent — as the model size becomes larger and the training data set becomes smaller, the transition becomes more pronounced. Our experiments on the GLUE benchmark show that the super tickets improve single task fine-tuning by 0.9 points on BERT-base and 1.0 points on BERT-large, in terms of task-average score. We also demonstrate that adaptively sharing the super tickets across tasks benefits multi-task learning.


Introduction
The Lottery Ticket Hypothesis (LTH, Frankle and Carbin (2018)) suggests that an over-parameterized network consists of "lottery tickets", and training a certain collection of them (i.e., a subnetwork) can 1) match the performance of the full model; and 2) * Work was done at Microsoft Azure AI. 1 Our codes are available at https://github.com/cliang1453/ super-structured-lottery-tickets. outperform randomly sampled subnetworks of the same size (i.e., "random tickets"). The existence of such a collection of tickets, which is usually referred to as "winning tickets", indicates the potential of training a smaller network to achieve the full model's performance. LTH has been widely explored in across various fields of deep learning (Frankle et al., 2019;You et al., 2019;Brix et al., 2020;Movva and Zhao, 2020;Girish et al., 2020).
Aside from training from scratch, such winning tickets have demonstrated their abilities to transfer across tasks and datasets Desai et al., 2019;Chen et al., 2020a). In natural language processing, Chen et al. (2020b); Prasanna et al. (2020) have shown existence of the winning tickets in pre-trained language models. These tickets can be identified when fine-tuning the pre-trained models on downstream tasks. As the pre-trained models are usually extremely overparameterized (e.g., BERT Devlin et al. (2019), GPT-3 Brown et al. (2020), T5 Raffel et al. (2019)), previous works mainly focus on searching for a highly compressed subnetwork that matches the performance of the full model. However, behavior of the winning tickets in lightly compressed subnetworks is largely overlooked.
In this paper, we study the behavior of the winning tickets in pre-trained language models, with a particular focus on lightly compressed subnetworks. We observe that generalization performance of the winning tickets selected at appropriate compression ratios can not only match, but also exceed that of the full model. In particular, we observe a phase transition phenomenon ( Figure 1): The test accuracy improves as the compression ratio grows until a certain threshold (Phase I); Passing the threshold, the accuracy deteriorates, yet is still better than that of the random tickets (Phase II). In Phase III, where the model is highly compressed,  training collapses. We refer to the set of winning tickets selected on that threshold as "super tickets".
We interpret the phase transition in the context of trade-offs between model bias and variance (Friedman et al., 2001, Chapter 7). It is well understood that an expressive model induces a small bias, and a large model induces a large variance. We classify the tickets into three categories: non-expressive tickets, lightly expressive tickets, and highly expressive tickets. The full model has a strong expressive power due to over-parameterization, so that its bias is small. Yet its variance is relatively large. In Phase I, by removing non-expressive tickets, variance of the selected subnetwork reduces, while model bias remains unchanged and the expressive power sustains. Accordingly, generalization performance improves. We enter Phase II by further increasing the compression ratio. Here lightly expressive tickets are pruned. Consequently, model variance continues to decrease. However, model bias increases and overturns the benefit of the reduced variance. Lastly for Phase III, in the highly compressed region, model bias becomes notoriously large and reduction of the variance pales. As a result, training breaks down and generalization performance drops significantly.
We conduct systematic experiments and analyses to understand the phase transition. Our experiments on multiple natural language understanding (NLU) tasks in the GLUE (Wang et al., 2018) benchmark show that the super tickets can be used to improve single task fine-tuning by 0.9 points over BERTbase (Devlin et al., 2019) and 1.0 points over BERTlarge, in terms of task-average score. Moreover, our experiments show that the phase transition phenomenon is task and model dependent. It becomes more pronounced as a larger model is used to fit a task with less training data. In such a case, the set of super tickets forms a compressed network that exhibits a large performance gain.
The existence of super tickets suggests potential benefits to applications, such as Multi-task Learning (MTL). In MTL, different tasks require different capacities to achieve a balance between model bias and variance. However, existing methods do not specifically balance the bias and variance to accommodate each task. In fact, the finetuning performance on tasks with a small dataset is very sensitive to randomness. This suggests that model variance in these tasks are high due to overparameterization. To reduce such variance, we propose a tickets sharing strategy. Specifically, for each task, we select a set of super tickets during single task fine-tuning. Then, we adaptively share these super tickets across tasks.
Our experiments show that tickets sharing improves MTL by 0.9 points over MT-DNN BASE  and 1.0 points over MT-DNN LARGE , in terms of task-average score. Tickets sharing further benefits downstream fine-tuning of the multi-task model, and achieves a gain of 1.0 task-average score. In addition, the multi-task model obtained by such a sharing strategy exhibits lower sensitivity to randomness in downstream fine-tuning tasks, suggesting a reduction in variance.
We summarize our contributions as follows: • Our result is the first to identify the phase transition phenomenon in pruning large neural language models.
• Our result is the first to show that pruning can improve the generalization when the models are lightly compressed, which has been overlooked by previous works. Our analysis paves the way for understanding the connection between model compression and generalization.
• Motivated by our observed phase transition, we further propose a new pruning approach for multi-task fine-tuning of neural language models.

Background
We briefly introduce the Transformer architecture and the Lottery Ticket Hypothesis.

Transformer Architecture
The Transformer (Vaswani et al., 2017) encoder is composed of a stack of identical Transformer layers. Each layer consists of a multi-head attention module (MHA) followed by a feed-forward module (FFN), with a residual connection around each. The vanilla single-head attention operates as where Q, K, V ∈ R l×d are d-dimensional vector representations of l words in sequences of queries, keys and values. In MHA, the h-th attention head is parameterized by where q ∈ R l×d and x ∈ R l×d are the query and key/value vectors. In MHA, H independently parameterized attention heads are applied in parallel, and the outputs are aggregated by W O h ∈ R d h ×d : Each FFN module contains a two-layer fully connected network. Given the input embedding z, we let FFN(z) denote the output of a FFN module.

Structured and Unstructured LTHs
LTH (Frankle and Carbin, 2018) has been widely explored in various applications of deep learning (Brix et al., 2020;Movva and Zhao, 2020;Girish et al., 2020). Most of existing results focus on finding unstructured winning tickets via iterative magnitude pruning and rewinding in randomly initialized networks (Frankle et al., 2019;Renda et al., 2020), where each ticket is a single parameter. Recent works further investigate learning dynamics of the tickets Frankle et al., 2020) and efficient methods to identify them (You et al., 2019;Savarese et al., 2020). Besides training from scratch, researchers also explore the existence of winning tickets under transfer learning regimes for over-parametrized pre-trained models across various tasks and datasets Desai et al., 2019;Chen et al., 2020a). For example, Chen et al. (2020b); Prasanna et al. (2020) have shown the existence of winning tickets when fine-tuning BERT on downstream tasks. There is also a surge of research exploring whether certain structures, e.g., channels in convolutional layers and attention heads in Transformers, exhibit properties of the lottery tickets. Compared to unstructured tickets, training with structured tickets is memory efficient (Cao et al., 2019). Liu et al. (2018); Prasanna et al. (2020) suggest that there is no clear evidence that structured winning tickets exist in randomly initialized or pre-trained weights. Prasanna et al. (2020) observe that, in highly compressed BERT (e.g., the percent of weight remaining is around 50%), all tickets perform equally well. However, Prasanna et al. (2020) have not investigated the cases where the percent of weight remaining is over 50%.

Finding Super Tickets
We identify winning tickets in BERT through structured pruning of attention heads and feed-forward layers. Specifically, in each Transformer layer, we associate mask variables ξ h to each attention head and ν to the FFN (Prasanna et al., 2020): Here, we set ξ h , ν ∈ {0, 1}, and a 0 value indicates that the corresponding structure is pruned. We adopt importance score (Michel et al., 2019) as a gauge for pruning. In particular, the importance score is defined as the expected sensitivity of the model outputs with respect to the mask variables. Specifically, in each Transformer layer, where L is a loss function and D x is the data distribution. In practice, we compute the average over the training set. We apply a layer-wise 2 normalization on the importance scores of the attention heads (Molchanov et al., 2016;Michel et al., 2019).
The importance score is closely tied to expressive power. A low importance score indicates that the corresponding structure only has a small contribution towards the output. Such a structure has low expressive power. On the contrary, a large importance score implies high expressive power.
We compute the importance scores for all the mask variables in a single backward pass at the end of fine-tuning. We perform one-shot pruning of the same percent of heads and feed-forward layers with the lowest importance scores. We conduct pruning multiple times to obtain subnetworks, or winning tickets, at different compression ratios.
We adopt the weight rewinding technique in Renda et al. (2020): We reset the parameters of the winning tickets to their values in the pre-trained weights, and subsequently fine-tune the subnetwork with the original learning rate schedule. The super tickets are selected as the winning tickets with the best rewinding validation performance.

Multi-task Learning with Tickets Sharing
In multi-task learning, the shared model is highly over-parameterized to ensure a sufficient capacity for fitting individual tasks. Thus, the multi-task model inevitably exhibits task-dependent redundancy when being adapted to individual tasks. Such redundancy induces a large model variance.
We propose to mitigate the aforementioned model redundancy by identifying task-specific super tickets to accommodate each task's need. Specifically, when viewing an individual task in isolation, the super tickets can tailor the multi-task model to strike an appealing balance between the model bias and variance (recall from Section 3 that super tickets retain sufficient expressive power, yet keep the model variance low). Therefore, we expect that deploying super tickets can effectively tame the model redundancy for individual tasks.
Given the super tickets identified by each task, we exploit the multi-task information to reinforce fine-tuning. Specifically, we propose a tickets sharing algorithm to update the parameters of the multitask model: For a certain network structure (e.g., an attention head), if it is identified as super tickets by multiple tasks, then its weights are jointly updated by these tasks; if it is only selected by one specific task, then its weights are updated by that task only; otherwise, its weights are completely pruned. See Figure 2 for an illustration. In more detail, we denote the weight parameters in the multi-task model as θ. Suppose there are N tasks. For each task i ∈ {1, . . . , N }, we denote as the collection of the mask variables, where is the layer index and h is the head index. Then the parameters to be updated in task i are denoted as where M (·, Ω i ) masks the pruned parameters according to Ω i . We use stochastic gradient descenttype algorithms to update θ i . Note that the taskshared and task-specific parameters are encoded by the mask variable Ω i . The detailed algorithm is given in Algorithm 1.
Tickets sharing has two major difference compared to Sparse Sharing (Sun et al., 2020): 1) Sun et al. (2020) share winning tickets, while our strategy focuses on super tickets, which can better generalize and strike a sensible balance between model bias and variance. 2) In tickets sharing, tickets are structured and chosen from pre-trained weight parameters. It does not require Multi-task Warmup, which is indispensable in Sun et al. (2020) to stabilize the sharing among unstructured tickets selected from randomly initialized weight parameters.

Data
General Language Understanding Evaluation (GLUE, Wang et al. (2018)) is a standard benchmark for evaluating model generalization performance. It contains nine NLU tasks, including question answering, sentiment analysis, text similarity Algorithm 1 Tickets Sharing Initialize the super tickets for task i: Update θ i using SGD-type algorithm. 10: end for 11: end for and textual entailment. Details about the benchmark are deferred to Appendix A.1.1.

Models & Training
We fine-tune a pre-trained BERT model with taskspecific data to obtain a single task model. We append a task-specific fully-connected layer to BERT as in Devlin et al. (2019).
• ST-DNN BASE/LARGE is initialized with BERTbase/large followed by a task-specific layer.
• SuperT BASE/LARGE is initialized with the chosen set of super tickets in BERT-base/large followed by a task-specific layer. Specifically, we prune BERT-base/large in unit of 10% heads and 10% feed-forward layers (FFN) at 8 different sparsity levels (10% heads and 10% FFN, 20% heads and 20% FFN, etc). Among them, the one with the best rewinding validation result is chosen as the set of super tickets. We randomly sample 10% GLUE development set for tickets selection.

Generalization of the Super Tickets
We conduct 5 trails of pruning and rewinding experiments using different random seeds. Table 1 3 https://github.com/namisan/mt-dnn and 2 show the averaged evaluation results on the GLUE development and test sets, respectively. We remark that the gain of SuperT BASE/LARGE over ST-DNN BASE/LARGE is statistically significant. All the results 4 have passed a paired student t-test with p-values less than 0.05. More validation statistics are summarized in Appendix A.1.3.
Our results can be summarized as follows. 1) In all the tasks, SuperT consistently achieves better generalization than ST-DNN. The taskaveraged improvement is around 0.9 over ST-DNN BASE and 1.0 over ST-DNN LARGE .   2) Performance gain of the super tickets is more significant in small tasks. For example, in Table 1, we obtain 3.3 points gain on RTE (2.5k data), but only 0.4/0.3 on QQP (364k data) in the SuperT BASE experiments. Furthermore, from Figure 3, note that the super tickets are more heavily compressed in small tasks, e.g., for SuperT BASE , 83% weights remaining for RTE, but 93% for QQP. These observations suggest that for small tasks, model variance is large, and removing nonexpressive tickets reduces variance and improves generalization. For large tasks, model variance is low, and all tickets are expressive to some extent.
3) Performance of the super tickets is related to model size. Switching from SuperT BASE to SuperT LARGE , the percent of weights remaining shrinks uniformly across tasks, yet the generalization gains persist (Figure 3). This suggests that in large models, more non-expressive tickets can be pruned without performance degradation.

Phase Transition
Phase transitions are shown in Figure 4. We plot the evaluation results of the winning, the random, and the losing tickets under 8 sparsity levels using BERT-base and BERT-large. The winning tickets contain structures with the highest importance scores. The losing tickets are selected reversely, i.e., the structures with the lowest importance scores are selected, and high-importance structures are pruned. The random tickets are sampled uniformly across the network. We plot the averaged scores over 5 trails using different random seeds 5 . Phase transitions of all the GLUE tasks are in Appendix A.5. We summarize our observations: 1) The winning tickets are indeed the "winners". In Phase I and early Phase II, the winning tickets perform better than the full model and the random tickets. This demonstrates the existence of struc-  2) Phase transition is pronounced over different tasks and models. Accuracy of the winning tickets increases up till a certain compression ratio (Phase I); Passing the threshold, the accuracy decreases (Phase II), until its value intersects with that of the random tickets (Phase III). Note that Phase III agrees with the observations in Prasanna et al. (2020). Accuracy of the random tickets decreases in each phase. This suggests that model bias increases steadily, since tickets with both low and high expressive power are discarded. Accuracy of the losing tickets drops significantly even in Phase I, suggesting that model bias increases drastically as highly expressive tickets are pruned.
3) Phase transition is more pronounced in large models and small tasks. For example, in Figure 4, the phase transition is more noticeable in BERTlarge than in BERT-base, and is more pronounced in RTE (2.5k) and MRPC (3.7k) than in SST (67k) and MNLI (393k). The phenomenon becomes more significant for the same task when we only use a part of the data, e.g., Figure 5 vs. Figure 4 (bottom left).

Model & Training
We adopt the MT-DNN architecture proposed in . The MT-DNN model consists of a set of task-shared layers followed by a set of task-specific layers. The task-shared layers take in the input sequence embedding, and generate shared semantic representations by optimizing multi-task objectives. Our implementation is based on the MT-DNN code base. We follow the same training settings in  for multi-task learn-ing, and in Section 5.2 for downstream fine-tuning. More details are summarized in Appendix A.2.
• MT-DNN BASE/LARGE . An MT-DNN model refined through multi-task learning, with task-shared layers initialized by pre-trained BERT-base/large. • MT-DNN BASE/LARGE + ST Fine-tuning. A single task model obtained by further fine-tuning MT-DNN on an individual downstream task.
• Ticket-Share BASE/LARGE . An MT-DNN model refined through the ticket sharing strategy, with task-shared layers initialized by the union of the super tickets in pre-trained BERT-base/large. • Ticket-Share BASE/LARGE + ST Fine-tuning. A fine-tuned single-task Ticket-Share model. Table 3 summarizes experimental results. The finetuning results are averaged over 5 trails using different random seeds. We have several observations: 1) Ticket-Share BASE and Ticket-Share LARGE achieve 0.9 and 1.0 gain in task-average score over MT-DNN BASE and MT-DNN LARGE , respectively. In some small tasks (RTE, MRPC), Ticket-Share achieves better or on par results compared to MT-DNN+Fine-tuning. This suggests that by balancing the bias and variance for different tasks, the multitask model's variance is reduced. In large tasks (QQP, QNLI and MNLI), Ticket-Share behaves equally well with the full model. This is because task-shared information is kept during pruning and still benefits multi-task learning.

Experimental Results
2) Ticket-Share BASE +Fine-tuning and Ticket-Share LARGE +Fine-tuning achieve 1.0 and 0.7 gains in task-average score over MT-DNN BASE +Finetuning and MT-DNN LARGE +Fine-tuning, respectively. This suggests that reducing the variance in the multi-task model benefits fine-tuning downstream tasks.

Domain Adaptation
To demonstrate that super tickets can quickly generalize to new tasks/domains, we conduct few-shot domain adaptation on out-of-domain NLI datasets.

Data & Training
We briefly introduce the target domain datasets. SciTail is a textual entailment dataset derived from a science question answering (SciQ) dataset (Khot et al., 2018). The hypotheses are created from science questions, rendering SciTail challenging.

Experimental Results
We consider domain adaptation on both single task and multi-task super tickets. Specifically, we adapt SuperT BASE and ST-DNN BASE from MNLI to SNLI/SciTail, and adapt the shared embeddings generated by Ticket-Share BASE and by MT-DNN BASE to SNLI/SciTail. We adapt these models to 0.1%, 1%, 10% and 100% SNLI/SciTail training sets 6 , and evaluate the transferred models on SNLI/SciTail development sets. Table 4 shows the domain adaptation evaluation results. As we can see, SuperT and Ticket-Share can better adapt to SNLI/SciTail than ST-DNN and MT-DNN, especially under the few shot setting.

Analysis
Sensitivity to Random Seed. To better demonstrate that training with super tickets effectively reduces model variance, we evaluate models' sensitivity to changes in random seeds during single task fine-tuning and multi-task downstream finetuning. In particular, we investigate fitting small tasks with highly over-parametrized models (variance is often large in these models, see Section 5 and 6). As shown in Table 5  Tickets Importance Across Tasks. We analyze the importance score of each ticket computed in different GLUE tasks. For each ticket, we compute the importance score averaged over tasks as the Ticket Importance, and the proportion of the taskspecific importance score out of the sum of all tasks' scores as the Task Share, as illustrated in Figure 6. We observe that many tickets exhibit almost equal Task Shares for over 5 out of 8 tasks (Figure 6(a)(b)). While these tickets contribute to the knowledge sharing in the majority of tasks, they are considered non-expressive for tasks such as SST-2 (see Figure 6(a)(c)(d)). This explains why SST-2 benefits little from tickets sharing. Furthermore, a small number of tickets are dominated by a single task, e.g., CoLA (Figure 6(c)), or dominated jointly by two tasks, e.g., CoLA and STS-B (Figure 6(d)). This suggests that some tickets only learn task-specific knowledge, and the two tasks may share certain task-specific knowledge.

Discussion
Structured Lottery Tickets. LTH hypothesizes that a subset of unstructured parameters can be trained to match the full model's performance. Instead, we question whether a subset of structured weight matrices, e.g., FFN layers and attention heads, can also be trained to match the full model's performance. This question is more practically important than the unstructured one: training and inference on structured matrices are better optimized for hardware acceleration. Our results give a positive answer to this question, while previous works show that the structured tickets do not exist in highly compressed models (Prasanna et al., 2020). Searching Better Generalized Super Tickets. We select winning tickets according to the sensitivity of the model outputs with respect to the mask variables of each structure (Michel et al., 2019;Prasanna et al., 2020), as this measure is closely tied to the structure's expressive power (Section 3). In addition, we conduct an one-shot pruning for computational simplicity. We leave other importance measures and pruning schedules, which may help identifying better generalized super tickets, for future works (Voita et al., 2019;Behnke and Heafield, 2020;Fan et al., 2019;Zhou et al., 2020;Sajjad et al., 2020). Searching Super Tickets Efficiently. Determining the compression ratio of the super tickets requires rewinding models at multiple sparsity levels.
To leverage super tickets in practice, a potential direction of research is to find heuristics to determine this ratio prior or early-on in training. We leave this for future works.

Conclusion
We study the behaviors of the structured lottery tickets in pre-trained BERT. We observe that the generalization performance of the winning tickets exhibits a phase transition phenomenon, suggesting pruning can improve generalization when models are lightly compressed. Based on the observation, we further propose a tickets sharing strategy to improve multi-task fine-tuning. Our analysis paves the way for understanding the connection between model compression and generalization.

Broader Impact
This paper studies the behavior of the structured lottery tickets in pre-trained language models. Our investigation neither introduces any social/ethical bias to the model nor amplifies any bias in the data. We do not foresee any direct social consequences or ethical issues. Furthermore, our proposed method improves performance through model compression, rendering it energy efficient.

A.1.2 Training
We use Adamax as the optimizer. A linear learning rate decay schedule with warm-up over 0.1 is used. We apply a gradient norm clipping of 1. We set the dropout rate of all task specific layers as 0.1, except 0.3 for MNLI and 0.05 for CoLA. All the texts were tokenized using wordpieces, and were chopped to spans no longer than 512 tokens. All experiments are conducted on Nvidia V100 GPUs.

A.1.3 Evaluation Results Statistics
We conduct 5 sets of experiments on different random seeds. Each set of experiment consists of fine-tuning, pruning, and rewinding at 8 sparsity levels. For results on GLUE dev set (Table 1), we report the average score of super tickets rewinding results over 5 sets of experiments. The standard deviation of the results is shown in Table 6. The statistics of the percent of weight remaining in the selected super tickets are shown in Table 7.
For results on GLUE test set (Table 2), as the evaluation server sets an limit on submission times, we only evaluate the test prediction under a single random seed that gives the best task-average validation results.

A.2.1 Multi-task Model Training
We adopt the MT-DNN code base and adopt the exact optimization settings in . We use Adamax as our optimizer with a learning rate of 5 × 10 −5 and a batch size of 32. We train for a maximum number of epochs of 5 with early stopping. A linear learning rate decay schedule with warm-up over 0.1 was used. The dropout rate of all the task specific layers is set to be 0.1, except 0.3 for MNLI and 0.05 for CoLa. We clipped the gradient norm within 1. All the texts were tokenized using wordpieces, and were chopped to spans no longer than 512 tokens. Worth mentioning, the task-specific super tickets used in Ticket Share are all selected during the case where a matched learning rate (i.e., 5 × 10 −5 ) is used in single task fine-tuning. We empirically find that, rewinding the super tickets selected under a matched optimization settings usually outperforms those selected under a mismatched settings (i.e. using two different learning rates in single-task finetuning and rewinding/multi-task learning). This agrees with previous observation in literature of Lottery Ticket Hypothesis, which shows that unstructured winning tickets are not only related to its weight initialization, but also model optimization path.

A.3 Domain Adaptation Experiments
A.3.1 Data SNLI. is one of the most widely used entailment dataset for NLI. SciTail involves assessing whether a given premise entails a given hypothesis. In contrast to other entailment datasets, the hypotheses in SciTail is created from science questions. These sentences are linguistically challenging. The corresponding answer candidates and premises come from relevant web sentences. The lexical similarity of premise and hypothesis is often high, making SciTail particularly challenging. Details of the SNLI and SciTail, including tasks, statistics, and evaluation metrics, are summarized in Table 9.

A.3.2 Training
For single task model domain adaptation from MNLI to SNLI/SciTail, we follow the exact optimization setting as in Section 5.2 and in Section A.1.2, except we choose the learning rate in {5 × 10 −5 , 1 × 10 −4 , 5 × 10 −4 }.   Table 7: Statistics of the percent of weight remaining of the selected super tickets over 5 different random seeds.

A.4.1 Randomness Analysis
For single task experiments in Table 5, we vary the random seeds only and keep all other hyperparameters fixed. We present the standard deviation of the validation results over 5 trails rewinding experiments. For multi-task downstream fine-tuning experiments, we present the standard deviation of the validation results over 5 trails, each result averaged over learning rates in {5×10 −5 , 1×10 −4 , 2× 10 −4 }. This is because the downstream fine-tuning performance is more sensitive to hyper-parameters.

A.4.2 Hyper-parameter Analysis
We further analyze the sensitivity of Ticket Share LARGE model to changes in hyper-parameters in downstream fine-tuning in some GLUE tasks. We vary the learning rate in {5 × 10 −5 , 1 × 10 −4 , 2×10 −4 } and keep all other hyper-parameter fixed. Table 8 shows the standard deviation of the validation results over different learning rates, each result averaged over 5 different random seeds. As can be seen, Task Share LARGE exhibits stronger robustness to changes in learning rate in downstream fine-tuning.