EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets

Heavily overparameterized language models such as BERT, XLNet and T5 have achieved impressive success in many NLP tasks. However, their high model complexity requires enormous computation resources and extremely long training time for both pre-training and fine-tuning. Many works have studied model compression on large NLP models, but only focusing on reducing inference time while still requiring an expensive training process. Other works use extremely large batch sizes to shorten the pre-training time, at the expense of higher computational resource demands. In this paper, inspired by the Early-Bird Lottery Tickets recently studied for computer vision tasks, we propose EarlyBERT, a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models. By slimming the self-attention and fully-connected sub-layers inside a transformer, we are the first to identify structured winning tickets in the early stage of BERT training. We apply those tickets towards efficient BERT training, and conduct comprehensive pre-training and fine-tuning experiments on GLUE and SQuAD downstream tasks. Our results show that EarlyBERT achieves comparable performance to standard BERT, with 35 45% less training time. Code is available at https://github.com/VITA-Group/EarlyBERT.


Introduction
Large-scale pre-trained language models (e.g., BERT (Devlin et al., 2018), XLNet (Yang et al., 2019), T5 (Raffel et al., 2019)) have significantly advanced the state of the art in the NLP field. Despite impressive empirical success, their computational inefficiency has become an acute drawback in practice. As more transformer layers are stacked * Work was done when the author interned at Microsoft. with larger self-attention blocks, model complexity increases rapidly. For example, compared to BERT-Large model with 340 million parameters, T5 has more than 10 billion to learn. Such high model complexity calls for expensive computational resources and extremely long training time.
Model compression is one approach to alleviating this issue. Recently, many methods have been proposed to encode large NLP models compactly Jiao et al., 2019;Sun et al., 2019Sun et al., , 2020b. However, the focus is solely on reducing inference time or resource costs, leaving the process of searching for the right compact model ever more costly. Furthermore, most model compression methods start with a large pretrained model, which may not be available in practice. Recent work (You et al., 2020b) proposes to use large training batches, which significantly shortens pre-training time of BERT-Large model but demands daunting computing resources (1,024 TPUv3 chips).
In contrast, our quest is to find a general resourceefficient training algorithm for large NLP models, which can be applied to both pre-training and finetuning stages. Our goal is to trim down the training time and avoid more costs of the total training resources (e.g., taking large-batch or distributed training). To meet this challenge demand, we draw inspirations from recent work (You et al., 2020a) that explores the use of Lottery Ticket Hypothesis (LTH) for efficient training of computer vision models. LTH was first proposed in Frankle and Carbin (2019) as an exploration to understand the training process of deep networks. The original LTH substantiates a trainable sparse sub-network at initialization, but it cannot be directly utilized for efficient training, since the subnetwork itself has to be searched through a tedious iterative process. In addition, most LTH works discussed only unstructured sparsity. The study of You et al. (2020a) presents discoveries that structured lottery tickets can emerge in early stage of training (i.e., Early-Bird Ticket), and therefore a structurally sparse subnetwork can be identified with much lower costs, leading to practical efficient training algorithms.
Inspired by the success of LTH and Early-Bird Ticket, we propose EarlyBERT, a general efficient training algorithm based on structured Early-Bird Tickets. Due to the vast differences between the architectures and building blocks of computer vision models and BERT, directly extending the method of You et al. (2020a) does not apply to our work. By instead using network slimming (Liu et al., 2017) on the self-attention and fully-connected sub-layers inside a transformer, we are the first to introduce an effective approach that can identify structured winning tickets in the early stage of BERT training, that are successfully applied for efficient language modeling pre-training and fine-tuning. Extensive experiments on BERT demonstrate that EarlyBERT can save 35∼45% training time with minimal performance degradation, when evaluated on GLUE and SQuAD benchmarks.

Related Work
Efficient NLP Models It is well believed that BERT and other large NLP models are considerably overparameterized (McCarley, 2019;Sun et al., 2019). This explains the emergence of many model compression works, which can be roughly categorized into quantization Zafrir et al., 2019), knowledge distillation (Sun et al., 2019;Jiao et al., 2019;Sun et al., 2020a,b), dynamic routing (Fan et al., 2019;Xin et al., 2020), and pruning (Li et al., 2020;Wang et al., 2019;McCarley, 2019;Michel et al., 2019). Almost all model compression methods focus on reducing inference time, while their common drawback is the reliance on fully-trained and heavily-engineered dense models, before proceeding to their compact, sparse versions -which essentially transplants the resource burden from the inference to the training stage.
Pruning is the mainstream approach for compressing BERT so far (Gordon et al., 2020). Mc-Carley (2019) proposed to greedily and iteratively prune away attention heads contributing less to the model. Wang et al. (2019) proposed to structurally prune BERT models using low-rank factorization and augmented Lagrangian 0 norm regularization. McCarley (2019) pruned less important self-attention heads and slices of MLP layers by applying 0 regularization to the coefficient corresponding to each head/MLP layer. Others aim to reduce the training time of transformer-based models via large-batch training and GPU model parallelism (You et al., 2020b;Shoeybi et al., 2019). Our work is orthogonal to these works, and can be readily combined for further efficiency boost.

Lottery Ticket Hypothesis in Computer Vision
Lottery Ticket Hypothesis (LTH) was firstly proposed in Frankle and Carbin (2019), which shed light on the existence of sparse sub-networks (i.e., winning tickets) at initialization with non-trivial sparsity ratio that can achieve almost the same performance (compared to the full model) when trained alone. The winning tickets are identified by pruning fully trained networks using the socalled Iterative Magnitude-based Pruning (IMP). However, IMP is expensive due to its iterative nature. Moreover, IMP leads to unstructured sparsity, which is known to be insufficient in reducing training cost or accelerating training speed practically. These barriers prevent LTH from becoming immediately helpful towards efficient training.  studies the transferability of winning tickets between datasets and optimizers.  investigates different components in LTH and observes the existence of super-masks in winning tickets. Lately, You et al. (2020a) pioneers to identify Early-Bird Tickets, which emerge at the early stage of the training process, and contain structured sparsity when pruned with Network Slimming (Liu et al., 2017) which adopts channel pruning. Early-bird tickets mitigate the two limitations of IMP aforementioned, and renders it possible to training deep models efficiently, by drawing tickets early in the training and then focusing on training this compact subnetwork only. Chen et al. (2021) reveals the benefit of LTH in data-efficient training, but their focus is not on saving training resources.
Lottery Ticket Hypothesis in NLP All above works evaluate their methods on computer vision models. For NLP models, previous work has also found that matching subnetworks exist in transformers and LSTMs Renda et al., 2020). Evci et al. (2020) derived an algorithm for training sparse neural networks according to LTH and applied it to character-level language modeling on WikiText-103. For BERT models, a latest work (Chen et al., 2020b) found that the pre-trained BERT models contain sparse subnetworks, found by unstructured IMP at 40% to 90% sparsity, that are independently trainable and transferable to a range of downstream tasks with no performance degradation. Their follow-up work Gan et al., 2021) pointed out similar phenomenons in pre-trained computer vision and vision-language models. Another work (Prasanna et al., 2020) aims to find structurally sparse lottery tickets for BERT, by pruning entire attention heads and MLP layers. Their experiments turn out that all subnetworks ("good" and "bad") have "comparable performance" when fined-tuned on downstream tasks, leading to their "all tickets are winning" conclusion.
Nevertheless, both relevant works (Chen et al., 2020b;Prasanna et al., 2020) examine only the pre-trained BERT model, i.e., finding tickets with regard to the fine-tuning stage on downstream tasks. To our best knowledge, no existing study analyzes the LTH at the pre-training stage of BERT; nor has any work discussed efficient BERT training using LTH, for either pre-training or fine-tuning. Our work makes the first attempt of introducing LTH to both efficient pre-training and efficient fine-tuning of BERT. Our results also provide positive evidence that LTH and Early-Bird Tickets in NLP models are amendable to structured pruning.

The EarlyBERT Framework
In this section, we first revisit the original Lottery Ticket Hypothesis (LTH) (Frankle and Carbin, 2019) and its variant Early-Bird Ticket (You et al., 2020a), then describe our proposed EarlyBERT.

Revisiting Lottery Ticket Hypothesis
Denote f (x; θ) as a deep network parameterized by θ and x as its input. A sub-network of f can be characterized by a binary mask m, which has exactly the same dimension as θ. When applying the mask m to the network, we obtain the sub-network f (x; θ m), where is the Hadamard product operator. LTH states that, for a network initialized with θ 0 , an algorithm called Iterative Magnitude Pruning (IMP) can identify a mask m such that the sub-network f (x; θ 0 m) can be trained to have no worse performance than the full model f following the same training protocol. Such a subnetwork f (x; θ 0 m), including both the mask m and initial parameters θ 0 , is called a winning ticket. The IMP algorithm works as follows: (1) initialize m as an all-one mask; (2) fully train f (x; θ 0 m) to obtain a well-trained θ; (3) remove a small portion of weights with the smallest magnitudes from θ m and update m; (4) repeat (2)-(3) until a certain sparsity ratio is achieved.
Two obstacles prevent LTH from being directly applied to efficient training. First, the iterative process in IMP is essential to preserve the performance of LTH; however, this is computationally expensive, especially when the number of iterations is high. Second, the original LTH does not pursue any structured sparsity in the winning tickets. In practice, unstructured sparsity is difficult to be utilized for computation acceleration even when the sparsity ratio is high (Wen et al., 2016).
To mitigate these gaps, Early-Bird Tickets are proposed by You et al. (2020a), who discovers that when using structured mask m and a properly selected learning rate, the mask m quickly converges and the corresponding mask emerges as the winning ticket in the early stage of training. The early emergence of winning tickets and the structured sparsity are both helpful in reducing computational cost in the training that follows. You et al. (2020a) focuses on computer vision tasks with convolutional networks such as VGG (Simonyan and Zisserman, 2014) and ResNet (He et al., 2016). Inspired by this, we set out to explore whether there are structured winning tickets in the early stage of BERT training that can significantly accelerate language model pre-training and fine-tuning.

Discovering EarlyBERT
The proposed EarlyBERT 1 training framework consists of three steps: (i) Searching Stage: jointly train BERT and the sparsity-inducing coefficients to be used to draw the winning ticket; (ii) Ticketdrawing Stage: draw the winning ticket using the learned coefficients; and (iii) Efficient-training Stage: train EarlyBERT for pre-training or downstream fine-tuning.
Searching Stage To search for the key substructure in BERT, we follow the main idea of Network Slimming (NS) (Liu et al., 2017). However, pruning in NS is based on the scaling factor γ in batch normalization, which is not used in most NLP models such as BERT. Therefore, we make necessary modifications to the original NS so that it can be adapted to pruning BERT. Specifically, we propose to associate attention heads and intermediate layers of the fully-connected sub-layers in a transformer with learnable coefficients, which will be jointly trained with BERT but with an additional 1 regularization to promote sparsity.
Some studies (Michel et al., 2019;Voita et al., 2019) find that the multi-head self-attention module of transformer can be redundant, presenting the possibility of pruning some heads from each layer of BERT without hurting model capacity. A multihead attention module (Vaswani et al., 2017) is formulated as: where n is the number of heads, and the projections i are used for output, query, key and value. Inspired by Liu et al. (2017), we introduce a set of scalar coefficients c h i (i is the index of attention heads and h means "head") inside h i : After the self-attention sub-layer in each transformer layer, the output MultiHead(Q, K, V ) will be fed into a two-layer fully-connected network, in which the first layer increases the dimension of the embedding by 4 times and then reduces it back to the hidden size (768 for BERT BASE and 1,024 for BERT LARGE ). We multiply learnable coefficients to the intermediate neurons: These modifications allow us to jointly train BERT with the coefficients, using the following loss: where L 0 is the original loss function used in pretraining or fine-tuning, c is the concatenation of all the coefficients in the model including those for attention heads and intermediate neurons, and λ is the hyper-parameter that controls the strength of regularization.
Note that in this step, the joint training of BERT and the coefficients are still as expensive as normal BERT training. However, the winning strategy of EarlyBERT is that we only need to perform this joint training for a few steps, before the winning ticket emerges, which is much shorter than the full training process of pre-training or fine-tuning. In other words, we can identify the winning tickets at a very low cost compared to the full training. Then, we draw the ticket (i.e., the EarlyBERT), reset the parameters and train EarlyBERT that is computationally efficient thanks to its structured sparsity. Next, we introduce how we draw EarlyBERT from the learned coefficients.
Ticket-drawing Stage After training BERT and coefficients c jointly, we draw EarlyBERT using the learned coefficients with a magnitude-based metric. Note that we prune attention heads and intermediate neurons separately, as they play different roles.
We prune the attention heads whose coefficients have the smallest magnitudes, and remove these heads from the computation graph. We also prune the rows in W O (see Eqn. (1)) that correspond to the removed heads. Note that this presents a design choice: should we prune the heads globally or layer-wisely? In this paper, we use layer-wise pruning for attention heads, because the number of heads in each layer is very small (12 for BERT BASE and 16 for BERT LARGE ). We observe empirically that if pruned globally, the attention heads in some layers may be completely removed, making the network un-trainable. Furthermore, Ramsauer et al. (2020) observes that attention heads in different layers exhibit different behaviors. This also motivates us to only compare importance of attention heads within each layer.
Similar to pruning attention heads, we prune intermediate neurons in the fully-connected sublayers. Pruning neurons is equivalent to reducing the size of intermediate layers, which leads to a reduced size of the weight matrices W 1 and W 2 in Eqn. (3). Between global and layer-wise pruning, empirical analysis shows that global pruning works better. We also observe that our algorithm naturally prunes more neurons for the later layers than earlier ones, which coincides with many pruning works on vision tasks. We leave the analysis of this phenomenon as future work.

Efficient-training Stage
We then train Early-BERT that we have drawn for pre-training or finetuning depending on the target task. If we apply EarlyBERT to pre-training, the initialization θ 0 of BERT will be a random initialization, the same setting as the original LTH (Frankle and Carbin, 2019) and Early-Bird Tickets (You et al., 2020a). Bottom: mask distance observed in fine-tuning. The color represents the normalized mask distance between different training steps. The darker the color, the smaller the mask distance. In both cases, the mask converges quickly, which indicates the early emergence of the tickets.
If we apply EarlyBERT to fine-tuning, then θ 0 can be any pre-trained model. We can also moderately reduce the training steps in this stage without sacrificing performance, which is empirically supported by the findings in Frankle and Carbin (2019); You et al. (2020a) that the winning tickets can be trained more effectively than the full model. In practice, the learning rate can also be increased to speed up training, in addition to reducing training steps. Different from unstructured pruning used in LTH and many other compression works (Frankle and Carbin, 2019;Chen et al., 2020b), structurally pruning attention heads and intermediate neurons in fully-connected layers can directly reduce the number of computations required in the transformer layer, and shrink the matrix size of the corresponding operations, yielding a direct reduction in computation and memory costs.

Validation of EarlyBERT
Early Emergence Following a similar manner in You et al. (2020a), we visualize the normalized mask distance between different training steps, to validate the early emergence of winning tickets. In Figure 1, the axes in the plots are the number of training steps finished. We only use one fullyconnected sub-layer to plot Figure 1( tuning, the mask converges in a very early stage of the whole training process. Although we observe an increase of mask distance in fully-connected layers during pre-training (in Figure 1(b)), this can be easily eliminated by early stopping and using mask distance as the exit criterion. An ablation study on how early stopping influences the performance of EarlyBERT is presented in Sec. 4.2.
Non-trivial Sub-network Here, by non-trivial we mean that with the same sparsity ratio as in EarlyBERT, randomly pruned model suffers from significant performance drop.  on BERT BASE . Specifically, we prune 4 heads from each transformer layer in BERT BASE and Early-BERT. We fine-tune BERT BASE for 3 epochs with an initial learning rate 2 × 10 −5 . We run the searching stage for 0.2 epochs with λ = 1 × 10 −4 , draw EarlyBERT with pruning ratio ρ = 1/3, and then fine-tune EarlyBERT for 2 epochs with doubled initial learning rate. For the randomly pruned models, we randomly prune 4 heads in each layer and follow the same fine-tuning protocol as EarlyBERT.
The reported results of randomly pruned models are the average of 5 trials with different seeds for pruning. The results on four tasks from GLUE benchmark (Wang et al., 2018) presented in Table 1 show that randomly pruned model consistently underperforms EarlyBERT with a significant gap, supporting our claim that EarlyBERT indeed identifies non-trivial sub-structures.  (Wang et al., 2018) and a questionanswering dataset SQuAD v1.1 (Rajpurkar et al., 2016). Note that as our goal is efficient pre-training and fine-tuning, we focus on larger datasets from GLUE (MNLI, QNLI, QQP and SST-2), as it is less meaningful to discuss efficient training on very small datasets. We use the default training settings for pre-training and fine-tuning on both models. To evaluate model performance, we use Matthew's correlation score for CoLA, matched accuracy for MNLI, F1-score for SQuAD v1.1, and accuracy in percentage for other tasks on GLUE. We omit % symbols in all the tables on accuracy results.

Implementation Details
For the vanilla BERT, we fine-tune on GLUE datasets for 3 epochs with initial learning rate 2 × 10 −5 , and for 2 epochs on SQuAD with initial learning rate 3 × 10 −5 ; we use AdamW (Loshchilov and Hutter, 2017) optimizer for both cases. For pre-training, we adopt LAMB optimization technique (You et al., 2020b), which involves two phases of training: the first 9/10 of the total training steps uses a sequence length of 128, while the last 1/10 uses a sequence length of 512. Pre-training by default has 8,601 training steps and uses 64k/32k batch sizes and 6 × 10 −3 /4 × 10 −3 initial learning rates for the two phases, respectively. All experiments are run on 16 NVIDIA V100 GPUs.

Experiments on Fine-tuning
The main results of EarlyBERT in fine-tuning are presented in Table 2. According to the observation of the early emergence of tickets in Sec. 3.3, we run the searching stage for 0.2 epochs (which accounts for less than 7% of the cost of a standard 3-epoch fine-tuning) with λ = 1 × 10 −4 for all tasks. When drawing EarlyBERT, we prune 4 heads in each layer from BERT BASE and 6 heads from BERT LARGE , and globally prune 40% intermediate neurons in fully-connected sub-layers in both models, instead of pruning only heads as in Table 1. After this, we re-train the EarlyBERT models for reduced training epochs (from 3 to 2) on GLUE benchmark and the learning rate scaled up by 2 times to buffer the effect of reduced epochs. For SQuAD dataset, we keep the default setting, as we find SQuAD is more sensitive to the number of training steps. The selection of these hyperparameters are based on the ablation studies that follow the main results in  stage in EarlyBERT BASE , but it induces much more accuracy drop. EarlyBERT BASE can also outperform another strong baseline LayerDrop (Fan et al., 2019), which drops one third of the layers so that the number of remaining parameters are comparable to ours. Note that LayerDrop models are fine-tuned for three full epochs, yet EarlyBERT is still competitive in most cases. Second, we consistently observe obvious performance advantage of EarlyBERT over randomly pruned models, which provides another strong evidence that EarlyBERT does discover nontrivial key sparse structures. Even though there still exists a margin between Early-BERT and the baseline (You et al. (2020a) also observed similar phenomenon in their tasks), the existence of structured winning tickets and its potential for efficient training is highly promising. We leave as future work to discover winning tickets of higher sparsity but better quality. Ablation Studies on Fine-tuning We perform extensive ablation studies to investigate important hyper-parameter settings in EarlyBERT, using EarlyBERT BASE as our testing bed. For all experiments, we use the average accuracy on the larger datasets from GLUE benchmark (MNLI, QNLI, QQP and SST-2) as the evaluation metric.
• Number of training epochs and learning rate.
We first investigate whether we can properly reduce the number of training epochs, and if scaling the learning rate can help compliment the negative effect caused by reducing training steps. Results in Figure 2 show that when we fine-tune EarlyBERT for fewer epochs on GLUE, up-scaling learning rate first helps to recover performance, and then causes decrease again. We will use two epochs and 4 × 10 −5 as learning rate  for EarlyBERT on GLUE experiments.
• Regularization strength λ. A proper selection of the regularization strength λ decides the quality of the winning ticket, consequently the performance of EarlyBERT after pre-training/finetuning. Results in Table 3 show that λ has marginal influence on EarlyBERT performance. We use λ = 10 −4 that achieves the best performance in following experiments.
• Pruning ratios ρ. We further investigate the effects of different pruning ratios as well as layerwise/global pruning on the performance of Early-BERT. As discussed in Sec. 3.2, we only consider layer-wise pruning for self-attention heads. Table 3 shows that the performance monotonically decreases when we prune more self-attention heads from BERT; however, we see a slight increase and then a sharp decrease in accuracy, when the pruning ratio is raised for intermediate neurons in fully-connected sub-layers (40% pruning ratio seems to be the sweet spot). We also observe consistent superiority of global pruning over layer-wise pruning for intermediate neurons.
• Early-stop strategy for searching. In Figure 1, we show the early emergence of winning tickets in BERT when trained with 1 regularization, suggesting we stop the searching stage early to save computation while still generating high-quality tickets. Here, we study how the early-stop strategy influences the model performance. We finetune EarlyBERT on QNLI following the same setting described earlier in this section, but stop the searching stage at different time points during the first epoch for searching. Results in Figure 3 show (

Trade-off Between Efficiency and Performance
We vary the pruning ratios for the FC layers and the number of self-attention heads pruned in each layer in EarlyBERT, fine-tune the models on QQP in GLUE, and obtain the corresponding validation accuracies and training time savings following the protocol above. Results are shown in Table 5. We can see clear correlations between the training time saving and the accuracy -the more FC neurons or self-attention heads pruned, the more training time saving yet the larger accuracy drop. Moreover, for most combinations of these two hyper-parameters, the accuracy drop is within 1%, which also supports the efficiency of EarlyBERT.

Experiments on Pre-training
We also conduct pre-training experiments and present the main results in Table 4. We run the search stage for 400 steps of training in the first training phase that uses a sequence length of 128 which only accounts for less than 3% of a standard pre-training, with λ = 1 × 10 −4 . When we draw EarlyBERT, similar to the settings in fine-tuning experiments, we prune 4 heads in each layer from BERT BASE and 6 heads from BERT LARGE ; however, we prune slightly fewer (30%) intermediate neurons in fully-connected sub-layers in both models, since we empirically observe that pre-training is more sensitive to aggressive intermediate neuron pruning. In both phases of pre-training, we reduce the training steps to 80% of the default setting when training EarlyBERT (based on the ablation study shown in Figure 4). Other hyper-parameters for pre-training follow the default setting described in Sec. 4.1. All models are fine-tuned and evaluated on GLUE and SQuAD v1.1 with the default setting.
Different from fine-tuning experiments, the pretraining stage dominates the training time over the downstream fine-tuning, and thus we only consider the training time saving during pre-training. Since the randomly pruned models do not have competitive performance in fine-tuning experiments as shown in Sec. 4.2, we focus on comparing Early-BERT with the full BERT baseline.
From the results presented in Table 4, we can see that on downstream tasks with larger datasets such as QNLI, QQP and SST-2, we can achieve accuracies that are close to BERT baseline (within 1% accuracy gaps except for EarlyBERT BASE on MNLI and SQuAD). However, on downstream tasks with smaller datasets, the patterns are not consistent: we observe big drops on CoLA and MRPC but improvement on RTE. Overall, EarlyBERT achieves comparable performance while saving 30∼35% training time thanks to its structured sparsity and reduction in training steps.
Reducing Training Steps in Pre-training We investigate whether EarlyBERT, when nonessential heads and/or intermediate neurons are pruned, can train more efficiently, and whether we can reduce the number of training steps in pretraining. This can further help reduce training cost in addition to the efficiency gain from pruning. We use EarlyBERT BASE -Self (only self-attention heads are pruned when drawing the winning ticket) as the testing bed. Figure 4 shows the performance decreases more when we reduce the number of training steps to 60% or less. Reducing it to 80% seems to be a sweet point with the best balance between performance and efficiency.

Comparison with Previous Lottery
Tickets Work in NLP On one hand, two relevant works (Chen et al., 2020b;Prasanna et al., 2020) only investigate lottery tickets on pre-trained NLP models for finetuning on the downstream tasks, while EarlyBERT makes the first attempt of introducing lottery tickets to both fine-tuning and pre-training stages, and provides empirical evidence that NLP models are amendable to structured pruning.
On the other hand, EarlyBERT pursues structured sparsity while Chen et al. (2020b) promotes unstructured sparsity, which is hardware unfriendly and provides almost no acceleration, besides the high cost of IMP. As an implicit comparison, Chen et al. (2020b) induces 0.4% accuracy drop on SQuAD v1 dataset compared to the BERT baseline with 40% unstructured sparsity (comparable with our settings in Section 4.2), while EarlyBERT induces 1.37% accuracy drop. Note that Chen et al. (2020b) uses 6x training times (because IMP reaches 40% sparsity with 6 iterations) and 4.69x FLOPs, but EarlyBERT uses only 0.76x training times and FLOPs in contrast.

Conclusion
In this paper, we present EarlyBERT, an efficient framework for large-scale language model pretraining and fine-tuning. Based on Lottery Ticket Hypothesis, EarlyBERT identifies structured winning tickets in an early stage, then uses the pruned network for efficient training. Experimental results demonstrate that the proposed method is able to achieve comparable performance to standard BERT with much less training time. Future work includes exploring more data-efficient strategies to enhance the current training pipeline.

A More Comparison with BERT Baseline
For more explicit comparison, we conduct a twoway fine-tuning experiment in addition to the main results in Table 2. All results are averages of 3 runs. We first increase the training cost of EarlyBERT to match BERT performance by extending the searching stage to a full epoch, which, according to our ablation study in Figure 3, helps to improve the performance of EarlyBERT. In this case, Early-BERT still has 16% time and FLOPs savings, with comparable performance shown in Table 6.
Secondly, we reduce the training steps of BERT to match the FLOPs of EarlyBERT, inducing obvious gaps between BERT and EarlyBERT as presented in Table 7.

B Searching EarlyBERT using on the Masked Language Modeling Task
It is found in Chen et al. (2020b) selecting a winning ticket for BERT fine-tuning on the masked language modeling task (MLM), i.e., pre-training objective makes for better tickets performing on many of the downstream tasks. Here, we try the experiments of using the MLM objective during the searching stage. Results are summarized in Table 8. Our main observations include: • When using the MLM objective for the searching stage, the mask distance for both self-  attention heads and FC neurons converged well and quickly within 100 training steps.
• We first apply the global pruning method to the FC neurons because we observed better performance of EarlyBERT with that method. However, while we previously found in Early-BERT that the latter layers will be pruned more, we observed the opposite phenomenon when using MLM objective -the former layers are pruned more instead. In terms of accuracy, we observed significant gaps compared to EarlyBERT.
• Based on the above observations, we also applied layerwise pruning for MLM experiments (shown in the last row of Table 8). We did see improved accuracy with layerwise pruning but the gaps between EarlyBERT are still large (except on QQP).

C The Effect of Reduced Training Steps during Pre-training
We perform the same as the analysis of the effect of reduced training steps during pre-training in Figure 4 for both the vanilla BERT and EarlyBERT. We calculate how performance will be influenced due to the reduced training steps. We use F1 score for SQuAD, Matthew's correlation score for CoLA and accuracy for all other tasks on GLUE as the metric. We report the performance reduction (or  gain) in percentage average on all tasks, normalized by the performance of baseline, i.e., BERT or EarlyBERT trained with the default number of training steps. Similar metric is used in DistilBERT . Results are shown in Table 9.
We can see that using only 80% training steps actually improves the performance of EarlyBERT on average but in contrast hurts the performance of BERT. Similarly, using 60% training steps hurts BERT more than EarlyBERT. And as expected, saving more training steps generally hurt more. We think this is one piece of evidence that motivated us to use reduced training steps for EarlyBERT.