BERxiT: Early Exiting for BERT with Better Fine-Tuning and Extension to Regression

The slow speed of BERT has motivated much research on accelerating its inference, and the early exiting idea has been proposed to make trade-offs between model quality and efficiency. This paper aims to address two weaknesses of previous work: (1) existing fine-tuning strategies for early exiting models fail to take full advantage of BERT; (2) methods to make exiting decisions are limited to classification tasks. We propose a more advanced fine-tuning strategy and a learning-to-exit module that extends early exiting to tasks other than classification. Experiments demonstrate improved early exiting for BERT, with better trade-offs obtained by the proposed fine-tuning strategy, successful application to regression tasks, and the possibility to combine it with other acceleration methods. Source code can be found at https://github.com/castorini/berxit.


Introduction
Large-scale pre-trained language models such as BERT (Devlin et al., 2019) have brought the natural language processing (NLP) community large performance gain but at the cost of heavy computational burden. While pre-trained models are available online and fine-tuning is typically done without a strict time budget, inference poses a much lower latency tolerance, and the slow inference speed of these models can impede easy deployment. It becomes even more difficult when inference has to be done on edge devices due to limited network capabilities or privacy concerns.
Early exiting (Schwartz et al., 2020;Xin et al., 2020a; has been proposed to accelerate the inference of BERT and models with similar architecture, i.e., those comprising multiple transformer layers (Vaswani et al., 2017) with a classifier at the top. Instead of using only one ⋮ ⋮  Figure 1: Multi-output structure of early exiting BERT.
classifier, additional classifiers are attached to each transformer layer (see Figure 1), and the entire model is fine-tuned together. At inference time, the sample can perform early exiting through one of the intermediate classifiers.
While existing early exiting papers provide promising quality-efficiency trade-offs, improvements are necessary for two important components: fine-tuning strategies and exiting decision making. In these papers, fine-tuning strategies are relatively simple and fail to take full advantage of the pretrained model's effectiveness; we propose a novel fine-tuning strategy, Alternating, for this multioutput model. Moreover, previous work makes exiting decisions based on the confidence of output probability distributions, and is hence only applicable to classification tasks; we extend it to other tasks by proposing the learning-to-exit idea. With carefully designed fine-tuning strategies and methods for making exiting decisions, the model can achieve better quality-efficiency trade-offs and can be extended to regression tasks.
We refer to our proposed ideas collectively as BERxiT (BERT+exit), and apply it to Muppets 1 including BERT, RoBERTa (Liu et al., 2019), and ALBERT (Lan et al., 2020); we also apply it on top of another BERT acceleration method, Distil-BERT (Sanh et al., 2019). We conduct experiments on datasets including classification and regression tasks, and show that our method can save up to 70% of inference time with minimal quality degradation.
Our contributions include the following: (1) an effective fine-tuning method Alternating; (2) the learning-to-exit idea that extends early exiting to tasks other than classification; (3) extensive experiments that show the effectiveness of our ideas and the successful combination of early exiting with other BERT acceleration methods; (4) additional experiments that provide insight into the inner mechanism of pre-trained models.

Related Work
BERT (Devlin et al., 2019) is a pre-trained multilayer transformer (Vaswani et al., 2017) model. RoBERTa (Liu et al., 2019) and ALBERT (Lan et al., 2020) are variants of BERT with almost identical model architectures but different training methods and parameter sharing strategies. In our paper, we refer to these transformer-based models as Muppets, and apply our method on them.
In the general deep learning context, there are a number of well-explored methods to accelerate model inference. Pruning (Han et al., 2015;Fan et al., 2020;Gordon et al., 2020) removes unimportant parts of the neural model, from individual weights to layers and blocks. Quantization (Lin et al., 2016;Shen et al., 2020) reduces the number of bits needed to operate a neural model and to store its weights. Distillation (Hinton et al., 2015;Jiao et al., 2020) transfers knowledge from large teacher models to small student models. These methods typically require pre-training Muppets from scratch 2 and produce only one small model with a predetermined target size. Early exiting requires only fine-tuning and also produces a series of small models, from which the user can choose flexibly. It extends the idea of Adaptive Computation (Graves, 2016) for recurrent neural networks, and is also closely related to BranchyNet (Teerapittayanon et al., 2016), Multi-Scaled DenseNet (Huang et al., 2018), and Slimmable Network (Yu et al., 2019).
Early exiting for Muppets has been explored by RTJ 3 (Schwartz et al., 2020), DeeBERT (Xin et al., 2020a,b), and FastBERT . Despite their promising results, there is still room for im-provement regarding the fine-tuning strategies of RTJ and DeeBERT. FastBERT, on the other hand, uses self-distillation (Zhang et al., 2019;Phuong and Lampert, 2019) for fine-tuning, which works well for small Muppets such as BERT BASE . However, our preliminary experiments 4 show that selfdistillation is unstable on larger Muppets such as BERT LARGE , suggesting that future work is necessary for fully understanding and robustly applying self-distillation on Muppets.
All these three methods make early exiting decisions based on confidence (or its variants) of the predicted probability distribution, and are therefore limited to classification tasks. Runtime Neural Pruning (Lin et al., 2017), SkipNet (Wang et al., 2018b), and BlockDrop  use reinforcement learning (RL) to decide whether to execute a network module. Universal Transformer (Dehghani et al., 2019) and Depth-Adaptive Transformer (DAT, Elbayad et al., 2020) use learned decisions for early exiting in sequence-to-sequence tasks. Concurrently, PABEE  proposes patience-based early exiting which is applicable to regression, but it relies on inter-layer prediction consistency and is therefore not very efficient for exiting at early layers. Inspired by them, we propose a method to extend early exiting for Muppets to regression tasks, using only layerspecific information as in classification. Moreover, our method requires neither RL nor complicated distribution fitting as in DAT, but uses a straightforward layer-wise certainty estimation, and achieves performance comparable with confidence-based early exiting on classification tasks.

Model Structure and Fine-Tuning
We start from a pre-trained Muppet model (the backbone model), attach additional classifiers to it, fine-tune the model, and use it for accelerated inference by early exiting.
Backbone model The backbone model is an nlayer pre-trained Muppet model. We denote the i th layer hidden state corresponding to the [CLS] token as h i : where x is the input sequence, θ i is the parameters of the i th transformer layer, and f i is the mapping from input to the i th layer hidden state.
Classifiers In the original BERT paper (Devlin et al., 2019), the way to fine-tune is to attach a classifier to the final transformer layer, and then to jointly update both the backbone model and the classifier. The classifier is a one-layer fullyconnected network. It takes as input the final layer hidden state h n and outputs a prediction. Its output is a probability distribution over all classes for classification tasks and a scalar for regression 5 tasks.
To enable early exiting, we instead attach a classifier to every transformer layer, i.e., there are n classifiers in total. Each classifier can make its own prediction, and therefore the model can accelerate inference by exiting earlier.
Fine-tuning strategies We discuss how to finetune this multi-output network. The loss function for the i th layer classifier is where x and y are the input sequence and corresponding label, g i the i th layer's classifier, w i the parameters of g i , and H the task-specific loss function, e.g., cross-entropy for classification tasks and mean squared error (MSE) for regression tasks. The most straightforward fine-tuning strategy is perhaps minimizing the sum of all classifiers' loss functions and jointly updating all parameters in the process. We refer to this strategy as Joint, and it is also used in RTJ (Schwartz et al., 2020): If we hope to preserve the best model quality for the final layer, the desired fine-tuning strategy is Two-stage, which is also used in DeeBERT (Xin et al., 2020a). The first stage is identical to vanilla BERT fine-tuning: updating the backbone model and only the final classifier. In the second stage, we freeze all parameters updated in the first stage, and fine-tune the remaining classifiers. Objectives in the two stages are as follows.
Stage 1: min Stage 2: min 5 In this case, we still refer to this one-layer network as classifier for naming consistency.
These two fine-tuning strategies are not ideal. Intuitively, in this multi-output network, loss functions of different classifiers interfere with each other in a negative way. Transformer layers have to provide hidden states for two competing purposes: immediate inference at the adjacent classifier and gradual feature extraction for future classifiers. Therefore, achieving a balance between the classifiers is critical. Two-stage produces final classifiers with optimal quality at the price of earlier layers, since most parameters are solely optimized for the final classifier. Joint treats all classifiers equally, and therefore its final classifier is less effective than that of Two-stage. To combine the advantages, we propose a novel fine-tuning strategy, Alternating. It alternates between two objectives (taken from Equation 3 and 4) for odd-numbered and even-numbered iterations.
Odd: min Even: min Combining objectives from Joint and Two-stage, Alternating has the potential to find the most preferable region in the parameter space: the intersection between optimal regions for different layers.

Exiting Decision Making
After the entire model (including the backbone and all classifiers) is fine-tuned, it can perform early exiting for an inference sample. In this section we discuss two methods to make exiting decisions.

Confidence Threshold
When the model is "certain" enough of its prediction at an intermediate layer, the forward inference can be terminated.
For classification tasks, a straightforward measurement of the prediction certainty is the maximum probability of the output prediction, which is referred to as confidence in previous work (Schwartz et al., 2020;. Similarly, Xin et al. (2020a) use entropy as the metric, which is also closely related to confidence. Before inference starts, a confidence threshold is chosen. In forward propagation, the confidence of the output at each layer is compared with the threshold; if it is larger than the threshold at a certain layer, the sample exits and future layers are skipped.

Learning to Exit
While using a confidence or entropy threshold is straightforward and effective, it is exploiting the fact that the classifier's output is a probability distribution in classification tasks. This is generally not the case for other tasks such as regression. To address the gap, we propose learning-to-exit (LTE) as a substitute when the distribution is unavailable.
The i th layer hidden state h i is a vector in the embedding space. Intuitively, different regions of the embedding space have different certainty levels. For instance, in binary classification tasks, regions closer to the decision boundary have a lower certainty level, and this is explicitly expressed as a lower confidence of the output probability distribution. But even when certainty cannot be explicitly measured, we can still train an auxiliary LTE module to estimate such a metric.
Concretely, the LTE module is a simple onelayer fully-connected network. It takes as input the hidden state h i and outputs the certainty level u i of the sample at the i th layer: where σ is the sigmoid function, c is the weight vector, and b is the bias term. The loss function for the LTE module is a simple MSE between u i and the "ground truth" certainty level at the i th layerũ i : For classification, the ground truth certainty level is whether the classifier makes the correct prediction: where g i is the output probability distribution at the i th layer and g (j) i is its j th entry. For regression, the ground truth certainty level is negatively related to the prediction's absolute error: To apply LTE, we initialize the LTE module together with classifiers and it is shared among all layers. We train the LTE module jointly with the rest of the model by substituting L i in Equation 3-7 with L i + J i . At inference time, if the predicted certainty level is higher than the chosen threshold, the inference sample performs early exiting.

Setup
We conduct experiments on six classification datasets of the GLUE benchmark (Wang et al., 2018a); since there is only one regression dataset, STS-B (Cer et al., 2017), in GLUE, we additionally use another regression dataset, SICK (Marelli et al., 2014). Statistics of these datasets are listed in Table 1. Our implementation is adapted from the Huggingface Transformer Library (Wolf et al., 2020). We conduct searches on experiment settings such as the optimizer, learning rates, hidden state sizes, and dropout probabilities, and discover that it is best to keep original settings from the library. Random seeds are also unchanged from the library for fair comparisons. 6 Most results in this paper use the dev split, since the large number of evaluations we need are forbidden by the GLUE evaluation server. The only exception is Table 2, where we report model qualityefficiency trade-offs on the test split.

Layer-wise Scores Comparison
We discuss three fine-tuning strategies in Section 3: Joint (also used in RTJ), Two-stage (also used in DeeBERT), and Alternating (proposed in this paper). In tables and figures, they are labeled respectively as JOINT, 2STG, and ALT. Figures 2 and 3 compare these three fine-tuning strategies by showing their layer-wise score curves: each point in the curve shows the output score at a certain exit layer, i.e., all samples are required to exit at this layer for evaluation. More specifically, we report relative scores, and the 100% baseline is the original score of the vanilla Muppet without early exiting, and this is also the score of the final layer of Two-stage because of parameter freezing in its second stage.
For BERT BASE , we show plots for all six classification datasets, ordered by their training set sizes from smallest to largest. As we will see in later analyses, low-resource datasets show the most difference. Therefore, for RoBERTa BASE and ALBERT BASE , we only show plots for RTE and MRPC (with training set size smaller than 6% of others) due to space limitations. 7 We observe the following from the figures: • Two-stage is unsatisfying. While it achieves the best score at the final layer, it comes at a large cost of other layers, especially for non-lowresource datasets.
• Alternating is better than Joint in later layers, and weaker in earlier layers. However, as we will see in the next section, when we evaluate qualityefficiency trade-offs of confidence-based early 7 Results for other datasets are in Appendix C. exiting, the weakness of Alternating in earlier layers is no longer substantial, while its advantage is preserved.
• The difference between Joint and Alternating is larger for low-resource datasets, where the training set is insufficient to fine-tune all layers well simultaneously.
• Interestingly, for ALBERT BASE , Alternating's relative scores are higher than 100% in the final layers. We speculate that this is because of the parameter sharing nature of ALBERT and the small sizes of the datasets: better supervision for intermediate layers also helps the final layer.

Early Exiting Trade-offs Comparison
From the previous section, we see that Two-stage is visibly less preferable than the other two. Therefore in this section, we compare quality-efficiency trade-offs of Joint and Alternating when confidence threshold is used for making exiting decisions. Specifically, we use average exit layer of all inference samples as the metric of efficiency for the following reasons: (1) it is linear w.r.t. the actual amount of computation; (2) according to our experiments, it is proportional to actual wall-clock runtime, and is also stable across different runs. 8 We visualize the trade-offs in Figures 4 and 5, and also show detailed numbers in Table 2 using results from the test set. Dots in the figures and ALT rows in the table are generated by varying the confidence threshold, and the thresholds are chosen to show trade-offs at different average exit layers. In addition to the comparison between Joint and Alternating, we add another strong baseline, Distil-BERT (Sanh et al., 2019). We apply Alternating fine-tuning and early exiting on top of DistilBERT (labeled as DB+ALT), and the rightmost point of the curve is DistilBERT itself without early exiting (the green ). Observations from the table and figures are as follows: • On the test set, early exiting with Alternating fine-tuning saves a large amount of inference computation, with only minimal quality degradation, compared with vanilla Muppets. • Compared with Joint, Alternating inherits its by other processes on the same machine. Detailed discussions can be found in Appendix D. benefits from the previous section: better tradeoffs at higher scores (larger average exit layer). Additionally, its improvements are larger in smaller datasets. • Alternating's weakness at more aggressive exiting (smaller average exit layer) is minimized. Take Figure 4 as an example, we report the area of one curve above the other as a numerical metric: JOINT over ALT and ALT over JOINT is respectively (0.4, 13.5) for RTE, (0.9, 8.8) for MRPC, and (0.2, 18.2) for SST-2. The advantage of Alternating indicates that later layers intrinsically contribute more to early exiting performance, partly because the final layer's score is the upper bound for all previous layers (ignoring randomness in training). This shows that the Joint fine-tuning strategy, which treats all layers equally, is not ideal. • In most cases, Alternating outperforms Distil-BERT, which requires distillation in pre-training and is therefore much more resource-demanding.
It also further improves model efficiency on top of DistilBERT, indicating that early exiting is cumulative with other acceleration methods.

Learning to Exit Performance
To examine the effectiveness of LTE, we apply it on top of models fine-tuned with Alternating. We show the results in Figure 6 on four datasets.    We use the layer-wise score of Alternating as the baseline: if we want to save x% inference runtime, a straightforward way is to use the first (100 − x)% layers for every sample, regardless of its difficulty. LTE is expected to dynamically allocate resources based on a sample's difficulty and therefore outperform this baseline. From the figures for QNLI and QQP, we observe that the blue curves are substantially above the orange curves, i.e., LTE provides better accuracy-efficiency trade-offs than the layer-  wise baseline, achieving the same model quality with less computation. For regression tasks STS-B and SICK, the layer-wise baseline reaches its maximum score at relatively early layers, leaving little room for LTE to perform. Nevertheless, LTE still outperforms the baseline, especially in earlier layers (note that the y-axis is from 0 to 100%).
We also compare LTE with the concurrent patience-based baseline PABEE  in Table 3, showing their speedups and average exit layers at the same relative scores. PABEE does not provide exact speedup numbers; therefore we estimate the values from their figures. We can see that Alternating fine-tuning plus LTE is marginally better than PABEE on regression tasks.
We further compare LTE-predicted certainty for each layer with layer-wise scores in Figure 7, where we observe large differences of predicted certainty both within and across layers. Also, predicted certainty is generally positively correlated with scores. This further demonstrates that the LTE module successfully captures certainty information based on the model's hidden state.
LTE extends confidence-based early exiting to tasks other than classification. Furthermore, our LTE module is more straightforward and intuitive than DAT (Elbayad et al., 2020), yet achieves comparable results with classification tasks.

Prediction Confidence as a Probe
So far, we have only regarded prediction confidence as something produced by the black-box model and use it for making early exiting decisions. In this section, we show an example of how confidence is related to a human-interpretable feature, demonstrating its potential to reveal the inner mechanisms of Muppet models. We choose two datasets, MRPC and QQP, where the task is to predict whether two input sequences are semantically equivalent. Intuitively, the BLEU score (Papineni et al., 2002) between the two sequences, which measures n-gram matching, may be related to the prediction. At each output layer, we first divide all dev set samples into two subsets by whether they are predicted as positive or negative; then, we calculate the BLEU-4 score for each sample, and calculate the Pearson correlation between BLEU scores and confidence in each subset; finally, we compare the correlation for both subsets in each layer, along with the layer-wise relative scores, in Figure 8.
We notice that the BLEU scores and predicted confidence show the strongest correlation in layers where the model quality starts to improve (layer 4-5 in MRPC 9 and 2-3 in QQP). After these layers, the correlation gradually weakens. It suggests that in the early layers, the model relies more on simple features such as n-gram matching for making semantic judgments: the higher the BLEU score is, the more certain it is for making positive predictions and the less certain it is for negative ones; however, with more layers, the model acquires the ability to look beyond the BLEU score, reducing its reliance on n-gram matching and achieving better performance. Therefore, with MRPC as an example, analyzing differences between layers 3 and 4 may reveal how the model detects n-gram matching, and analyzing differences between layers 4 and 6 may reveal advanced semantic features learned by the model.

Conclusion and Future Work
To improve early exiting for Muppets, we present BERxiT, including the Alternating fine-tuning strategy which outperforms methods from previous papers and the LTE idea which extends early exiting to a broader range of tasks. Experiments show the effectiveness of Alternating in providing better quality-efficiency trade-offs and the successful application of LTE to regression tasks. They also show that early exiting is cumulative with other acceleration methods such as DistilBERT and has the potential for model interpretation.
Future Work The fundamental question of early exiting for Muppets is how many transformer layers are sufficient for making good predictions. We draw inspiration from the Limit performance of Muppets: the score of Limit at the i th layer is obtained by taking the first i transformer layers from the pre-trained Muppet model, attaching a classifier to the i th layer, and fine-tuning this single-output  Figure 9: Comparison between LIMIT of BERT BASE (brown) and BERT LARGE (blue). Red arrow: difference between the two models when they both use 12 layers. model. Limit estimates the upper bound for any fine-tuning methods by removing inter-classifier interference. We compare the Limit performance of BERT BASE and BERT LARGE in Figure 9, and notice that with the same number of layers (and identical fine-tuning strategy), BERT BASE almost always outperforms BERT LARGE by a large margin. This suggests that most transformer layers' potential to provide information for early exiting is limited by the single-output nature of pre-training. If we want to further improve early exiting Muppets for better trade-offs, adding more exiting paths in pretraining would be a promising direction.

A Negative Results for Self-Distillation
FastBERT  does not provide result for BERT LARGE or RoBERTa LARGE . We show BERT LARGE and RoBERTa LARGE layer-wise scores for different fine-tuning strategies in Figure 10. SD in the legend stands for self-distillation. We can see that for Two-stage and Alternating, the patterns are similar to those of BERT BASE : Alternating better in earlier layers while Two-stage better in later layers.
However, self-distillation's behavior is inconsistent between models and datasets. While it performs as expected for BERT LARGE in SST-2 and MNLI, and for RoBERTa LARGE in MRPC, selfdistillation fails to improve after the first few layers for BERT LARGE in MRPC and for RoBERTa LARGE in SST-2 and MNLI, and most layers' quality is considerably worse than Alternating. We therefore consider self-distillation an unstable and premature fine-tuning strategy.

B Additional Experiment Setting
For pre-trained models, we use the following ones provided by the Huggingface Transformer Library (Wolf et al., 2020) as backbone models: • DISTILBERT-BASE-UNCASED For BERT, ALBERT, and DistilBERT, we finetune for 3 epochs; for RoBERTa, we fine-tune for 10 epochs; no early-stopping or checkpoint selection is performed.
Experiments are done on a single NVIDIA P100 GPU with CUDA 10.1. For inference, we use a batch size of 1 (since we need to perform early exiting based on each individual sample's difficulty). Inference runtime for the entire dev set for all models and datasets is shown in Table 4. RoBERTa has the same model structure as BERT, and therefore its runtime is also very close to that of BERT. Note that this is affected by competing processes, and may vary between different runs.
Numbers of parameters for BERT and ALBERT backbone models can be found in the paper by Lan et al. (2020). RoBERTa shares the same model structure with BERT and has the same number of parameters. Numbers of parameters for earlyexiting-specific modules, such as additional classifiers and the LTE module, are on the order of thousands, and are therefore negligible compared with those of backbone models (millions).

C Additional Experiment Results
In the main paper, we report results of RoBERTa BASE and ALBERT BASE only on the two smallest datasets. Results of the other datasets are provided in Figure 11. We can see that while Two-stage is visibly less preferable, Joint and Alternating are close to each other with larger dataset sizes, and this is the reason why we keep only the low-resource datasets in the main paper.

D Analyses of Efficiency Metric
In our experiments, we use average exit layer as the metric of efficiency for the following three reasons.
It is linear w.r.t. the amount of computation. Inference time computation in our model occurs in the following parts: the embedding layer, transformer layers, classifiers, and the LTE module (if used). If a layer is chosen, i.e., the exit layer is after   it, all components of the layer (transformer, classifier, LTE module) are used and incur computation cost. Additionally, embedding look-up (selecting a column in the matrix) is much faster than the above components (involving matrix-vector multiplication), and can therefore be neglected.
It is stable across different runs. With a finetuned model, an inference sample's exit layer only depends on the confidence (or LTE-predicted certainty) at each layer and the threshold. On the other hand, direct measurement of wall-clock runtime is frequently affected by competing processes and fluctuates between different runs.
The computation overhead of early exiting is negligible. With the above reasons, there is only one concern left for using average exit layer as the efficiency metric: how do additional layers in our model (including the additional classifier and possibly the LTE module) compare with transformer layers in the original BERT paper? We estimate FLOPS used in one sample's inference as follows.
Since we will eventually end up with orders of magnitude differences, we use the big-theta asymptotic notation for estimation. Most computation is incurred for matrix and vec-tor multiplication. Using the naïve implementation, the cost for multiplying two vectors in R d is Θ(d), and the cost for multiplying a matrix in R d 1 ×d 2 and a vector in R d 2 is Θ(d 1 d 2 ). We denote d as the hidden state size of our model (768 for base models and 1024 for large models), c as the number of classes (less than 4 in our experiments), n as the sequence length (typically in the hundreds), and h as the number of heads in multi-head attention (12 for base and 16 for large).
The classifier is a one-layer fully-connected layer, mapping a vector in R d to an output in R c , therefore the cost is Θ(cd). Similarly, the cost of the LTE module is Θ(2d), since its output is always a vector in R 2 .
The transformer layer mainly consists of multihead self-attention, a fully-connected layer, and two layer normalization modules. Layer normalization is much faster than the other two and we therefore neglect it. The fully connected layer maps n vectors from R d to R d , therefore the cost is Θ(nd 2 ). The multi-head attention 10 computes h individual uni-head attention. For each uni-head attention, the mapping from original query, key, and value vectors (R d ) to head-specific ones (R d/h ) incurs a cost of Θ(nd 2 /h); calculating the attention results incurs a cost of Θ(nd/h). Therefore the total cost here is Θ(nd 2 + nd) = Θ(nd 2 ). Finally, results of each head are combined and one more matrixvector multiplication is needed, incurring a cost of Θ(nd). The total cost of one transformer layer is therefore Θ(nd 2 ).
Considering the value of n and d, the classifier and the LTE module are several orders of magnitude lighter than the transformer layer. Even with advanced algorithms and parallel hardware that may accelerate transformer layers, we can still safely come to the conclusion that the computation overhead of early exit is negligible.