Accelerating BERT Inference for Sequence Labeling via Early-Exit

Both performance and efficiency are crucial factors for sequence labeling tasks in many real-world scenarios. Although the pre-trained models (PTMs) have significantly improved the performance of various sequence labeling tasks, their computational cost is expensive. To alleviate this problem, we extend the recent successful early-exit mechanism to accelerate the inference of PTMs for sequence labeling tasks. However, existing early-exit mechanisms are specifically designed for sequence-level tasks, rather than sequence labeling. In this paper, we first propose a simple extension of sentence-level early-exit for sequence labeling tasks. To further reduce the computational cost, we also propose a token-level early-exit mechanism that allows partial tokens to exit early at different layers. Considering the local dependency inherent in sequence labeling, we employed a window-based criterion to decide for a token whether or not to exit. The token-level early-exit brings the gap between training and inference, so we introduce an extra self-sampling fine-tuning stage to alleviate it. The extensive experiments on three popular sequence labeling tasks show that our approach can save up to 66%∼75% inference cost with minimal performance degradation. Compared with competitive compressed models such as DistilBERT, our approach can achieve better performance under the same speed-up ratios of 2×, 3×, and 4×.


Introduction
Sequence labeling plays an important role in natural language processing (NLP). Many NLP tasks can be converted to sequence labeling tasks, such as named entity recognition, part-of-speech tagging, Chinese word segmentation and Semantic Role Labeling. These tasks are usually fundamental and highly time-demanding, therefore, apart from performance, their inference efficiency is also very important.
The past few years have witnessed the prevailing of pre-trained models (PTMs)  on various sequence labeling tasks (Nguyen et al., 2020;Mengge et al., 2020). Despite their significant improvements on sequence labeling, they are notorious for enormous computational cost and slow inference speed, which hinders their utility in real-time scenarios or mobile-device scenarios.
Recently, early-exit mechanism Schwartz et al., 2020; has been introduced to accelerate inference for large-scale PTMs. In their methods, each layer of the PTM is coupled with a classifier to predict the label for a given instance. At inference stage, if the prediction is confident 2 enough at an earlier time, it is allowed to exit without passing through the entire model. Figure 1(a) gives an illustration of early-exit mechanism for text classification. However, most existing early-exit methods are targeted at sequence-level prediction, such as text classification, in which the prediction and its confidence score are calculated over a sequence. Therefore, these methods cannot be directly applied to sequence labeling tasks, where the prediction is token-level and the confidence score is required for each token.
In this paper, we aim to extend the early-exit mechanism to sequence labeling tasks. First, we proposed the SENTence-level Early-Exit (SEN-TEE), which is a simple extension of existing earlyexit methods. SENTEE allows a sequence of tokens to exit together once the maximum uncertainty Early-Exit prediction (c) Token-Level Early-Exit for Sequence Labeling Figure 1: Early-Exit for Text Classification and Sequence Labeling, where represents the self-attention connection, represents simply copying, and represents the confident and uncertain prediction. The darker the hidden state block, the deeper layer it is from. Due to space limit, the window-based uncertainty is not reflected. of the tokens is below a threshold. Despite its effectiveness, we find it redundant for most tokens to update the representation at each layer. Thus, we proposed a TOKen-level Early-Exit (TOKEE) that allows part of tokens that get confident predictions to exit earlier. Figure 1(b) and 1(c) illustrate our proposed SENTEE and TOKEE. Considering the local dependency inherent in sequence labeling tasks, we decide whether a token could exit based on the uncertainty of a window of its context instead of itself. For tokens that are already exited, we do not update their representation but just copy it to the upper layers. However, this will introduce a train-inference discrepancy. To tackle this prob-lem, we introduce an additional fine-tuning stage that samples the token's halting layer based on its uncertainty and copies its representation to upper layers during training. We conduct extensive experiments on three sequence labeling tasks: NER, POS tagging, and CWS. Experimental results show that our approach can save up to 66% ∼75% inference cost with minimal performance degradation. Compared with competitive compressed models such as DistilBERT, our approach can achieve better performance under speed-up ratio of 2×, 3×, and 4×.

BERT for Sequence Labeling
Recently, PTMs  have become the mainstream backbone model for various sequence labeling tasks. The typical framework consists of a backbone encoder and a task-specific decoder.
Encoder In this paper, we use BERT (Devlin et al., 2019) as our backbone encoder . The architecture of BERT consists of multiple stacked Transformer layers (Vaswani et al., 2017).
Given a sequence of tokens x 1 , · · · , x N , the hidden state of l-th transformer layer is denoted by Decoder Usually, we can predict the label for each token according to the hidden state of the top layer. The probability of labels is predicted by where N is the sequence length, C is the number of labels, L is the number of BERT layers, W is a learnable matrix, and f (·) is a simple softmax classifier or conditional random field (CRF) (Lafferty et al., 2001). Since we focus on inference acceleration and PTM performs well enough on sequence labeling without CRF (Devlin et al., 2019), we do not consider using such a recurrent structure.

Early-Exit for Sequence Labeling
The inference speed and computational costs of PTMs are crucial bottlenecks to hinder their application in many real-world scenarios. In many tasks, the representations at an earlier layer of PTMs are usually adequate to make a correct prediction. Therefore, early-exit mechanisms Schwartz et al., 2020; are proposed to dynamically stop inference on the backbone model and make prediction with intermediate representation.
However, these existing early-exit mechanisms are built on sentence-level prediction and unsuitable for token-level prediction in sequence labeling tasks. In this section, we propose two earlyexist mechanisms to accelerate the inference for sequence labeling tasks.

Token-Level Off-Ramps
To extend early-exit to sequence labeling, we couple each layer of the PTM with token-level s that can be simply implemented as a linear classifier. Once the off-ramps are trained with the golden labels, the instance has a chance to be predicted and exit at an earlier time instead of passing through the entire model.
Given a sequence of tokens X = x 1 , · · · , x N , we can make predictions by the injected off-ramps at each layer. For an off-ramp at l-th layer, the label distribution of all tokens is predicted by where W is a learnable matrix, f (l) is the tokenlevel off-ramp at l-th layer, n ∈ R C , indicates the predicted label distribution at the l-th off-ramp for each token.
Uncertainty of the Off-Ramp With the prediction for each token at hand, we can calculate the uncertainty for each token as follows, where p (l) n is the label probability distribution for the n-th token.

Early-Exit Strategies
In the following sections, we will introduce two early-exit mechanisms for sequence labeling, at sentence-level and token-level.

SENTEE: Sentence-Level Early-Exit
Sentence-Level Early-Exit (SENTEE) is a simple extension for sequential labeling tasks based on existing early-exit approaches. SENTEE allows a sequence of tokens to exit together if their uncertainty is low enough. Therefore, SENTEE is to aggregate the uncertainty for each token to obtain an overall uncertainty for the whole sequence. Here we perform a straight-forward but effective method, i.e., conduct max-pooling 3 over uncertainties of all the tokens, where u (l) represents the uncertainty for the whole sentence. If u (l) < δ where δ is a pre-defined threshold, we let the sentence exit at layer l. The intuition is that only when the model is confident of its prediction for the most difficult token, the whole sequence could exit.

TOKEE: Token-Level Early-Exit
Despite the effectiveness of SENTEE (see Table 1), we find it redundant for most simple tokens to be fed into the deep layers. The simple tokens that have been correctly predicted in the shallow layer can not exit (under SENTEE) because the uncertainty of a small number of difficult tokens is still above the threshold. Thus, to further accelerate the inference for sequence labeling tasks, we propose a token-level early-exit (TOKEE) method that allows simple tokens with confident predictions to exit early.
Window-Based Uncertainty Note that a prevalent problem in sequence labeling tasks is the local dependency (or label dependency). That is, the label of a token heavily depends on the tokens around it. To that end, the calculation of the uncertainty for a given token should not only be based on itself but also its context. Motivated by this, we proposed a window-based uncertainty criterion to decide for a token whether or not to exit at the current layer. In particular, the uncertainty for the token x n at l-th layer is defined as where k is a pre-defined window size. Then we use u (l) n to decide whether the n th token can exit at layer l, instead of u (l) n . Note that window-based uncertainty is equivalent to sentence-level uncertainty when k equals to the sentence length.
Halt-and-Copy For tokens that have exited, their representation would not be updated in the upper layers, i.e., the hidden states of exited tokens are directly copied to the upper layers. 4 Such a haltand-copy mechanism is rather intuitive in two-fold: • Halt. If the uncertainty of a token is very small, there are also few chances that its prediction will be changed in the following layers. So it is redundant to keep updating its representation. • Copy. If the representation of a token can be classified into a label with a high degree of confidence, then its representation already contains the label information. So we can directly copy its representation into the upper layers to help predict the labels of other tokens.
These exited tokens will not attend to other tokens at upper layers but can still be attended by other tokens thus part of the layer-specific query projections in upper layers can be omitted. By this, the computational complexity in self-attention is re- The halt-and-copy mechanism is also similar to multi-pass sequence labeling paradigm, in which the tokens are labeled their in order of difficulty (easiest first). However, the copy mechanism results in a train-inference discrepancy. That is, a layer never processed the representation from its non-adjacent previous layers during training. To alleviate the discrepancy, we further proposed an additional fine-tuning stage, which will be discussed in Section 3.3.2.

Model Training
In this section, we describe the training process of our proposed early-exit mechanisms.

Fine-Tuning for SENTEE
For sentence-level early-exit, we follow prior earlyexit work for text classification to jointly train the added off-ramps. For each off-ramp, the loss function is as follows, where H is the cross-entropy loss function, N is the sequence length. The total loss function for each sample is a weighted sum of the losses for all the off-ramps, where w l is the weight for the l-th off-ramp and L is the number of backbone layers. Following , we simply set w l = l. In this way, The deeper an off-ramp is, the weight of its loss is bigger, thus each off-ramp can be trained jointly in a relatively balanced way.

Fine-Tuning for TOKEE
Since we equip halt-and-copy in TOKEE, the common joint training off-ramps are not enough. Because the model never conducts halt-and-copy in training but does in inference. In this stage, we aim to train the model to use the hidden state from different previous layers but not only the previous adjacent layer, just like in inference.
Random Sampling A direct way is to uniformly sample halting layers of tokens. However, halting layers at the inference are not random but depends on the difficulty of each token in the sequence. So random sampling halting layers also causes the gap between training and inference.
Self-Sampling Instead, we use the fine-tuned model itself to sample the halting layers. For every sample in each training epoch, we will randomly sample a window size and threshold for it, and then we can conduct TOKEE on the trained model, under the window size and threshold, without haltand-copy. Thus we get the exiting layer of each token, and we use it to re-forward the sample, by halting and copying each token in the corresponding layer. In this way, the exiting layer of a token can correspond to its difficulty. The deeper a token's exiting layer is, the more difficult it is. Because we sample the exiting layer using the model itself, we think the gap between training and inference can be further shrunk. To avoid over-fitting during further training, we prevent the training loss from further reducing, similar with the flooding mechanism used by Ishida et al. (2020). We also employ the sandwich rule to stabilize this training stage (Yu and Huang, 2019). We compare self-sampling with random sampling in Section 4.4.4.    , POS: ARK Twitter (Gimpel et al., 2011;Owoputi et al., 2013), CTB5 POS (Xue et al., 2005) and UD POS (Nivre et al., 2016), CWS: CTB5 Seg (Xue et al., 2005) and UD Seg (Nivre et al., 2016). Besides the standard benchmark dataset like CoNLL2003 and Ontonotes 4.0, we also choose some datasets closer to realworld application to verify the actual utility of our methods, such as Twitter NER and Weibo in social media domain. We use the same dataset prepro-cessing and split as in previous work (Huang et al., 2015;Mengge et al., 2020;Jia et al., 2020;Nguyen et al., 2020).

Baseline
We compare our methods with three baselines: • BiLSTM-CRF (Huang et al., 2015;Ma and Hovy, 2016) The most widely used model in sequence labeling tasks before the pre-trained language model prevails in NLP. • BERT The powerful stacked Transformer encoder model, pre-trained on large-scale corpus, which we use as the backbone of our methods. • DistilBERT The most well-known distillation method of BERT. Huggingface released 6 layers DistilBERT for English (Sanh et al., 2019). For comparison, we distill {3, 4} and {3, 4, 6} layers DistilBERT for English and Chinese using the same method.

Hyper-Parameters
For all datasets, We use batch size=10. We perform grid search over learning rate in {5e-6,1e-5,2e-5}. We choose learning rate and the model based on the development set. We use the AdamW optimizer (Loshchilov and Hutter, 2019). The warmup step, weight decay is set to 0.05, 0.01, respectively.

Main Results
For English Datasets, we use the 'BERT-basecased' released by Google (Devlin et al., 2019) as backbone. For Chinese Datasets, we use 'BERTwwm' released by (Cui et al., 2019). The Distil-BERT is distilled from the backbone BERT.
To fairly compare our methods with baselines, we turn the speedup ratio of our methods to be consistent with the corresponding static baseline. We report the average performance over 5 times under different random seeds. The overall results are shown in Table 1, where the speedup is based on the backbone. We can see both SENTEE and TOKEE brings little performance drop and outperforms DistilBERT in speedup ratio of 2, which has achieved similar effect like existing early-exit for text classification. Under higher speedup, 3× and 4×, SENTEE shows its weakness but TOKEE can still keep a certain performance. And under 2∼4× speedup ratio, TOKEE has a lower performance drop than DistilBERT. What's more, for datasets where BERT can show its power than LSTM-CRF, e.g., Chinese NER, TOKEE (4×) on BERT can still outperform LSTM-CRF significantly. This indicates the potential utility of it in complicated real-world scenario.
To explore the fine-grained performance change under different speedup ratio, We visualize the speedup-performance trade-off curve on 6 datasets, in Figure2. We observe that, • Before the speedup ratio rises to a certain turning point, there is almost no drop on performance. After that, the performance will drop gradually. This shows our methods keep the superiority of existing early-exit methods . • As the speedup rises, TOKEE will encounter the speedup turning point later than SENTEE. After both methods reach the turning point, SENTEE's performance degradation is more drastic than TOKEE. These both indicate the higher speedup ceiling of TOKEE. • On some datasets, such as CoNLL2003, we observe a little performance improvement under low speedup ratio, we attribute this to the potential regularization brought by early-exit, such as alleviating overthinking (Kaya et al., 2019).
To verify the versatility of our method over different PTMs, we also conduct experiments on two well-known BERT variants, RoBERTa   5 and ALBERT  6 , as shown in Table 2. We can see that SENTEE and TO-KEE also significantly outperform static backbone internal layer on three Representative datasets of corresponding tasks. For RoBERTa and ALBERT, we also observe the TOKEE can have a better performance than SENTEE under high speedup ratio.

Analysis
In this section, we conduct a set of detailed analysis on our methods.

The Effect of Window Size
We show the performance change under different k in Figure 3, keeping the speedup ratio consistent. We observe that: (1) when k is 0, in other words, not using window-based uncertainty but token-independent uncertainty, the performance is the almost lowest across different speedup ratio, because it does not consider local dependency   on sequence labeling. In detail, we verify the entire window-based uncertainty and its specific hyperparameter, k, on CoNLL2003, shown in Figure 4. For the uncertainty, we intercept the 4 th and 8 th off-ramps and calculate their accuracy in each uncertainty interval, when k=2. The result shown in Figure 4(a) indicates that 'the lower the windowbased uncertainty, the higher the accuracy', similar as in text classification. For k, we set a certain threshold = 0.3, and calculate accuracy of tokens whose window-based uncertainty is small than the threshold under different k, shown in Figure 4(b). The result shows that, as k increases: (1) The accuracy of screened tokens is higher. This shows that the wider of a token's low-uncertainty neighborhood, the more accurate the token's prediction is. This also verifies the validity of window-based uncertainty strategy.
(2) The accuracy improvement slows down. This shows the low relevance of distant tokens' uncertainty and explains why large k performs not well under high speedup ratio: it does not help improving more accurate exiting but slowing down exiting.

Influence of Sequence Length
Transformer-based PTMs, e.g. BERT, face a challenge in processing long text, due to the O(N 2 d) computational complexity brought by self-   Figure 5. We observe that TOKEE has a stable computational cost saving as the sentence length increases, but SENTEE's speedup ratio will gradually reduce. For this, we give an intuitive explanation. In general, a longer sentence has more tokens, it is more difficult for the model to give them all confident prediction at the same layer. This comparison reveals the potential of TOKEE on accelerating long text inference.

Effects of Self-Sampling Fine-Tuning
To verify the effect of self-sampling fine-tuning in Section 3.3.2, we compare it with random sampling and no extra fine-tuning on CoNLL2003. The performance-speedup trade-off curve of TOKEE is shown in Figure 6, which shows self-sampling is always better than random sampling for TOKEE. As speedup ratio rises, this trend is more significant. This shows the self-sampling can help more in reducing the gap of training and inference. As for no extra fine-tuning, it will deteriorate drastically at high speedup ratio. But it can roughly keep a certain capability at low speedup ratio, which we attribute to the residual-connection of PTM and similar results were reported by Veit et al. (2016).

Layer Distribution of Early-Exit
In TOKEE, by halt-and-copy mechanism, each token goes through a different number of PTM layers according to the difficulty. We show the average distribution of a sentence's tokens exiting layers under different speedup ratio on CoNLL2003, in Figure 7. We also draw the average exiting layer number of SENTEE under the same speedup ratio. We observe that as speedup ratio rises, more tokens will exit at the earlier layer but a bit of tokens can still go through the deeper layer even when 4×, meanwhile, the SENTEE's average exiting layer number reduces to 2.5, where the PTM's encoding power is severely cut down. This gives an intuitive explanation of why TOKEE is more effective than SENTEE under high speedup ratio: although both SENTEE and TOKEE can dynamically adjust computational cost on the sample-level, TOKEE can adjust do it in a more fine-grained way.

Related Work
PTMs are powerful but have high computational cost. To accelerate them, many attempts have been made. A kind of methods is to reduce its size, such as distillation (Sanh et al., 2019;Jiao et al., 2020), structural pruning (Michel et al., 2019;Fan et al., 2020) and quantization (Shen et al., 2020). Another kind of methods is early-exit, which dynamically adjusts the encoding layer number of different samples Schwartz et al., 2020;Li et al., 2020). While they introduced early-exit mecha-nism in simple classification tasks, our methods are proposed for the more complicated scenario: sequence labeling, where it has not only one prediction probability and it's necessary to consider the dependency of token exitings. Elbayad et al. (2020) proposed Depth-Adaptive Transformer to accelerate machine translation. However, their earlyexit mechanism is designed for auto-regressive sequence generation, in which the exit of tokens must be in left-to-right order. Therefore, it is unsuitable for language understanding tasks. Different from their method, our early-exit mechanism can consider the exit of all tokens simultaneously.

Conclusion and Future Work
In this work, we propose two early-exit mechanisms for sequence labeling: SENTEE and TO-KEE. The former is a simple extension of sequencelevel early-exit while the latter is specially designed for sequence labeling, which can conduct more finegrained computational cost allocation. We equip TOKEE with window-based uncertainty and selfsampling finetuning to make it more robust and faster. The detailed analysis verifies their effectiveness. SENTEE and TOKEE can achieve 2× and 3∼4× speedup with minimal performance drop.
For future work, we wish to explore: (1) leveraging the exited token's label information to help the exiting of remained tokens; (2) introducing CRF or other global decoding methods into early-exit for sequence labeling.