COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models

Transformer-based pre-trained language models (PLMs) mostly suffer from excessive overhead despite their advanced capacity. For resource-constrained devices, there is an urgent need for a spatially and temporally efficient model which retains the major capacity of PLMs. However, existing statically compressed models are unaware of the diverse complexities between input instances, potentially resulting in redundancy and inadequacy for simple and complex inputs. Also, miniature models with early exiting encounter challenges in the trade-off between making predictions and serving the deeper layers. Motivated by such considerations, we propose a collaborative optimization for PLMs that integrates static model compression and dynamic inference acceleration. Specifically, the PLM is slenderized in width while the depth remains intact, complementing layer-wise early exiting to speed up inference dynamically. To address the trade-off of early exiting, we propose a joint training approach that calibrates slenderization and preserves contributive structures to each exit instead of only the final layer. Experiments are conducted on GLUE benchmark and the results verify the Pareto optimality of our approach at high compression and acceleration rate with 1/8 parameters and 1/19 FLOPs of BERT.


Introduction
Pre-training generalized language models and finetuning them on specific downstream tasks has become the dominant paradigm in natural language processing (NLP) since the advent of Transformers (Vaswani et al., 2017) and BERT (Devlin et al., 2019).However, pre-trained language models (PLMs) are predominantly designed to be vast in the pursuit of model capacity and generalization.With such a concern, the model storage and inference time of PLMs are usually high, making them challenging to be deployed on resourceconstrained devices (Sun et al., 2020).
Recent studies indicate that Transformer-based PLMs bear redundancy spatially and temporally which comes from the excessive width and depth of the model (Michel et al., 2019;Xin et al., 2021).With static compression methods including network pruning (Xia et al., 2022) and knowledge distillation (Jiao et al., 2020), spatial overheads of PLMs (i.e., model parameters) can be reduced to a fixed setting.From the perspective of input instances rather than the model, early exiting without passing all the model layers enables the dynamic acceleration at inference time and diminishes the temporal overheads (Zhou et al., 2020).
However, static compression can hardly find an optimal setting that is both efficient on simple input instances and accurate on complex ones, while early exiting cannot diminish the redundancy in model width and is impotent for reducing the actual volume of model.Further, interpretability studies indicate that the attention and semantic features across layers are different in BERT (Clark et al., 2019).Hence, deriving a multi-exit model from a pre-trained single-exit model like BERT incurs inconsistency in the training objective, where each layer is simultaneously making predictions and serving the deeper layers (Xin et al., 2021).Empirically, we find that the uncompressed BERT is not severely influenced by such inconsistency, whereas small capacity models are not capable of balancing shallow and deep layers.Plugging in exits after compression will lead to severe performance degradation, which hinders the complementation of the two optimizations.
To fully exploit the efficiency of PLMs and mitigate the above-mentioned issues, we design a slenderized multi-exit model and propose a Collaborative Optimization approach of Spatial and Temporal EFFiciency (COST-EFF) as depicted in Figure 1.Unlike previous works, e.g., DynaBERT (Hou et al., 2020) and CoFi (Xia et al., 2022), which obtain a squat model, we keep the depth intact while slenderizing the PLM.Superiority of slender architectures over squat ones is supported by (Bengio et al., 2007) and(Turc et al., 2019) in generic machine learning and PLM design.To address the inconsistency in compressed multi-exit model, we first distill an multi-exit BERT from the original PLM as both the teaching assistant (TA) and the slenderization backbone, which is more effective in balancing the trade-off between layers than compressed models.Then, we propose a collaborative approach that slenderizes the backbone with the calibration of exits.Such a slenderization diminishes less contributive structures to each exit as well as the redundancy in width.After the slenderization, task-specific knowledge distillation is conducted with the objectives of hidden representations and predictions of each layer as recovery.Specifically, the contributions of this paper are as follows.
• To comprehensively optimize the spatial and temporal efficiency of PLMs, we leverage both static slenderization and dynamic acceleration from the perspective of model scale and variable computation.
• We propose a collaborative training approach that calibrates the slenderization under the guidance of intermediate exits and mitigates the inconsistency of early exiting.
• Experiments conducted on the GLUE benchmark verify the Pareto optimality of our ap-proach.COST-EFF achieves 96.5% performance of fine-tuned BERT Base with approximately 1/8 parameters and 1/19 FLOPs without any form of data augmentation.1

Related Work
The compression and acceleration of PLMs were recently investigated to neutralize the overhead of large models by various means.
The structured pruning objects include, from small to large, hidden dimensions (Wang et al., 2020), attention heads (Michel et al., 2019), multihead attention (MHA) and feed-forward network (FFN) modules (Xia et al., 2022) and entire Transformer layers (Fan et al., 2020).Considering the benefit of the overall structure, we keep all the modules while reducing their sizes.Besides pruning out structures, a fine-grained approach is unstructured pruning which prunes out weights.Unstructured pruning can achieve high sparsity of 97% (Xu et al., 2022) but is not yet adaptable to general computing platforms and hardware.
During the recovery training of compressed models, knowledge distillation objectives include predictions of classifiers (Sanh et al., 2020), features of intermediate representations (Jiao et al., 2020) and relations between samples (Tung and Mori, 2019).Also, the occasion of distillation varies from general pre-training and task-specific fine-tuning (Turc et al., 2019).Distillation enables the training without ground-truth labels complementing data augmentation.In this paper, data augmentation is not leveraged as it requires a long training time, but our approach is well adaptable to it if better performance is to be pursued.
Dynamic early exits come from BranchyNet (Teerapittayanon et al., 2016), which introduces exit branches after specific convolution layers of the CNN model.The idea is adopted to PLMs as Transformer layer-wise early exiting (Xin et al., 2021;Zhou et al., 2020;Liu et al., 2020).However, early exiting only accelerates inference but does not reduce the model size and the redundancy in width.Furthermore, owing to the inconsistency between shallow and deep layers, it is hard to achieve high speedup using early exiting alone.
The prevailing PLMs, e.g., RoBERTa (Liu et al., 2019) and XLNet (Yang et al., 2019) are variants of Transformer with similar overall structures, well-adaptable to the optimizations that we propose.Apart from PLMs with increasing size, ALBERT (Lan et al., 2020) is distinctive with a small volume of 18M (Million) parameters obtained from weight sharing of Transformer layers.Weight sharing allows the model to store the parameters only once, greatly reducing the storage overhead.However, the shared weights have no contribution to inference speedup.Instead, the time required for ALBERT to achieve BERT-like accuracy increases.

Methodology
In this section, we analyze the major structures of Transformer-based PLMs and devise corresponding optimizations.The proposed COST-EFF has three key properties, namely static slenderization, dynamic acceleration and collaborative training.

Preliminaries
In this paper, we focus on optimizing the Transformer-based PLM which mainly consists of embedding, MHA and FFN.Specifically, embedding converts each input token to a tensor of size H (i.e., hidden dimension).With a common vocabulary size of |V| = 30, 522, the word embedding matrix accounts for < 22% of BERT Base parameters.Inside the Transformer, MHA has four matrices W Q , W K , W V and W O , all of which with input and output size of H. FFN has two matrices W F I and W F O with the size of H × F .As the key components of Transformer, MHA and FFN account for < 26% and < 52% of BERT Base parameters, respectively.
Based on the analysis, we have the following slenderization and acceleration schemes.(1) The word embedding matrix W t is decomposed into the multiplication of two matrices following (Lan et al., 2020).Thus, the vocabulary size |V| and hidden size H are not changed.(2) For the transformation matrices of MHA and FFN, structured pruning is adopted to reduce their input or output dimensions.(3) The inference is accelerated through early exiting as we retain the pre-trained model depth.To avoid introducing additional parameters, we remove the pre-trained pooler matrix before classifiers.(4) Knowledge distillation on prediction logits and hidden states of each layer is leveraged as a substitute for conventional finetuning.The overall architecture of COST-EFF is depicted in Figure 2.

Matrix Decomposition of Embedding
As mentioned before, the word embedding takes up more than 1/5 of BERT Base parameters.The output dimension of word embedding is equal to hidden size, which we don't modify, we use truncated singular value decomposition (TSVD) to internally compress the word embedding matrix.
TSVD first decomposes the matrix as A m×n = U m×m Σ m×n V n×n , where Σ m×n is the singular value diagonal matrix.After that, the three matrices are truncated to the given rank.Thus, the decomposition of word embedding is as where we multiplies Ũ and Σ matrices as the first embedding matrix and W R×H t2 = Ṽ is a linear transformation with no bias.

Structured Pruning of MHA and FFN
To compress the matrices in MHA and FFN which contribute to most of the PLM's parameters, we adopt structured pruning to compress one dimension of the matrices.As depicted in Figure 2, the pruning granularity of MHA and FFN are attention head and hidden dimension, respectively.
Following (Molchanov et al., 2017), COST-EFF has the pruning objective of minimizing the difference between pruned and original model, which is calculated by first-order Taylor expansion where S denotes a specific structure, i.e., a set of weights, L(•) is the loss function and δL δh i is the gradient of loss to weight h i .|∆(S)| is the importance of structure S measured by absolute value of the first-order term.For simplicity, we ignore the remainder R (1) in Taylor expansion.
In each Transformer layer, the structure S of MHA is the attention head while that of FFN is the hidden dimension as depicted in the lower part of Figure 2. Specifically, the output dimensions  of W Q , W K , W V and W F I are compressed.On the contrary, the input dimensions of W O and W F O are compressed.Thus, the dimension of the hidden states remains intact in COST-EFF.Also, as a single but drastic pruning would usually cause damage hard to recover, we use iterative pruning (Tan and Motani, 2020) in COST-EFF which gradually prunes out insignificant modules.

Inference with Early Exiting
Unlike static compression, early exiting dynamically determines the computation at inference time, depending on the complexity of inputs and the perplexity of the model.Specifically, we use layer-wise early exiting, as shown in Figure 1, by plugging in a classifier at each Transformer layer.
Following the experimental results of Elas-ticBERT (Liu et al., 2022), entropy-based exiting generally outperforms patience-based, we use entropy of the classifier output as the exit condition defined as H(x) = − C i=1 p(i) ln p(i), where p(•) is the probability distribution calculated by softmax function and H(x) is the entropy of the probability distribution x.If the entropy is greater than a given threshold H T , the model is hard to make a prediction at that state.Conversely, the model tends to make a certain prediction with small entropy, where the difference in the probability distribution is large and dominant.

Training Multiple Exits
When training the model with multiple exits, the loss function of each exit is taken into account.DeeBERT (Xin et al., 2020) introduced a twostage training scheme where the backbone model and exits are separately trained.However, only with the loss of the final classifier and the gradients that back-propagate, shallow layers of the backbone model are not capable of making confident predictions but rather serve the deep layers.Thus, it is necessary to introduce the loss of intermediate classifiers while training and calculating the Taylor expansion-based structure importance as Equation 2 in COST-EFF.
To balance the gradient from multiple classifier losses, we use gradient equilibrium following (Li et al., 2019) and scale the gradient of layer k to where L is the model depth, ∇ w k L i is the gradient propagates from layer i down to layer k and ∇ ′ w k L is the rescaled gradient.

Training with Knowledge Distillation
The small size and capacity of the compressed model make it hard to restore performance only with fine-tuning.Whereas knowledge distillation is used as a complement that transfers the knowledge from the original teacher model to the compressed student model.In this paper, we aim to distill the prediction and intermediate features (i.e., hidden states) as depicted in Figure 2.
As the inconsistency between layers is observed (Xin et al., 2021), simply using ground-truth labels to train a compressed multi-exit model would result in severe contradictions.Given this, we first distill the original model into a multi-exit BERT Base model with the same layers as the TA.Then, each layer output of TA is used as soft labels of the corresponding layer in COST-EFF as where z T A i and z CE i are the prediction outputs of TA and COST-EFF at the i-th layer, respectively.T is the temperature factor usually set as 1.Besides distilling the predictions, COST-EFF distills hidden states to effectively transfer the representations of TA to the student model.The hidden state outputs, denote as H i (i = 0, 1, • • • , L+1) including embedding output H 0 and each Transformer layer output, are optimized as (5)

COST-EFF Procedure
As mentioned in Section 3.4.1,COST-EFF first distills the model into a multi-exit TA model with the same number of layers.Specifically, we distill the predictions at this stage.Although feature distillation is typically more powerful, representations of the single-exit model are not aligned with the multi-exit model and will introduce inconsistencies during training.Such distillation masks the trivial implementations of different PLMs to be compressed, as well as preliminarily mitigates the inconsistency between layers with a larger and more robust model.Then, the TA model is used as both the slenderization backbone and the teacher of further knowledge distillation.
During slenderization, we integrate the loss of exits into Taylor expansion-based structure importance calculation.Compared to simply using the loss of the final classifier, multi-exit loss helps calibrate the slenderization by weighing structures' contribution to each subsequent exit instead of only the final layer.In this way, the trade-off between layers can be better balanced in the slenderized model.After slenderization, the recovery training is a layer-wise knowledge transferring from TA to COST-EFF with the objective of minimizing the sum of L pred and L f eat which mitigates the contradictions of ground-truth label training on the slenderized multi-exit model.

Overall Results
The results of COST-EFF and comparative methods are listed in Table 3.When counting parameters, we include the parameters of embeddings and use the vocabulary size of 30,522 as default.The FLOPs are evaluated by PyTorch profiler with input sequences padded or truncated to the default length of 128 tokens and are averaged by tasks.
In the first group, the models are highly compressed and accelerated, while the performance is retained at approximately 96.5% by COST-EFF 8× , which is much better than the conventional pretraining and fine-tuning of BERT 8L-256H .Specifically, COST-EFF 8× out-performs TinyBERT 4 in all four tasks, suggesting that a slenderized model preserving all the layers is superior to a squat one.The slenderized architecture is more likely to extract hierarchical features for hard instances while expeditiously processing simple instances.For larger models, TinyBERT 6 with general distillation gains a slight advantage over COST-EFF 2× .Whereas COST-EFF 2× has a smaller volume than TinyBERT 6 and does not require general distillation, the performance gap is not significant.Meanwhile, TinyBERT 6 without general distillation is dominated by COST-EFF 2× in both efficiency and effectiveness, indicating the necessity of Tiny-BERT general distillation.However, a large effort is required by general distillation which pretrains a single model of a fixed size and computation.In case the computation demand changes, pre-training yet another model can be extremely time-consuming.Compared to TinyBERT, COST-EFF has advantages in both performance and flexible inference.
To demonstrate the effect of dynamic acceleration, we empirically select simple instances from the development set which are shorter (i.e., lower than the median non-padding length after tokenization).The results on simple instances exhibit extra improvements attributed to dynamic inference, which are hard to obtain with static models.Notably, shorter length does not always indicate simplicity.For entailment tasks like QNLI, shorter inputs would contain less information, which potentially aggravate the perplexity of language models.Also, we plot performance curves with respect to GLUE scores and FLOPs in Figure 3 and 4. The performance curves are two-dimensional and exhibit the optimality of different methods.Aiming at obtaining the model with smaller computation and performance, we focus on the models in the upper left part of the figure, which is the Pareto frontier plotted in dashed blue lines.
As depicted in Figure 3 and 4, both COST-EFF 8× and COST-EFF 2× outperform DistilBERT, DeeBERT, PABEE and BERT baselines.Compared with TinyBERT and ElasticBERT, COST-EFF is generally optimal.We find that early exiting reduces the upper bound of NLI performance, where both COST-EFF 2× and ElasticBERT 6L are inferior to TinyBERT 6 .This issue may stem from the inconsistency between layers.Given that the complex samples in the NLI task rely on highlevel semantics, the shallow layers should serve the deeper layers rather than solving the task by themselves.However, this issue does not affect global optimality.As shown in Figure 3, COST-EFF 8× has non-dominated performance against TinyBERT 4 on QNLI and MNLI, demonstrating the flexibility of our approach.
The performance of models incorporating early exiting is substantially affected by each exit.In Figure 5, we plot the layer-wise performance of models with early exiting in the first group and the final performance of TinyBERT  TinyBERT 4 , COST-EFF 8× can achieve better performance from the 7th to 12th layer, further verifying our claim that slender models are superior to squat models in performance, benefiting from the preserved architecture and its ability to extract high-level semantics.Another way to obtain powerful multi-exit models is alternating the backbone from BERT to the pre-trained ElasticBERT (Liu et al., 2022).In view of fairness, we uniformly use BERT as the backbone of COST-EFF and comparative methods.Notably, our approach is welladaptable to ElasticBERT and the advanced performance is exhibited in Appendix A. Attributing to the imitation of hidden representations, COST-EFF 8× has an advantage of 1.6% in performance compared to training without feature distillation.Without prediction distillation, the performance drops more than 3.4%.Previous works of static compression, e.g., TinyBERT (Jiao et al., 2020) and CoFi (Xia et al., 2022), are generally not sensitive to prediction distillation in GLUE tasks, as the output distribution of the single-exit teacher model is generally in accordance with the ground-truth label.However, with exit loss ablated before and during slenderization.The layer-wise comparison of the above methods is shown in Figure 6.Intuitively, two-stage training has an advantage on the final layer over collaborative training, as the inconsistency between layers is not introduced.However, the advantage diminishes in shallow layers, leaving the general performance unacceptable.Compared to slenderizing without exit loss, our approach has an advantage of 1.1% to 2.3%.Notably, slenderizing without calibration of exit can still achieve similar performance to COST-EFF at shallow layers, suggesting that the distillation-based training is effective in restoring performance.However, the inferior performance of deep layers indicates that the trade-off between layers is not well-balanced, since the slenderization is conducted aiming at optimizing the performance of the final classifier.

Conclusion
In this paper, we statically slenderize and dynamically accelerate PLMs in the pursuit of inference efficiency as well as preserving the capacity.To integrate the perspectives, we propose a collaborative optimization approach that achieves a mutual gain of static slenderization and dynamic acceleration.Specifically, the size of PLM is reduced in model width, and the inference is adaptable to the complexity of inputs without introducing redun- dancy for simple inputs and inadequacy for hard inputs.Comparative experiments are conducted on GLUE benchmark and verify the Pareto optimality of our approach at high compression and acceleration rate.

Limitations
COST-EFF currently has the following limitations.
If they are addressed in future works, the potential capabilities of COST-EFF can be unleashed.
(1) During the inference of dynamic early exiting models, the conventional practice is to set batch size as 1 to better adjust the computational according to individual input samples.However, such a setting is not always effective as a larger batch size is likely to reduce inference time, whereas input complexities inside a batch may differ significantly.Thus, it is inspiring to investigate a pipeline that gathers samples with similar expectations of complexity into a batch while controlling the priority of batches with different complexities to achieve parallelism.(2) We choose natural language understanding (NLU) tasks to study compression and acceleration following the strong baselines TinyBERT (Jiao et al., 2020) and Elas-ticBERT (Liu et al., 2022).However, the extensibility of COST-EFF is yet to be explored in more complex tasks including natural language generation, translation, etc.So far, static model compression is proved to be effective in complex tasks (Gupta and Agrawal, 2022) and we are seeking  the extension of dynamic inference acceleration on different tasks using models with an iterative process.

Figure 1 :
Figure 1: An illustration of COST-EFF model structure and inference procedure.Emb, Tfm and Clf are abbreviations of embedding, Transformer and classifier, respectively.Blue bar charts denote probability distribution output by classifiers.

Figure 2 :
Figure 2: Illustration of COST-EFF.The upper part is the general architecture and forward procedure of the model.The lower part is the slenderization details of corresponding modules, where grey circles denote the input and output dimensions of matrices and the lines connecting them are weights.

Figure 5 :
Figure 5: Layer-wise performance on GLUE development set.Horizontal lines indicate the final classifier performance of TinyBERT 4 .

Figure 6 :
Figure 6: Layer-wise performance (zoomed) of collaborative training ablation study on GLUE development set.Horizontal lines indicate the final classifier performance of TinyBERT 4 .COST-EFF 8× (slend w/o exit) is slenderized without the calibration of exits.

Table 1 :
Details of the datasets.

Table 2 :
Settings of compressed models.L is the number of layers and H is the dimension of hidden states.A denotes the MHA size as head_num × head_size, and the intermediate size of FFN is F .Models with a check sign in the EE column adopt early exiting.
4 .COST-EFF 8× achieves the dominant performance compared to DeeBERT and PABEE.Compared to

Table 3 :
Results on GLUE development set.BERT Base is used as the baseline to evaluate the average compression and acceleration rate, i.e., Params reduc.and FLOPs reduc., which are the higher the better.TinyBERT is implemented by conducting task-specific distillation without data augmentation on the public general distilled models, while TinyBERT 6 w/o GD is initialized from pre-trained BERT 6L-768H without general distillation.ElasticBERT 6L is initialized from the first 6 layers of ElasticBERT without pooler.The best results are in bold and the second best results are underlined.

Table 4 :
Ablation results on GLUE development set with 8× compression.Feature distillation is ablated in −L f eat , while ground-truth label is used to replace prediction distillation in −L pred .FLOPs of two ablated methods are ensured more than COST-EFF 8× .
Figure 3: Performance curves of models with 8× compression rate on GLUE development set.Horizontal grey line indicates the 95% of BERT Base performance and vertical line indicates 5% BERT Base FLOPs.Figure 4: Performance curves of models with 2× compression rate on GLUE development set.Horizontal grey line indicates the 97% of BERT Base performance and vertical line indicates 25% BERT Base FLOPs.