Variator: Accelerating Pre-trained Models with Plug-and-Play Compression Modules

Pre-trained language models (PLMs) have achieved remarkable results on NLP tasks but at the expense of huge parameter sizes and the consequent computational costs. In this paper, we propose Variator, a parameter-efficient acceleration method that enhances computational efficiency through plug-and-play compression plugins. Compression plugins are designed to reduce the sequence length via compressing multiple hidden vectors into one and trained with original PLMs frozen. Different from traditional model acceleration methods, which compress PLMs to smaller sizes, Variator offers two distinct advantages: (1) In real-world applications, the plug-and-play nature of our compression plugins enables dynamic selection of different compression plugins with varying acceleration ratios based on the current workload. (2) The compression plugin comprises a few compact neural network layers with minimal parameters, significantly saving storage and memory overhead, particularly in scenarios with a growing number of tasks. We validate the effectiveness of Variator on seven datasets. Experimental results show that Variator can save 53% computational costs using only 0.9% additional parameters with a performance drop of less than 2%. Moreover, when the model scales to billions of parameters, Variator matches the strong performance of uncompressed PLMs.


Introduction
Large pre-trained language models (PLMs) have made significant advancements in natural language processing tasks (Han et al., 2021;Brown et al., 2020;Qiu et al., 2020;Bommasani et al., 2021;OpenAI, 2023).It is widely observed that amplifying the model scale correlates positively with * Corresponding authors.enhanced downstream performance.Nevertheless, the expansive parameter scale intrinsic to PLMs demands significant computational and storage resources.Such formidable overheads necessitate investigating alternative strategies to maintain performance while reducing costs.
Many efforts have been devoted to improving the training and inference efficiency of PLMs (Sun et al., 2019;Liu et al., 2022;Fan et al., 2020;Xia et al., 2022;Stock et al., 2021).These methods compress PLMs into fixed smaller sizes, and cannot fulfill the following requirements: (1) Dynamic Workload.In real-world scenarios, the system workload varies dynamically over time, while the computational resources are fixed.This implies that we can use more resources for higher performance when the workload is low, and ensure response efficiency when the workload is high.
(2) Storage Efficiency.These methods typically depend on a large number of additional parameters to construct compressed models, which require amounts of memory space for model training and storage across various tasks and acceleration ratios.
To address these issues, we propose a novel plugand-play acceleration framework named Variator.As shown in Figure 1, different from compressing PLMs into smaller sizes, Variator enables PLMs acceleration via devising compression plugins, which can be inserted into PLMs to enhance the inference speed.Various plugins entail different acceleration ratios, and the system can dynamically choose the appropriate one to trade off response speed and model performance depending on the workload.Moreover, Variator only necessitates plugins with minimal parameters and freezes the original parameters of PLMs, which substantially lowers the memory and storage requirements.
To achieve plug-and-play acceleration, there are two main challenges: (1) Plugin Architecture: Compression plugins do not modify PLMs scales, and how to devise plugins to reduce the inference time is a challenge.(2) Plugin Training: Compression plugins only contain limited parameters, and how to effectively train plugins so that they can enhance the model speed while preserving downstream performance.
As for plugin architecture, inspired by previous findings about redundancy in hidden vectors (Goyal et al., 2020;Ye et al., 2021), we design compression plugins for data compression rather than parameter compression.Specifically, compression plugins consist of hidden compression layers and hidden decompression layers.The goal of the hidden compression layers is to compress multiple hidden vectors into one, thereby diminishing the sequence length for PLMs and enabling model acceleration.Simultaneously, to preserve token-level information, we also devise decompression layers that recover the processed shorter sequence to the original length.Compression plugins can be applied in any layer of PLMs, enabling various levels of acceleration.As for plugin training, we adopt a two-step training strategy.Firstly, we train compression plugins on pre-trained PLMs with pre-training corpus.Then the compression plugins trained in the first step are used as initialization for task-specific models.In both steps, we apply knowledge distillation objectives to train the compression plugins not to alter the hidden vectors produced by PLMs.
To verify the effectiveness of Variator, we conduct experiments with a widely-used pre-trained backbone, T5 (Raffel et al., 2020), on seven widelyused language understanding benchmarks.The experimental results show that Variator can save 53% computational costs using only 0.9% parameters with absolute average performance drops of < 2% compared to original downstream PLMs.
When the model scales to billions of parameters, Variator can achieve nearly no performance drop.We also examine the effectiveness of Variator on a decoder-only LLM, LLaMA (Touvron et al., 2023).In addition, we conduct neuron-level analysis for compression plugins, and find that compression plugins can effectively store important information in the compressed vectors to achieve satisfactory performance with limited computational costs.

Model Acceleration
Improving the computational efficiency of PLMs has been widely studied in recent years (Gupta and Agrawal, 2022;Zhang et al., 2022a).The related work can be divided into four categories: knowledge distillation, which guides the training of compressed models with the output or middle states of original PLMs (Hinton et al., 2015;Sanh et al., 2019;Sun et al., 2019Sun et al., , 2020)); model pruning, which removes unimportant parameters or layers from PLMs (Fan et al., 2020;Michel et al., 2019;Chen et al., 2020;Xia et al., 2022); model quantization, which converts model parameters into low-bit precision values, thus achieving acceleration on compatible devices (Stock et al., 2021;Xiao et al., 2022); and conditional computation, which only selects parts of parameters to compute outputs for each input (Zhang et al., 2022b;Xin et al., 2020).Some researchers make preliminary exploration for dynamic acceleration, such as early exit (Xin et al., 2020;Matsubara et al., 2023), which attempts to skip layer computation based on instance complexity.But these works rely heavily on confidence judgment and are thus only applicable to specific tasks and model architectures.Our model, which focuses on dynamic acceleration based on system workload, parallels these works and can be intuitively combined with them to reduce computational costs further.
Besides, within the realm of conditional computation there are a line of reasearches find out the redundancy of hidden vectors and focus on discarding tokens at each layer of PLMs to accelerate model inference (Goyal et al., 2020;Ye et al., 2021;Kim and Cho, 2021;Kim et al., 2022;Dai et al., 2020;Murahari et al., 2022), which inspire the design of our compression and decompression layers.But these works require to retrain the whole PLMs to achieve accleration, while Variator focuses on the parameter-efficient acceleration setting and thus enable dynamic acceleration ratio selection with minimal additional memory requirements.In addition to merge tokens, the compression layer can also be designed to dynamically prune the parameters, which we leave for future work.

Parameter-Efficicent Learning
The huge parameter scale of PLMs imposes substantial costs on model training and storage.To alleviate this problem, parameter-efficient learning, also known as delta tuning, is proposed to perform task adaptation via tuning a small portion of parameters and keep other parameters frozen (Liu et al., 2021;Ding et al., 2022;He et al., 2022).According to the operation of tunable parameters, delta tuning methods can be divided into: additionbased models, which introduce additional layers into PLMs (Houlsby et al., 2019;Lester et al., 2021); specification-based models, which specify existing weights of PLMs as tunable (Zaken et al., 2022;Guo et al., 2021); and reparameterizationbased models, which rewrite the computation process of specific layers into parameter-efficient manners (Hu et al., 2021).In addition, some researchers attempt to construct plug-and-play modules for retrieval augmentation (Shi et al., 2023;Yu et al., 2023), knowledge injection (Wang et al., 2021;Zhang et al., 2023;Xiao et al., 2023), controllable text generation (Pascual et al., 2021;Madotto et al., 2020), and model debiasing (Lauscher et al., 2021).In this paper, we propose a parameter-efficient acceleration model with hidden vector compression, which can save the memory and storage costs compared to traditional compression methods.

Methodology
In this section, we first describe the paradigm and basic annotations for our plug-and-play model acceleration.Then we present the framework and training recipes of Variator to accelerate model inference with minimal additional parameters.To showcase the efficiency of Variator, we also conduct an analysis of the computational and storage complexity.

Preliminary
Our primary goal is to design a plug-and-play acceleration framework, which can dynamically improve the computational efficiency with multiple compression plugins.Specifically, given an PLM M, and the fine-tuned downstream model M T de-rived from M, Variator aims to construct a compression plugin P, which can be inserted into M T to improve the computational efficiency.That is, given an input squence s, the computation costs of (M T + P)(s) should be lower than M T (s).Variator is designed for dynamic workload, which means plugins with different acceleration ratios can be applied in the same downstream model M T .Therefore, the original M T should be frozen during the training of the compression plugin P.

Overall Framework
Previous researches find out the redundancy in hidden vectors, which means eliminating hidden vectors is a promising direction for acceleration (Goyal et al., 2020;Ye et al., 2021).Inspired by these works, our compression plugins are designed to compress hidden vectors, and thus the sequence length is reduced to speed up inference.
As shown in Figure 2, compression plugins consist of two layers: a hidden compression layer and a hidden decompression layer, which are inserted before and after a vanilla neural layer, respectively.In this way, the compression layer can reduce computational overhead for the following neural layer and the decompression layer aim to restore tokenlevel information into the output vectors.Then we will introduce these two layers in detail.
Hidden Compression Layer.Hidden compression layers aim to reduce the sequence length.Previous token pruning methods assign importance scores for each hidden vector and discard hidden vectors with low scores, which may suffer from loss of useful information when the required compression ratio is high.Different from directly dropping hidden vectors, our hidden compression layer is designed to merge multiple vectors into one.
Specifically, given the input vector sequence with n tokens, H = {h 0 , ..., h n−1 }, we first split the sequence into several groups with each group containing k vectors, Then the compressed vector is calculated as the weighted average of input vectors:  Hidden Decompression Layer.Compression layers merge multiple hidden vectors into a global vector with information for all tokens in the corresponding group.To preserve the ability to solve token-level tasks, we design hidden decompression layers, which are inserted after the original neural layer, to restore token-level information into the output vectors.
Given the output of the original neural layer g o i generated from the compressed vector g i , we need to compute the output vectors for all k vectors in g i .We first concatenate the original vector h ik+j and the compressed output vector g o i to combine the token-level and group-level information.Then, instead of applying a linear projection layer with high computation complexity, we adopt an Adapter (Houlsby et al., 2019) layer and a residual layer to project the concatenated vector to the output vector o ik+j : Here, trainable parameters, and r ≪ d refers to the bottleneck dimension of adapter layers.
Both two layers only involve minimal additional parameters and computation overhead and can significantly reduce the sequence length.Besides, our proposed compression plugins can be flexibly applied in any neural layers, such as self-attention layers and feed-forward layers, allowing for different acceleration ratios.Notably, the compression and decompression layers can be implemented with other efficient operations including convolutional neural networks.Due to the high computa-tional requirements of feed-forward layers in Transformer (Zhang et al., 2022b), we attempt to apply compression plugins in feed-forward layers in most of our experiments.

Plugin Training
To mitigate information loss during the sequence compression of Variator, we design a two-step training strategy with plugin pre-training and plugin adaptation.
Plugin Pre-training.Plugin pre-training aims to learn general information compression ability and obtain a good initialization of compression plugins for downstream models.In this step, compression plugins are trained to mitigate redundancy in the original input text.Specifically, we insert the compression plugins into the original PLM M, and train compression plugins on a pre-training corpus.Notably, the pre-training process is task-agnostic.It is conducted only once and caters to the requirements of all downstream tasks, which make compression plugins pratical even when PLMs scale to billions of parameters.
Plugin Adaptation.Plugin adaptation is designed to drive compression plugins to preserve task-specific information during compression.Different tasks tend to pay attention to different information in the sequence.For example, sentiment analysis tasks usually need to maintain the information contained in emotional words, while reading comprehension tasks usually need to maintain information about the question.Therefore, it is important for compression plugins to learn different task information preferences in plugin adaptation.During plugin adaptation, compression plugins are inserted into downstream model M T , and trained with task data.
Both steps adopt knowledge distillation as the training objectives, guiding the compression plugins not to modify the output distribution.Given output vectors of the model without compression plugins O ′ , and output vectors of the model with compression plugins O, the final training loss is computed as the mean squared error (MSE) between O ′ and O: (1)

Complexity Analysis
In this section, we provide a complexity analysis of computational and storage overhead.Here we present the analysis with compression plugins applied in feed-forward networks (FFNs), with the input length as n, hidden vector dimension as d, and the middle dimension of FFNs as 4d.As mentioned in previous sections, k and r refer to the compression ratio and bottleneck dimension of decompression layers.
Computational Complexity.Compression and decompression layers involve several linear projections with tiny matrices.Therefore, our compression plugins only require minimal computation costs.For each token, compression plugins contain three linear projection operations and two addition operations.The floating point operations (FLOPs) required by the compression and decompression layer are (kd+2d+3)n and (3rd+2d+r)n, respectively.In contrast, the FLOPs of the feed-forward network are 8nd 2 .The computation costs of compression plugins are only about 1 8d (4 + k + 3r) of FFN, where k, r ≪ d.And compression plugins can reduce the computation costs of FFN to 1 k .Therefore, compression plugins can achieve significant inference speed-up for PLMs.
Storage Complexity.Different from training the entire models to accelerate model inference, Variator relies on two projection layers to compress hidden vectors.Compression and decompression layers consist of three linear projection layers, with only k 2 d + k and 3rd + r + d parameters, respectively.In contrast, an FFN layer consists of 8d 2 parameters.
To demonstrate the effectiveness of our parameter-efficient compression plugins more intuitively, we assume that k = 4, r = 64, and d = 768.In this way, compression plugins can save 71.7% computational costs with only 3.4% additional parameters for FFNs.

Experiments 4.1 Datasets
To evaluate the effectiveness of Variator, we use seven typical NLP datasets as evaluation benchmarks, including text classification and sequenceto-sequence generation tasks.Specifically, we adopt three natural language inference datasets, MNLI-m (Williams et al., 2018), QNLI (Rajpurkar et al., 2016), RTE (Wang et al., 2019), two sentence similarity datasets, QQP (Wang et al., 2019), MRPC (Dolan and Brockett, 2005), a sentiment analysis dataset, SST-2 (Socher et al., 2013), and a reading comprehension dataset, SQuAD (Rajpurkar et al., 2016).We apply F1 scores for MRPC, F1 scores and exact match scores (EM) for SQuAD, and accuracy for other datasets as evaluation metrics.We also present the average scores on these seven datasets, where we use EM scores of SQuAD for average.Please refer to Appendix for statistics.

Implementation Details
We adopt the widely-used pre-trained model, T5base and T5-large (Raffel et al., 2020), as our model backbone.Please refer to Appendix for results of compression plugins on BERT backbone (Devlin et al., 2019).For the main experiments, we only insert the compression plugins around the feedforward network layers in the encoder, which accounts for the majority of computational requirements.As for the training objective, we compute the MSE loss with the output vectors from the last layers.The compression ratio k is set as 4 for the main experiments and the bottleneck dimension r of the Adapter layers is set as 64.
For plugin pre-training, we apply the widelyused Wikipedia corpus.The learning rate is set as 10 −3 and batch size is set as 256.We pre-train compression plugins for 60k steps.For plugin adaptation, we apply grid search for hyper-parameter selection.We select batch size in {16, 32}, learning rate in {10 −4 , 5 × 10 −5 }.The total training steps for each task are set as 26k, and we evaluate the models every 1k steps.We train all models with half-precision floating-point on NVIDIA A100 GPUs.For both plugin pre-training and adaptation, we use Adam for parameter optimization.Please refer to Appendix for more details.

Baselines
In this paper, we compare Variator with several competitive baseline models, including: (1) The original fine-tuned downstream PLMs without acceleration, which are also used as teacher models to guide the training of other compressed models.
(3) Our method aims to reduce the sequence length for PLMs, which is inspired by previous token pruning models.Therefore, we also compare Variator with a typical token pruning model, LTP (Kim et al., 2022), which adopts the attention scores as importance scores, and only keeps tokens with the most scores for each layer.Notably, original token pruning models directly discard tokens for entire Transformer layers, and our models in main experiments focus on the acceleration of FFNs.Therefore, to make a fair comparison, we implement token pruning models with only skipping computation of FFNs and only keeping 25% tokens in each layer.We apply knowledge distillation objectives to train all downstream tasks for a fair comparison.

Main Results
The comparison results are shown in from scratch, which require amounts of additional parameters.Limited by amounts of parameters, switching traditional methods between different compression ratios requires large memory space or repeatedly loading compressed models from the disk.
(2) Compared to the widely-used model distillation, our parameter-efficient model acceleration method achieve competitive performance with much fewer parameters, which indicates the potential of parameter-efficient model compression.
(3) Compared to the token pruning baselines, our models can achieve better performance with even a small portion of parameters, which proves that merging tokens can better preserve sequence information compared to directly dropping them.

Ablation Study
To DeCom without plugin pre-training to speed up experiments.We select three tasks for the ablation study, including sentence classification, SST-2, sentence-pair classification, MNLI-m, and reading comprehension, SQuAD.
The results are shown in Table 2. From the results, we can observe that: (1) Both two training steps contribute to the main model, as when anyone step is missing, the model performance drops significantly.(2) Plugin adaptation is important for all tasks.Plugin pre-training guides compression plugins to discard general redundant information contained in the input text.Therefore, for SST-2, which usually only focuses on parts of important words, compression plugins without task-specific adaptation can also achieve satisfactory results.In contrast, for SQuAD and MNLI-m, which require models to collect information from entire contexts, plugins without adaptation lead to a large performance drop.(3) Compression and decompression layers play an important role in selecting information for hidden merging and restoring token-level information, as without anyone of them, model performance drops significantly.Especially, decompression layers are quite important for preserving token-level information, and training compression plugins without decompression layers lead to large drops for the span extraction task, SQuAD.

Effects of Compression Ratios
Variator apply compression plugins to compress multiple hidden vectors into one, thus achieving inference speedup.In this section, we explore the effects of compression ratios for our compression plugins.We construct compression plugins with compression ratios as {2, 4, 8, 16, 32}.The results are shown in Figure 3.
From the results, we can find that: (1) With the compression ratio increasing, the model performance decreases as expected.But the decline rate is becoming slow, which indicates the potential for Variator to achieve higher compression ratios.
(2) Variator can achieve competitive performance even when the compression ratio reaches 32, where Variator maintains 95.4% and 96.7% accuracy scores of original PLMs for MNLI-m and SST-2, respectively, while reducing 69% computational costs.The satisfactory performance ensures the response speed of real-world applications when the system load is high.

Compression for Attention Layers
In our main experiments, we insert the compression plugins around the FFN layers.In this section, we examine the performance of Variator when we insert compression plugins around the self-attention layers.Here we do not perform plugin pre-training.
The results are shown in Table 4.For comparison, we also present the results of original models and Variator with plugins in FFN layers.
From the results, we can observe that Variator with plugins in self-attention layers perform worse than plugins in FFN layers.That is because selfattention layers are designed to fuse token-level information, and inserting hidden compression layers before self-attention layers would lead to the loss of token information.Thus in the self-attention layers, only the k-gram information integration is performed, resulting in a significant performance drop.To address this issue, we improve compression plugins for self-attention layers by only compressing key and value vectors, denoted as Variator (Att-KV).With compression only for key and value vectors, Variator (Att) can achieve comparable results with Variator (FFN).And compressing key and value vectors can be further adopted in decoderonly models to reduce the sequence length of past key-value vectors, which we leave for future work.

Scaling to PLMs with Billions of Parameters
In this section, we attempt to apply our compression plugins to PLMs with billions of parameters.
We adopt four variants of T5 as our backbones, including T5-base (200 million parameters), T5large (700 million parameters), T5-XLarge (3 billion parameters), and T5-XXLarge (11 billion parameters).Following the main experiments, for each model, we conduct the two-step training process with 6k-step plugin pre-training and 26k-step plugin adaptation.We apply a parameter-efficient learning method, LoRA (Hu et al., 2021), to train the task-specific downstream models to speed up the experiments.We show the results of the SST-2.As shown in Figure 4, the performance continues to improve with the increasing of backbone model sizes.Similar to previous parameter-efficient learning methods (Lester et al., 2021;Ding et al., 2022), the performance gap between Variator and original PLMs becomes small when the model scales to billions of parameters.It shows the potential of Variator to be applied in nowadays general PLMs with more than 100 billion parameters, such as ChatGPT and GPT-4 (OpenAI, 2023).Besides, to present the effectiveness of Variator on decoder-only LLMs, we evaluate Variator with recent popular backbone LLaMA (Touvron et al., 2023) with 7 billion parameters.Variator can be used for the input encoding acceleration and reduce the service latency in real-world applications.We conduct experiments with a compression ratio of 2 on the FFN layers and without plugin-pretraining to accelerate experiments.The results suggest that our approach can reduce the computational overhead while maintaining comparable performance with the original model for decoder-only LLMs.

Neuron-Level Analysis
Our compression plugins enable the feed-forward layers to process information from multiple tokens simultaneously to save computational costs.In this section, we attempt to explore the computational mechanism of our compression plugins from the perspective of activated neurons.Previous works find out that FFNs can be regarded as memory networks (Geva et al., 2021;Dai et al., 2022), and the activated neurons can be used as indicators to reflect what information is preserved in the input hidden vectors.T5 adopts ReLU (Nair and Hinton, 2010) as the activation function, and following Zhang et al. (2022b), we define activated neurons as ones with positive (non-zero) activation values.We present the average ratio of activated neurons with different compression ratios, k = {1, 2, 4, 8, 16, 32}, in Figure 5. From the results, we can observe that the ratios of activated neurons drop with the increase in compression ratios.When the compression ratio reaches 32, only less than 2‰ neurons are activated to process sequence information.In indicates that compressed hidden vectors only contain the necessary information for the sequences and discard unimportant ones.Besides, the low activated ratios also indicate the potential of the combination of Variator and neuron pruning methods (Zhang et al., 2022b) to further improve the computational efficiency.
Then we further explore the relationship between the activated neurons of FFNs with compression plugins and the activated neurons of original FFNs.Specifically, we denote the intersection set and union set of activated neurons of k hidden vec-

Conclusion
In this paper, we explore the parameter-efficient acceleration setting and propose Variator, which reduces the computational costs with compression plugins.The extensive experiments on seven datasets show that we can reduce 53% computational costs with only 0.9% additional parameters.
In the future, we will explore more effective tokenmerging frameworks to improve compression plugins.Besides, we will further decouple compression plugins from specific tasks, thus we can construct compression plugins once and for all with transferability across multiple tasks.objectives.Here, we denote the task-specific loss as L t and the distillation loss as L d .The final loss is calculated as L = λL t + L d .We present the performance of Variator with compression ratio k as 4 on T5-base.
As shown in Table 7, we can find that training with task-specific objectives leads to performance drop on both MNLI-m and SST-2 datasets.That is because task-specific loss is usually easier to optimize than distillation loss, and adding task-specific loss functions makes our compression plugins more likely to fall into the local optimum of the model.Therefore, in other experiments, we only utilize the distillation loss functions to optimize compression plugins.

C Compression Plugins for BERT
Our compression plugins can be applied in Transformer-based pre-trained models.In this section, we explore inserting compression plugins into the widely-used encoder-only pre-trained model, BERT (Devlin et al., 2019).We adopt the 100million-parameter version, BERT-base, as our backbone.Following the main experiments, we set the compression ratio as 4 and the bottleneck dimension as 64.We conduct plugin pre-training for 24k steps.The results are shown in Table 8.From the results, we can observe that Variator on BERT can also show competitive results, and longer plugin pre-training is supposed to lead to better performance.

Figure 1 :
Figure 1: Illustration of model acceleration with compression plugins.
and b c ∈ R k are trainable parameters, and d is the dimension of hidden vectors.Then the compressed vectors are fed into the original neural layers.

Figure 2 :
Figure 2: Illustration of Variator, which improves the computational efficiency via compressing the hidden vectors.

Figure 3 :
Figure 3: Model performance with different compression ratios.The horizontal lines indicate the performance of original PLMs without compression plugins.

Figure 4 :
Figure 4: Performance with different backbone sizes.

Figure 5 :
Figure 5: The ratio of activated neurons with different compression ratios on two datasets.

Table 1 :
Comparison results between Variator and baseline models.Here Avg.refers to the average scores on seven datasets.Para.and FLOPs refer to the ratio of the number of additional parameters and floating point operations required by the compressed methods to the original PLMs.

Table 1
mance drops of < 2%.Specifically, Variator can save 53.2% and 45.9% computation costs for T5base and T5-large, using only 0.9% and 0.7% additional parameters.In contrast, traditional acceleration methods need to construct compressed models

Table 2 :
The results for ablation study.

Table 3 :
The results for compression plugins inserted around the self-attention layers.

Table 4 :
The results for compression plugins in LLaMA.

Table 5 :
The relationship between activated neurons of Variator and original models.torsasI and U.The set of activated neurons of compressed vector as C. The intersection set I can be regarded as important global information for k hidden vectors, and U can be regarded as all information contained in the k hidden vectors.Compression layers are used to select important information and feed them into neural layers.Therefore, we hope that I is approximately a subset of C and C is approximately a subset of U. In Table5, we present what fraction of neurons in I are in C and what fraction of neurons in C are in U. From the results, we can observe that when the compression ratios are no more than 8, the relationship between the three sets approximately satisfies the abovementioned inclusion assumption.It proves the effectiveness of our compression plugins in preserving global information.When the compression ratios become larger (such as 16, 32), only no more than 70% neurons in I are contained in C.

Table 6 :
The statistics and input templates of downstream datasets.In the templates, the task-specific inputs are denoted in monospaced font, and <extra_id_0> refers to the special mask token for T5.

Table 7 :
The performance with different training objectives.

Table 8 :
The performance with compression plugins in BERT.