BMCook: A Task-agnostic Compression Toolkit for Big Models

Recently, pre-trained language models (PLMs) have achieved great success on various NLP tasks and have shown a trend of exponential growth in model size. To alleviate the unaffordable computational costs brought by the size growth, model compression has been widely explored. Existing efforts have achieved promising results in compressing medium-sized models for speciﬁc tasks, while task-agnostic compression for big models with over billions of parameters is rarely studied. Task-agnostic compression can provide an efﬁ-cient and versatile big model for both prompting and delta tuning, leading to a more general impact than task-speciﬁc compression. Hence, we introduce a task-agnostic compression toolkit BMCook for big models. In BMCook, we implement four representative compression methods, including quantization, pruning, distillation, and MoEﬁcation. Devel-opers can easily combine these methods towards better efﬁciency. To evaluate BMCook, we apply it to compress T5-3B (a PLM with 3 billion parameters). We achieve nearly 12x ef-ﬁciency improvement while maintaining over 97% of the original T5-3B performance on three typical NLP benchmarks. Moreover, the ﬁnal

models could not do, such as quantitative reasoning (Lewkowycz et al., 2022) and long-form question answering (Nakano et al., 2021).Despite the success of big models, their exponentially growing sizes impose unaffordable computational costs for real-world applications.
To improve the efficiency of PLMs, model compression is an essential solution.There are several compression techniques, including model distillation (Hinton et al., 2015), model quantization (Bai et al., 2021), and model pruning (Liang et al., 2021).Based on these techniques, practitioners can conduct task-specific compression during finetuning (Sun et al., 2019) and task-agnostic compression during pre-training (Sanh et al., 2019).Previous studies mainly focus on applying taskspecific compression for medium-sized PLMs with around one-hundred million parameters, such as BERT BASE (Zafrir et al., 2019;Jiao et al., 2020;Hou et al., 2020;Xia et al., 2022), while compressing large-scale PLMs with over billions of parameters is rarely studied.
In this work, we focus on the task-agnostic compression of big models because it enables developers to utilize the powerful intelligence of big models with fewer computation resources for both prompting (Brown et al., 2021) and delta tuning (aka parameter-efficient tuning) (Houlsby et al., 2019;Ding et al., 2022).Both prompting and delta tuning are the current core approaches to drive big models.There exist two challenges for the taskagnostic compression of big models.First, big models require high compression rates to achieve affordable costs while existing compression toolkits only support one or two techniques as shown in Table 1, which cannot provide enough compression rates.Second, existing compression implementation ignores the memory challenge brought by big models.They are usually based on HuggingFace Transformers (Wolf et al., 2020), which cannot well support the training of large-scale PLMs.
In this work, we introduce BMCook, a taskagnostic compression toolkit for big models.BM-Cook has three main characteristics: (1) Zeroredundancy training.BMCook is developed on an efficient training toolkit, BMTrain1 , which supports the zero-redundancy optimizer with offloading (Rajbhandari et al., 2020;Ren et al., 2021a) to handle the memory challenge.(2) Flexible combination.
To achieve better efficiency, we make BMCook flexible to support arbitrary combinations of different compression techniques.To this end, we implement four popular compression techniques and distribute each technique to different parts of one unified training life-cycle.(3) Runtime model modification.Since some compression techniques require to access the inner hidden states of PLMs, developers have to modify the code of model implementation provided by a third-party package.To make the compression easier to operate, BMCook implements runtime modification by monkey patch to get rid of modifying the source code of PLMs.
BMCook is supported by Open Lab for Big Model Base (OpenBMB)2 .We hope BMCook can help researchers explore better compression methods for large-scale PLMs in the future and help practitioners to improve their model efficiency in real-world applications.

Design and Implementation
As mentioned in the introduction, we implement three main characteristics in BMCook, zeroredundancy training, flexible combination, and runtime model modification.In this section, we will introduce the design and the implementation details of these three characteristics.

Zero-redundancy Training
Due to the outrageous model size, big models require large memory to store their parameters and optimizer states, which cannot be maintained in one GPU.Recently, zero-redundancy optimizer has been proposed to solve this problem (Rajbhandari et al., 2020), which distributes the parameters and the optimizer states to multiple GPUs instead of storing all of them in one GPU repetitively.If more GPUs are used, each GPU requires less memory, which can alleviate the memory challenge brought by big models.Since BMCook targets the compression of big models, the training of big models is an important part.Therefore, we implement BMCook based on an efficient training toolkit -BMTrain, which supports zero-redundancy optimizer with parameter checkpointing (Chen et al., 2016) and offloading (Ren et al., 2021b).

Flexible Combination
Previous work on model compression usually explores one or two specific techniques.Due to the huge model size, we have to combine different techniques to achieve extreme compression.Hence, BMCook explores to build a unified compression framework that can support different techniques.Specifically, BMCook supports model distillation, model pruning, model quantization, and model MoEfication.By better utilizing these techniques, we distribute them into different parts of one unified life-cycle as shown in Figure 1.With this scope, we decouple these techniques in the implementation and support arbitrary combinations.Next, we will show more details about these techniques.
Model quantization aims to represent parameters by low-bit fixed-precision values and reduce both the memory and computational costs.For example, the computation of an 8-bit quantized model is 4 times faster than that of a 32-bit model.Towards better performance, BMCook supports QAT.Specifically, we replace all linear layers in PLMs with quantized linear layers.In quantized linear layers, we simulate the quantized matrix multiplication.Since the linear layers account for more than 90% of the computation in the Transformer (Han et al., 2022), model quantization brings significant efficiency improvement.
Model distillation aims to guide the training of a compressed model by a larger teacher model.Traditional distillation adds the KL divergence between the outputs of teacher models and student models as an extra training objective (Hinton et al., 2015).For PLMs, Sun et al. ( 2019 2021) find that it is also effective to make the inner computation results of student models close to those of their teachers.For example, they add the MSE loss between the hidden states of student models and teacher models.Model pruning aims to prune the redundant parameters of a model.There are two typical approaches, structured pruning and unstructured pruning.Structured pruning removes complete modules (e.g., layers) from the model (Fan et al., 2020;Wang et al., 2020;Zhang et al., 2021b;Xia et al., 2022).Instead, unstructured pruning removes individual parameters from the model (Han et al., 2015;Chen et al., 2020;Xu et al., 2021).Both of them change the forward and backward process of the model according to their pruned parameters.To decouple pruning and quantization, we distribute the pruning operations to the optimization step, where we set the pruned parameters to zero after parameter update.Due to this, we keep redundant parameters pruned during the forward and backward processes without directly affecting these processes.
Note that unstructured pruning cannot guarantee efficiency improvement in most cases because parallel processing devices, such as GPUs, usually do not provide optimized sparse computation operations (Zheng et al., 2022).Hence, BMCook implements unstructured pruning with 2:4 sparsity, which is well supported by Sparse Tensor Core (Zhou et al., 2021).2:4 sparsity means that every four continuous parameters have two zeros.In this way, the sparse computation is guaranteed to be twice as fast as the dense computation.Besides, for structured pruning, we implement CoFi (Xia et al., 2022) in BMCook, which adds L0-regularization to the parameters of the model to learn an optimal sparse mask.
MoEfication aims to transform the feedforward networks (FFNs) in Transformers to the equivalent mixture-of-expert (MoE) version (Fedus et al., 2021), which significantly reduces the computational costs of FFNs (Zhang et al., 2022b).Since Transformers (Vaswani et al., 2017) adopt ReLU (Nair and Hinton, 2010) as the activation function of FFNs and there exists an obvious sparse activation phenomenon, we can only involve part of FFNs for a specific input without affecting the model performance.The transformation process does not change the number and the values of model parameters.Therefore, we treat MoEfication as a post-processing technique.It can be applied to any compressed model to achieve better efficiency.
To train routers for expert selection, MoEfication requires the hidden states to simulate the computation process of FFNs.The training of routers is localized to specific FFNs and is dealt with by an external MoEfication package.
In summary, BMCook is the first to contain a series of compression techniques.And, benefiting from the decoupled implementation of compression techniques, practitioners can design their own compression strategies with arbitrary combinations.

Runtime Model Modification
All of the compression techniques mentioned in the last subsection require modifying the life-cycle of the training process, i.e., the implementation code of PLMs.Taking distillation as an example, to compute the mean squared error between the hidden states of the teacher model and the student model, we have to modify the forward functions to make the hidden states become return values.Existing compression toolkits usually ask developers to modify the codes (Yang et al., 2020(Yang et al., , 2022)).For example, in the case of distillation, after developers modify the forward functions, the toolkit provides the implementation of the loss calculation.However, the model implementation is usually provided by a third-party package, such as Hug-gingFace Transformers, making the manual modification inconvenient.Besides, the modification is simple and similar across different PLMs.Hence, BMCook explores to implement runtime modification in a general way to keep the source code clean and make it easy to compress different PLMs.
Specifically, we utilize the characteristic of monkey patch in Python.Monkey patch is to modify the behavior of an object at runtime.As shown in Figure 2, we first rename the original forward function of the module as forward_old, and then define a new forward function containing forward_old and a tensor recording step.Finally, we assign the new forward function to the module.The inspect function for recording is to store the tensor in a global dictionary.After the whole forward process is finished, we can access the tensor by its name.
Both knowledge distillation and MoEfication require accessing the hidden states of PLMs.Considering that different modules have different foward functions, e.g., attention modules take hidden states and attention masks as input, we choose to access the hidden states of layer normalization modules and provide a general interface to add tensor recording to their forward functions.The inputs of layer normalization modules are only hidden states and are widely used before or after other modules.Hence, based on layer normalization modules, we can access nearly all hidden states of PLMs.
Similarly, we also modify the linear layers and optimizer by monkey patching for quantization and pruning.For quantization, we replace the matrix multiplication in the forward functions of linear layers with a quantized one.For pruning, we modify the behavior of the optimizer's step function.We keep the original operation and add a pruning step after the parameter update.
In summary, BMCook utilizes runtime modification to keep the source code clean and provides general interfaces to compress different PLMs.

Usage and Configuration
Since different compression modules are decoupled in BMCook, we implement each module independently, where each module is usually a Python file and provides one or two general interfaces.Benefiting from the general interfaces, BMCook can be applied to a PLM with only a few lines of code as shown in Figure 3.The details of compression are mainly determined by a configuration file, which will be used by different compression modules.In practice, users can easily reuse the code of pretraining for compression by adding a few lines of code to import compression modules and then setting the configuration file.Note that BMCook supports the PLMs implemented based on BMTrain and ModelCneter 3 has provided BMTrain-based implementations of almost all mainstream PLMs.
As shown in Figure 4, the configuration file is a 3 https://github.com/OpenBMB/ModelCenterJSON file.The keys are the names of the compression modules.The values are the configurations of the compression modules.Note that the module names used in the configuration file are corresponding to the names provided by PyTorch.Therefore, BMCook can access the modules by their names.
The key of knowledge distillation is distillation.
Currently, BMCook supports two kinds of distillation objectives, KL divergence between output distributions (turn on when ce_scale>0) and mean squared error (MSE) between hidden states (turn on when mse_hidn_scale>0).
Practitioners need to specify the hidden states used for MSE by mse_hidn_module.Meanwhile, the dimensions of the hidden states may be different between teacher and student models.Therefore, the hidden states of the teacher model need to be projected to the same dimension as those of the student model.Practitioners can turn on mse_hidn_proj for simple linear projection.
The key of model pruning is pruning.Practitioners can turn on pruning by is_pruning.The pruning mask is stored in pruning_mask_path.The pruned modules are specified by pruned_module.To simplify the list, practitioners can only provide the suffix of the modules.The mask method mask_method is to choose the algorithm for the computation of the pruning mask.
The key of quantization is quantization.Practitioners can turn on quantization by is_quant, which will replace all linear layers with quantized

Evaluation
To validate the effectiveness of BMCook, we study task-agnostic compression on T5-3B (Raffel et al., 2020), which has 3 billion parameters.Since task-agnostic compression would benefit various downstream tasks, we evaluate the performance of adapter tuning (Houlsby et al., 2019) of T5-3B and its compressed variants.We also study T5-Base and T5-Large, which have 220 million and 770 million parameters, respectively.
Training and evaluation data.We use the Pile dataset (Gao et al., 2020) for task-agnostic compression training, which is a large-scale corpus for pre-training language models.The training objective is masked language modeling used by T5.Note that we turn on distillation during the compression training in all experiments because we find knowledge distillation with MSE loss can improve model performance in our pilot experiments.Besides, we choose three downstream datasets for evaluation: SST-2 (Socher et al., 2013), a representative singlesentence classification dataset, MNLI (Williams et al., 2018), a representative sentence-pair classification dataset, SQuAD v1.1 (Rajpurkar et al., 2016), a representative question-answer dataset.For the first two datasets, we use accuracy as the evaluation metric.For the third dataset, we use both exact match and F1 score as the evaluation metrics.We evaluate model performance on their development sets.We adopt the same task templates and label words of the original T5 paper (Raffel et al., 2020).
Hyper-parameters.The learning rate of taskagnostic compression training is 1e-4 while that of adapter tuning ranges from 1e-6 to 1e-5.The batch size of task-agnostic compression training and adapter tuning is 32.We use 4 NVIDIA A100 GPUs in the experiments.The training step of taskagnostic compression training ranges from 10K to 50K according to the compression methods.The training epoch of adapter tuning ranges from 3 to 5.
To fairly compare the efficiency of T5-3B and its variants, we define a new metric, named activated model size, because Brown et al. ( 2021) mentioned that the computation of Transformer is linear in the model size, which excludes the embedding layer.Hence, we consider the parameters of self-attention networks and FFNs.For the original model, the activated model size is equal to its original model size.Although it is intuitive to directly compare the speedup of compressed models, there is no inference toolkit supporting all the compressed methods.Hence, we focus on the theoretical computational cost in this work.
In the evaluation, we set the pruning sparsity to 50%, i.e., we prune 50% of the parameters and reduce half of the activated model size.Besides, we quantize the parameters to 8 bits, which reduces three-fourths of the activated model size compared to the floating-point version.For MoEfication, we dynamically involve 50% of parameters in FFNs for specific input.Therefore, the activated model size of the modified FFNs is about half of the original FFNs.Note that Transformer consists of both attention layers and FFNs and the model size of FFNs are about 70%.The final activated model size of the modified Transformer is about 66% of the original one.If we combine all three techniques, we can achieve a compressed model with about one-twelfth of the original activated size.
We report the evaluation results in Table 2. From this table, we have three observations: (1) In the experiment of single modules, quantization achieves the best efficiency and performance.Unstructured pruning achieves the second-best efficiency and performance, and significantly outperforms structured pruning.It suggests that directly removing layers may bring significant performance degradation.Besides, as a post-processing method, which does not require further pre-training, MoEfication maintains over 98% original performance while reducing 33% of the activation model size.(2) Different compression techniques can be combined to archive better efficiency while maintaining most of performance.For example, combining quantization, unstructured pruning, and MoEfication achieves a compressed model with about one-twelfth of the original activated size and maintains over 97% original performance.(3) Compressing big models can get better small models.For example, Quant+Pruning+MoE is smaller than T5-base while this model significantly outperforms T5-base.

Conclusion and Future Work
In this paper, we introduce a task-agnostic compression toolkit for big models, named BMCook.This toolkit contains four popular techniques and is designed to be flexible to support arbitrary combinations.Users can easily compress a PLM by adding several lines to its pre-training code and specifying the strategy in a configuration file.
In the future, there are three directions to further improve BMCook.First, we will enrich the options of existing compression techniques, such as knowledge distillation on attention matrices (Jiao et al., 2020) and extreme low-bit quantization (Bai et al., 2021).Second, there are some other compression techniques that are not covered by BM-Cook, such as weight sharing (Lan et al., 2020) and low-rank decomposition (Chen et al., 2021).Third, we will explore the automatic search for better compression strategies or configurations.Given a specific computation budget, we want to find the compression strategy that achieves the best model performance, which is similar to neural architecture search (Elsken et al., 2019).
Meanwhile, we will also plan to enrich the inference toolkits to support different compression techniques.Although compression techniques have been fast developed, the inference toolkits are still lagging behind.Recently, there are some efforts to support compressed models in inference, such as BMInf (Han et al., 2022) and DeepSpeed-MoE (Rajbhandari et al., 2022), while they are still limited to specific compression techniques.

Figure 1 :
Figure 1: Life-cycle of the training process, including output computation, loss computation, parameter update, and post-processing.Each computation technique is bundled into a specific step.Specifically, quantization influences the output computation, distillation influences the loss computation, pruning influences the parameter update, and MoEfication influences the post-processing.Flow(Abadi et al., 2016), have already supported post-training quantization.Post-training quantization directly quantizes the parameters of a PLM, which may bring a significant performance degradation.To alleviate the degradation caused by quantization,Stock et al. (2021) propose quantizationaware training (QAT).It simulates the quantization during the training, i.e., the parameters are quantized during the forward propagation, making the parameters adapted to low-bit fixed-precision computation.
); Jiao et al. (2020); Liu et al. (2022); Park et al. ( Note that the model distillation module in BM-Cook is only to provide additional training loss instead of reducing the size of the model.Any compression technique requiring further training can be combined with model distillation to improve the performance of compressed models.

Figure 2 :
Figure 2: Example of monkey patch.Add a tensor recording step to a forward function.

Figure 4 :
Figure 4: Example of the configuration file.

Table 2 :
Evaluation of original models and compressed models.In the combination experiments, we use unstructured pruning due to its superior performance in the single module experiments.The size of adapters keeps the same for all PLMs.Activated model size is used to measure the compression rate because the computational cost, i.e., FLOPS, is linear to the model size.linearlayers.BMCook provides the simulation of 8-bit quantization.The key of MoEfication is MoEfication.Practitioners can turn on MoEfication by is_moefy.The hidden states used for router training are specified by first_FFN_module, which is the nearest layer normalization module before each FFN.Providing the suffix of the modules is also sufficient.