TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities

Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit supporting pre-training models of different modalities. The core feature of TencentPretrain is the modular design. The toolkit uniformly divides pre-training models into 5 components: embedding, encoder, target embedding, decoder, and target. As almost all of common modules are provided in each component, users can choose the desired modules from different components to build a complete pre-training model. The modular design enables users to efficiently reproduce existing pre-training models or build brand-new one. We test the toolkit on text, vision, and audio benchmarks and show that it can match the performance of the original implementations.


Introduction
Pre-training on large-scale data and then finetuning on downstream tasks has become a paradigm for text, vision, and audio tasks (Devlin et al., 2019;Bao et al., 2021;Baevski et al., 2020). In addition to the similarity in the pipeline paradigm, these pre-training models as well have close model structures: On one hand, most of them consist of the following components, embedding, encoder, target embedding, decoder, and target, on the other hand, many modules in above components are shared among models of different modalities. For example, the transformer module (in encoder component) (Vaswani et al., 2017), which is successful in the field of text, is increasingly being applied to the vision and audio modalities. (Dosovitskiy et al., 2020;Gulati et al., 2020). Table  1 lists the commonly used pre-training models and their modules.
The trend towards homogeneity in pre-training models is becoming more apparent, which makes it possible to integrate them into a uniform framework. A representative work in this direction is Huggingface Transformers (Wolf et al., 2020), which exploits a non-modular design mode. For each pre-training model in Huggingface Transformers, several separate classes are created, and the code is not refactored with additional abstractions. Users can develop their pre-training models independently which is useful to collaborative development in the community. However, in this design mode, users need to implement the model from scratch when adding a new pre-training model, requiring considerable code work. In addition, with the increased number of pre-training models, the number of classes and lines of code also increases linearly. Codes with the same function may be written many times, which degrades the readability and maintainability of the project.
In response to shortcomings of non-modular design mode, we introduce TencentPretrain, a modular toolkit specially designed for pre-training models of different modalities. As shown in Figure  1, TencentPretrain has five components, namely embedding, encoder, target embedding, decoder, and target. Among them, target embedding and decoder components are optional, since the targets of many pre-training models do not involve sequence decoding Lewis et al., 2020). TencentPretrain is hierarchical modular designed with two degrees of freedom. At component level, users are free to combine modules within a component, for example, combining multiple modules in target component to perform multi-task pretraining (Lan et al., 2019;Sun et al., 2020). At  Pre-training model   Modality  Embedding  Encoder  Target  embedding  Decoder  Target ELMo (Peters et al., 2018) Text word bi-lstm --bilm Infersent (Conneau et al., 2017) Text word gru --cls CoVe (McCann et al., 2017) Text word lstm word lstm lm BERT (Devlin et al., 2019) Text word, pos, seg transformer --mlm, sp GPT-2 (Radford et al., 2019) Text word, pos transformer --lm T5 (Raffel et al., 2020) Text word transformer word transformer lm ViT (Dosovitskiy et al., 2020) Vision patch, pos transformer --cls BEiT (Bao et al., 2021) Vision patch, pos transformer --mlm S2T  Audio speech, pos transformer word, pos transformer lm ViLT  Text-vision word_patch, pos, seg transformer --mlm, cls the model level, users can combine modules from different components to constitute a complete pretraining model. Modularity in design makes TencentPretrain scalable with the increasing number of newly proposed pre-training models. Users are allowed to reuse existing modules with little efforts, avoiding repeated implementation of core functions. At the same time, TencentPretrain provides a robust and clear interface among different components. It brings flexibility, allowing users to build custom model structures through a configuration file without any code work.
TencentPretrain is implemented with PyTorch (Paszke et al., 2019), and it supports distributed training and DeepSpeed optimization library (Rasley et al., 2020). TencentPretrain is fully connected with Huggingface Transformers, providing comprehensive conversion scripts of pre-training models between the two frameworks. Users can switch between the two frameworks at low cost. TencentPretrain is tested on text, vision, and audio benchmarks and is able to reproduce the results of SOTA pre-training models. The Ten-centPretrain toolkit is publicly available at https: //github.com/Tencent/TencentPretrain.

Pre-training models
Pre-training models have been widely applied in text scenario. The success of pre-training is largely due to the powerful encoders for feature extraction (e.g., LSTM and Transformer), as well as the progress of pre-training target for learning knowledge from unsupervised corpus Lewis et al., 2020;Lan et al., 2019). More recently, the text pre-training paradigm has been replicated in other modalities. For example, Transformer encoder (and its variants) has been widely used in vision (Dosovitskiy et al., 2020), audio (Gulati et al., 2020;, and vision-language tasks (Radford et al., 2021;. Regarding pre-training target component, text models have inspired models of other modalities. Mirroring the idea of masked language modeling (MLM), MAE (He et al., 2022), BEiT (Bao et al., 2021), and SimMIM  use masked image modeling (MIM) for self-supervised vision pretraining. Speech model Wav2vec2.0 (Baevski et al., 2020) exploit negative sampling in pre-training target, which is previously used in word embedding (Mikolov et al., 2013) and sentence prediction models (Logeswaran and Lee, 2018;Devlin et al., 2019;Lan et al., 2019).
In addition to the sharing of modules, several works have recently shown the feasibility of using the same pre-trained weight to handle different modalities simultaneously. For example, ERNIE-ViLG  and Talk2Face ) exploit prefix language model to achieve bi-directional text-and-image generation.
PolyViT uses a single transformer model for image, video and audio classification (Likhosherstov et al., 2021).
It can be seen that the trend towards homogeneity of pre-training models is becoming obvious, from sharing modules, to using the same network and parameters. This inspires us to build a unified framework that can implement various pre-training models efficiently.

Toolkits with modular design
Modular design regards a complex system as the combination of multiple modules, each of which can be independently modified and replaced. In the field of artificial intelligence, a typical work with modular design is Keras (Chollet et al., 2015). The core data structure of Keras is layer. Keras allows building arbitrary graphs of layers to construct NN models. In the NLP field, modular toolkits are prevailing and they decompose models from different perspectives with different abstraction levels. For example, OpenNMT (Klein et al., 2017) is a modular toolkit designed for NMT. It builds an NMT model through the combination of encoder and decoder modules. Related NLP modular toolkits include OpenAttack (designed for text attack) (Zeng et al., 2021), Ngram2vec (designed for word embedding) (Zhao et al., 2017), TextFlint (designed for robustness evaluation) , Neu-ralClassifier (designed for text classification) (Liu et al., 2019a), and etc. Inspired by the above-mentioned works, this paper proposes TencentPretrain, a modular designed toolkit for pre-training models of different modalities. Compared with Huggingface Transformers (Wolf et al., 2020), the most well-known pre-training toolkit, TencentPretrain provides additional abstractions on pre-training model implementations, splitting a complete model into multiple modules hierarchically. Pre-training weights between two toolkits can be switched easily. In fact, TencentPretrain can be regarded as the high-level encapsulation of Huggingface Transformers.
It is worth mentioning that TencentPretrain reuses part of the code in UER (Zhao et al., 2019), which is published in 2019 and supports several text pre-training models. Compared with UER, Ten-centPretrain is improved in three aspects: 1) It supports the modular design within components, providing a more scalable manner to build pre-training models; 2) The target embedding and decoder components are introduced to support sequence generation; 3) In addition to text, TencentPretrain supports vision, audio, and cross-modal pre-training models. Currently, TencentPretrain supports around 30 pre-training models.

Framework
The current mainstream pre-training models are basically similar in structure. In the embedding component, the data is mapped into an embedding matrix. And then the matrix is passed through the encoder. Finally the target layer performs pretraining tasks according to the output of the encoder layer. If the pre-training task requires sequence generation, the decoder is inserted between the encoder and the target. Figure 1 demonstrates the overall framework of TencentPretrain. It divides a pre-training model into five components, and various modules are provided in each component. In practice, a user firstly selects one or multiple modules from each component (modularization within component), and then combine modules from different components to build a pre-training model (modularization cross components). In the rest of this section, we respectively introduce the above five components and modules included in them.

Embedding
In the embedding component, TencentPretrain converts text, image, and audio modal data into embedding matrix. The matrix holds the low-level features as the input to the encoder.
TencentPretrain also contains auxiliary embedding modules, e.g., position embedding and segment embedding. The embedding of pre-training model is usually obtained by the addition of multiple modules. As shown in Table 1 (Embedding column), the addition of word, position, and segment embeddings constitutes the embedding layer of BERT; the addition of patch and position embeddings constitutes the embedding layer of ViT. TencentPretrain supports hierarchical modular design, enabling users to freely combine modules within embedding component to construct the desired embedding layer. This design greatly reduces code redundancy since different models often use similar, instead of identical combinations.
In addition, TencentPretrain supports dualstream encoder, with which the users specify two encoder modules separately. Dual-stream encoder is usually used by models related with semantic search, such as text pair model SBERT (Reimers and Gurevych, 2019) and text-image pair model CLIP (Radford et al., 2021).

Target embedding and decoder (optional)
The pre-training tasks of some models involve sequence generation. These models require modules in target embedding component and decoder component. The modules in these two components are identical with the modules in embedding component and encoder component respectively.

Target
The module in target component receives highlevel features obtained from encoder (or decoder) and then uses the features to perform pre-training tasks. Specifically, the target estimates gradients by objectives and updates the network weights. The target is of vital importance to the performance and has been extensively investigated in the pre-training field (Devlin et al., 2019;Lan et al., 2019;Sun et al., 2020). TencentPretrain supports comprehensive target modules, including language model (Radford et al., 2019), classification (Conneau et al., 2017), contrastive learning (Radford et al., 2021), etc.
Sometimes pre-training models use multiple tasks, e.g., predicting word and sentence relationship simultaneously in BERT and ALBERT. And multi-task is especially common in cross-modal scenario Qi et al., 2020) since pre-training models have to deal with supervision signals from different modalities. The model can learn knowledge from different perspectives through multiple tasks. With the characteristic of hierarchical modular design, TencentPretrain facilitates the implementation of multi-task pretraining models. One can introduce multiple tasks by combining different modules in target component. The pre-training task can be easily added, modified, and replaced.

Downstream task fine-tuning
TencentPretrain supports comprehensive downstream tasks, including classification, regression, sequence labeling, reading comprehension, question answering, automated speech recognition, etc. As shown in Figure 1, the downstream task model can be constructed by replacing the pre-training target with specific task. In evaluation section, we show the performances of TencentPretrain on a range of benchmarks.

Usage
This section provides examples of building pretraining models with TencentPretrain. The modular design enables the users to quickly build the pretraining model through the combination of modules. Modules used in pre-training models are specified in configuration files and the examples are shown as follows 1 : # BERT implementation { "embedding" : [ " word " , " p o s " , " s e g " ] , "encoder" : " t r a n s f o r m e r " , "target" : [ " mlm " , " s p " ] } # T5 implementation { "embedding" : [ " word " ] "encoder" : " t r a n s f o r m e r " "tgt_embedding" : [ " word " ] "decoder" : " t r a n s f o r m e r " "target" : [ " lm " ] } # ViLT implementation { "embedding" : [ " p a t c h _ w o r d " , " p o s " , " s e g " ] "encoder" : " t r a n s f o r m e r " "pooling" : " f i r s t " "target" : [ " c l s " , "mlm " ] } # CLIP implementation { "stream_0" : { "embedding" : [ " word " , " p o s " ] , "encoder" : " t r a n s f o r m e r " , "pooling" : " f i r s t " } "stream_1" : { "embedding" : [ " p a t c h " , " p o s " ] , "encoder" : " t r a n s f o r m e r " "pooling" : " f i r s t " } "target" : [ " c l r " ] } • BERT configuration file provides modules in embedding, encoder, and target components. Since BERT has two pre-training tasks, its target is the combination of masked language model (mlm) and sentence prediction (sp).
• T5 involves text generation. Its configuration file specifies modules used in target embedding and decoder components.
• ViLT, an image-text pre-training model, is basically similar with text pre-training model BERT. The main difference is that an imagetext embedding module is used in embedding component.
• CLIP is a dual-stream model. The modules in stream0 process text and the modules in stream1 process image. Contrastive learning (clr) module is used in target component.
If the desired pre-training model cannot be built by the combination of existing modules, Tencent-Pretrain encourages users to develop a new module, and combine it with existing modules. We take the implementation of ASR model S2T  as an example. Most modules required by S2T are available and we only need to implement a new module, speech embedding, which greatly speeds up the implementation process.
TencentPretrain and Huggingface Transformers are interoperable. The conversion scripts are publicly available 2 , and the weights of different pretraining models can be converted between the two frameworks. In practical use, users are free to switch between these two frameworks.
With TencentPretrain, we build a pre-trained weight model zoo. Each pre-trained weight has two versions which can be loaded by either TencentPretrain or Huggingface Transformers. Currently, the TencentPretrain model zoo includes over 50 pretrained weights. We provide pre-training data as well as training details, allowing users to reproduce results with less effort. The weights (pre-trained by TencentPretrain) are currently downloaded over 500 thousand times per month 3 .

Evaluation
This section evaluates TencentPretrain framework quantitatively. Firstly, we compare TencentPretrain with non-modular framework in terms of implementation cost. Then we show that TencentPretrain can reproduce the results of SOTA models on a range of benchmarks.
Model MNLI QNLI QQP RTE SST MRPC CoLA STS AVG BERT-base (Ori.) (Devlin et al., 2019) 83  Table 3: The comparison of TencentPretrain with other implementations on GLUE benchmark. We pre-train from scratch and then fine-tune on a range of datasets

Implementation cost
The number of code lines is used to estimate the implementation cost. We only count the code lines in classes inheriting nn.Module. We compare three frameworks, Huggingface Transformers (HF), UER, and TencentPretrain (TP). Huggingface Transformers exploits non-modular design. UER exploits semi-modular design, which doesn't support modularization within component. When we continue to add new pre-training models (as shown in Table 2 from top to bottom), the number of code lines required by the Tencent-Pretrain is less than the other two toolkits. Take RoBERTa as an example, TencentPretrain does not require any code work since it reuses modules for BERT. UER needs to add word_pos module in embedding component and mlm module in target component. Huggingface Transformers builds a series of classes specific to RoBERTa, such as RoBERTa-Model, RoBERTaEmbeddings, RoBERTaEncoder, RoBERTaPooler, which greatly increases the number of code lines. For other pre-training models, the conclusions are similar. The homogeneity among pre-training models makes the modular design much more advantageous.
In general, the code styles of Huggingface and TencentPretrain are inconsistent. Huggingface creates separate classes for each pre-training model, while TencentPretrain establishes generic modules that are independent of the specific model. Therefore, for most pre-training models, no additional code implementation is required in TencentPretrain.

Reproducibility
In this section, we follow the experimental settings of original papers. The scripts for running models on benchmarks are organized here 4 , and users can easily reproduce the results in Table 3 and 4. For text modality, we use GLUE benchmark to test TencentPretrain's performance. BERT-base and RoBERTa-large are used as test models. The results of BERT-base are listed in the first five rows in Table 3. As shown in AVG column, our result is 82.6, which falls into the range of 82.2-83.5 (the lowest and highest results reported by other papers). The average scores reported by DynaBERT and Metadistil are slightly higher than our result. One of the reasons is that development set of RTE only includes 277 instances, which leads to large fluctuations. The RTE results reported by Dyn-aBERT and Metadistil are 3 point higher than our implementation. For RoBERTa-large, we can observe that our implementation results are close to the results reported by original RoBERTa paper. Table 4 provides the results on vision and audio tasks. ViT (Dosovitskiy et al., 2020) and BEiT (Bao et al., 2021) are used as test models for vision datasets. Top1 accuracy on vision datasets is reported. The original paper of BEiT only reports results on ImageNet. For audio dataset, we report the Automatic Speech Recognition (ASR) results on LibriSpeech with S2T . Word Error Rate (WER) is shown in

Conclusion
This paper presents TencentPretrain, a pre-training toolkit characterized by modular design and multimodal support. In TencentPretrain, pre-training models of different modalities are regarded as the combination of multiple modules, which is easy to configure and extensible. Furthermore, we quantitatively demonstrate that TencentPretrain facilitates the users to reuse existing modules and decreases the cost of model development. At last, we test TencentPretrain on a range of datasets and show that it can reproduce the SOTA results.

Limitations
Although the TencentPretrain pre-training framework has integrated optimization libraries like Deepspeed and Apex, it still lacks support for other components such as Megatron. In the future, we will provide more parallelism modes to achieve efficient training of large-scale language models (LLM).