Multi-stage Pre-training over Simplified Multimodal Pre-training Models

Multimodal pre-training models, such as LXMERT, have achieved excellent results in downstream tasks. However, current pre-trained models require large amounts of training data and have huge model sizes, which make them impossible to apply in low-resource situations. How to obtain similar or even better performance than a larger model under the premise of less pre-training data and smaller model size has become an important problem. In this paper, we propose a new Multi-stage Pre-training (MSP) method, which uses information at different granularities from word, phrase to sentence in both texts and images to pre-train a model in stages. We also design several different pre-training tasks suitable for the information granularity in different stage in order to efficiently capture the diverse knowledge from a limited corpus. We take a Simplified LXMERT (LXMERT-S) which is with 45.9% parameters of the original LXMERT model and only 11.44% of the original pre-training data as the testbed of our MSP method. Experimental results show that our method achieves comparable performance to the original LXMERT model in all downstream tasks, and even outperforms the original model in Image-Text Retrieval task.


Introduction
Self-attention based Transformer (Vaswani et al., 2017) effectively overcomes the problem of RNN being difficult to run in parallel, and greatly promotes the development of large-scale pre-training models. The pre-training language models, such as BERT (Devlin et al., 2019), have achieved excellent performance in many natural language processing tasks. With their big success, researchers have also developed pre-training models on multimodal tasks. A series of multimodal pre-training models have been proposed, such as ViLBERT (Lu et al., 2019), LXMERT (Tan and Bansal, 2019), UNITER (Chen et al., 2019) etc., and have achieved excellent results in languagevision multimodal tasks.
However, the current pre-training models are normally with large-scale parameters, require huge pre-training data and have very high demands on computational resources. For example, the GPT model (Radford et al., 2018) has 110 Million parameters, GPT-2 (Radford et al., 2019) has 1.5 Billion parameters, and GPT-3 (Brown et al., 2020) has a staggering 175 Billion parameters. The same is true for multimodal pre-trained models. For example, LXMERT (Tan and Bansal, 2019) has 183.5 Million parameters and requires 816 TitanX GPU hours for training on 9.18 Million text-image pairs. The sizes of these models are too huge for them to be deployed in many real-world scenarios. Therefore, the study of lightweight pre-training models, which can achieve similar performances to largescale models with smaller parameter scales and training costs, is significantly valuable.
There are some types of work on developing lightweight pre-trained models, including the design of the model structure, quantization, pruning and distillation. For example, ALBERT (Lan et al., 2020) is a lightweight model through structural design such as parameter sharing and parameter decomposition, and achieves better performance than original models; Q8BERT (Zafrir et al., 2019) compresses the model to 1/4 of the original model but with no more than 1% performance loss by quantizing 32bit floating point into 8bit; (Michel et al., 2019) used BERT weight pruning to compress the model and found that removing a large number of attention heads would not have a major impact on the model performance; TinyBERT (Jiao et al., 2020) reduced the model size by 7.5 times but with no more than 4% performance loss by designing a teacher-student distillation model. All above works are on language pre-training models, and most of them concern scales of model parameters. There are few works on cutting training data and light weighing multimodal pretraining model. In fact, compared with language model, multimodal pre-training models should deal with data from both language and visual modal, which demand larger amounts of data and more computational resources. Meanwhile, collections of training data are more difficult. Taking for example the size of text-image pairs used for multimodal pre-training, the frequently used MS COCO (Lin et al., 2014) is a high quality dataset with only 0.82M pairs, while LAIT (Qi et al., 2020) is already a big data with 10M pairs but with average quality. Therefore, it is significantly valuable to develop lightweight multimodal pre-training models which can make use of limited data efficiently.
Existing research on curriculum learning (Bengio et al., 2009) has shown that imitating the process of human learning by gradually increasing the difficulty of a task from simple to complex in stages helps to make better use of different types of data and effectively improve the performance of learning. Many models (Qi et al., 2020) use as much as data available but few works have been done on how to arrange the tasks for better making use of limited data. We therefore borrow the idea of curriculum learning on training pre-training models. We construct a pre-training process which makes use of data from smaller units to bigger units in stages, and design appropriate pre-training tasks for each corresponding stage.
Specifically, we propose a new Multi-stage Pretraining (MSP) method. The first pre-training stage is on the token units, where the text input is the category labels of the objects in the images, and the image input is the object features. An Image Features Random Shuffle (IFRS) is designed as a pre-training task for this stage. IFRS randomly shuffles the object features, and the model predicts the original object order based on the text information. The second stage focuses on phrase units. Phrase-level descriptions of the image are input on the text side and image features are input on the image side. A Topic of Image and Text for Phrase (TITP) task is designed for it. The third stage is sentence-based pre-training. Sentence-level captions are input on the text side, and image features are input on the image side. A Topic of Image and Text for Sentence (TITS) task is designed for it. We take a Simplified LXMERT (LXMERT-S) which has fewer parameters and less pre-training data as the testbed of our MSP method. Experimental results show that our method achieves comparable performance to the original LXMERT model in downstream tasks.
The main contributions of our work are as follows: (1) We propose a new MSP method that allows the model to learn different granularities of text-image correspondence information at different stages; (2) For each stage, we design pre-training tasks suitable for that stage, IFRS task for tokenbased pre-training, TITP task for phrase-based pretraining, and TITS task for sentence-based pretraining; (3) With less pre-trained data (11.76%), fewer model parameters (45.9%), less resource consumption (25%) and less training time (46.57%), the performances of downstream tasks are comparable to or even exceed that of the original model.

Related Works
Multimodal Pre-training Models Multimodal pre-training models are mainly divided into two categories: single-stream models and two-stream models. Single-stream models such as B2T2 (Alberti et al., 2019), OSCAR , etc., fuse image and text information at the beginning of the input; two-stream models such as ViL-BERT (Lu et al., 2019), LXMERT (Tan and Bansal, 2019), etc., encode the image and text information alone first and then fuse them later. Generally two-stream models will have more parameters than single-stream models, but whether the single-stream model or the two-stream model has better performance or is related to the specific tasks require more rigorous experimental proof. We conduct follow-up experiments based on the two-stream model LXMERT by removing the coding layer of the individual modalities and keeping only the fusion coding layer, so that the simplified LXMERT model is more like the single-stream model.
Multimodal Pre-training Data There are several different considerations on making use of data. VisualBERT  believes that pre-training on the target dataset can improve the performance of the model, so VisualBERT first pre-trains on COCO Caption and then continues pre-training on the target dataset (e.g. VQA). Im-ageBERT (Qi et al., 2020), on the other hand, is trained on the out-of-domain LAIT dataset and ITM_HS [CLS] [CLS]

Stage2
Data：sentence-image object (seman c) Task: TITS and others Stage3 Stage1 Figure 1: Overview of our proposed MSP method, including three stages from token, phrase to sentence-based pre-training, with appropriate pre-training tasks for each stage of pre-training.
then on the in-domain datasets, such as Conceptual Captions(CC) (Sharma et al., 2018) and SBU Captions (Ordonez et al., 2011). It can be said the dataset that is most similar to the downstream task is used for training at last, and the general data is used firstly. Clearly, this way of using data is directly related to the downstream tasks. Different downstream tasks might lead to different order of data usage. In this paper, we design a staged pre-training from word-level to phrase-level to sentence-level, which is related to the size of information units. We also design suitable pretraining tasks for different phases to fully exploit the text-image information correspondence of different units in each phase, which has consistent effectiveness for different downstream tasks.
Multimodal Pre-training Tasks The mostly employed language pre-training task is Masked Language Modeling (MLM) (Chen et al., 2019), where tokens are masked with a probability and those masked tokens are predicted by the model. Masked Region Feature Regression (MRFR) (Chen et al., 2019), which is similar to the MLM task, is a popular image pre-training task. Masked Object Classification (MOC) (Qi et al., 2020) task can be regarded as a multimodal pre-training task, which is to predict the category label of each masked object feature. Another popular multimodal pre-training task called Image-Text Matching (ITM) (Chen et al., 2019) is similar to the Next Sentence Prediction (NSP) task in BERT (Devlin et al., 2019), where an image corresponding to a text is randomly replaced with a probability of 50%, and the task is to discriminate whether the image matches the text. The existing pre-training tasks for multimodal data are limited. We design new pretraining tasks with the aim of making full use of the existing training dataset at different granularities.

Method
The overall structure of our MSP method is shown in Figure 1. The pre-training process is divided into three stages based on different granularities of text-image correspondence from token, phrase to sentence. We design corresponding pre-training tasks for the three stages.
We perform the above three-stage pre-training on a simplified model of LXMERT (LXMERT-S). The simplified process of the LXMERT model is shown in Figure 2. The Cross-Modality Encoder of LXMERT-S is identical to the LXMERT. We obtain the Simplified LXMERT (LXMERT-S) by removing the Object-Relationship Encoder and Language Encoder. The image features and text features are directly input to the Cross-Modality Encoder in the LXMERT-S.
By removing the single modal coding layer in LXMERT, the 12-layer LXMERT is simplified to a 5-layer LXMERT-S. The amounts of parameters in simplified LXMERT-S are only 45.9% of the original model, and the whole experiment can be completed on a single GPU. The three-stage pretraining method is also fully applicable to other pre-training models.

Stage 1: Word-based Pre-training
The first stage of pre-training focuses on learning the correspondence between text token units and image objects to help the model mine fine-grained information. To this end, we design the appropriate pre-training tasks and corresponding dataset for this phase of pre-training.

Pre-training Tasks
We design an Image Features Random Shuffle (IFRS) pre-training task to enhance the pre-training of the token layer, based on the existing Masked Language Modeling (MLM) (Chen et al., 2019), Masked Region  Feature Regression (MRFR) (Chen et al., 2019) and Masked Object Classification (MOC) (Qi et al., 2020).
Image Features Random Shuffle (IFRS): Given a set of image regions R = {r 1 , r 2 , r 3 . . . r m }, which are obtained by adding a fully-connected (FC) layer to the regions of interest (ROIs) and projecting them to the hidden size, a feature triplet is three consecutive features in R, e.g. t j = (r i , r i+1 , r i+2 ). A shuffle on a triplet is to randomly change the order of features in the triplet with a probability of 5%. For example, the triplet t j is shuffled as t j is used as input for the network, and the corresponding output is converted to the dimensionality of ROIs to obtain h θ (t We use the L2 loss to calculate the distance between the network output h θ (t [S] j ) and f θ (t j ) as in the following equation.
Where K is the number of shuffled triples.
Other pre-training tasks: We add the existing MLM, MRFR and MOC tasks to the token-based pre-training. MLM masks the token-level category labels of objects with a certain probability P, and the model predicts the masked category label based on the corresponding object feature on the image side. MRFR masks the object features, and the model predicts the original object-level features based on the text-side category label and information around the object. MOC predicts the category and attribute labels of the masked object features.
Training Data We extract training data for IFRS task from caption-image pairs directly. For each image, 36 object features and their corresponding 36 category labels are provided by Faster-RCNN. These category labels have been unified with the text vocabulary, so they are all included in the text vocabulary. During training, the image side inputs the image features in sequence, and the text side inputs the category labels in the corresponding order. In the IFRS task, when the image side is shuffled, the order of the text side remains unchanged.

Stage 2: Phrase-based Pre-training
The previous stage explores the correspondence between the image objects and their category. This stage mines the correspondence between the image object and the phrase describing of the object. Since the phrase description usually contains richer information about the attributes of the object, such as "green old car", building a pre-training task based on the correspondence between the phrase and the object allows the model to obtain rich information about the attributes.
Pre-training Tasks We define a Topic of Image and Text for Phrase (TITP) pre-training task that more directly supports phrase-based information mining.
Topic of Image and Text for Phrase (TITP): Given a token sequence of image phrase-level description W = {w 1 , w 2 , w 3 . . . w n }, object feature sequence R = {r 1 , r 2 , r 3 . . . r m }, and correspondent category label sequence L = {l 1 , l 2 , l 3 . . . l m } extracted by Faster-RCNN. Let topic set is topic = W ∩L = {p 1 , p 2 . . . p q }, and label set Y = {y 1 , y 2 . . . y v }, where v is the size of the vocabulary. If y i ∈topic, then y i is 1, otherwise y i is 0. We add a FC layer to the multimodal representation to get s θ (W, R), predict the correct topic from the vocabulary size v categories, and use BCELoss to calculate the gap between the model output s θ (W, R) and the label Y.
Other pre-training tasks: We add MLM, MRFR and MOC tasks to the phrase-based pre-training. MLM masks the attribute or category information of the phrase with a certain probability P, and the model predicts the masked information based on the corresponding object features. MRFR masks the object features of the image, and the model predicts the original object based on the phrase-level description on the text side and the surrounding object information, and MOC predicts the category and attribute of the object being masked based on the surrounding image features and the phrase-level description on the text side.
Training Data: We obtain the corresponding training data based on the Visual Genome (VG) (Krishna et al., 2017) dataset, which contains a large number of phrases. We eliminate the phrases containing verbs. The remaining phrases are concatenated with commas to obtain a phrase-level description of the image. During training, the spliced VG phrase is used as input on the text side and 36 object features extracted by Faster-RCNN are input on the image side.

Stage 3: Sentence-based Pre-training
On the basis of the above token and phrase training, this stage uses the overall sentence-image correspondence relationship for pre-training to mine larger unit text-image related information.
Pre-training Tasks we design two sentencelevel pre-training tasks, Image-Text Matching Based on Hard Sample (ITM HS) and Topic of Image and Text for Sentence (TITS) described as follows.
Image-Text Matching Based on Hard Sample (ITM HS): The purpose of this task is to reduce the noise brought to the model when the text-image pair does not match. We retrieve the top M most similar images for each image from difficult samples file 1 as the hard sample set. In the ITM HS task, each image is replaced with a randomly selected hard sample with probability of 50% if the hard sample sets is not empty. If the set of current sample is empty, an image in the training set is randomly selected. Let the token sequence W = {w 1 , w 2 , w 3 . . . w n } and the image feature sequence R = {r 1 , r 2 , r 3 . . . r m }, the label y∈{0, 1} indicates whether the input image-text pair matches each other. We apply the FC layer on top of the multimodal representation to get s θ (T, R), which is the matching score of the image and text. Topic of Image and Text for Sentence (TITS): The purpose of this task is to jointly predict the content described by both image and sentence information. Given a token sequence W = {w 1 , w 2 , w 3 . . . w n }, an image feature sequence R = {r 1 , r 2 , r 3 . . . r m }, category labels for object features L = {l 1 , l 2 , l 3 . . . l m }, topic = W ∩L = {p 1 , p 2 . . . p q }, and label Y = {y 1 , y 2 . . . y v }, where v is the size of the vocabulary. If y i ∈topic, then y i is 1, otherwise y i is 0. We apply the FC layer on top of the multimodal representation, convert its dimension to the vocabulary size v to get s θ (W, R), and use BCELoss to calculate the gap between the model output s θ (W, R) and the label Y.
Other pre-training tasks: We add the existing MLM, MRFR and MOC tasks to the sentencebased pre-training. MLM masks the information in the sentence and the model predicts the masked information based on the all information on the image side. MRFR masks the object features of the image and the model predicts the original object based on the overall information at the sentence level on the text side and the surrounding object information. MOC predicts the category and attribute of the masked object based on the image features and the text-side sentence-level description.
Training Data In this stage, the image and its corresponding caption are directly used as input, the sentence level information caption is input on the text side, and the 36 object features provided by Faster-RCNN are input on the image side.

Pre-training Dataset
In this paper, the model is pre-trained using the COCO dataset and part of the VG dataset, and only 1.08M text-image pairs are used, where 0.12M image-text pairs are used in token-based pre-training stage, 0.34M image-text pairs are used in phrase-based pre-training stage, and 0.62M image-text pairs are used in the sentence-based pre-training stage. All datasets we used are also used in initial LXMERT. computational resources with other models.

Downstream Tasks and Data Sets
Visual Question Answering (VQA): There are multiple datasets for VQA. We use three common used datasets: VQA V2.0 (Goyal et al., 2017), GQA (Hudson and Manning, 2019), and NLVR2 (Suhr et al., 2019). Accuracy is used as to measure model performance.
Cross-modal Retrieval task: We choose Flickr30K (Young et al., 2014) dataset as the retrieval task data, and evaluate the performance of the model in Image Retrieval (IR), Text Retrieval (TR), Zero Shot Image Retrieval (ZS-IR), and Zero Shot Text Retrieval (ZS-TR) respectively, and the performance metric is the matching score of text and image pairs. Zero shot is to evaluate the performance of the pre-trained model directly on the test set without fine-tuning, and is used to evaluate the effect of the pre-trained model. Therefore ZS-IR and ZS-TR are directly loaded with model parameters to perform IR and TR tasks without fine-tuning.
In the fine-tuning stage, the multimodal representation of the model is passed through a FC layer as a joint representation of image and text to solve downstream tasks. For VQA tasks, we linearize the multimodal representation into the answer category dimension through the FC layer to predict the answer of each question. For the Image-Text Retrieval (Young et al., 2014) task, we randomly replace the image or text, construct three negative examples for an image-text pair, including two random negative examples and a hard sample, and use BCELoss to calculate the difference between the matching score and the text-image matching label .
Unified VLP Unified VLP uses a 12 layers of shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. It conducts pre-training on the Conceptual Captions(CC) (Sharma et al., 2018) which has around 3.3 million image-text pairs, and requires 150 hours of training on the 8x V100 GPUS. Unified VLP includes only the MLM task when processing the comprehension tasks.
VisualBERT VisualBERT contains 12 layers of transformer with 85.05M parameters. It first pretrains on COCO Caption (Lin et al., 2014) with MLM and ITM tasks and then continues pretraining on the target dataset with MLM task. The pre-training data sizes for VisuaBERT on the VQA V2.0 task are shown in Table 1. For different downstream tasks, the second stage of pre-training needs to be re-trained.
VL-BERT VL-BERT contains 12 layers of transformer with 134.8M parameters. It pre-trains on both visual-linguistic and text-only datasets. Samples are randomly drawn from both CC and BooksCorpus (Zhu et al., 2015) & English Wikipedia (at a ratio of 1:1) in each mini-batch. VL-BERT considers ITM to be harmful to downstream tasks and therefore only includes MLM and MOC tasks.   ViLBERT ViLBERT extends the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. It trains on CC with MLM, MOC and ITM tasks.
LXMERT LXMERT has a large-scale Transformer model that consists of three encoders and a large-scale pre-training data, including MS COCO, Visual Genome, VQA v2.0, GQA and VG-QA (Zhu et al., 2016). The pre-training requires 8.5 days on the 4x TitanX GPUS. It also has many pre-training tasks, including MLM, MRFR, MOC, ITM and Image Question Answering (QA) (Tan and Bansal, 2019), and has achieved good results in downstream tasks, especially VQA tasks.

Implementation Details
Our Transformer backbone is the same as LXMERT, where each Transformer block has 768 hidden units and 12 attention heads. Image features are extracted by Faster- RCNN (Ren et al., 2015) model (with ResNet-101 (He et al., 2016 backbone) trained on Visual Genome (VG). During pre-training, our model is trained for about 95 hours on 1 TitanX GPU, and takes Adam (Kingma and Ba, 2015) as the optimizer with a learning rate of 1e-5. We train the tokenbased model for 10 epochs with a batch size of 64, phrase-based model for 20 epochs with a batch size of 128 and sentence-based model for 20 epochs with a batch size of 128.
During Fine-tuning, the learning rate of all downstream tasks is 5e-5, and the batch size is 32. We fine-tune 6 epochs for VQA V2.0, 5 epochs for GQA, and 8 epochs for NLVR2 and Image-Text Retrieval tasks.
For hard samples in ITM HS task, we retrieve the top 100 most similar images from difficult samples file. For the masking strategies, we randomly mask 15% tokens, 15% object features. The codes of our models are available at https: //github.com/lttsmn/LXMERT-S. Table 2 gives the results of the model on the three VQA datasets, and Table 3 gives the results of the model on the Flickr30K Image-Text Retrieval dataset.

Experimental Results
It can be seen from both Table 2 and 3 that the pre-training model proposed in this paper has achieved comparable performances with the existing large models under the condition of less training data, fewer parameters and less computing resource occupation. In some cases, our small model even outperforms the big one. For example, NLVR2 task is 0.22 higher than LXMERT on Test-P, and ZS-IR is 18.42 higher than LXMERT in R@1 under the premise that the model parameters are reduced by 54.1% and the training data set is reduced by 88.24%.  column gives the number of stage(s) in pre-training. The second column gives the stage(s) used, where S for sentence stage, P for phrase stage, and T for token stage, T →S means there are two stages including token-based pre-training first and then sentencebased pre-training. T →P →S means there are three stages including token-based pre-training first and then phrase-based pre-training and sentencebased pre-training last. T+P+S means to train all stages together. The third column gives the pretraining tasks used in the pre-training. We first give all the pre-training tasks used in the training stages used, then verify the validity of the pretraining tasks by removing a task based on all the pre-training tasks, "-" indicates that a pre-training task is removed.

Ablation Study
From Table 4, we can find: (1) With the orderly increase of the training phase, the performance of the model on downstream tasks is gradually improving; (2) The training granularity from small to large is the most effective training sequence; (3) The pre-training tasks we propose for each stage of pre-training can improve the performance of the model on downstream tasks, such as TITP improves VQA performance by 0.09, GQA performance by 0.4, NLVR2 performance by 0.24, IR performance by 0.48, and ZS-TR by 0.83. top1:a young man sitting on a rock above a body of water, fishing rod in hand (yes, score=0.9956) top1:a person with a backpack stands on a rocky bank beside a body of water (no, score=0.9989) top1:a young man sitting on a rock above a body of water, fishing rod in hand (yes, score=0.9899) top1:an old women in pink and wearing hat is squeezing her eyes while looking at something (yes,score=0.9843) top1:one man in a hooded sweatshirt picking up articles of clothing while a woman in a blue shirt looks on (no,score=0.9982) top1:a young man sitting on a rock above a body of water, fishing rod in hand (no, score=0.

Qualitative Analysis
We visualize the impact of different pre-training stages on VQA and Image-Text Retrieval task by showing the answers probability distribution. For each example in Figure 3, the left side is the input image of the model, and the right side is the probability distribution of the top3 scoring answers in different pre-training stages. For Image-text Retrieval task, we select the top 1 caption for visualization. For each sample in Figure 4, the left side is the input image and the right side is the highest scoring caption predicted by the model. Figure 3 and 4, we can find: (1) Token-based pre-training (S vs T →S) helps the model to learn object information in the images. For example, in the left sample in Figure 3 and 4, the model improves its performance on downstream tasks by adding token-based pre-training that makes the model focus on object information such as horses, man and rocks in the images;

From both
(2) Phrase-based pre-training (T →S vs T →P →S) helps the model to learn information about the attributes of the objects. As shown in right-hand image in Figure 3 and 4, the model pays attention to attribute information, i.e. blanket is white, clothes are pink, etc.

Conclusion
In this paper, inspired by the idea of curriculum learning, we propose a MSP method, which uses information at different granularities from word, phrase to sentence in both texts and images to pre-train a model in stages, we also design pretraining tasks suitable for each stage of pre-training, IFRS task for word-based pre-training, TITP task for phrase-based pretraining, and TITS task for sentence-based pretraining. Experimental results on several VQA datasets as well as one cross-modal retrieval dataset show that our method achieves similar or even better performance than a larger model in terms of accuracy in all downstream tasks under the premise that the model parameters are reduced by 54.1% and the training data set is reduced by 88.24%. In future work, we will add the above training method to other simplified pretrained models to further explore the effectiveness of MSP method.