UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e., text or image) or limited multi-modal data (i.e., image-text pairs). In this work, we propose a UNIfied-MOdal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections are utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space, over a corpus of image-text pairs augmented with related images and texts. With the help of rich non-paired single-modal data, our model is able to learn more generalizable representations, by allowing textual knowledge and visual knowledge to enhance each other in the unified semantic space. The experimental results show that UNIMO greatly improves the performance of several single-modal and multi-modal downstream tasks. Our code and pre-trained models are public at https://github.com/PaddlePaddle/Research/tree/master/NLP/UNIMO.


Introduction
Large-scale pre-training has drawn much attention in both the community of Compute Vision (CV) and Natural Language Processing (NLP) due to its strong capability of generalization and efficient usage of large-scale data. Firstly in CV, a series of models were designed and pre-trained on the large-scale dataset ImageNet, such as AlexNet (Krizhevsky et al., 2017), VGG (Simonyan and * These authors contribute equally to this study and are listed with random order.
Who is standing behind the baseball player?
Any baseball game involves one or more umpires, who make rulings on the outcome of each play. At a minimum, one umpire will stand behind the catcher, to have a good view of the strike zone, and call balls and strikes. Additional umpires may be stationed near the other bases … from wikipedia Figure 1: An illustrative example for the necessity of unified-modal learning. We can only determine the correct answer to the visual question based on the textual background information.
Zisserman, 2014) and ResNet (He et al., 2016), which effectively improved the capability of image recognition for numerous tasks. Recent years have witnessed the burst of pre-training in NLP, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019) and UniLM (Dong et al., 2019), which greatly improve the capabilities of language understanding and generation. However, the above researches focus on the singlemodal learning and can only be effectively used in single-modal (i.e., only text or image) scenarios. In order to adapt to multi-modal scenarios, a series of multi-modal pre-training methods were proposed and pre-trained on the corpus of image-text pairs, such as ViLBERT , VisualBERT (Li et al., 2019b) and UNITER (Chen et al., 2020b), which greatly improve the ability to process multimodal information. However, these models can only utilize the limited corpus of image-text pairs and cannot be effectively adapted to single-modal scenarios (Lin et al., 2020b).
A smarter AI system should be able to process different modalities of information effectively. There are large scale of data in different modalities on the Web, mainly textual and visual information. The textual knowledge and the visual knowledge Unified-Modal Transformer

Image-Text Pairs
The baseball player readies to swing at the pitch while the umpire behind him looks on. usually can enhance and complement each other.
As the example shown in Figure 1, it's difficult to answer the question correctly only with the visual information in the image. However, if we connect the visual information to the textual information which describes the background of a baseball game, it's very easy to determine the correct answer. Also, the visual information can make it easier to understand the scene described by the text. The research in neuroscience by Van Ackeren et al. (2018) reveals that the parts of the human brain responsible for vision can learn to process other kinds of information, including touch and sound. Inspired by this research, we propose to design a unified-modal architecture UNIMO which aims to process multiscene and multi-modal data input with one model, including textual, visual and vision-and-language data, as shown in Figure 2.
The greatest challenge to unify different modalities is to align and unify them into the same semantic space which are generalizable to different modalities of data. Existed cross-modal pretraining methods try to learn cross-modal representations based on only limited image-text pairs by simple image-text matching and masked language modeling (Chen et al., 2020b). They can only learn specific representations for image-text pairs, and thus fail to generalize to single-modal scenarios. So their performance will drop dramatically when applied to language tasks (Lin et al., 2020b), which has also been revealed by our experiments (see Section 4.2). In this work, UNIMO learns visual representations and textual representations simultaneously, and unifies them into the same semantic space via cross-modal contrastive learning (CMCL) based on a large-scale corpus of image collections, text corpus and image-text pairs. UNIMO effectively utilizes the large-scale of text corpus and image collections to learn general textual and visual representations. The CMCL aligns the visual representations and textual representations, and unifies them into the same semantic space based on image-text pairs. As shown in Figure 3, to facilitate different levels of semantic alignment between vision and language, we propose to utilize a series of text rewriting techniques to improve the diversity of cross-modal information. Specifically, for an image-text pair, various positive examples and hard negative examples can be obtained by rewriting the original caption at different levels. Moreover, to incorporate more background information from the single-modal data, text and image retrieval are also applied to augment each image-text pair with various related texts and images. The positive pairs, negative pairs, related images and texts are learned jointly by CMCL. In this way, our model can effectively unify different levels of visual and textual representations into the same semantic space, and incorporate more singlemodal knowledge to enhance each other.
The unified-modal architecture mainly has the following advantages compared with previous methods: • We can utilize large scale of non-paired text corpus and image collections on the Web to learn more generalizable textual and visual representations, and improve the capability of vision and language understanding and generation.
• Our model can be effectively fine-tuned for both single-modal and multi-modal understanding and generation downstream tasks.
• The visual knowledge and textual knowledge can enhance each other to achieve better performance on several single-modal and multimodal tasks than previous methods.

UNIMO
Humans perceive the world through many modalities, such as sound, vision and language. Even

Positive Pairs
Any baseball game involves one or more umpires, who make rulings on the outcome of each play.
At a minimum, one umpire will stand behind the catcher, to have a good view of the strike zone, and call balls and strikes.

… Positive Texts
Positive Images

Text Rewriting
The baseball player readies to swing at the pitch while the umpire behind him looks on.

Text/image Retrieval
The man rides a horse in the court while the coach watches him.
T h e b a s e b a l l p l a y e r readies to swing in the room while the umpire behind him looks on. … Negative Pairs Figure 3: Illustration of the CMCL. A series of text rewriting techniques are utilized to create positive imagetext pairs X + and hard negative image-text pairs X − . Image and text retrieval are also utilized to obtain related images X I and texts X T from single-modal data, which are treated as single-modal positive samples during crossmodal learning. All of them are encoded by the same unified-modal Transformer in pairs or individually, and the representations of images and texts are extracted to compute the contrastive loss.
though any individual modality might be incomplete or noisy, important information is still perceivable since they tend to be shared or enhanced each other. With this motivation, we propose a unifiedmodal pre-training method UNIMO to learn representations that capture modality-invariant information at the semantic level. Different from previous methods, UNIMO learns from different modalities of data, including images, texts and image-text pairs, thus achieving more robust and generalizable representations for both textual and visual input.
As shown in Figure 2, UNIMO employs multi-layer self-attention Transformers to learn unified semantic representations for both textual and visual data. For a textual input W, it is firstly split into a sequence of subwords W = {[CLS], w 1 , ..., w n , [SEP ]} by Byte-Pair Encoding (BPE) (Sennrich et al., 2016), and then the self-attention mechanism is leveraged to learn contextual token representations {h [CLS] , h w 1 , ..., h wn , h [SEP ] }. Then the sequence is feed into the multi-layer Transformer network to learn cross-modal contextual representations for both the textual tokens and image regions. We extract the representations h [IM G] and h [CLS] as the semantic representations of image V and text W , respectively.
Based on large volumes of image collections {V }, text corpus {W } and image-text pairs {(V, W )}, UNIMO learns generalizable visual and textual representations in similar ways by masked prediction, and unify them into the same semantic space via CMCL. Joint visual learning on image collections, language learning on text corpus and cross-modal learning on image-text pairs not only improve the capability of visual and language understanding and generation, but also enable the textual knowledge and visual knowledge to enhance each other in the unified semantic space.

Cross-Modal Contrastive Learning
The greatest challenge to unify different modalities is to align and unify their representations at different levels. For the example shown in Figure  2, the model not only needs to connect the scene shown in the whole image to an article describing a baseball game, but also needs to align the two men and their location relationship in the image with "baseball player", "umpire" and "behind" in the text, respectively. Several existing cross-modal pre-training methods try to align visual and textual representations by simply image-text matching (Li et al., 2019a;Chen et al., 2020b) based on a limited corpus of image-text pairs. They randomly sample a negative image or text from the same training batch for each image-text pair, and utilize a classifier to determine whether the image and text are matching. As the randomly sampled negative text or image is usually very different from the original text or image, they can only learn very coarse alignment between textual and visual representations. In this work, we propose a novel CMCL method to align and unify different levels of textual and visual representations into the same semantic space.
The main idea is to let the representations of the paired image and text near in the representation space while the non-paired far away. The representations of image V and text W are used to compute the similarity between them to measure their distance d(V, W ). As shown in Figure 3, to facilitate semantic alignment between vision and language at different levels, we design several novel text rewriting techniques to rewrite the original caption of an image either at word, phrase or sentence level. In this way, we can create large volumes of positive examples X + and negative examples X − for each image-text pair (V, W ). Moreover, to augment cross-modal learning with single-modal information, text and image retrieval are applied to obtain various related texts X T and images X I for each image-text pair (V, W ). Different from the positive and negative image-text pairs, the retrieved images and texts are encoded individually as they mainly carry weak correlations, as shown in the right part of Figure 3. Based on these positive and negative examples, the following contrastive loss L CM CL is utilized to learn detailed semantic alignments across vision and language: where τ denotes the temperature parameter. Note that, for single-modal images X I and texts X T , the original text W and image V are used to compute the cross-modal relevance, respectively. To the best of our knowledge, this is the first work that explores CMCL to unify visual and textual semantic space.
Text Rewriting To enhance multi-granularity of semantic alignment between image and text, we rewrite the caption of an image at different levels, including sentence-level, phrase-level and wordlevel. For sentence-level rewriting, we utilize the back-translation techniques (Edunov et al., 2018) to obtain several positive samples for each imagetext pair. Specifically, each caption of an image is translated into another language and then translated back to the original language. In this way, several similar captions can be obtained for an image. Furthermore, for each image-text pair, the most similar captions of other images are retrieved based on TF-IDF similarity. The retrieved results are very similar to the original caption but doesn't accurately describe the corresponding image, so they can be used as hard negative samples to enhance the sentence-level alignment between image and text. For phrase-level and word-level rewriting, we first parse the image caption into a scene graph (Wang et al., 2018), then randomly replacing the object, attribute or relation nodes of the scene graph with a different object, attribute or relation from the corresponding vocabularies. Instead of randomly sampling negative samples as previous methods, text rewriting can generate large volumes of hard negative samples. In this way, we can help the model to learn more detailed semantic alignment from different levels between image and text.
Image/Text Retrieval In order to incorporate more single-modal information during cross-modal learning, each image-text pair is further augmented with various related images and texts that retrieved from the single-modal data. Specifically, for an image, other images in the image collections will be ordered by their visual similarities. Those images that have highly overlapped objects with the original image will be extracted to provide relevant visual information. Similarly, sentences that are semantically related with the original caption are extracted based on semantic similarity to provide background language information. The retrieved images and texts are encoded individually by the unified-modal Transformer as shown in Figure 3, then their representations are extracted to compute the cross-modal contrastive loss in Equation 1.
These retrieved single-modal information provide rich background information for better cross-modal learning.

Visual Learning
Similar to the masked language modeling in BERT, we sample image regions and mask their visual features with a probability of 15%. The visual features of the masked regions are replaced by zeros.
As the regions from an image usually are highly overlapped with each other, we choose to mask all regions that have a high proportion of mutual intersection to avoid information leakage. Similar to Lin et al. (2020b), we randomly choose regions as masking anchors and mask the regions whose overlapping ratios with the anchors are larger than 0.3. For an image V , the model is trained to reconstruct the masked regions v m given the remaining regions v \m : Similarly, for an image-text pair (V, W ), the model is trained to reconstruct the masked regions v m given the text W and the remaining regions v \m : As the visual features are high-dimensional and continuous, we utilize both feature regression and region classification objective to learn better visual representations. The feature regression learns to regress the contextualized visual representations h v i to its visual features v i , which can be formulated as: where r indicates an FC layer to convert h v i into a vector of the same dimension as v i . The region classification learns to recognize the object semantic class of each masked region based on its contextualized visual representation h v i . An FC layer is utilized to compute the scores for K object classes s(h v i ), which further goes through a sof tmax function to obtain the normalized distribution. The final objective minimizes the cross-entropy (CE) loss between the predicted distribution and the object detection out- The score function f θ (v m |v \m , W ) is formulated similarly.

Language Learning
To learn general language representations for both language understanding and generation tasks, our model is trained as a unified encoder-decoder model with two types of language modeling tasks: bidirectional prediction and sequence-to-sequence (Seq2Seq) generation. The unified modeling is achieved by utilizing specific self-attention masks to control what context the prediction conditions on, inspired by Dong et al. (2019). To improve the language learning process, we firstly detect semanticly complete phrases from the text, such as name entities by syntactic parsing, and then treat them as a whole in the following masking strategies. Different from previous work, we always sample a sequence of complete words or phrases instead of subword tokens, for both bidirectional prediction and Seq2Seq generation.
Bidirectional prediction. Given a sequence of tokens W = {[CLS], w 1 , ..., w n , [SEP ]}, we iteratively sampling spans of text until totally 15% tokens have been selected. We sample the span length from a geometric distribution l ∼ Geo(p), where p is set as 0.2, similar to SpanBERT (Joshi et al., 2020). All tokens in the selected spans are replaced with either a special [M ASK] token, a random token or the original token with probability 80%, 10% and 10%, respectively. The goal is to predict these masked tokens w m based on their surrounding context w \m , by minimizing the negative log-likelihood: Seq2Seq generation. For the Seq2Seq generation task, we iteratively sample fragments from the token sequence until the 25% budget has been spent, inspired by Xiao et al. (2020). For each iterate, we first sample a fragment length from a uniform distribution l ∼ U (4, 32), and then sample a fragment with the specified length. Every selected fragment {w i , ..., w j } is further appended with two special tokens [CLS] and [SEP ] (i.e., {[CLS], w i , ..., w j , [SEP ]}), which denotes the beginning and end of the fragment. All selected fragments are removed from the text and concatenated as the target sequence T while the remaining parts are concatenated as the source sequence S. The model is trained to generate the target sequence auto-regressively condition on the source sequence: During pre-training, we alternate between the bidirectional prediction objective and the Seq2Seq generation objective uniformly. For image-text pairs, the two objectives are applied to the captions similarly to learn cross-modal understanding and generation.

Experimental Settings
In this section, we introduce the pre-training and finetuning experimental settings.

Pre-training Dataset
Our pre-training datasets consist of three types: text corpus, image collections and image-text pairs. The text corpus includes two large-scale corpora: BookWiki and OpenWebText, which are part of the training dataset of RoBERTa. BookWiki is composed of English Wikipedia and BookCorpus (Zhu et al., 2015), and OpenWebText is an open recreation of the WebText corpora. The image collections are images without textual descriptions, including a subset of OpenImages (Krasin et al., 2017) and COCO unlabel. The image-text pairs are composed of four existing multi-modal datasets: COCO (Lin et al., 2014), Visual Genome (VG) (Krishna et al., 2017), Conceptual Captions (CC) (Sharma et al., 2018) and SBU Captions (Ordonez et al., 2011), which have also been widely used in previous multi-modal pre-training models. The statistics of them are shown in Appendix A.

Implementation Detail
We evaluate UNIMO on two model sizes: UNIMObase with 12 layers of Transformer block and UNIMO-large with 24 layers of Transformer block. The maximum sequence length of text tokens and image-region features are set as 512 and 100, respectively. We pre-train UNIMO-base by initializing from RoBERTa-base, and UNIMO-large by initializing from RoBERTa-large. Both UNIMObase and UNIMO-large are trained for at least 500K steps. An Adam optimizer with initial learning rate 5e-5 and a learning rate linear decay schedule is utilized. By virtue of float16 mixed precision training, it takes almost 7 days for training UNIMO-base with 32 Nvidia Telsa V100 32GB GPU and 10 days for UNIMO-large with 64 Nvidia Telsa V100 32GB GPU.
For visual learning, we adopt Faster R-CNN (Ren et al., 2016) pre-trained on the Visual-Genome dataset to select salient image regions and extract region features from images. The regions with class detection probability exceeds a confidence threshold of 0.2 are selected and 100 boxes are kept. For CMCL, we utilize back-translation to create 3 positive samples and apply rewriting to obtain 100 hard negative samples for each imagetext pair. The most similar of 100 images and 100 sentences are retrieved from the single-modal image collections and text corpus for each image-text pair, respectively. More details are described in Appendix A.

Finetuning Tasks
We fine-tune our model on two categories of downstream tasks: (1) single-modal language understanding and generation tasks; (2) multimodal vision-language understanding and generation tasks. The single-modal generation tasks include: generative conversational question answering on the CoQA dataset (Reddy et al., 2019), question generation on the SQuAD 1.1 dataset (Rajpurkar et al., 2016), abstractive summarization on the CNN/DailyMail (CNNDM) dataset (Hermann et al., 2015), and sentence compression on the Gigaword dataset (Rush et al., 2015). The single-modal understanding tasks include: sentiment classification on the SST-2 dataset (Socher et al., 2013), natural language inference on the MNLI dataset (Williams et al., 2017), linguistic acceptability analysis on the CoLA dataset (Warstadt et al., 2019) and semantic similarity analysis on the STS-B dataset (Cer et al., 2017). The multi-modal tasks include: visual question answering (VQA) on the VQA v2.0 dataset (Goyal et al., 2017), image caption on the Microsoft COCO Captions dataset (Chen et al., 2015), visual entailment on the SNLI-VE dataset (Xie et al., 2019) and image-text retrieval on Flickr30k datasets (Young et al., 2014). The detail statistics of the datasets and hyper-parameter settings for the above tasks are described in Appendix B.

Results and Analysis
In this section, we report the evaluation results on both the multi-modal and single-modal tasks to show the adaptability and generalizability of UNIMO to different scenarios. We further make several ablation studies to validate that textual knowledge and visual knowledge can enhance each other in the unified semantic space. The visualization and case analysis of the model results are appended in Appendix C.

Multi-Modal tasks
The evaluation results on the multi-modal tasks are shown in Table 1. We compare with most of the existed multi-modal pre-training models, including ViLBERT , VLP (Zhou et al., 2020), UNITER (Chen et al., 2020b), Oscar , Villa (Gan et al., 2020) and ERNIE-ViL . The results show that UNIMO achieves the best results against almost all benchmarks under both the base and large size of models. Particularly, UNIMO-large outperforms previous best performing model ERNIE-ViL-large by 1.34 R@1 on image retrieval and 1.3 R@1 on text retrieval, which are great improvements for the image-text retrieval tasks. On the image caption task, UNIMO outperforms the best performing model Oscar by more than 2 BLUE4 score. UNIMO achieves better performance on both the multi-modal understanding and generation tasks, while previous methods usually focus on either the understanding or generation tasks. The above results demonstrate the effectiveness of the unifiedmodal learning architecture that takes advantage of the large scale of single-modal images and texts for cross-modal learning.

Single-Modal tasks
Previous multi-modal pre-training models usually cannot effectively adapt to single-modal scenarios.To further validate that, we remove the singlemodal learning processes on the text corpus and   image collections (i.e., "w/o single-modal") from UNIMO and replace the CMCL with an image-text matching objective. Then, the model "w/o singlemodal" is just a multi-modal pre-training method similar to UNITER (Chen et al., 2020b). As shown in Table 2, the performance of the model on all the language understanding and generation tasks drop dramatically compared to UNIMO, which demonstrates that multi-modal pre-training only on image-text pairs cannot effectively adapt to the single-modal tasks.
To show the effectiveness of UNIMO on the language understanding and generation tasks, we further compare with existed pre-trained language models (PLMs), including BERT (Devlin et al.  Table 2 demonstrate that UNIMO achieves better or comparable performance than existed PLMs on both the language understanding and generation tasks. Specifically, UniLM (Dong et al., 2019) is designed for both natural language understanding and generation. UNIMO outperforms UniLM on most of the tasks with a large margin, which demonstrates the effectiveness of UNIMO on the single-modal scenarios.
In all, UNIMO not only achieves the best performance on the multi-modal tasks, but also performs very well on the single-modal tasks, which demonstrate the superiority of our unified-modal learning architecture.

Mutual Enhancement of Text and Vision
We further make several ablation studies to show that the unified-modal architecture can help textual knowledge and visual knowledge mutually enhance each other in the unified semantic space.
Text Enhance Vision To explore whether the textual knowledge in the text corpus facilitates the cross-modal learning, we remove the language learning process on the text corpus from UNIMO (i.e., "w/o texts"), and compare their performance on the multi-modal tasks. Table 3 summarizes the comparison results, which show that the performance of the model "w/o texts" declines consistently on both the multi-modal understanding and generation tasks. The results demonstrate that the textual knowledge in the text corpus benefit the vision-language tasks by enhancing the crossmodal learning with more textual information.
Vision Enhance Text To further validate that the visual knowledge in the image collections and image-text pairs facilitates the language learning, we remove the images and image-text pairs from the pre-training dataset (i.e., "w/o pairs&images") and compare their performance on the single-modal language tasks. After removing the images and image-text pairs, our model is trained by only the language learning objectives, which are similar to previous pre-trained language models BERT and UniLM. Table 4 summarizes the comparison results, which demonstrate that after removing the visual data, the performance of the model "w/o pairs&images" drops obviously on most of the language understanding tasks and all the language generation tasks. The results reveal that visual knowledge can enhance the language tasks by enabling the model to learn more robust and generalizable representations in a unified semantic space.

Related Work
Existing researches on pre-training can be mainly classified into two categories: single-modal pretraining and multi-modal pre-training. The singlemodal pre-training methods only focus on singlemodal tasks, while the multi-modal pre-training methods only focus on multi-modal tasks.
Single-Modal Pre-training The single-modal pre-training methods mainly consist of visual pretraining and language pre-training. Most visual pre-training methods are based on the multi-layer CNN architecture such as VGG (Simonyan and Zisserman, 2014) and ResNet (He et al., 2016), and trained on the ImageNet dataset. Recently, contrastive self-supervised learning like SimCLR (Chen et al., 2020a) and MoCo (He et al., 2020) also greatly improve the performance of visual representation learning. These pre-trained models only focus on visual tasks (e.g. image classification etc.), however, they cannot be used in textual or multimodal (i.e., with both text and image) tasks. The language pre-training methods based on the Transformer architecture are also very popular in NLP models, such as GPT (Radford et al., 2018), BERT (Devlin et al., 2019), XLNet (Yang et al., 2019) and BART (Lewis et al., 2020). However, they mainly focus on textual tasks. They cannot effectively deal with the multi-modal tasks, such as image-text retrieval, image captioning, multimodal machine translation (Lin et al., 2020a;Su et al., 2021) and visual dialog (Murahari et al., 2020).

Multi-Modal Pre-training
Recently, multimodal pre-training methods have been more and more popular for solving the multi-modal tasks. All of them are trained on a corpus of image-text pairs, such as ViLBERT , VisualBERT (Li et al., 2019b), VL-BERT (Su et al., 2019), Unicoder-VL (Li et al., 2019a) and UNITER (Chen et al., 2020b). Based on the multi-layer Transformer network, they all employ the BERT-like objectives to learn multi-modal representations from a concatenated-sequence of vision features and language embeddings. Their architectures can be mainly classified into two categories: single-stream and two-stream. The two-stream methods, such as ViLBERT, utilize two single-modal Transformer to process visual features and language embeddings respectively, and then learn their interactions based on a crossmodal Transformer. The single-stream methods directly utilize a single Transformer network to model both the visual features and the language embeddings. VisualBERT, VL-BERT, Unicoder-VL and UNITER all utilize the single-stream architecture, which show that fusing cross-modal information early and freely by a single-stream network can achieve better performance.
Recently, several contrastive learning-based multi-modal pre-training methods have also been proposed. OpenAI CLIP (Radford et al., 2021) leverages large-scale image-text pairs to learn transferrable visual representations by image-text matching, which enables zero-shot transfer of the model to various visual classification tasks. WenLan (Huo et al., 2021) further proposes a similar two-tower Chinese multi-modal pre-training model and adapts MoCo (He et al., 2020) to improve the contrastive cross-modal learning process. Instead of extracting salient image regions by pre-trained object detection models like Faster- RCNN (Ren et al., 2016), the end-to-end vision-language pre-training architecture SOHO (Huang et al., 2021) proposes to jointly learn Convolutional Neural Network (CNN) and Transformer for cross-modal alignments from millions of image-text pairs. All existed multi-modal pre-training methods only focus on multi-modal tasks with both vision and language inputs. However, they cannot be effectively adapted to single-modal tasks. Moreover, they can only utilize the limited corpus of image-text pairs. By contrast, our unified-modal pre-training method UNIMO can employ large volumes of text corpus and image collections to enhance each other, and can be effectively adapted to both textual and multi-modal scenarios. UNIMO also achieves the best performance on multi-modal tasks including image-text retrieval, visual entailment, VQA and image caption.

Conclusion
In this work, we propose UNIMO, a unified-modal pre-training architecture to leverage the large scale of non-paired text corpus and image collections for cross-modal learning. We verify that UNIMO provides an effective way for textual knowledge and visual knowledge to mutually enhance each other in a unified semantic space, and UNIMO successfully adapts to both single-modal and multi-modal understanding and generation tasks. In this way, UNIMO outperforms previous methods on both the multi-modal and single-modal downstream tasks. In the future work, we will focus on end-to-end visual and language unified learning, and much larger scale of model size and data volumes. ... [0]" will be utilized as the visual input, where "[0]" denotes a zero-value feature embedding. During language learning, the pseudo image-region sequence will be masked out. Based on the above techniques, both images and texts are represented in the same format as image-text pairs. For imagetext pairs, both the visual learning and language learning are applied on the images and captions simultaneously to learn cross-modal representations.
Training Details During pre-training, the samples of image collections, text corpus and imagetext pairs are randomly mixed together with ratio 1:1:5. The objectives of language learning, visual learning and cross-modal contrastive learning (CMCL) are trained jointly. The hyper-parameters for both UNIMO-Base and UNIMO-Large are shown in Table 6. For CMCL, each positive imagetext pair is appended with several hard negative samples by text rewriting, as well as several positive images and texts by image/text retrieval. All samples for other image-text pairs in the training batch are also treated as the negative samples (including negative images and negative texts), which are more than 6K for UNIMO-base and 3K for UNIMO-Large. For an image-text pair (V, W ), the detail formula of the CMCL loss L CM CL (V, W ) is as follows: −log posP + posI + posT (negP + negI + negT ) + (posP + posI + posT ) where pos P , pos I and pos T denote the scores of positive image-text pairs X + , related images X I and related texts X T , respectively. Also, neg P , neg I and neg T denote the scores of negative imagetext pairs X − , negative images Y I and negative texts Y T , respectively. The objective is to maximize the positive score pos P + pos I + pos T while minimizing the negative score neg P +neg I +neg T , while help aligns and unifies the visual and textual representation spaces. The pre-training process of UNIMO is described in Algorithm 1 in pseudocode style.      train/val/test splits respectively. (3) Visual Entailment (SNLI-VE) is evaluated on the SLNI-VE dataset which was derived from Flickr30K images and Stanford Natural Language Inference (SNLI) dataset. The task is to determine the logical relationship (i.e., "Entailment", "Neutral" and "Contradiction") between a natural language statement and an image. (4) Image-Text Retrieval is evaluated on the Flickr30k dataset, which contains two subtasks: image retrieval (Flickr30k-IR) and text retrieval (Flickr30k-TR), depending on which modality is used as the retrieved target. We report the top-K retrieval results on the test sets, including R@1, R@5 and R@10 (R denotes Recall). The statistics of the datasets for the above multimodal-tasks are described in Table 7. The hyper-parameters for all the downstream tasks, including both the multimodal tasks and single-modal tasks are shown in Table 8 and 9.

C Visualization and Analysis
To intuitively show the effectiveness of the unifiedmodal learning on the corpus of images, texts and image-text pairs, we utilize 2-dimensional visualization of the embeddings by Principal component analysis (PCA). The nearest neighbors of the center word are shown in the embedding space. UNIMO is compared with two ablation models described in Section 4.3. The figure shows that the model "UNIMO-w/o texts" can find more visual relevant words than "UNIMO-w/o image&pairs", which demonstrates the effectiveness of the visual learning on images. However, UNIMO not only finds many visually relevant words, but also finds some semantic relevant background words. For example, UNIMO finds "lunch" and "airplanes" for the center word "hamburger", which denotes people usually eat hamburger at lunch and often eat it while flying. Also, for the second example, UNIMO finds relevant concepts "meter", "steps" and "soccer" for "foot", which enrich the concept and connect it with rich relevant information.
To further intuitively show the advantages of the unified-modal learning with rich single-modal data, we compare UNIMO with the multimodal pre-training model "w/o single modal" (described in Section 4.2), on both the text retrieval and image retrieval tasks. The examples of text retrieval results in Figure 5 show that the retrieved captions by UNIMO describes the images more accurately by including different levels of information, including objects, attributes and relations in images. The Baseline: Two men are standing at telephone booths outside.

UNIMO:
A child dressed in blue jeans with rolled cuffs and a pink hoodie waits outdoors at the foot of the stairs with an axe.
Baseline: Young boy with a broom sweeps a deck in a wooded area.
UNIMO: Three guys are jumping on some grass and making funny faces, you can see their shadows on the ground.
Baseline: A group of young men are running a race.

UNIMO:
Two bicyclists are racing each other on a dirt track.
Baseline: Three runners are on a track and two of them are jumping hurdles. examples of the image retrieval results in Figure  6 also show that the retrieved images better match the captions with more detail semantic alignments.
A group of men are loading cotton onto a truck

UNIMO Baseline Text
A woman in a red shirt playing the cello.
Children enjoying themselves on an amusement park ride.
A man and a little boy beating drums.
Two men are smiling and riding bicycles.