GEM: A General Evaluation Benchmark for Multimodal Tasks

In this paper, we present GEM as a General Evaluation benchmark for Multimodal tasks. Different from existing datasets such as GLUE, SuperGLUE, XGLUE and XTREME that mainly focus on natural language tasks, GEM is a large-scale vision-language benchmark, which consists of GEM-I for image-language tasks and GEM-V for video-language tasks. Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages. We also provide two baseline models for this benchmark. We will release the dataset, code and baseline models, aiming to advance the development of multilingual multimodal research.


Introduction
In recent years, large-scale pre-training has become the new paradigm in the natural language processing (NLP) field. These models have demonstrated surprisingly good generalization abilities and can be applied to different downstream tasks by a simple fine-tuning. Several comprehensive benchmarks are constructed to evaluate such powerful models, including GLUE (Wang et al., 2018) and SuperGLUE  for evaluating monolingual natural language understanding systems, XGLUE (Liang et al., 2020) and XTREME (Hu et al., 2020) for evaluating multilingual natural language understanding and generation systems. Such pre-trained models have also been extended 1 https://github.com/microsoft/GEM to vision-language scenarios (Lu et al., 2019;Chen et al., 2019;Li et al., 2020a,b;Ni et al., 2021;Sun et al., 2019b,a;Luo et al., 2020) to handle multimodal tasks such as image(or video)-text retrieval and image (or video) captioning. However, there is still no comprehensive benchmark dataset for evaluating such multimodal pre-trained models. Besides, most existing vision-language datasets are labeled in English only, which cannot be used to evaluate the qualities of such models on other languages.
Motivated by this, we present GEM, a General Evaluation benchmark for Multimodal tasks. Comparing with GLUE, SuperGLUE, XGLUE and XTREME, GEM is designed for evaluating the generalization capabilities of vision-language models and consists of two subsets: GEM-I, which evaluates text-to-image retrieval and image captioning capabilities, and GEM-V, which evaluates text-to-video retrieval and video captioning capabilities. Besides, it is also a multilingual dataset, where the natural language contexts are collected from a commercial search engine. We describe two vision-language pre-trained models, M 3 P (Ni et al., 2021) and m-UniVL, as the baselines for GEM-I and GEM-V, respectively, where M 3 P is an existing multilingual image-language pre-trained model, m-UniVL is a multilingual extension of UniVL (Luo et al., 2020) for multilingual video-language tasks. The evaluation results of these two models on GEM are reported in the experiment part.
The key contribution of this paper is twofold: (1) we build GEM as the first large-scale multilingual multimodal benchmark, which can be used to evaluate the generalization capabilities of visionlanguage pre-trained models on a set of diversified multimodal tasks. (2) we provide two multilingual multimodal pre-trained models, M 3 P and m-UniVL, as the baselines of GEM for imagelanguage and video-language tasks, respectively. We hope GEM can further advance the research in  the multimodal community, just as its predecessors did in the NLP community.

Dataset Construction
To the best of our knowledge, GEM dataset is the first multilingual vision-language dataset constructed for image-language and video-language tasks as the same time. GEM-I contains 1.2 million {Query, Image, Title} triplets in 20 different languages for text-to-image retrieval and image captioning tasks. GEM-V contains 99K {Query, Video, Title} triplets in 30 languages for text-tovideo retrieval and video captioning tasks. In both GEM-I and GEM-V, Title denotes the title of the web page where each image (or video) is extracted. This signal can be used as the auxiliary information in all GEM tasks, as it is usually highly relevant to the corresponding image (or video). Next, we will describe how GEM-I and GEM-V are collected from a commercial search engine.

GEM-I Construction
First, we collect several billion images with Creative Commons licenses from the Internet, and discard images that contain pornographic or racy content. We also discard images with human faces, to avoid revealing privacy or introducing bias to our data. Besides, we only keep images which are larger than 300×300 pixels to guarantee high image quality. The pornographic classifier, racy classifier, and human face classifier are trained and evaluated on human-labeled data. The (precision,  Then, we collect user queries from a commercial search engine for each image based on user historical clicks. We also collect the title of the Web page that contains the image as the additional context, forming {Query, Image, Title} triplets. Some text cleanup work is done to only keep high quality queries and contexts, including removing pornographic words and meaningless strings, and discarding very short queries or titles in that they are less likely to depict the image content, etc. We also apply an in-house GBDT model to filter out potentially highly irrelevant {Query, Image, Title} triplets, which is trained using a small amount of human-labeled data, to predict the similarity between each {Query} and {Image, Title} pair. Finally, we only keep the top 20 languages which have more than 1000 images, and sample 1.2 million {Query, Image, Title} triplets in total. The average length of query in GEM-I is 5.5 terms, which is shorter than 10.6 in MSCOCO (Chen et al., 2015) and 12.3 in Flicker30K (Vinyals et al., 2015). Also, the average length of title is 10.1 terms. This makes GEM-I a more practical benchmark, since all data in GEM-I come from the real world, where the language configuration truly differs from the queries in existing datasets. For example, the queries in GEM were shorter and more concise, without perfect grammar or syntax structure. This makes GEM queries more "natural". Therefore, our benchmark can evaluate the models on data closer to real-world scenarios, so that the performance of the models will be more convincing in terms of being used in real-world applications. Based on human assessment on sampled query-image pairs, 83% of the them are well matched pairs in that the query is a plausible caption of its paired image. We randomly split the data into train, dev and test sets within each language. The data statistics and language distribution of GEM-I can be found in Table 1. Figure 1 gives some examples.

GEM-V Construction
We collect several billion videos from the Internet, and discard videos with pornographic or racy contents. We also discard very long videos to save storage and transfer expenses. For each video, its query and title are collected from a commercial search engine and cleaned-up according to a similar process as we described in GEM-I, where another in-house model is trained to filter out potentially irrelevant {Query, Video, Title} triplets.
Finally, we only keep the top 30 languages which have more than 700 videos, and sample 99K {Query, Video, Title} triplets in total. The total video length of GEM-V is 2,049 hours, and the average video length is 1.2 minutes. The average length of query in GEM-V is 5.3 terms, and that of title is 8.5 terms. We also conduct human evaluation on some sampled query-video pairs, and find 70% of them are plausible matched pairs. We randomly split the data into train, dev and test sets within each language. The data statistics and language distribution of GEM-V can be found in Table 2. Figure 2 gives some examples.

Baseline Models
This section will introduce two baseline models for GEM, including M 3 P, which is a multilingual multimodal pre-trained model for image-language tasks, and m-UniVL, which is a multilingual extension of UniVL (Luo et al., 2020) for video-language tasks.

M 3 P as Baseline of GEM-I
We select M 3 P (Ni et al., 2021) as the baseline model for tasks in GEM-I, as it is the state-of-theart multilingual image-language pre-trained model for both image-language understanding and generation tasks.
The M 3 P model uses the model architecture of BERT for understanding tasks and a BERT-based encoder-decoder architecture for generation tasks. For understanding tasks, multilingual masked language modeling, multimodal masked language modeling, masked region modeling and visuallinguistic matching are used as pre-training tasks to train a Transformer-based encoder. For generation tasks, multilingual denoising auto-encoding, image captioning and denoising image captioning are used as pre-training tasks to train a Transformerbased encoder-decoder. By training the encoder and the encoder-decoder with a multitask learning framework, universal representations are learned to map objects occurred in different modalities or expressed in different languages to vectors in a common semantic space.
Fine-tune Tasks Based on the pre-trained M 3 P model, we further finetune it on our GEM-I data.
For the text-image retrieval task, we adopt the BCE loss and NCE loss (Gutmann and Hyvärinen, 2010) (with equal weights) to learn the instance-level alignment between texts and images. The negative samples are generated by randomly forming text-image pairs from different training samples in the same batch. For the image captioning task, we directly learn captioning loss on GEM-I data.
Side-Information Since title is considered as the side-information of the image, we concatenate it together with the image and feed them into the model. During the negative sampling process in the retrieval task, we treat title and image as a whole, i.e., for a certain query, the titles and images from other samples are considered as negatives.

m-UniVL as Baseline of GEM-V
We adopt the same model structure with the unified video and language pre-trained model UniVL (Luo et al., 2020), which can perform both multi-modal understanding and generation tasks. Specifically, we extend the pre-trained UniVL model from monolingual to multilingual by replacing the BERT-based module with XLM-R (Conneau et al., 2019), and call the new model m-UniVL. m-UniVL adopts an encoder-decoder architecture, including two single-modal encoders to encode the multilingual text and the visual features respectively, and one cross-modal encoder to learn the interactions between the two modalities, and finally an optional decoder for generation tasks. To better leverage the pre-trained models, we initialize each module with different pre-trained weights: for the multilingual text encoder, we directly initialize it with the pre-trained XLM-R 2 (Conneau et al., 2019), and for other modules including the visual encoder, the cross encoder and the decoder, we initialize them with the weights of the pre-trained UniVL 3 .
Fine-tune Tasks Based on the pre-trained m-UniVL, we further finetune it using GEM-V data. For the text-video retrieval task, we only employ encoders in m-UniVL in the finetuning stage and use them to predict the matching score between text and video. There are two baseline models for the retrieval task: 1) m-UniVL(loose), the loosely coupled model that only uses the single-modal encoders; 2) m-UniVL(tight), the loosely plus tightly coupled model that includes both the single-modal encoders and the cross-modal encoder. We adopt the NCE loss (Gutmann and Hyvärinen, 2010) to learn to discriminate the positive video-text pairs against negative ones. The negative video-text samples are created by replacing the text or video in a positive sample with randomly-selected text or video from other samples. For video captioning task, we employ all modules including all the encoders and decoder to learn the caption generation task.
Side-Information In regard to the titles, we use them as the side-information of the videos for an efficient text-video retrieval. In details, we first map the embedding of the title to the same dimension with the video embedding, and then concatenate them together. Then we encode them using the visual encoder to generate the enhanced video features for further processing.
4 Related Work 4.1 Natural Language Benchmarks GLUE (Wang et al., 2018) and SuperGLUE  are two comprehensive datasets that can be used to train and evaluate natural language understanding systems. GLGE (Liu et al., 2020) is another comprehensive dataset for natural language generation evaluation. XGLUE (Liang et al., 2020) and XTREME (Hu et al., 2020) are two recent benchmark efforts that extend the evaluation scenarios from monolingual to multilingual. Recent pre-trained language models benefit a lot from these datasets, by evaluating their effectiveness under a relatively fair environment.

Vision-Language Benchmarks
A number of vision-language datasets have been widely used in the multimodal research.
MSCOCO (Chen et al., 2015) and Flicker30K (Vinyals et al., 2015) are two datasets for imagetext retrieval and image-captioning tasks. These two benchmarks have been extended to multilingual tasks (Elliott et al., 2016(Elliott et al., , 2017Miyazaki and Shimizu, 2016; as well. VQA (Antol et al., 2015) and GQA (Hudson and Manning, 2019) are two datasets for visual question answering. VCR (Zellers et al., 2018) is another dataset for visual commonsense reasoning. Comparing with all these existing datasets, GEM-I has unique characteristics. First, it is a large-scale multilingual image-text dataset covering 20 different languages. Second, the query-image pairs in GEM-I come from a commercial search engine. Therefore, it has big practical values. Third, for each query-image pair, the title of the Web page that contains the image is also included as the additional context, which makes GEM-I different from all existing datasets.
HowTo100M (Miech et al., 2019b), YouCook2 (Zhou et al., 2018), and MSR-VTT (Xu et al., 2016) are three typical benchmarks for video-text retrieval and video captioning tasks. TVQA (Lei et al., 2018) and ActivityNet-QA  are two typical benchmarks for video question answering. Comparing with all these existing datasets, GEM-V is the first video-language benchmark supporting multilingual scenarios. Similar to GEM-I, it also has big practical values, as all data in GEM-V come from a real-world search engine with massive users.

Experiments
In this section, we evaluate two baseline pre-trained models (described in Section 3) on GEM. Specifically, M 3 P is evaluated on GEM-I for multilingual image-language tasks and m-UniVL is evaluated on GEM-V for multilingual video-language tasks. For both baseline models, we fine-tune them on downstream tasks directly.

Experimental Settings
We select the open-source version 4 of M 3 P (Ni et al., 2021) for the image-language evaluation on GEM-I. It uses 101G sentences (in 100 languages) extracted from Wikipedia as the multilingual pretraining corpus, and uses 3.3 million English imagecaption pairs in Conceptual Captions (Sharma et al., 2018) as the multimodal pre-training corpus.
For text-to-image retrieval, the hyper-parameters of the encoder are set as follows: 768 hidden units, 12 heads, GELU activation, a dropout rate of 0.1, 128 max input length, 12 layers in encoder. For image captioning, the hyper-parameters of the encoder-decoder are set as follows: 768 hidden units, 8 heads, GELU activation, a dropout rate of 0.1, 128 max input length, 12 layers in both encoder and decoder. The transformer parameters between the encoder and decoder are shared, including embedding modules and self-attention modules.
We fine-tune M 3 P on text-to-image retrieval and image captioning tasks. For retrieval, we use Adam optimizer with β 1 = 0.9, β 2 = 0.98, an initial learning rate of 5e-5, a weight decay of 1e-4 and a batch size of 64 to fine-tune M 3 P for 30 epochs. For captioning, a learning rate of 1e-4 and a batch size of 16 are used to fine-tune M 3 P for 20 epochs. All above calculations are carried on 4 NVIDIA Tesla P100 GPUs.  Table 3: Evaluation results of M 3 P on GEM-I test set for text-to-image retrieval tasks where Mean-Recall is taken as metric. Q→I denotes the setting where only image (I) is used to compute its similarity with query (Q), Q→I+T denotes the setting where both image (I) and title (T) are used to compute the similarity with query (Q). The average score is computed over all 20 languages.

Text-to-Image Retrieval Results
We follow the same evaluation metric, mean-Recall (average score of R@1, R@5, R@10), in M 3 P to report its the performance on text-to-image retrieval task on GEM-I dataset.
From the results reported in Table 3, we have several observations: 1) When applying M 3 P to GEM-I without finetuning (i.e. the zero-shot setting), the general performance is poor. The major reason is that M 3 P is pre-trained on a monolingual multimodal corpus and a multilingual monomodal corpus, and both datasets have very different data distributions comparing with GEM-I.
2) By fine-tuning M 3 P using all labeled data in different languages (i.e. the fine-tune on all setting), better performance can be obtained. This shows the strong transfer ability of M 3 P, when there is a moderate amount of labeled data for fine-tuning.
3) By furthering considering the title signal in this retrieval task, the general performance can be improved significantly. This indicates the strong correlation between the query and the title. Besides, when taking the title signal into the zero-shot setting, we can observe a performance drop. It is due to that M 3 P is pretrained with input paradigm Q-I, thus making it not suitable for evaluating Q-I-T paradigm directly. Table 4, we report the performance of image captioning tasks on GEM-I test set with M 3 P model where ROUGE-L (Lin and Och, 2004), METEOR (Banerjee and Lavie, 2005) and CIDEr  are taken as the evaluation metrics. To study the image captioning ability of M 3 P, we only use images (without title) to generate queries in GEM-I dataset. In general, the M 3 P model performs relatively poor on this task, due to that most search queries are short keywords instead of a complete sentence, and they differ from our pre-training data a lot.

As in
From the above results from text-to-image retrieval task and the image captioning task, we can conclude that our proposed GEM-I dataset can demonstrate a model's image understanding and generation ability in 20 different languages.  Table 5: Evaluation results of m-UniVL on GEM-V for text-to-video retrieval, where Mean-Recall is used as the metric. Q→V denotes the setting where only video (V) is used to compute its similarity with query (Q). Q→V+T denotes the setting where both video (V) and title (T) are used to compute their similarity with query (Q). The average score is computed over all 30 languages.

Experimental Settings
We select the open-source version 5 of UniVL (Luo et al., 2020) and replace the original text encoder with XLM-R (Conneau et al., 2020), to support the multilingual video-language evaluation on GEM-V. The original UniVL is pre-trained on 1.2 million instructional videos with ASR transcripts in HowTo100M (Miech et al., 2019b).
For text-to-video retrieval task, m-UniVL extracts video features using the off-the-shelf pretrained S3D (Miech et al., 2019a) model. The FPS of the 3D feature extractor is 16 and the dimension is 1,024. The hyper-parameters of the video encoder are set as follows: 768 hidden units, 12 heads, 6 layers of of Transformer blocks to capture the sequential information on the 3D features. The hyper-parameters of the text encoder are identical to the ones in XLM-R: 768 hidden units, 12 heads, 12 layers of Transformer blocks. The cross encoder on the top of the text and video encoders has 2 layers with 768 hidden units and 12 heads. For video captioning, the decoder is with 3 layers, 768 hidden units and 12 heads.
We finetune m-UniVL on text-to-video retrieval and video captioning tasks. For retrieval, a learning rate of 1e-4 and a batch size of 128 are used to finetune m-UniVL for 50 epochs. For captioning, a learning rate of 3e-5 and a batch size of 16 are used to fine-tune m-UniVL for 5 epochs. All above calculations are carried on 4 NVIDIA Tesla V100 GPUs. 5 https://github.com/microsoft/UniVL

Text-to-Video Retrieval Results
Following official UniVL on retrieval task, we evaluate the text-to-video retrieval task on our GEM-V using two variants. One is m-UniVL (loose), which encodes the input text query and candidate video clips (and optional title) through the text encoder and video encoder respectively and finally calculates the matching score through dot product. The other is m-UniVL (tight), based on m-UniVL (loose), m-UniVL (tight) further concatenates the encoded features and feeds them to the cross encoder to get unified representation and predict the matching score on the first token ' s '. The evaluation metric is mean-Recall (arithmetic mean of Recall@K for K ∈ {1, 5, 10}).
Tables 5 lists the retrieval results. The results can be divided into two groups. One is from the fine-tuning on all training set across linguistic type (Fine-tune on ALL), and the other is from the finetuning on individual training set of each language (Fine-tune on Each). The target of such a division is to explore whether one language can benefit from other wide languages. Besides, there are 9 languages without training set. We keep such a zero-shot evaluation to explore the transfer ability of the proposed model.
We can get three conclusions from the results: 1) The m-UniVL (tight) outperforms m-UniVL (loose) at the same retrieval settings. It proves that the cross encoder of UniVL enables the multimodality features to fully interact with each other to capture better alignment.
2) The title of the video introduces a large performance gain and is a good semantic feature of the video. This metadata is especially useful for  zero shot setting with a significant improvement. They demonstrate that improving the retrieval performance on pure videos without titles is still a challenge. Our proposed GEM develops a chance to push such a multimodal challenge.
3) Fine-tune on all can achieve better results than fine-tune on each. The reason is the former can effectively leverage the data from all languages and benefit the task rather than the latter. Besides, for zero shot languages, fine-tune on all is also very effective. It demonstrates that our proposed GEM can also be used on multilingual research besides multimodal research.

Video Captioning Results
The captioning task aims to generate a caption given a video clip (and optional title) in our setting. Such a generation task is from our practical application. We adopt whole m-UniVL including encoders and decoder to finish the task. The evaluation metric are ROUGE-L (Lin and Och, 2004), METEOR (Banerjee and Lavie, 2005) and CIDEr , whose values are obtained from an open-source tool 6 . Table 6 lists the experimental results. Similar conclusions can be drawn as the retrieval task, and there are two more observations: 1) For captioning task, the performance of the 6 https://github.com/Maluuba/nlg-eval generation on pure videos is low. The reason is that the search queries sometimes are the keywords instead of a whole sentence, thus the task of V→Q is relatively hard.
2) Title is especially important due to the characteristic of this data collection process.
From the above results from the text-to-video retrieval task and the video captioning task, we can conclude that our proposed GEM-V can improve video understanding and generation under the multilingual and multimodal perspective.

Conclusion
This paper presents GEM as a benchmark for evaluating the generalization capabilities of visionlanguage models on image-language and videolanguage tasks. GEM is also the first large-scale multilingual multimodal dataset, where the natural language contexts are collected from a commercial search engine in 20 and 30 languages for imagerelated and video-related tasks, respectively. We describe two vision-language pre-trained models for GEM and hope these efforts can advance the development of multilingual multimodal research.

Ethical Considerations
We have reviewed our data release process and it has been approved by our institutional review board. Specifically, (a) In GEM-I, all of the images are with proper Creative Commons Licenses, so that they are safe to be distributed without violating any policies or intellectual rights. Also, we discarded images with human faces to avoid revealing privacy. (b) In GEM-V, all of the videos were originated from Youtube, and we will only provide Youtube URLs to the researchers. We have confirmed with our institutional review board that distributing URLs does not violate any policies or intellectual rights. We didn't do anything specific for human faces in the videos, since we are only distributing video URLs, and modifying the original videos (such as blurring the faces) might violate the copyright of the videos. When releasing GEM to the public, we will indicate the data source, emphasize that the dataset is for research purposes only, and provide an email address for people to contact us to delete any data if any infringement. During data collection, we didn't collect, store, or distribute any private information of the users.
To measure the quality of our dataset, we employed crowd-sourcing judges in the United States and provided labeling guidelines for them. The compensation given to the workers is 15 USD per hour for GEM-I and 25 USD per hour for GEM-V. The level of compensation is determined by: (a) Market price according to similar labeling tasks in the US. (b) The difficulty and labeling speed of this task. This task involves labeling if a query is related to an image or video, so it is considered as a relatively easy task. The labeling speed is about 300 query-image pairs per hour and 180 query-video pairs per hour.