Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset

Research in massively multilingual image captioning has been severely hampered by a lack of high-quality evaluation datasets. In this paper we present the Crossmodal-3600 dataset (XM3600 in short), a geographically diverse set of 3600 images annotated with human-generated reference captions in 36 languages. The images were selected from across the world, covering regions where the 36 languages are spoken, and annotated with captions that achieve consistency in terms of style across all languages, while avoiding annotation artifacts due to direct translation. We apply this benchmark to model selection for massively multilingual image captioning models, and show superior correlation results with human evaluations when using XM3600 as golden references for automatic metrics.


Introduction
Image captioning is the task of automatically generating a fluent natural language description for a given image.This task is important for enabling accessibility for visually impaired users, and is a core task in multimodal research encompassing both vision and language modeling.However, datasets for this task are primarily available in English (Young et al., 2014;Chen et al., 2015;Krishna et al., 2017;Sharma et al., 2018;Pont-Tuset et al., 2020).Beyond English, there are a few datasets such as Multi30K with captions in German (Elliott et al., 2016), French (Elliott et al., 2017) and Czech (Barrault et al., 2018), but they are limited to only a few languages that cover a small fraction of the world's population, while featuring images that severely under-represent the richness and diversity of cultures from across the globe.These aspects have hindered research on image captioning for a wide variety of languages, and directly hamper deploying accessibility solutions for a large potential audience around the world.
Creating large training and evaluation datasets in multiple languages is a resource-intensive endeavor.Recent works (Thapliyal and Soricut, 2020) have shown that it is feasible to build multilingual image captioning models trained on machine-translated data (with English captions as the starting point).This work also shows that the effectiveness of some of the most reliable automatic metrics for image captioning, such as CIDEr1 (Vedantam et al., 2015) is severely diminished when applied to translated evaluation sets, resulting in poorer agreement with human evaluations compared to the English case.As such, the current situation is that trustworthy model evaluation can only be based on extensive and expensive human evaluations.However, such evaluations cannot usually be replicated across different research efforts, and therefore do not offer a fast and robust mechanism for model hill-climbing and comparison of multiple lines of research.
The proposed XM3600 image captioning evaluation dataset provides a robust benchmark for multilingual image captioning, and can be reliably used to compare research contributions in this emerging field.Our contributions are as follows: (i) for human caption annotations, we have devised a protocol that allows annotators for a specific target language to produce image captions in a style that is consistent across languages; this protocol results in image-caption annotations that are free of direct translation artefacts, an issue that has plagued Machine Translation research for many years and is now well understood (Freitag et al., 2020); (ii) for image selection, we have devised an algorithmic approach to sample a set of 3600 geographically-diverse images from the Open Images Dataset (Kuznetsova et al., 2020), aimed at creating a representative set of images from across the world; (iii) for the resulting XM3600 bench- showcasing the creation of annotations that are consistent in style across languages, while being free of directtranslation artefacts (e.g. the Spanish "number 42" or the Thai "convertibles" would not be possible when directly translating from the English versions).mark, we empirically measure its ability to rank image captioning model variations, and show that it provides high levels of agreement with human judgements, therefore validating its usefulness as a benchmark and alleviating the need for human judgement in the future.
Fig. 1 shows a few sample captions for an image in XM3600 that exemplify point (i) above, and Fig. 2 shows the variety of cultural aspects captured by the image sampling approach from point (ii).We provide detailed explanations and results for each of the points above in the rest of the paper.We have released XM3600 under a CC-BY4.0license at https://google.github.io/crossmodal-3600/.

The XM3600 Dataset
In this section, we describe the heuristics used for language and image selection, the design of the caption annotation process, caption statistics including quality, and annotator details.
many native speakers, or major native languages from continents that would not be covered otherwise.The protocol for caption annotation (Sec.2.3) has been applied to the resulting union of languages plus English, for a total of 36 languages.

Image Selection
In this section, we consider the heuristics used for selecting a geographically diverse set of images.
For each of the 36 languages, we select 100 images that, as far as it is possible for us to identify, are taken in an area where the given language is spoken.The images are selected among those in the Open Images Dataset (Kuznetsova et al., 2020) that have GPS coordinates stored in their EXIF metadata.Since there are many regions where more than one language is spoken, and given that some areas are not well covered by Open Images, we design an algorithm that maximizes the percentage of selected images taken in an area in which the assigned language is spoken.This is a greedy algorithm that starts the selection of images by the languages for which we have the smallest pool (e.g.Persian) and processes them in increasing order of their candidate image pool size.Whenever there are not enough images in the area where a language is spoken, we have several back-off levels: (i) selecting from a country where the language is spoken; (ii) a continent where the language is spoken, and, as last resort, (iii) from anywhere in the world.
This strategy succeeds in providing our target number of 100 images from an appropriate region for most of the 36 languages except for Persian (where 14 continent-level images are used) and  Hindi (where all 100 images are at the global level because the in-region images are assigned to Bengali and Telugu).We keep the region each image is selected from as part of our data annotation, so that future evaluations can choose to either evaluate on images relevant to particular regions of interest or on the entire dataset.

Caption Annotation
In this section we detail the design of the caption annotation process.For a massively multilingual benchmark such as XM3600, consistency in the style of the description language is critical, since language can serve multiple communication goals.For a more in-depth discussion on these issues as they relate to image captions, we refer the reader to (Alikhani et al., 2020).We borrow from their terminology, as it identifies coherence relations between image and captions such as VISIBLE, META, SUB-JECTIVE, and STORY.The goal for our caption annotation is to generate VISIBLE image captions, i.e., use the target language to formulate a sentence that is intended to recognizably characterize what is visually depicted in the image.
One possible approach to generating such captions is to generate them as such in English, and have them translated (automatically, semiautomatically, or manually) into all the other languages.However, this approach results in an English-language bias, as well as other problems that have been already identified in the literature.For instance, translations are often less fluent compared to natural target sentences, due to word order and lexical choices influenced by the source language.The impact of this phenomenon on metrics and modeling has recently received increased attention in the evaluation literature (Toral et al., 2018;Zhang and Toral, 2019;Freitag et al., 2020), and references created in this style are thought to cause overlap-based metrics to favor model outputs that use such unnatural language.
We have designed our caption annotation process to achieve two main goals: (i) produce caption annotations in a VISIBLE relation with respect to the image content, and, strongly, create consistency in the description style across languages; (ii) be free of translation artefacts.To achieve this, we use bi-lingual annotators with a requirement to be reading-proficient in English and fluent/native in the target language.As a preliminary step, we train an image-captioning model on English-annotated data, which results in captions in the VISIBLE style of COCO-CAP (Chen et al., 2015).
The annotation process proceeds as follows.Each annotation session is done over batches of N = 15 images, using the images selected as described in Sec.2.2.The first screen shows the N images with their captions in English as generated by the captioning model, and asks the annotators if the captions are EXCELLENT, GOOD, MEDIUM, BAD, or there is NOT-ENOUGH-INFO.We refer to this rating scale as the 5-level quality scale in the subsequent text.We provide the annotators with clear guidelines about what constitutes an EXCEL-LENT caption, and how to evaluate degradations from that quality.This step forces the annotators to carefully assess caption quality and it primes them into internalizing the style of the captions without the need for complicated and lengthy annotation instructions.
The second round shows the same N images again, but one image at a time without the English captions, and the annotators are asked to produce descriptive captions in the target language for each image.In the absence of the English captions, the annotators rely on the internalized caption style, and generate their annotations mostly based on the image content -with no support from the text modality, other than potentially from memory.Note, however, that we have designed the system to support N annotations simultaneously, and we have empirically selected the value of N as to be large enough to "overwrite" the memory of the annotators with respect to the exact textual formulation of the English captions.As a result, we observe that the produced annotations are free of translation artefacts: See the example in Fig. 1 for Spanish mentioning "number 42", and for Thai mentioning "convertibles".
We also provide the annotators with an annotation protocol to use when creating the captions, which provides useful guidance in achieving consistent annotations across all the targeted languages.We provide the annotation guidelines in Appendices B and C. For each language, we annotate all 3600 images with captions using replication 2 (two different annotators working independently)4 , except Bengali (bn) with replication 1 and Maori (mi) with roughly 1 for 2/3 and 2 for 1/3 of the images, see Table 1.

Caption Statistics
In this section, we take a look at the the basic statistics of the captions in the dataset.captions per language.
For languages with natural space tokenization, the number of words per caption can be as low as 5 or 6 for some agglutinative languages like Cusco Quechua (quz) and Czech (cs), and as high as 18 for an analytic language like Vietnamese (vi).The number of characters per caption also varies drastically -from mid-20s for Korean (ko) to mid-90s for Indonesian (id) -depending on the alphabet and the script of the language.

Caption Quality
In this section, we describe the process for ensuring the creation of high quality annotations, and present Table 2: Caption quality statistics for the 36 languages.We use the median of three ratings as the aggregated rating for an image-caption pair.
quality statistics of the annotations produced.
In order to ensure quality, the annotation process is initially started with pilot runs on 150 images.The caption ratings are spot checked by the authors to verify that the raters have a good understanding of the rating scale.Further, the generated captions go through a verification round where they are rated by the human annotators on the 5-level quality scale described in Sec.2.3.If the annotations are below the desired quality, we clarify the guidelines and add more examples to provide feedback to the human annotators and then conduct another pilot.This process is repeated until very few low-quality captions are being produced 5 .After this, for every 5 We started the process with a set of six languages and 4-5 pilots were needed per language.For subsequent languages, only 1-2 pilots were needed because of these clarifications and language, we run the main annotation and finally a verification round where we select one caption for 600 randomly selected images and have the annotator pool (per language) rate them on the 5-level quality scale mentioned in Sec.2.3.The quality scores are presented in Table 2.

Annotator Details
We use an in-house annotation platform with professional (paid) annotators and quality assurance.Annotators are chosen to be native in the target language whenever possible, and fluent otherwise (for low-resource languages, they are usually linguists that have advanced-level knowledge of that language).All annotators are required to be proficient in English since the instructions and guidelines are given in English.

Model Comparison using XM3600
In this section, we detail our experiments for comparing several models using human evaluations, and also using XM3600 annotations as gold6 references for automated metrics.
For model comparison, we train several multilingual image captioning models with different sizes over different datasets, and compare them on XM3600.As our main result, we show a high level of correlation between model rankings based on human-evaluation scores and the scores obtained using CIDEr (Vedantam et al., 2015) with XM3600 annotations as gold references.

Datasets
We build two multilingual datasets for training, CC3M-35L and COCO-35L, by translating Conceptual Captions 3M (Sharma et al., 2018) and COCO Captions (Chen et al., 2015) to the other 34 languages using Google's machine translation API7 .The remaining language, Cusco Quechua (quz), is not supported by the API8 .We use the standard train and validation splits for CC3M9 .For COCO, we use the Karpathy split (Karpathy and Fei-Fei, 2014)

Models
In this section we detail the model architecture we used for the experiments.image are pooled into a single dense feature vector.
On the text side, a Language Identifier (LangId) string is used to specify the language.The LangId string is tokenized and embedded into dense token embeddings, which are merged with the dense visual embeddings as the input to a multi-layer Transformer Image and Text Encoder, followed by a multi-layer Transformer Image and Text Decoder to generate the predicted captions.
We train these three models on COCO-35L.In addition, we consider a fourth model based on mT5base + ViT-B/16 and trained on CC3M-35L.The models are trained on a 4x4x4 TPU-v4 architecture using an Adafactor (Shazeer and Stern, 2018) optimizer with a constant learning rate period between {1k, 10k} steps, followed by a reversed square-root decay with the number of steps.The batch size is 2048 in all the experiments.The initial learning rate is between {1e-4, 3e-4}.We use the same vocabulary (size 250k) as mT5 (Xue et al., 2021).The model trained with CC3M-35L is subsequently finetuned on COCO-35L with constant learning rate 3e-5 for 1 epoch.

Human Evaluation
In this section, we detail the process used for human evaluations comparing the performance of two models.
Our main goal in creating XM3600 is to automate the evaluation of massively multilingual image captioning models, by eliminating expensive and timeconsuming human evaluations.Our results indicate that they can be substituted by using the XM3600 annotations as gold references for automated metrics such as CIDEr (Vedantam et al., 2015).To quantify the correlation between the two methods, we train four different models (Tab.3) and conduct side-by-side human evaluations using the outputs of these models in several languages.We observe strong correlations (Sec.3.4) between the human evaluations and the CIDEr scores using the XM3600 references.
Specifically, we use a randomly selected subset of 600 images from XM3600 for human evaluations, which we call XM600.Image captions generated by a given pairing of models (m 1 vs m 2 , where m 1 is considered as the base condition and m 2 as the test condition) are compared and rated side-by-side, using a similar pool of annotators as described in Sec.2.6.Each side-by-side pair (shown in a random per-example left-vs-right order) is rated using a 7-point scale: MUCH-BETTER, BETTER, SLIGHTLY-BETTER, SIMILAR, SLIGHTLY-WORSE, WORSE, MUCH-WORSE, with a replication factor of 3 (three annotators rate each pair).We denote by WINS the percentage of images where the majority of raters (i.e. 2 out of 3) mark m 2 's captions as better, and by LOSSES the percentage of images where the majority of raters mark m 2 's captions as worse.We then define the overall side-by-side gain of m 2 over m 1 as ∆S×S = WINS -LOSSES.
Conducting the full set of six side-by-side evaluations for each pair of models over the 35 languages would require 210 human evaluation sessions.This is prohibitively expensive and time consuming.Thus, we conduct the full set of six side-by-side evaluations of the pairs of models, on a core set of four languages called LCORE 11 .We call this set of 24 evaluation sessions OCORE.Furthermore, we also conduct a sparser set of side-by-side evaluations over languages where the CIDEr differences on XM3600 and on COCO-DEV 12 indicate 11 Chinese-Simplified (zh), English (en), Hindi (hi), Spanish (es) 12 COCO validation split with machine-translated references disagreement or ambiguity (e.g., opposite sign of the CIDEr differences, and/or small CIDEr differences); this gives us a set of 28 languages called LEXT13 .We call the resulting set of 41 evaluation sessions OEXT.The set of all evaluations is called OALL =OCORE + OEXT, which are conducted over the languages LALL = LCORE + LEXT.
The choice of which model is called m 1 and which model is called m 2 is arbitrary in the sideby-side evaluations, since we randomly flip left vs right before presenting the captions to the raters.Hence a single side-by-side evaluation gives two points for the correlation calculations: one with the m 1 and m 2 assigned as per the actual evaluation conducted, and one more with the m 1 and m 2 assignment flipped and the ∆S×S sign flipped correspondingly.

Results
We present results that show that it is feasible to use the XM3600 annotations as gold references with automated metrics such as CIDEr to compare models in lieu of human evaluations, and that this option is superior to using silver references created via automated translation.Table 5 presents the results for the OCORE set of evaluations on XM600 on the LCORE languages, while Table 7 in the appendix shows the results on the LEXT languages.The reference for the relative strength of each pairing is given by ∆S×S, with positive numbers indicating the superiority of m 2 , and negative numbers indicating a superiority of m 1 .As can be seen from the table, the model comparisons span a range of model differences, from low ∆S×S to high ∆S×S.∆CIDEr XM600and ∆CIDEr XM3600 capture similar information, except these numbers are based on CIDEr scores using as references XM600 and XM3600, respectively, while ∆CIDEr COCO-DEV is based on machine-translated references from the validation split of COCO.
We use the results from Table 5 (and Table 7) to compute the correlation between human judgements of the relative quality of the captioning models and the ability of the CIDEr 14 metric -or, rather, of the underlying references used by the metricto perform an equivalent task.Table 6 presents the correlation results using three correlation metrics: 14 We use the reference implementation with default parameters: github.com/vrama91/cider.We remove punctuation and lowercase the captions and references before computing automated metrics.
Pearson, Spearman, and Kendall.The first section shows the correlations over all the side-by-side evaluations (i.e.OCORE and OEXT); These cover the LCORE and the LEXT languages.The second section shows the correlations for the OEXT covering the LEXT languages.The third section shows the correlations for the OCORE evaluations covering the LCORE languages.
We observe that ∆CIDEr XM3600 is highly correlated with human judgement according to all the correlation metrics (Bonett and Wright, 2000), over all the evaluations OALL, over the OCORE evaluations, and also the OEXT evaluations.Furthermore, for the OEXT evaluations, where most of the instances have opposite signs for ∆CIDEr COCO-DEV and ∆CIDEr XM3600, we find that the former is strongly anti-correlated with the human evaluation results while the latter is highly correlated with the human evaluation results.Overall, these results indicate that: (i) we can reliably substitute ∆CIDEr XM3600 for human evaluations on XM600 when comparing models similar to the ones we used; (ii) the gold XM3600 references are preferable over the silver references obtained from translating COCO captions, in terms of approximating the judgements of the human evaluators 15 .
Based on the results from Table 6, we recommend the use of the XM3600 references as a means to achieve high-quality automatic comparisons between multilingual image captioning models.We have provided the CIDEr scores for XM3600 in 35 languages for all the models, in Tables 8-11 in the Appendix.These can be used as baselines in future work. 15However, it is unclear whether machine translated references for one particular language in XM3600 translated to all others, are worse than using the human generated references.In particular, we studied the correlations of CIDEr computed using XM3600-en-MT (i.e. the XM3600 English references, machine translated to all the other languages), with the human evaluations.We found that even though the translations have artifacts and disfluencies, CIDEr differences calculated using them show comparable correlations with human judgement observations.We also studied such correlations for machine translated references from German, Greek, Hebrew, Hungarian and Swahili.We found that the correlations are similar and sometimes even a bit higher than using the human generated references.We believe this happens because the rater guidelines weigh informativeness over fluency and the CIDEr metric is also not as sensitive to fluency.Further work is needed to understand the use of translated references as compared to human generated references.We believe that using the human generated references along with the set of machine translated references from all the other languages may provide even stronger correlations and show greater diversity in the coverage of the image constituents.

Conclusions
We introduce the XM3600 dataset as a benchmark for evaluating the performance of multilingual image captioning models.The images in the dataset are geographically diverse, covering all inhabited continents and a large fraction of the world population.We believe this benchmark has the potential to positively impact both the research and the applications of this technology, and enable (among other things) better accessibility for visually-impaired users across the world, including speakers of lowresource languages.
The main appeal of this benchmark is that it alleviates the need for extensive human evaluation, which is difficult to achieve across multiple languages and hinders direct comparison between different research ideas and results.We show significant improvements in correlation with human judgements when using the XM3600 dataset as references for automatic metrics, and therefore hope that the adoption of this dataset as a standard benchmark will facilitate faster progress and better comparisons among competing ideas.
Our empirical observations are primarily on the full set of side-by-side comparisons over English and three other languages (Spanish, Hindi, Chinese).Due to the similarity in the data collection and the quality control process, we expect similar results to hold for all the other languages as well; we validated this expectation with additional empirical observations covering an additional 28 languages.

Limitations
Due to the high volume of work required and the cost associated with it, we have only targeted 36 languages for our annotation effort; while this number is significantly higher than what is available with previous annotations, it still falls short of including many other languages spoken and written around the world.Additionally, since the L30 languages were selected based on their internet presence, one unintended consequence is that the dataset over-represents European languages.While this is somewhat mitigated by including the L5 low resource languages, building and sharing this dataset can have the unintended effect of perpetuating the issue where computational linguistics and AI work is often unintentionally Eurocentric.
Due to the cost and logistical constraints, we have sampled only 100 images for each of the tar-geted languages, which limits the amount of natural and cultural phenomena that these images capture.While the resulting 3600 images have significantly more variety compared to previous datasets, it may still fall short of including important aspects of natural and cultural life from around the globe.Further, there is the possibility of bias in the dataset due to the uneven access to photographic equipment and internet connectivity (For example, several of the images in Fig. 4 seem be shared by people with non-native names in the context of the locales.Thus, these images may have been taken by tourists rather than natives.Further exploration into this aspect of the dataset is important as well).
Another limitation is around the absence of translation artifacts in the annotations.We primarily rely on the caption generation process outlined in Sec.2.3 and on rater quality controls for avoiding translation artifacts.Further, we have performed spot checks on captions in several languages and have not found indications of translation artifacts.Additionally, we have also compared the translations of annotations from another language such as English with generated annotations and verified that the translations show peculiar artifacts and disfluencies which are not seen in the generated annotations.
We would also like to emphasize that, while this dataset aims to ameliorate the need for human evaluations for multilingual image captioning, automated evaluation may be less sensitive to small changes, e.g. when comparing highly tuned methods submitted to competitions.This was one of our motivations for comparing models that range from very different (CC+Bg vs Bg/Lg/BB) to moderately different (BB vs Bg/Lg) and quite similar (Bg vs Lg), and the results from Table 5 show that our approach works well over this range of model differences over LCORE.We also stresstested our approach by focusing the OEXT evaluations on cases where ∆CIDEr XM3600 or ∆CIDEr COCO-DEV were quite small or of opposite signs, and the results from Table 7 in the appendix show that ∆CIDEr XM3600 correlated well with human evaluations even for this harder set of evaluations.However, we caution the reader that there will be cases where human judgement will still be needed.Further, automated evaluations may be biased to methods that explicitly optimize the evaluated metric, e.g.via approaches such as Self-Critical Sequence Training (Rennie et al., 2017).
We also note that the model outputs and human judgements data used for calculating the correlations would be useful for constructing new automated metrics and validating existing automated metrics for model comparisons.Releasing this data would also allow independent calculation of CIDEr and ∆S×S shown in Table 5 and Table 7.However, due to the timelines involved and approvals required, we are not able to release this data with the paper.This may hamper the reproducibility of these computations.
The approach to data collection and annotation of COCO-CAP (Chen et al., 2015) and CC3M (Sharma et al., 2018) upholds rigorous privacy and ethics standards, such as the avoidance of offensive content and exposure of personal identification data.This significantly mitigates but does not completely eliminate the risks that the captioning models we train would produce such information.Similarly, the XM3600 dataset mitigates such risks by adopting a defense-in-depth approach: 1) The annotations have been produced in-house and have been quality controlled, while the images used have been vetted to be appropriate for the intended use.2) Further, the machine translations of the annotations have been scanned with an automated tool to detect personally identifiable information.
3) The machine translations of the annotations have been spot-checked by the authors.
Overall, in spite of the above limitations, we believe that this dataset is a significant step toward ameliorating language and geographic bias, and that it should be used for advancing image captioning research over a wider variety of images and languages.

B Instructions for Rating Captions
The following instructions are provided to annotators for rating captions: This task involves rating captions.To guide your ratings, imagine that you are describing the image to a visually impaired friend, then consider: how well does the caption describe the image to this friend?
Use the following scale for judging the quality of the captions (for borderline cases, use the lower rating): • BAD: The caption has one or more of the following issues: a).

C Instructions for Generating Captions
The following instructions are provided to annotators for generating captions: To guide your caption generation, imagine that you are describing the image to a visually impaired friend.The caption should explain the whole image, including all the main objects, activities, and their relationships.The objects should be named as specifically as practical: For example when describing a young boy in a picture, "young boy" is preferred over "young child", which in turn is preferred over "person".
Note: the goal is to generate captions that would be labeled as "Excellent" under the Rating guidelines above, but raters should not copy captions from the first phase.We want the raters to generate the captions on their own.We outline here a procedure that you should try and follow when writing your image caption.Note that not all these steps may be applicable for all images, but they should give you a pretty good idea of how to organize your caption.We will make use of the first image in the table below (the one with the young girl smiling) Note: It is acceptable to make assumptions that are reasonable as long as they don't contradict the information in the image (eg: in the second image below, we use "families" in captions 1 and 3 because there seems to be a mix of children and adults though it is not perfectly clear.So it is a reasonable assumption to make and nothing in the image contradicts it.However it is also ok to use "people".) 1. Identify the most salient objects(s)/person(s) in the image; use the most informative level to refer to something (i.e., "girl" rather than "child" or "person"); in the example image: "girl" 2. Identify the most salient relation between the main objects; example "girl standing in front of the whiteboard" 3. Identify the main activity depicted; in the example image: "smiling" as an activity (note that this can also be an attribute of the girl), or "standing" as an activity 4. Identify the most salient attributes of the main object(s)/person(s)/activity(es); in the example image: "smiling" and "young" as attributes for the girl 5. Identify the background/context/environment in which the scene is placed; in the example image: "classroom" 6.Put everything together from steps 1-5 above; for the example image: "a smiling girl standing in a classroom", or "a young girl smiling in a classroom".

Figure 1 :
Figure 1: Sample captions in three different languages (out of 36 -see full list of captions in Appendix A),showcasing the creation of annotations that are consistent in style across languages, while being free of directtranslation artefacts (e.g. the Spanish "number 42" or the Thai "convertibles" would not be possible when directly translating from the English versions).

Figure 2 :
Figure2: A sample of images in the XM3600 dataset, together with the language for which they have been selected.Overall, the images span regions over 36 different languages and 6 different continents.
Figure 3: The architecture for the family of multilingual image captioning models used in the experiments.
found for different training setups and used in our quantitative experiments.In terms of model sizes, mT5-base has about 680 million parameters, mT5large about 1230 million, the ViT-B/16 86 million, and ViT-g/14 1011 million parameters.Together, all the experiments took around 5000 TPU hours to train.

Figure 4
Figure 4 displays the captions in the 36 languages covered in XM3600 for the same image as in Figure 1.

Figure 4 :
Figure 4: Example captions in the 36 languages covered in XM3600

Table 1 :
Table 1 provides detailed caption statistics, including the number of captions per image and the average number of words and characters per caption.There are a total of 261,375 captions across 36 languages, each image having in the vast majority of cases at least 2 Caption statistics: A total of 261,375 captions across 36 languages.We provide the replication stats per language, as well as average number of words (where applicable) and characters. 10.

Table 3 :
Model details for all model variants used in our experiments: lr denotes the learning rate; cp denotes the number of steps in the constant period where the learning rate is constant.

Table 4 :
CIDEr on XM3600 and COCO-DEV for the models over the four languages LCORE (COCO-DEV computed using machine-translated references).Tables 8-11 in the appendix show all the CIDEr values for all the models.

Table 3
describes the best hyperparameters we

Table 5 :
Model comparisons over LCORE languages (m 2 vs m 1 ).L denotes the target language; ∆CIDEr XM600 is CIDEr(m 2 )-CIDEr(m 1 ) on the XM600 dataset, ∆CIDEr XM3600 on the XM3600 dataset, and ∆CIDEr COCO-DEV on the COCO validation split with machine-translated references.Table7in the appendix shows model comparisons over the LEXT languages.
Caption misses the main topic of the image.b).Caption has major grammatical errors (such as being incomplete, words in wrong order, etc).Please ignore capitalization of words and punctuation.c).