Multilingual Multimodal Learning with Machine Translated Text

,


Introduction
Vision-and-language (V&L) pretraining is the process of learning deep contextualised cross-modal representations from large collections of imagesentence pairs (Li et al., 2019;Tan and Bansal, 2019;Chen et al., 2020, inter-alia).These pretrained models are an excellent backbone for transfer learning to a wide range of downstream tasks, such as visual question answering (Antol et al., 2015;Gurari et al., 2018;Agrawal et al., 2022), referring expression alignment (Kazemzadeh et al., 2014;Mao et al., 2016), and image-sentence retrieval (Young et al., 2014;Lin et al., 2014).Thus far, downstream evaluations have mostly focused on English tasks due to the availability of datasets, but the recent IGLUE benchmark (Bugliarello et al., 2022) makes it now possible to evaluate models on several downstream tasks across 20 languages.Figure 1: Multilingual multimodal data is a scarce resource compared to English multimodal data.Given an English multimodal dataset, we generate a multilingual dataset using a black box translation system.We explore the utility of this approach to creating multilingual text for both downstream task fine-tuning and pretraining.
Success in multilingual multimodal tasks, such as those in IGLUE, is expected to depend on models with grounded representations that transfer across languages (Bugliarello et al., 2022).For example, in the MaRVL dataset (Liu et al., 2021), models need to deal with a linguistic and cultural domain shift compared to English data.Therefore, an open problem is to define pretraining strategies that induce high-quality multilingual multimodal representations.Existing work has tackled this problem by either jointly training on English multimodal data and multilingual text-only data (Liu et al., 2021;Ni et al., 2021), pretraining with a private dataset of multilingual captioned images (Jain et al., 2021), or machine translating multimodal pretraining data (Zhou et al., 2021).
In this paper, we further investigate the potential of machine translated text for both fine-tuning and pretraining across four diverse V&L tasks. 1he overarching motivation is that machine translation is an inexpensive approach to producing large amounts of multilingual text compared to collecting data from humans, or scraping high-quality imagecaption pairs from the web.Having access to thou-sands of data points in a target language might indeed be necessary to improve cross-lingual performance in downstream tasks (Bugliarello et al., 2022).As such, translating fine-tuning data into multiple languages may be a compelling approach towards downstream task success.Moreover, if this can be achieved through machine translated text, it raises the question of whether we can also pretrain on many millions of multilingual translated examples.Motivated by the initial experiments of Zhou et al. (2021), we test this hypothesis further, on more languages and more tasks, reporting more nuanced results from large-scale translated text.
Overall, we show that machine translation can provide inexpensive and impressive improvements when fine-tuning models for multilingual multimodal tasks.Moreover, translation-based pretraining leads to significant gains in zero-shot crosslingual transfer over existing approaches.However, we find mixed results when combining this with multilingual fine-tuning.There are still opportunities to realise further benefits from machine translated text, which may be found through more compute-intensive pretraining.
Contributions. 1) We present the TD-MML framework to narrow the gap between English and non-English languages in multimodal research.2) In the process of translation-based pretraining, we present a reliable strategy to filter out bad translations.3) We conduct systematic evaluations in zero-shot and machine translated scenarios, and show the benefits that can be gained from simply having more data in the target languages.

Related Work
Inspired by the success of self-supervised language model pretraining (Devlin et al., 2019, inter-alia), researchers have also explored this paradigm with multimodal models (Gella et al., 2017;Ákos Kádár et al., 2018).The first wave (Li et al., 2019;Tan and Bansal, 2019;Li et al., 2020;Chen et al., 2020) were initialised from BERT and pretrained on English image-text datasets like Conceptual Captions (Sharma et al., 2018) and COCO (Lin et al., 2014), where the visual modality was represented using feature vectors extracted from 10-100 automatically detected object proposals (Anderson et al., 2018).More recent models (Kim et al., 2021;Li et al., 2021;Singh et al., 2022) represent the visual modality using a Vision Transformer (Dosovitskiy et al., 2021), which can be end-to-end fine-tuned during pretraining, as opposed to working with pre-extracted object proposals.
More related to our work are the multilingual variants of these models (Liu et al., 2021;Zhou et al., 2021;Ni et al., 2021;Jain et al., 2021).The lack of large-scale multilingual multimodal datasets has resulted in different strategies to train such models.Liu et al. (2021) simply augment English caption data with text-only multilingual Wikipedia data.In addition to this, Ni et al. (2021) further create code-switched multimodal data2 by randomly swapping English words in Conceptual Captions with the corresponding translation in one of 50 other languages, obtained through the Panlex dictionary.On the other hand, Zhou et al. (2021) machine translate the Conceptual Captions dataset into German, French, Czech, Japanese, and Chinese, for a total of 19.8M pretraining data points.Finally, Jain et al. (2021) pretrain on 3.6B multilingual captions by extending the Conceptual Captions collection pipeline to multiple languages. 3n this paper, we further explore the potential of machine translation for pretraining and finetuning.Zhou et al. (2021) first pretrained a model on machine translations of the Conceptual Captions pretraining data in five high-resource languages (Mandarin Chinese, Czech, French, German, and Japanese), which then resulted in overall better multilingual representations across a number of diverse languages (Bugliarello et al., 2022).Here, we explore the potential of training multimodal models on a much larger and diverse set of languages, including low-resource ones.Effectively doing so requires tackling issues and limitations with machine translation systems, which do not produce high quality translations across all languages.This is especially relevant when translating a large corpus, which might include a large number of data points with low-quality text.

The IGLUE Benchmark
The impetus of our work is the recent creation of the Image-Grounded Language Understanding Evaluation (IGLUE; Bugliarello et al. 2022) benchmark for evaluating multimodal models across twenty languages and four tasks, using five different datasets.Specifically, the benchmark focuses on zero-and few-shot transfer, where models are fine-tuned on English data and then tested to crosslingually generalise with no or few samples in the target language for the target downstream task.The following datasets are included in IGLUE:

XVNLI is a cross-lingual Visual Natural Language
Inference task (Bugliarello et al., 2022), which requires models to predict the relation (entailment, contradiction, or neutral) between a premise in the form of an image, and a hypothesis in the form of a sentence.
xGQA is a cross-lingual Grounded Question Answering task (Pfeiffer et al., 2022), using images from Visual Genome (Krishna et al., 2017) and translations of the English questions from GQA (Hudson and Manning, 2019).The questions in GQA are automatically generated from the image scene graphs.
MaRVL focuses on multicultural reasoning over images (Liu et al., 2021).The task is in the same format as the English NLVR2 (Suhr et al., 2019) data, namely to judge whether a sentence is true or false for a pair of images.However, the images and the descriptions are sourced directly in the target languages.
xFlickr&CO is an evaluation dataset for imagetext retrieval in eight high-resource languages (Bugliarello et al., 2022).The images are collected from Flickr30K (Young et al., 2014) and COCO (Lin et al., 2014), while the captions are new descriptions sourced from native speakers in the target languages.
WIT is a second image-text retrieval dataset based on the Wikipedia-based Image Text dataset (Srinivasan et al., 2021).WIT is scraped directly from Wikipedia and contains a much more diverse set of image types than the other datasets, as well as more complex and entity-centric descriptions.
Each of the tasks has a natural English training counterpart: SNLI-VE (Xie et al., 2019) for XVNLI; GQA for xGQA, NLVR2 for MaRVL, and English training splits of Flickr30K and WIT.Bugliarello et al. (2022) found that current multilingual V&L models show a large gap in performance, in each of these tasks, when evaluating on non-English data.Moreover, further training these models on a few examples in a target language only slightly improved their cross-lingual capabilities.

Fine-Tuning with Translated Data
As an initial experiment, we investigate the extent to which performance can be improved by finetuning on multilingual machine-translated data instead of only English data.We conduct this experiment on the MaRVL and xGQA datasets.The results can be seen in Tables 1 and 2, respectively.We use the M2M-100-large model (Fan et al., 2021) to translate the NLVR2 training data into the 5 MaRVL languages, and the GQA training data into the 7 xGQA languages.For the model, we use the xUNITER (Liu et al., 2021) implementation from VOLTA (Bugliarello et al., 2021).xUNITER extends the UNITER architecture (Chen et al., 2020) multilingually, by initializing the model from XLM-RoBERTa (Conneau et al., 2020) and pretraining on English image captions and text-only multilingual Wikipedia paragraphs.Starting from the publicly-released xUNITER checkpoint, we fine-tune on the machine translated training sets for each task.For a fair comparison to English-only fine-tuning, we ensure that the multilingual finetuning is based on the same number of parameter updates.In effect, this reduces the number of training epochs from 20→3 for MaRVL, and 5→1 for xGQA.We round number of epochs so it is close to the English-only setup. 4This means that, in our setup, all the images are seen for the same number of times, but each unique caption will be seen fewer times in each of the target languages.
Using machine translated data for fine-tuning brings large improvements in performance for both MarVL and xGQA.Table 1 shows the results for MaRVL, where each non-English language increases by between 4.5-8.1 accuracy points.Table 2 shows the results for xGQA, where the performance for the non-English languages increases by 11.7-32.7 points.We also observe small decreases in performance on English for each task but this is expected.Recall that the models were fine-tuned for the same number of steps, which means the model fine-tuned on translations has been exposed to less English text in order to process multilingual text.We conclude that using machine translated fine-tuning data is an inexpensive and viable path to better task-specific performance.

On Pretraining with Translated Data
The previous section showed the benefits of using machine translated data for multilingual finetuning.We now turn our attention to whether further improvements can be realised by adding multilinguality via machine translated data is useful for pretraining.This requires two components: (i) a large-scale translation pipeline and the means to deal with potential data quality issues, and (ii) a model that can exploit the machine translated training data, which we dub TD-MML for Translated-Data Multimodal Multilingual Learning.

Translation and Data Preparation
A commonly used dataset for multimodal pretraining is Conceptual Captions (Sharma et al., 2018), gathered from alt-text on the Internet and postprocessed to remove proper names.We translate 2.77M English sentences from the Conceptual Captions training split into the twenty target languages in IGLUE.Once again, we use the M2M-100-large model (Fan et al., 2021), with 1.2B parameters.
We notice that the quality of the translations varies across languages, presumably due to the amount of data used to train M2M-100.Moreover, captions in this dataset often consist of sentence fragments, which may be harder to translate well.
In order to prevent bad data from corrupting the model, we apply a filtering step to the translated data.The two most frequent types of errors are single words being repeated multiple times and English words being copied into the translation.We discard sentences that exhibit these characteristics based on the following two "badness" scores: • Complement of the token-to-type ratio.The token-to-type ratio (TTR) measures the fraction of unique tokens in a given text.We use its complement (1 − TTR), such that a large score (close to one) indicates repetition.
• BLEU score between the source sentence and its translation.We measure the similarilty between the English source and the (non-English) target by computing the BLEU score using the NLTK toolkit (Bird, 2006).A large score (close to one) indicates that English text has been copied into the translation.
We estimate thresholds for the two scores by manually inspecting a subset of 2,000 sentences from each of the twenty target languages.We use the same TTR threshold (0.5) for all languages (since repetition is language-independent).We observe different patterns of English copying so we set different thresholds for different language groups (Figure 2): Indo-European languages with a Latin script, all languages with a non-Latin script, and non-Indo-European languages using a Latin script.We will discard all sentences with scores above either threshold from the multilingual pre-  training process.Table 3 shows examples of translations that are filtered out by this procedure.The first two are rejected due to repeated words, the third because English words appear in the output.
The cumulative distribution of the scores and the corresponding thresholds are shown in Figure 2.For most languages over 95% of their translated sentences are kept, the most notable exceptions being Tamil, Japanese and Korean, for which only 54.6%, 76.6%, 85.2%, respectively, of the initial sentences are kept.Figure 3 shows the final distribution of training data across languages.The total number of sentences in the translated dataset for pretraining (including English) is 52M.

Model
The model we implement within our Translated Data for Multilingual Multimodal Learning framework, TD-MML, follows the single-stream crossmodal framework, such as UNITER (Chen et al., 2020) and xUNITER (Liu et al., 2021).It can be seen as a translate-train version of xUNITER, to which we add an additional pretraining task: visual translation language modelling (Zhou et al., 2021).
The TD-MML architecture consists of a series of Transformer blocks, which first concatenate the visual and language embeddings as the input, and then passes these into a multi-layer Transformer to encode the contextualized representations across image-text modalities and languages.
The input sequences are image-text pairs (V, X), where V are the visual features and X are the embedding sequence of the corresponding caption.The image features v in V = {v 1 , v 2 , . . ., v N } correspond to N = 36 object proposals extracted with Faster R-CNN (Ren et al., 2015).We extend the English pretraining text to also include the machine translated captions in m languages: x l 1 , x l 2 , . . ., x lm , The captions are processed by the same SentencePiece tokeniser regardless of their corresponding language.

Pretraining Tasks
To learn the contextualised representation across modalities and languages, we pretrain our model with three types of pretraining tasks introduced below.The goal of these pretraining tasks is to learn both the cross-modal alignment between each language and the visual modality, as well as alignments between the different languages.
Masked language and region modelling.In the V&L pretraining literature, Masked Language Modelling (MLM) and Masked Region Modelling (MRM) are two mainstream pretraining tasks, which have been demonstrated to be effective in UNITER.Given the visual features V and the corresponding text x l in language l, MLM randomly masks tokens in the text with 15% probability and predicts the identity of the masked token using remaining textual and visual features.Analogously to MLM, MRM samples and masks image regions with 15% probability and replaces the region input with zeros.The MRM task is to classify the top-1 object label of the masked visual feature region.Please refer to Chen et al. (2020) for more details.

Image-Text Matching (ITM)
. This task attempts to determine whether an image-text pair is matched or not.As such, this enables the model to learn the alignment of the language and vision modalities.The matching score s θ (x l , v) is com-puted based on the special [CLS] token which is passed through a fully-connected layer with sigmoid activation function.To train the ITM objective, we sample a positive or negative caption with equal probability for a given input image from the dataset D; the selected caption is sampled uniformly from one the twenty languages in the pretraining dataset.The loss is defined by the following equation, where we denote whether the sampled pair is a match or not by the binary label c: Visual Translation Language Modeling.VTLM is a training objective adopted from UC 2 (Zhou et al., 2021) that combines both crosslanguage and cross-modal alignment learning.It takes a triple of an image v, an English caption x ENG , and a corresponding caption x l in a different language l.The task is to predict the masked caption tokens, using the multilingual textual input as well as the visual input.During pretraining, we use the same masking strategy as in MLM to randomly mask 15% of tokens from English caption and 15% of tokens from the other language caption. 5The loss function is:

Experimental setup
The implementation of TD-MML is built in the VOLTA framework (Bugliarello et al., 2021).TD-MML uses the same model configuration as the xUNITER architecture (Liu et al., 2021), which is initialised from the XLM-R cross-lingual language model (Conneau et al., 2020).
Pretraining.The dataset used for pretraining is Conceptual Captions (Sharma et al., 2018), which consists of 3.3M images with their English alttext descriptions, of which we only have access to 2.7M instances due to linkrot.The images are represented using ResNet-101 features extracted from N = 36 object proposals from the Faster R-CNN (Ren et al., 2015) object detector trained on the Visual Genome dataset (Anderson et al., 2018).As described in Section 5.1, we translate the captions in the 19 non-English IGLUE languages with the 1.2B-parameter M2M-100-large model (Fan et al., 2021); the translations are then filtered to eliminate likely translation errors.During training, we iterate over the images and uniformly at random sample the corresponding caption in one of the twenty languages, i.e. a batch contains samples from multiple languages.We reuse the training hyperparameters of Bugliarello et al. (2021).Specifically, xUNITER is trained over 2.77M image-English caption pairs, while TD-MML is pretrained on 52M image-multilingual caption pairs on 3×24GB TITAN RTX for 250,000 steps, which takes 5 days.
Figure 4 shows the Recall@1 curves for image retrieval and text retrieval on 5K validation samples from the Conceptual Captions dataset as a function of the number of pretraining steps.The samples are chosen from languages that represent the original MaRVL languages, i.e.ENG, IND, JPN, SWA and CMN, where the non-English data comes from the machine translation process.Model performance continues to increase in both metrics until 220,000 updates.Therefore, we use this checkpoint to finetune on the downstream tasks.
Fine-tuning on downstream tasks.We employ a translated data procedure at fine-tuning as well, using the same data filtering steps.Similar to the initial experiments in Section 4, we match the number of parameter updates between the experiments with English-only data and the ones with translated multilingual data.We use the same setup as in IGLUE when fine-tuning on downstream tasks.Table 5: MaRVL accuracy results for zero-shot cross-lingual evaluation, i.e.English-only NLVR2 fine-tuning, and multilingual fine-tuning using machine translated NLVR2 data (either with the full or filtered translated data).All of the models are fine-tuned for a similar number of updates.The average results exclude ENG accuracy.

TD Pretraining and English Fine-tuning
Here, we evaluate the zero-shot language understanding abilities of the TD-MML model that has been pretrained with multiple languages, but fine-tuned on English task-specific data only (e.g.NLVR2 for MaRVL, GQA for xGQA, etc.).The averaged zero-shot results across languages are shown in Table 4.The full zero-shot per-language results on each task are detailed in Appendix C. We see a substantial improvement for TD-MML across all tasks compared to xUNITER and the state-of-the-art UC 2 .The improvement between our TD-MML and the best baseline models for each task reaches 4.23 points for XVNLI, 6.6 points for xGQA, 2.39 points for MaRVL, 0.99 (IR) and 8.46 points (TR) for xFlickr&CO, and 0.6 (IR) and 0.13 points (TR) for WIT.The results from the other tasks show the clear benefit of pretraining on multilingual multimodal data on a diverse array of multimodal tasks across many languages.4 also shows the effectiveness of pretraining with the additional visual translation language modeling (VTLM) objective: adding VTLM boosts the performance for five out of the seven tasks.Improving cross-lingual alignment during pretraining thus manifests itself in better multi-lingual understanding ability.

MT Fine-tuning TD-MML
We now ask whether combining the machine translated pretraining strategy of TD-MML with additional machine translated fine-tuning can provide further gains, compared to both English-only finetuning (zero-shot) and the MT-fine-tuning strategy applied to xUNITER in Section 4. The results for MaRVL and xGQA are shown in Table 5   similar after multilingual MT fine-tuning, with TD-MML slightly outperforming xUNITER on the MaRVL dataset and with xUNITER generally yielding better performance on xGQA.That is, in this setting, multilingual multimodal pretraining in TD-MML does not convey any added benefit.A potential explanation is that fine-tuning on enough multilingual data (as it is the case of machine translated data) compensates for the lack of multilingual multimodal pre-training.The performance of TD-MML on the English splits after English-only fine-tuning supports this hypothesis.The results in Tables 5 and 6 further indicate that the filtering strategies offer mixed results when applied to fine-tuning.We believe that this outcome happens because the corresponding datasets, GQA for xGQA and NLVR2 for MaRVL, are typically much cleaner datasets than the Conceptual Captions dataset, resulting in less filtering and, consequently, less representative results.

Few-Shot vs Machine Translated Data
We also ask whether it is better for downstream performance to train on a limited number of clean language-specific and task-specific samples (fewshot learning) or to simply machine translate the task-specific English data, which is likely to result in noisier data.So, in addition to the previously introduced fine-tuning setups-fine-tuning on task-specific English data for a zero-shot evaluation setting, and fine-tune on machine translated task-specific English data-we also consider the few-shot learning setup, in which we continue fine-tuning of the zero-shot (English fine-tuned) model, using a few human-authored languagespecific and downstream-specific samples from the IGLUE benchmark (i.e., 1, 5, 10, 20, 25, 48 shots).
Few shot results for xGQA are presented in Figure 5, broken down by question type.We observe that, as expected, the performance generally improves with the quantity of training data (number of shots).More interestingly, the machine translated fine-tuning upper bounds the performance of the few-shot approach for both of our multimodal multilingual models.Across the five question categories, the variation in performance can be largely explained by the cardinality of the set of plausible answers: it is easier to answer correctly a verification or logical question, whose answer is usually either "yes" or "no", than a querying question, whose answer usually involves a broader set of words.

Cross-Language Correlation Analysis
Are the same questions easy (answered correctly) or difficult (answered incorrectly) across languages?We use Cohen's kappa coefficient κ to measure agreement between languages on the xGQA dataset (see Appendix B for details).The results are presented in Figure 7.We show results for two variants of our pretrained TD-MML model: either fine-tuned on English-only data (ENG) or on machine translated data (MT).We see that the MT fine-tuned results show much higher agreement across languages, compared to English-only fine-tuning.On the one hand, this could be considered counter intuitive: in the MT fine-tuned setting, there is more language-specific data.However, the MT fine-tuned results have higher accuracy overall, suggesting that high agree-ment across languages is due to the model confronting inherent item difficulty (as judged by all languages), rather than language-specific issues.
Examples of increased cross-language agreement in the MT fine-tuned model (MT) are presented in Figure 6.Across the eight languages, we find the predictions of the MT fine-tuned model are more consistent than those of the ENG-finetuned model (Q1, Q2, Q3).However, for more ambiguous and difficult samples, the model fine-tuned with translated data still gives varied, but arguably more plausible, predictions across languages (Q4).

Conclusion
In this paper, we investigate the role of machine translated (MT) data in multilingual multimodal learning in a controlled setup.We consider two applications of MT data, namely for augmenting the pretraining and the fine-tuning data.We find that both convey a clear immediate benefit on downstream performance for nearly all tasks in the IGLUE benchmark; however, we do not find an additive benefit of combining it for both pretraining and fine-tuning together.When using machine translated text, filtering out bad translations in a quick and reliable way is crucial, and we develop a simple and effective strategy for doing this.Our results shed light in the importance of explicitly grounding multilingual text in the visual modality in both pretraining and fine-tuning stages.

Limitations
Our paper investigates the benefits and limitations of machine translated data towards multilingual multimodal learning.In doing so, we solely rely on the M2M-100 model (Fan et al., 2021).This is a large, multi-to-multi translation system, which proved to be easy to use.Our analyses and results are based on the performance of this model.It would be instructive to investigate how the expected performance of translation systems6 affects (i) the proportion of sentences with high 'badness' scores, and (ii) the resulting performance of the multilingual multimodal systems.Moreover, while machine translating a large corpus is a cheaper effort than manually translating the data or scraping it from the web, there is still a one-time effort required to translate the data before using it for training new models.Therefore, we release our multilingual pretraining and fine-tuning datasets.
From an experimental angle, although the proposed framework can be applied to any existing architecture, we only evaluate a single model due to computational constraints.
We would also like to stress the importance of using target-language originating evaluation data in multimodal setups, rather than translated data.Fitting to translationese is a risk when using translation data at training time, and can only be identified if the evaluation data does not also contain translations, especially automatically generated ones.
Finally, a core limitation of the overall translate data framework is that it centers English as the source language.For example, this means only concepts mentioned in English captions can be grounded across languages (Liu et al., 2021), and hence some non-English concepts might never be modelled.However, we show that machine translating data provides a strong starting point that can effortlessly be integrated in a pipeline, upon which language-specific annotations can be added.

A Language information
The machine translated fine-tuning data and pretraining data cover 20 languages, spanning 11 language families and 9 scripts.The scripts are Arabic, Bengali-Assamese, Chinese, Cyrillic, Greek, Huangual, Kanji, Latin, and Tamil.Table 7 summarizes this information together with a listing of the language code abbreviations and details regarding the use of the languages in each of the five tasks from the IGLUE benchmark.

B Kappa Setting
Cohen's kappa corrects for random agreement, but has an upper bound based on the difference in marginal probabilities (here, equivalent to accuracy): if one system assigns more correct answers than the other, then the maximum achievable κ is less than one.Since we compare across languages with different accuracies, we normalise the usual coefficient by the maximum achievable value, resulting in κ-ratio, the proportion of agreement given system accuracies (label rates).

C Performance per Target Language
Tables 8 to 12 provide language-specific performance for zero-shot evaluation (i.e., English-only fine-tuning) on the five IGLUE tasks (MaRVL, XVNLI, xGQA, xFlickrCO, WIT) for two variants of our TD-MML approach (with and without VTML loss) against four state-of-the-art approaches (mUNITER, xUNITER, UC 2 , M 3 P).
The experimental results show that our TD-MML usually achieves better performance than the competing models on the non-English languages.The closest competitor is UC 2 , which is also a method that is pretrained on machine translated data, but only in five languages (CMN, DEU, JPN, FRA, CZE).This partially explains UC 2 strong performance in some of these instances: for example, on FRA in XVNLI (Table 9) or on DEU in xGQA (Table 10).
On the WIT retrieval tasks, we notice that even if TD-MML still performs better or on par with previous approaches, the results are poor across all languages and methods.A possible explanation is that the distribution of images and captions on Wikipedia is distinct from other datasets.
Among the two variants of TD-MML, we generally observe a benefit of incorporating the VTML loss, the largest gains manifesting for BEN, CMN, KOR, RUS languages on the xGQA task.

D Analysis over Question Types in xGQA
Figure 8 shows the accuracy results for three evaluation setups (zero-shot, few shot learning and machine translation fine-tuning) on xGQA over the five different question types.The language-specific sub-figures show the gap between xUNITER and TD-MML.The largest differences between of them are the KOR, POR, RUS languages and for the compare, logical, and verify question types.

Welche
Art von Essen befindet sich oben links in der Brotdose?饭盒左上角是什么食物 ?¿Qué tipo de comida se encuentra en la parte superior izquierda de la lonchera?… Welche Art von Essen befindet sich oben links in der Brotdose?饭盒左上角是什么食物 ?¿Qué tipo de comida se encuentra en la parte superior izquierda de la lonchera?… Q: What type of food is in the top left of the lunchbox?Q: What is the cat sitting on?
Figure2: Cumulative distributions of the two badness scores (1 -TTR, the complement of the token-to-type ratio, and BLUE src-tgt, the BLEU score between the source and target sentence) for the nineteen non-English languages in IGLUE.The languages are grouped in three categories, and the vertical lines denote the filtering thresholds for each of the categories and two scores.

Figure 4 :
Figure 4: Average pretraining accuracy-image retrieval (left) and text retrieval (right)-as a function of the number of pretraining steps.Accuracy is calculated on 5K validation samples across five languages in the Conceptual Captions machine translated dataset.

Figure 5 :
Figure 5: xGQA average accuracy (with confidence intervals) across the languages on the five question types.xUNITER and TD-MML are evaluated in the zero-shot, few-shot and machine translation fine-tuning settings.The error bars represent 95% confidence intervals and are obtained by bootstraping (1000 repeats).

Figure 6 :
Figure 6: Qualitative results on the xGQA dataset.Given an image and a question (denoted by Q), we show the corresponding groundtruth answer (denoted by A), together with the predictions in each of the eight languages for the model finetuned on either English-only sentences (left column of answers) or on machine translated sentences (right column of answers).

Figure 7 :
Figure 7: Cross-language prediction correlations on the test split of xGQA for two of the proposed models: TD-MML fine-tuned on English-only data (left) and TD-MML fine-tuned on translated data (right).

Figure 8 :
Figure8: xGQA accuracy on zero-shot, few shot learning and machine translation fine-tuning evaluation for each of the five questions types (compare, choose, logical, query, verify) and eight languages (Bengali, German, English, Indonesian, Korean, Portuguese, Russian, Mandarin).

Table 3 :
Examples of translations that are filtered out by the proposed procedure.

Table 4 :
Average zero-shot performance on non-English languages of multimodal models for the V&L evaluation tasks in IGLUE.Best results are marked in bold.The performance measure is accuracy for all the tasks except the cross-modal retrival tasks, which use Recall@1.

Table 6 :
and Table6respectively.Perhaps surprisingly, the performance for xUNITER and TD-MML are very xGQA accuracy results for English-only fine-tuning (zero-shot evaluation) and multilingual fine-tuning using machine translated GQA data (either with the full or filtered translated data).All of the models are fine-tuned for a similar number of updates.The average results exclude ENG accuracy.

Table 8 :
Accuracy on the MaRVL task for zero-shot evaluation (i.e.English-only fine-tuning).

Table 9 :
Accuracy on the XVNLI task for zero-shot evaluation (i.e.English-only fine-tuning).

Table 12 :
Experimental results on the WIT tasks (image retrieval, IR, top, and text retrieval, TR, bottom) for zero-shot evaluation (i.e.English-only fine-tuning).