MaXM: Towards Multilingual Visual Question Answering

Visual Question Answering (VQA) has been primarily studied through the lens of the English language. Yet, tackling VQA in other languages in the same manner would require a considerable amount of resources. In this paper, we propose scalable solutions to multilingual visual question answering (mVQA), on both data and modeling fronts. We first propose a translation-based framework to mVQA data generation that requires much less human annotation efforts than the conventional approach of directly collection questions and answers. Then, we apply our framework to the multilingual captions in the Crossmodal-3600 dataset and develop an efficient annotation protocol to create MaXM, a test-only VQA benchmark in 7 diverse languages. Finally, we develop a simple, lightweight, and effective approach as well as benchmark state-of-the-art English and multilingual VQA models. We hope that our benchmark encourages further research on mVQA.


Introduction
Visual Question Answering (VQA), the task of answering visual questions grounded in images, is key to human-machine interaction in the visual world.In particular, the natural language interface in VQA makes it easy for lay people to express their needs and benefit from its applications, including accessibility, education, and search.Yet, VQA advances were mostly focused on English, therefore only applied to a privileged subset of human populations.
the blind and the visually-impaired (Gurari et al., 2018), scene-text understanding (Singh et al., 2019;Biten et al., 2019), to VQA that requires external, commonsense, or world knowledge (Marino et al., 2019;Zellers et al., 2019;Schwenk et al., 2022).These benchmarks require considerable amount of resources to create, mostly by employing human annotators to laboriously collect and verify the questions and the answers for each image.
To extend VQA to all languages in the world, we must make data creation more automatic.Building on recent work on automatic data creation for English VQA from captions (Changpinyo et al., 2022), in this paper we propose a translation-based framework for multilingual visual question answering (mVQA) data creation.Our framework automates much of the task of generating questions and answers, thus providing a scalable path to mVQA.
We apply our framework to the generation of question-answer pairs from the multilingual captions of the recently-proposed Crossmodal-3600 dataset (XM3600) (Thapliyal et al., 2022).Combined with an efficient human annotation protocol, we construct MAVERICS-XM3600 (MaXM), a test benchmark for mVQA in 7 languages (see examples in Fig. 1).
Finally, we use this novel benchmark to drive progress in mVQA modeling and measure where we stand.We leverage advances in image modeling and multilingual modeling: ViT (Dosovitskiy et al., 2021) and mT5 (Xue et al., 2021) and propose a unified, extensible, open-ended mVQA model, called Simple MPT, which is competitive to stateof-the-art English VQA models that we adapt to apply in the mVQA setting (OFA (Wang et al., 2022b) and BLIP2 (Li et al., 2023)).Overall, there exists a large room for improvement.
Beyond mVQA, training and evaluation data for multilingual multimodal models is limited.For a review of previous work, we refer the reader to the Image-Grounded Language Understanding Evaluation (IGLUE) benchmark (Bugliarello et al., 2022), where xGQA is a part of.In general, early attempts often focus on Chinese (Li et al., 2019;Wang et al., 2019), Japanese (Yoshikawa et al., 2017;Aggarwal and Kale, 2020) and several Indo-European languages (e.g., German, French, and Czech) (Elliott et al., 2016(Elliott et al., , 2017;;Barrault et al., 2018).However, there is a recent effort toward a wider variety of both languages and tasks.Examples include image retrieval (Aggarwal and Kale, 2020) (also Russian, Korean, Turkish), visual natural language inference (Bugliarello et al., 2022) (also Arabic), multilingual visual reasoning (Liu et al., 2021) (also Indonesian, Swahili, Tamil, Turkish), and vision-and-language navigation (Ku et al., 2020) (also Hindi, Telugu).Notably, Wikipedia Image Text (WIT) (Srinivasan et al., 2021) provides a large-scale image-text dataset in 108 languages, automatically collected form Wikipedia, and Crossmodal-3600 (XM3600) (Thapliyal et al., 2022) provides human-curated test-only image captions in 36 languages.Our work builds on top of XM3600, and the 7 languages that we consider are typologically, genealogically, and geographically diverse.

VQA Data Creation
Previous work on VQA data creation relies heavily on humans to create questions and answers (Zhu et al., 2016;Krishna et al., 2017;Goyal et al., 2017;Gurari et al., 2018;Marino et al., 2019).Some works attempt to automate this process.CLEVR (Johnson et al., 2017a) uses a template-based approach, but it is based on synthetic images for which ground-truth annotations are available.GQA (Hudson and Manning, 2019) follows a similar approach but instead starts from Visual Genome scene graphs (Krishna et al., 2017), which themselves require large annotation efforts.
More relevant are works that rewrite image captions or video transcripts as question-answer pairs.COCOQA (Ren et al., 2015) uses a templatebased approach that can only generate questions with one-word answers.WeaQA (Banerjee et al., 2021) improves upon this with semantic role labeling, paraphrasing, and backtranslation.Recently, Changpinyo et al. (2022) and Yang et al. (2021) leverage T5 (Raffel et al., 2020) fine-tuned on question answering datasets, generating large-scale VQA datasets for images and videos, respectively.Our approach to mVQA data creation leverages VQ 2 A, the approach in (Changpinyo et al., 2022) (Sect. 3.1).To the best of our knowledge, besides xGQA, no other prior work on VQA data generation considered languages beyond English.

Multilingual VQA Data Creation
Like in many other machine learning tasks, the main bottleneck to mVQA is obtaining high-quality labeled data.The most popular data collection framework to English VQA is to ask a set of human annotators to come up with visual questions, and another set of annotator to answer them (Sect.2.2).To scale VQA to all languages, we argue that mVQA data creation must significantly reduce its use of human annotation.To this end, we study the extension of an automatic English VQA data creation method called Visual Question Generation with Question Answering validation, or VQ 2 A (Changpinyo et al., 2022) for the purpose of mVQA data creation.

Background: VQ 2 A
The VQ 2 A approach leverages aligned image-text data sources that are available at scale (Ordonez et al., 2011;Chen et al., 2015;Sharma et al., 2018;Pont-Tuset et al., 2020;Changpinyo et al., 2021;Desai et al., 2021;Schuhmann et al., 2021) and beyond English (Srinivasan et al., 2021;Gu et al., 2022).It rewrites a declarative image caption into multiple interrogative question-answer pairs via three steps: (i) Candidate Answer Extraction extracts candidate answers based on syntactic and semantic analysis of an input caption, (ii) Question Generation generates candidate questions for each candidate answer, (iii) Answer Validation filters candidate questions that do not pass a consistency check that involves automatically answering each question from the caption and comparing this answer to the original extracted answer (Alberti et al., 2019;Honovich et al., 2021).
Inspired by VQ 2 A, our goal is to generate mVQA data at scale, leveraging multilingual image captions.Multilingualizing each step in VQ 2 A can be non-trivial and resource-intensive due to the heavy reliance of English tools, models, and data (Sect.3.1).To alleviate this, we propose a translation-based extension of VQ 2 A.
Given an input caption c in any language, and a target language ⟨lang⟩, we want to generate question-answer pairs in ⟨lang⟩.We propose Translation-based VQ 2 A (TransVQ 2 A), as follows: Step 1 Caption Translation: Automatically translate a non-English caption c to English c e .
Step 2 Apply VQ 2 A: Generate a set of English questionanswer pairs {q e , a e } from c e .Step 3 Question-Answer Translation: Automatically translate all (q e , a e ) pairs to ⟨lang⟩ (q, a).Step 4 Validation: Filter (q, a) pairs1 in which a does not appear in the original caption c, back-translating a to c's language if necessary.The upper part of Fig. 2 exemplifies TransVQ 2 A using a Chinese caption from Crossmodal-3600 (Thapliyal et al., 2022).
We highlight that the approach we have described so far is fully automatic and applicable to a huge set of languages that are supported by automatic translation.We note that the final validation is important due errors that could pile up during translation steps.This is especially acute in Step 3, since translating answers is harder due to the lack of disambiguating context in the short answers.We also note that TransVQ 2 A can generate question/answer pairs in the target ⟨lang⟩ from any caption.The output quality depends on the translation quality, e.g. the back-translation in step 4 from ⟨lang⟩ to c's language.We use out-of-the-box Figure 2: Our approach to multilingual VQA data generation, which is easy to scale, highly automatic and only requiring humans to modify "Almost Correct" questions or correct/expand answers (left) or filter out "Incorrect" questions(right).MT is short for automatic machine translation.translation tools in this work, and leave the exploration of better translation tailored for TransVQ 2 A for future work.
In Sect. 4 we employ human annotators to further clean and expand the generated data to create a high quality test benchmark.

Direct Question Generation (DirectQG)
One drawback of TransVQ 2 A is the low coverage of particular types of answers, such as "no".This is because the captions generally do not indicate the absence of objects or properties (e.g., "There is no dog", "The dog is not white").To mitigate this bias, we train a multilingual question generator that takes in an answer and a caption in a target language and generates relevant questions in the same language.We use the model to generate questions for "yes", "no", or "none" as answers in each target language, as a complement to TransVQ 2 A.
Concretely, we fine-tuned mT5-XXL (Xue et al., 2021) on large-scale translated COCO Captions (Chen et al., 2015) and its corresponding VQA data VQ 2 A-COCO (Changpinyo et al., 2022).For validation, we used the subset of generated multilingual VQA data in Sect.3.2, with ∼300 golden examples for each language.The best checkpoint was selected based on ROUGE-L scores.

MaXM: Multilingual VQA Benchmark
In this section, we leverage the approach we presented in Sect. 3 for creating a multilingual VQA test-only benchmark.We next describe our data sources, how candidate data was generated, human annotation protocol, and an analysis and a discussion of our benchmark.Following the naming convention in (Changpinyo et al., 2022), we call our benchmark MAVERICS-XM3600, or MaXM in short.We will release MaXM to foster research on mVQA.

Image and Caption Selection.
We chose a subset of the images in Crossmodal-3600  (XM3600) (Thapliyal et al., 2022), in which highquality multilingual image captions are available.
For each language, 100 validation and test images of Open Images (Krasin et al., 2017;Kuznetsova et al., 2020) that were taken in the region(s) in which those languages were spoken were selected.
Our image selection criteria cover a wide range of visual concepts in different cultural contexts, making the constructed VQA examples diverse and specific to the languages of the captions related to each image.For example, in Fig. 3, unlike French and Romanian speakers, Hebrew and Thai speakers are less likely to know what a snow cannon is.On the other hand, Thai and Chinese speakers are more likely to understand what xiao long bao is, whereas in French or Hindi it could be referred to as dimsum ravioli or Chinese dim sum.

Large-Scale mVQA Data Creation
We apply our approach described in Sect. 3 to the XM3600 captions to generate a large number of question-answer pairs for each language.TransVQ 2 A. Table 1 reports the number of question-answer pairs at different stages in our pipeline.Overall, we are able to generate a large number of question-answer pairs in all languages.We found that, across languages, approximately 30% of (translated) English question-answer pairs are filtered out due to VQ 2 A validation.In contrast, different percentages of translated answers across languages are filtered out based on the captionanswer consistency validation.A main reason for this is the quality of question-answer translation.For instance, 68% of questions with "alb" (masculine "white" in Romanian) are filtered out because they are not translated to the correct feminine form "alba" w.r.t the corresponding object in the question.
DirectQG.We augment the TransVQ 2 A questions with additional candidate questions generated by TransVQ 2 A (Sect.3.3), using the XM3600 captions paired with "yes", "no", or "none" in their corresponding language as input.

Human Annotation
We employed native speakers for each of the selected 7 languages to annotate and create our benchmark.We designed an annotation protocol to balance efficiency and accuracy.In particular, we keep human-in-the-loop brief and only when an automated model straggles in a task, e.g., correcting translation artifacts, expanding answers, identifying sensitive questions.Furthermore, our protocol promotes quick discarding of examples, when the question does not make sense.We provide more details next and also in Appendix B.
Question and Answer Validation.We define a 3-way rating system of Correct, Almost Correct and Incorrect for both the questions and the answers (see Table 2).Correct questions are kept unchanged, Almost Correct questions are manually rewritten, and Incorrect questions are discarded.Given Correct and Almost Correct questions, an annotator rates the answer and corrects it in the cases of both Almost Correct and Incorrect.erated by TransVQ2 A 2 .Across languages, we observe at least 75% Correct or Almost Correct questions and, given these questions, at least 90% Correct or Almost Correct answers.This highlights the effectiveness of our approach.
Answer Expansion and Standardization.We split the generated questions into 4 categories: boolean, numeric, color, and others.We then asked the annotators to perform standardization of boolean, numeric, and color questions based on each language's guideline.For the rest of the questions, we tasked another set of at least 2 annotators per language to expand the answers to these questions with as many additionally correct (but not overly lengthy) answers as they can.
Additional Filtering.Our raters performed another round of verification, filtering out examples with "ambiguous" and "responsible-AI-sensitive" questions and/or with inappropriate image content.
The raters also labeled "Collection" questions that are likely to lead to long answers that are difficult to evaluate, such as for "What is on the table?" when there are multiple items, without filter them out.Table 4 shows a breakdown of question types from MaXM.Since question prefixes in some languages are not indicative of question types (e.g., Thai does not always begin the "What" questions with the Thai "What"), we estimate a question's type using the prefix of its English version before translation.We observe diverse types and a high degree of linguistic variations.Fig. 4 presents word clouds for answers for selected question types: "What", "What color", Boolean, and "How many", to further illustrate the diverse answers within MaXM for each question type.Comparison to xGQA.In terms of settings, one difference is the languages of the answers; xGQA operates in the "cross-lingual" setting where the input question can be non-English but the output answer is always English.While this simplifies the evaluation process, we argue that the "multilingual" setting with non-English answers considered in MaXM is more practical.
Another difference is the definition of the zeroshot setting; xGQA refers to unseen languages (not images) whereas our setting is more general, referring to both unseen images and languages.Finally, the type of translated data and how it is used for training are different; we only consider zero-shot setting and always use machine-translated questions for training, while xGQA considers both zeroshot and few-shot settings with human-translated questions involved only in the few-shot case.
In terms of the datasets, xGQA inherits the characteristics of GQA, whose questions are restricted in style (e.g., generated by a probabilistic templatebased question engine) and in the skills required (e.g., reasoning-based with multi-step inference of object attributes and relationships) (Hudson and Manning, 2019).In contrast, MaXM's questions are more general.Additionally, xGQA considers the same set of questions for all languages, whereas MaXM considers different sets of questions guided by the captions in each language.

Evaluation Protocol
Evaluation Metrics.We use Exact Match Accuracy as the main evaluation measure for MaXM, following previous work on VQA (Antol et al., 2015;Goyal et al., 2017;Gurari et al., 2018).We deem the answer as correct if it matches any of the ground-truth answers.To assess the degree of strictness of this measure, we also consider soft text similarity metrics CIDEr (Vedantam et al., 2015) and ROUGE-L (Lin, 2004) in our experiments, where we treat each of the ground-truth answers equally as one of the references (as if each of them was answered by an annotator).
Training Data.MaXM is a test-only benchmark; it cannot be used for training.We designate VQA2.0 (Goyal et al., 2017) and its translations as the default training data source for our benchmark, due to its popularity and quality, similarly to the use of COCO-Captions (Chen et al., 2015) for the nocaps benchmark (Agrawal et al., 2019) in the image captioning task.Nevertheless, we allow free use of existing VQA resources for training as long as the corresponding training images do not overlap with MaXM images.In our experiments, we also consider VQ 2 A-COCO and VQ 2 A-CC3M (Changpinyo et al., 2022) to assess the effect of text domain gap.

Models for Multilingual VQA
Inspired by approaches to multilingual NLP research, we consider two main families of models for mVQA that adapt existing source English VQA datasets to target languages: Translate-Test and Translate-Train.Translate-Test leaves the training data and the model as-is, but translates the test VQA data to the the source language English, apply the model, and then translate it back to the target language.On the other hand, Translate-Train translates the English VQA data to a target language, trains a model on this pseudo VQA data (i.e., their translations), and directly apply the trained model to the test data.

Translate-Test.
We consider two open-source state-of-the-art VQA models: OFA-Large (Wang et al., 2022b) and BLIP2 (Li et al., 2023).Neither of them are designed for mVQA.
Translate-Train.We include the results from the state-of-the-art multilingual vision-and-language model PaLI-17B (Chen et al., 2023), which pretrains on diverse VQA datasets in 35 languages (Thapliyal et al., 2022) among other datasets, and then finetune on VQA2.0 in 13 languages: en, bn, de, fr, hi, id, iw, ko, pt, ro, ru, th, zh.Further, we implement a lightweight version of PaLI, called Simple Multi-Language Prompted Training, Simple MPT, with a much smaller model and without vision-and-language pre-training.Simple MPT is trained on the data in 13 languages in a multi-task fashion.Details can be found in Appendix C.

Main Results. Table 5 benchmarks our proposed
Simple MPT and state-of-the-art VQA models on MaXM.We observe that PaLI-17B performs best on all languages.This can be attributed to both the fact that PaLI is the strongest English VQA model and the fact that it was designed to be multilingual, leveraging pre-training image-text corpus in 105 languages.This result suggests it can be beneficial to design and develop multilingual VQA models from day one.Surprisingly, our proposed Simple MPT model is a strong baseline even though it is much smaller than PaLI and does not leverage multilingual pretraining data.While its English performance is on par with OFA and much worse than BLIP2, its multilingual performance excels, outperforming OFA in all languages and underperforms BLIP2 only for Hindi and Hebrew.
Overall, our result suggests that Translate-Train may be a superior approach to mVQA to Translate-Test.We note, however, that in our early experiments, we find that Translate-Train is inferior to Translate-Test as an adaptation approach for English VQA models.For instance, the answer of finetuned BLIP2 to the French question "Outre les fleurs roses, quelle autre couleur y avait-il dans le jardin?" ("Besides pink flowers, what other color was there in the garden?") is "pink" while the correct answer is "blanc" ("white") -wrong both in terms of language and semantics.It is not immediately obvious how to adapt English VQA models with, for example, vocab and tokenizers that overfit the English language.This again suggests that the design of these multimodal models would benefit from having multilinguality in mind from the start.
Single-Language vs. Multi-Language Training, Different Training Datasets.In Table 6, our Simple MPT model performs similarly or better than each of the Single-Language baselines.This suggests that modern models are capable of learning from related languages.We also find that translated COCO is overall the best training data source.We attribute this to (i) the fact that VQ 2 A was used to generate VQ 2 A-COCO, and (ii) VQ 2 A-COCO is generally more robust in the cross-dataset setting (Changpinyo et al., 2022).However, VQ 2 A-CC3M is unable to outperform VQA2.0 despite (i); applying VQ 2 A to the noisy alt-texts in CC3M (Sharma et al., 2018) is prone to errors that would only be exacerbated by automatic MT.
Less Strict Metrics.In Table 7 We observe generally consistent results when using CIDEr and ROUGE-L instead of the stricter Accuracy, except for Thai and Chinese, where the gaps in Accuracy are small to begin with.
No Adaptation via Translate-Test.Can existing English VQA models work out of the box?In Table 8, we find that the answer is no.Expectedly, the models perform well on French, which is closer to English than other languages are.
Simple MPT on xGQA.Can our modeling approach be extended to the cross-lingual setting in xGQA (Pfeiffer et al., 2022)?We report this result in Appendix D.

Conclusions
We take initial steps toward multilingual VQA by proposing scalable solutions on both data creation and modeling fronts.We create a multilingual VQA benchmark in 7 diverse languages to drive modeling progress on multilingual VQA.We establish strong unified and open-ended VQA models that work well on 13 languages as well as benchmark state-of-the-art models.For future work, we would like to expand native-language question generation that is done in a limited scope and have single one for all target answers.

A Considerations and Limitations
Our dataset is intended to be used for research-only purposes.
Our pipeline takes in an image caption as input.Image captions may have mistakes and biases, which could be further amplified by machine learning models used by our approach.In particular, we use generative models for automatic question generation and machine translation that may create outputs with incorrect or nonfactual contents or outputs with Translationese artifacts.We have mitigated this manually via human in the loop and automatically via the caption-answer consistency check (cf., Sect.3.2).Note that the English VQ 2 A (Changpinyo et al., 2022) that we leverage in our pipeline also has similar filtering using the round-trip consistency check via question answering.Together these significantly improve the cor-   Another type of biases is the low coverage of particular types of answers, resulting from the image captions not mentioning the absences of objects or properties.We have also taken a step toward mitigating this.See Sect.3.3.
Finally, we select a diverse set of languages, alleviating typological, genealogical, and geographical language biases presented in the VQA research community.
We mainly use Crossmodal-3600 (XM3600) (Thapliyal et al., 2022).Open Images (Krasin et al., 2017;Kuznetsova et al., 2020) and the multilingual captions in XM3600 are human-curated and cleaned, which mitigates the risks that MaXM would contain information that names or uniquely identifies individual people or offensive content.

B.1 Annotation Guideline
We provide our general instructions and detailed instructions on question annotation in Fig. 5, where we explicitly ask the annotators to be wary of Responsible-AI-sensitive questions.Fig. 6 provides detailed instructions on answer annotation and on answer expansion and standardization.

B.2 Additional Examples
Additional Examples.Fig. 7 provides additional examples to the ones in Fig. 1.Again, we highlight the richness and diversity of our questions.For instance, it requires recognizing a cross under occlusion (French), a type of vegetables (Hindi), the Arabic language (Hebrew), and a type of flowers (Romanian).Some of these examples are specific to particular languages; it would be difficult for other language speakers to answer the Hebrew example (or the Chinese example in Fig. 1, which requires OCR).
We also highlight the richness of our candidate answers.For the "where" question in Thai, 10 answers count as correct.Similarly, the Romanian example in Fig. 1 provides multiple diverse surface forms for "coffee with cream." Fig. 8 additional examples to the Chinese one in Fig 2 .These examples showcase the efficiency of our annotation process.They also provide concrete examples of "Almost Correct."For instance, in the middle example, the Thai translation of "What leaves are in the photo?" is not neutral because it contains an Honorific particle3 ; it ends with "khá" which signifies a sign of respect to the addressee and indicates that the sex of the speaker is female.Finally, these examples provide a glimpse of sources of errors.For instance, it is VQ 2 A that hallucinates "in the video" in the Hindi example on the right.Collection Examples.Fig. 10 provides examples of "Collection" questions.We keep these questions as we believe they are useful in practice and as a way to encourage the community to work on better automatic evaluation metrics for this type of questions.Ambiguous Examples.Fig. 9 provides examples of "Ambiguous" questions that we filter out.Reasons include object being too small (English) or irregular (Chinese), determining sizes being subjective (French), and not enough context (Hebrew, Romanian)."What kind/What type" questions are particularly difficult to answer and tend to be ambiguous.
Responsible-AI-sensitive Examples.Fig. 11 provides examples of Responsible-AI-sensitive questions that we filter out.These cases are often associated with directly asking for the information about or describing a particular gender or race, or involving an incorrect assumption about such protected attributes (e.g., girl vs. woman in the Hebrew example).

C Simple MPT
In this section, we describe Simple MPT, a lightweight model for mVQA in detail.
Design.Much of the previous work on VQA is built for English.Further, VQA is often formulated as vocab-based VQA, a classification task into a pre-defined space of top (English) answer vocabulary; see, e.g., (Antol et al., 2015;Goyal et  Figure 12: Our Simple MPT model used in our experiments.We leverage ViT (Dosovitskiy et al., 2021) and mT5 (Xue et al., 2021) and train them together end-toend.

2017
).The main drawback of this approach is its inability to deal with rare answers through language compositionality.Recent work considers VQA as generation (Cho et al., 2021;Wang et al., 2022c;Alayrac et al., 2022;Wang et al., 2022a), capable of open-ended VQA.We adopt this as a scalable and flexible modeling approach to mVQA as the language coverage increases.In particular, we propose a single open-ended VQA model for multiple languages.Our proposed formulation is more desirable than existing ones since it takes advantage of both compositionality in individual languages and the relationship among related languages.To this end, we first describe an encoder-decoder architecture for VQA in the open-ended generation setting.Then, we describe how we train this model for multiple languages.This is summarized in Fig. 12.
Open-Ended VQA.Our starting architecture is mT5 (Xue et al., 2021), a multilingual variant of T5 (Raffel et al., 2020).mT5 is an encoder-decoder transformer-based architecture, pre-trained on a Common Crawl-based dataset covering 101 languages.This allows us to leverage multilingual language understanding (for the questions) and generation (for the answers) from the get-go.To adapt mT5 to the VQA task, we prepend patch embeddings from the image to the question tokens.In particular, we encode the image pixels using Vision Transformers (ViT) (Dosovitskiy et al., 2021).We use ViT-L16 and mT5-Large in all of our experiments.Both mT5 and ViT are trained together in an end-to-end fashion to predict the target answer for each image-question pair, using the standard cross-entropy loss.
Multi-Language Prompted Training.We resort to multi-task prompted/instruction training (Sanh et al., 2022;Wei et al., 2022), where a task corresponds to VQA for a particular language.For the input question ⟨question⟩ in language ⟨lang⟩, we construct the prompt "Answer in ⟨lang⟩: ⟨question⟩" and use it as the text input to our model, similar to a modification to the input in Google's Multilingual Neural Machine Translation System (Johnson et al., 2017b).Such a design for multi-task learning makes extending VQA to multiple languages simple; as data for additional languages become available, one can simply add them to the pool without the need for architecture changes.
Implementation Details.We use the Flax implementation (Bradbury et al., 2018).For training both our 2⟨lang⟩ and 2en models, we use Adafactor (Shazeer and Stern, 2018) with a β 1 of 0 and a second-moment exponential decay of 0.8.We use a linear warmup of 1K steps with a peak learning of learning rate of 1e-3 and inverse square-root decay.We set the ViT dropout rate to 0 and the mT5 dropout rate to 0.1.We train each model with data parallelism using 16 Cloud TPU Pods4 , each with a batch size of 512, for 100K steps.We use

frFigure 3 :
Figure 3: The diversity of multilingual captions in XM3600.We show the captions (their English translations) from 4 languages for the images of a snow cannon (left) and xiao long bao (right).

Figure 4 :
Figure 4: Top answer cloud is for "What" questions (excluding "What color").Bottom answer clouds from left to right are for "What color", Boolean "Is/Are/Was/Were/Do/Does/Did", and "How many" questions, respectively.4.4 Analysis and DiscussionSize and Question Type and Answer Distributions.MaXM v1 includes 2,142 questions in 7 languages: English (298), French (293), Hindi (294), Hebrew (315), Romanian (333), Thai (302), and Chinese (307).Table4shows a breakdown of question types from MaXM.Since question prefixes in some languages are not indicative of question types (e.g., Thai does not always begin the "What" questions with the Thai "What"), we estimate a question's type using the prefix of its English version before translation.We observe diverse types and a high degree of linguistic variations.Fig.4presents word clouds for answers for selected question types: "What", "What color", Boolean, and "How many", to further illustrate the diverse answers within MaXM for each question type.Comparison to xGQA.In terms of settings, one difference is the languages of the answers; xGQA operates in the "cross-lingual" setting where the input question can be non-English but the output answer is always English.While this simplifies the evaluation process, we argue that the "multilingual" setting with non-English answers considered in

Figure 5 :
Figure 5: Detailed Instructions on Question Annotation

Figure 6 :
Figure 6: Detailed Instructions on Answer Annotation as well as Answer Expansion and Standardization.

Figure 8 :Figure 9 :
Figure8: Additional examples on our approach to multilingual VQA data generation.Green, yellow, and red texts correspond to "Correct", "Almost Correct", and "Incorrect," respectively.
rectness and fluency of our pipeline.In addition, we explicitly mark examples that can be considered Responsible-AI-sensitive, but not necessarily incorrect; see Sect.B for details and examples.

QFigure 10 :
Figure 10: Examples of Ambiguous questions that we flagged and filtered out.

Q:Figure 11 :
Figure 11: Examples of Responsible-AI-sensitive questions that we flagged and filtered out.Faces are hidden.

Table 1 :
Number of Instances (% of Previous Stage) of automatically-generated question-answer (QA) pairs based on Crossmodal-3600 captions.Validated English pairs are w.r.t the QG-QA consistency filter.Validated multilingual pairs are w.r.t the caption-answer consistency filter.Almost Correct Correct but its surface form can be improved (syntactic errors or awkward/uncommon usages.)

Table 2 :
Definition of Correct, Almost Correct and Incorrect labels for questions and answers in our annotation protocol.
Table 3 reports label distribution for questions and answers randomly-sampled from those gen-

Table 3 :
Human evaluation of the generated questions and answers.

Table 4 :
The distribution of question types in MaXM across languages.Approximated by their corresponding English question prefixes.

Table 6 :
Effect of Training Data Sources.Accuracy of Single-Language baselines (MPT architecture) and Accuracy (%) of MPT models trained on different training datasets.