CrossVQA: Scalably Generating Benchmarks for Systematically Testing VQA Generalization

One challenge in evaluating visual question answering (VQA) models in the cross-dataset adaptation setting is that the distribution shifts are multi-modal, making it difficult to identify if it is the shifts in visual or language features that play a key role. In this paper, we propose a semi-automatic framework for generating disentangled shifts by introducing a controllable visual question-answer generation (VQAG) module that is capable of generating highly-relevant and diverse question-answer pairs with the desired dataset style. We use it to create CrossVQA, a collection of test splits for assessing VQA generalization based on the VQA2, VizWiz, and Open Images datasets. We provide an analysis of our generated datasets and demonstrate its utility by using them to evaluate several state-of-the-art VQA systems. One important finding is that the visual shifts in cross-dataset VQA matter more than the language shifts. More broadly, we present a scalable framework for systematically evaluating the machine with little human intervention.


Introduction
Multiple datasets have been proposed to measure the progress on visual question answering (VQA) (Antol et al., 2015;Zhu et al., 2016b;Goyal et al., 2017;Gurari et al., 2018;Hudson and Manning, 2019;Tu et al., 2014;Qi et al., 2015;. However, these datasets often possess biases introduced in the data collection process and by the human annotators. It has been shown that existing VQA models leverage these spurious biases and take shortcuts (Goyal et al., 2017;Agrawal et al., 2018;Chao et al., 2018a;Akula et al., 2020a). As a result, the performance of those models on a specific VQA dataset can only serve as a rough proxy for the true learning of the VQA task (Bras et al., 2020). * Work done in part while AA was an intern at Google.  One common remedy to this is to go beyond indomain evaluation, in which the test set exhibits some form of "distribution shifts" from the training set (Agrawal et al., 2018;Chao et al., 2018b).
The key idea is that a generalizable VQA model should be able to extrapolate, for example, from one dataset to another. One challenge that is quite unique to VQA in this setting is that the distribution shift is multi-modal. When one dataset unsatisfactorily transfers to another, it is difficult to identify how much of this is due to vision or language distribution mismatches. To complicate things even more, the frequency of objects occurring in natural images follows a long-tail distribution (Salakhutdinov et al., 2011;Zhu et al., , 2016a. Lack of sufficient instances of minority classes in the test sets further complicates the estimation of generalization capabilities from one dataset to another. A possible solution to address this issue is to use an iterative, human-in-the-loop approach for dataset collection where human annotators carefully devise new test samples by incorporating visual and language distribution shifts (Nie et al., 2020;Bartolo et al., 2020;Kaushik et al., 2020;Gardner et al., 2020). However, this approach is not scalable and training the human annotators, be they seasoned AI experts or non-experts, would incur huge annotation time and cost.
In this work, we propose to make the process of creating distribution shifts more systematic and automatic. Inspired by recent work on dynamic benchmarks that co-evolve with strong models (Zellers et al., 2019b), we propose to bring in visual question-answer generation (VQAG) module in the evaluation process. More specifically, we first build a strong, controllable VQAG engine that is capable of creating particular dataset-style question-answer pairs. Then, we use it to generate novel image, question, answer test splits, while controlling distribution shifts in vision and language features. This is summarized in Table 1 and exemplified with the VQA2 and VizWiz datasets in Figure 1. Collectively, we refer to the resulting VQA test sets as CrossVQA.
There are at least two advantages in using a VQAG model to construct our CrossVQA test sets: (1) We can evaluate the adaptation skills of VQA models on non-VQA datasets such as Open Images (OID) (Kuznetsova et al., 2018), which contains various image annotations but no question/answer pairs, i.e. I oid , Q vqa2 and I oid , Q vzwz (see Table 1); (2) Collecting human-annotated test sets is resource-intensive and scales poorly, while the VQAG approach can be massively scaled and applied in a never-ending learning scenario for generating dynamic benchmarks (Nie et al., 2020).
We conduct extensive experiments to evaluate the utility of our proposed framework. First, we validate that our VQAG module is capable of generating relevant questions and correct answers with the desired distribution shifts, which we achieve through a combination of transformer-based architectures, vision-and-language pre-training, and multiple types of control signals. We also find that, when evaluated against state-of-the-art generative models for visual question generation, our VQAG substantially outperforms them in terms of accuracy, diversity, and novelty.
Additionally, we perform analysis and human evaluation of our CrossVQA test sets that are built on VQA2, VizWiz, and Open Images datasets. We show that they are effective at finding and quantifying weaknesses of cross-dataset generalization abilities in the state-of-the-art VQA models. For instance, our experimental results show that VQA models drop up to 40% in absolute accuracy if there is a mismatch in image distribution. On the other hand, VQA models are found to be relatively less sensitive to a mismatch in language distribution. Finally, inspired by the success of contrastive learning and multi-task learning techniques in improving generalization and robustness of multimodal tasks (Akula et al., 2020a), we investigate whether these techniques improve the performance of VQA models on our CrossVQA test sets. Interestingly, we find that contrastive losses and multitask regularization do not lead to significant generalization gains on CrossVQA.
In summary, our key contributions are three-fold. First, we introduce the CrossVQA benchmark for systematically assessing the generalization skills of VQA models, and provide analysis and experiments to support its utility. Second, we describe a scalable data collection and benchmarking framework for semi-automatically constructing the proposed benchmarks using a strong and controllable visual question-answer generation (VQAG) module. Finally, we empirically demonstrate the superiority of our VQAG module by achieving new state-of-the-art results in visual question generation.

Related Work
Cross-Dataset Distribution Shifts. There is a large body of work analyzing the generalization skills of neural networks from a labeled source domain to a target domain where there is no or limited labeled data (Ganin and Lempitsky, 2015;Gong et al., 2012;Guo and Xiao, 2012;Tzeng et al., 2015;Akula et al., 2020b). However, these works focus either on language modeling or visual recognition tasks. Here, we investigate adaptation skills using the multi-modal VQA task, for which distribution mismatches can occur in both language and visual features.
There are a few works that study systematic compositional skills in multi-modal tasks. For example, Lampert et al. (2009) study the use of attributes in transferring information between object classes. Jabri et al. (2016) explore several variants of the VQA task and show that VQA models struggle with transferring knowledge across datasets. Agrawal et al. (2018) study the extent to which a model is visually grounded, by evaluating its ability to generalize to a different answer distribution for each question type. Chao et al. (2018b) investigate the issue of cross-dataset generalization, using a specific setting where the source domain contains a large amount of training data and the target domain contains insufficient data to train a VQA system from scratch. Unlike these works, our work performs a more fine-grained analysis by disentangling the distribution mismatches in language and vision, achieved by generating out-of-distribution shifts using a learned VQAG module.
Visual Question Generation (VQG). The goal of VQG is to generate natural questions for an image. This task has drawn much attention due to its ability to test a model's understanding of natural language in the context of visual grounding and its application in downstream tasks such as image re- trieval and question answering (Antol et al., 2015;Zhu et al., 2016b;Akula, 2015;Palakurthi et al., 2015).
While the task of generating question automatically is well studied in the language domain, it has been under-explored for image-related natural questions (Mostafazadeh et al., 2016). Prior works explored VQG using autoencoder-based architectures (Jain et al., 2017;Yang et al., 2018;Alberti et al., 2019;Krishna et al., 2019). Jain et al.
(2017) employ a variational autoencoder paradigm where they first learn to embed a given question and image into a low dimensional latent space. The latent codes are subsequently mapped to a highdimensional representation using RNNs during inference to generate the question. Krishna et al. (2019) model question generation as a process that maximizes mutual information between the image and the expected answer's category. They incorporate fine-grained answer type as the guidance to generate goal-driven questions. Xu et al. (2020) propose an answer-centric approach where they model the complex relationship between an answer and its relevant image regions. Unlike these works, our approach uses a simple encoder-decoder framework, but we enhance it using a transformer-based architecture, vision-and-language pre-training, and various control signals, which together lead to a stronger VQG model. Furthermore, our work not only improves the VQG performance, but also takes a step further by exploring using VQG in the context of VQA evaluation.

Overview
Figure 2 overviews our approach to systematically generating cross-dataset distribution shifts. During training, we train a visual question-and-answer generation (VQAG) engine using multiple sources of VQA data (denoted by A and B). This VQAG module uses a dataset indicator to learn and generate question-answer pairs of a particular dataset's style.
During inference, we apply the trained VQAG model to multiple image sources (denoted by A, B, and C), while varying the dataset indicator. For example, we turn on the dataset B indicator for the images of A, which generates B-style questions/answers for the images in A. Furthermore, VQAG can also be applied to images from a different dataset C, for which no VQA annotations are available, yet we can still control the style of annotations generated. In the post-processing step, the resulting VQA datasets are validated by human annotators.
We first provide more details on our VQAG engine (Sec. 3.2) and then describe how it is used to generate CrossVQA benchmarks (Sec. 3.3).

Visual Question-Answer Generation
We start from a transformer-based encoder-decoder model that learns to generate question-answer pairs from images. We then enhance this model in two ways. First, we perform image-text pretraining using a recently introduced Conceptual 12M (CC12M) dataset (Changpinyo et al., 2021). Second, we experiment with multiple control signals. As we will show in our experimental results, these signals help improve the accuracy and the diversity of the generated outputs when applied to diverse sources of images.
Base VQAG Model and Input-Output Format. We adopt a transformer-based encoder-decoder framework (Vaswani et al., 2017) for image-totext generation as our base model, following recent work on large-scale image captioning (Sharma et al., 2018;Changpinyo et al., 2019). In particular, we represent each input image as a sequence of feature vectors, and the model learns to produce relevant questions and their corresponding correct answers.
Each input image is represented by multiple (iii) top semantic object label vectors, where labels (e.g. "river", "man", "football") are produced by the Google's Vision API 1 .
Our target is a question-answer pair in the format q sep a, where q is the question tokens, a is the answer tokens, and sep is the chosen delimiter. Furthermore, since a is not limited to a single answer (Bhattacharya et al., 2019), a is represented as a 1 dsep a 2 dsep . . . dsep a K , where a 1 , a 2 , . . . , a k are possible answers for q. We use beam search to generate the target question and answer(s) during the decoding stage.
Next we incorporate two enhancements into this base model to (a) maximize the relevance between image, question and expected answer in the generated test sets; (b) improve generalization capability of the model to out-of-domain images; and (c) increase the diversity and novelty of the questions.
Enhancement 1: Image-To-Text Pre-Training. We pre-train our base VQAG model on Conceptual 12M (Changpinyo et al., 2021), a largescale dataset specifically designed for vision-andlanguage pre-training. It consists of 12.4 million image-Alt-text pairs harvested from the Web. We use the standard image captioning objective for pretraining (Changpinyo et al., 2021). Despite this task mismatch (i.e., image captioning vs. visual question/answer generation), we observe the utility of pre-training in addressing the long-tail distribution of objects (see Sec. 4.2) Enhancement 2: Dataset-Agnostic Control Signals. In addition to the image features, we also condition our model on up to three control knobs more directly related to visual question generation and answering. In particular, we explore three main types of dataset-agnostic control signals, summarized in Table 2: the expected first two words of the question (i.e. question prefix), the expected answer category, and the expected answer(s). See Examples Question 1: Is the screen's background blue? P : Is the, C : Color, Ã : yes <dsep> true <dsep> blue screen <dsep> yes, A : yes Question 2: How many men are in the picture? P : How many, C : Counting, Ã : 2 <dsep> 2 <dsep> 3 <dsep> 5, A: 2 Appendix A for further discussion.
To condition the VQAG model on these control signals, the embeddings for the control signals are fed to the encoder together with the image embeddings. The visual and language features from the image embeddings and the control signals are allowed to attend to all other features through the self-attention mechanism.
Dataset indicator as additional control signal. As the main focus of this paper is cross-dataset shifts, we consider the dataset indicator control signal as an additional input. This signal helps inform the model of the desired domain or style of visual questions. Similar to dataset-agnostic control signals above, the one-hot embedding for the dataset indicator is concatenated to the image and other control signal embeddings and fed to the encoder.

Generating CrossVQA Benchmarks
We now describe how to use the enhanced VQAG model together with the dataset indicator described in previous section for generating CrossVQA benchmarks.

Datasets.
We consider two VQA datasets: VQA2 (Goyal et al., 2017) and VizWiz (Gurari et al., 2018). The two datasets are drastically different visually and textually. VQA2 is built on top of high-quality COCO images (Lin et al., 2014) with visual questions intended to fool "smart robot" but not humans. VizWiz, on the other hand, is collected in-the-wild from the visually-impaired users, often with lower image quality and more conversational and simpler questions intended to be useful if answered correctly.
Additionally, we consider the images from Open Images (OID) (Kuznetsova et al., 2018), which is known to have more diverse objects than COCO (Agrawal et al., 2019).

Training
We mix the training splits of VQA2 and VizWiz and use that for training our VQAG. We experiment with pre-training and different combinations of dataset-agnostic control signals (Sec. 4). We leverage ground-truth control signals in the training set whenever available; question prefixes and answers are available for both datasets, while the answer categories are available on a subset of VQA2, as provided by (Krishna et al., 2019).
By varying the dataset-indicator control knob of our bestperforming VQAG models, we generate our desired disentangled shifts. More specifically, denote by I A , QA B a dataset with A-style images and B-style questions. We generate the following four VQA splits: VQA2-style question-answer pairs on a subset of VizWiz validation images I vzwz , QA vqa2 , VizWiz-style question-answer pairs on a subset of VQA2 validation images I vqa2 , QA vzwz , and additionally both VQA2style and VizWiz-style pairs on a subset of OID validation images I oid , QA vqa2 and I oid , QA vzwz . In addition, we also generate I vqa2 , QA vqa2 and I vzwz , QA vzwz as a sanity check to verify if our model learns to understand the styles of VQA2 and VizWiz.
Dataset-agnostic control signals. There are no ground-truth control signals for the images during inference. Thus, we train an image tagger with the multi-label sigmoid cross entropy loss to predict top-k most relevant first two words (i.e. question prefix), answer categories, and answers from the input image and the target dataset indicator This is more flexible than the approach used in (Krishna et al., 2019) where all the pre-annotated answer categories are used during inference for all images.

Postprocessing
We further clean CrossVQA by using the human annotators to assess question relevance and answer correctness (Sec. 4.2).

Experiments
In this section, we first evaluate the performance of our VQAG model against existing state-of-theart baselines (Krishna et al., 2019;Wang et al., 2017;Jain et al., 2017). We then demonstrate the importance of conditioning our VQAG model on the proposed control signals by performing several ablation studies. Next, we present CrossVQA examples and several statistics based on the generated data. We finally show that CrossVQA is effective at identifying the limitations of state-ofthe-art VQA models, and examine the extent to which existing adaptation techniques help in improving performance of VQA models as measured by CrossVQA.

In-Domain Evaluation of VQAG
We first benchmark the in-domain performance of our VQAG model by training and testing on VQA2 (Goyal et al., 2017) against existing models for visual question generation (VQG). Note that, unlike those models which focus on generating only questions, our model also generates answers; we discard the generated answers when evaluating the generated questions against existing work.
Metrics. We consider two sets of evaluation metrics. The first set of metrics measure question relevance. It consists of multiple automatic text similarity metrics widely used for image captioning and VQG: BLEU (Papineni et al., 2002), ROUGE-L (Lin, 2004) oracle CIDEr: the maximum value of the CIDEr over a list of all references. Note that, although not considered by previous work, both generative strength and inventiveness for questions (QS and QI, respectively) can be extended to measure the diversity and novelty of generated answers as well (AS and AI, respectively).
Notation. We use X2Y to denote the model with X as input and Y as output. We use I, Q, A, C to refer to image, question, answer, and answer category, respectively. Furthermore, we use Ã to refer to multiple answers and P to question prefix. See Table 2 for examples of our control signals.
Baselines. We compare the performance of our VQAG model against the following baselines: IA2Q (Wang et al., 2017), a non-variational model that takes an image and answer as input and generates a question; V-IA2Q (Wang et al., 2017), a variational-autoencoder based approach that embeds the input image and question to a latent space before generating a question; IC2Q and V-IC2Q, extensions to the IA2Q and V-IA2Q models, respectively, where the models are conditioned on answer categories (Krishna et al., 2019) instead of ground-truth answers; MI-IA2Q (Krishna et al., 2019) and MI-IC2Q, also variational models posing the question generation as a process that maximizes mutual information between the image, the expected answer and the answer category.
Results. Results are reported in Table 3. Our models (IÃC2QÃ, IÃP2QÃ) significantly outperform all the baselines on standard automatic metrics by large margins, especially improving the BLEU-4, METEOR and CIDEr scores by +29.5%, +23.17% and +0.62, respectively, compared to the current state-of-the-art methods MI-IC2Q and MI-IA2Q. In addition, our best model IÃP2QÃ outperforms the state-of-the-art MI-IC2Q by +7.06% in QS, suggesting that we generate a diverse pool of questions. Moreover, for question inventiveness, a +30.39% QI improvement paired with a high oracle CIDEr score indicates that our model also generates novel and appropriate questions by using new combinations of objects and question patterns. We also find a +20% improvements in AS and AI with the enhancements discussed in Sec. 3.2.

Analysis of Generated Data
Now that we establish the superiority of our VQAG engine to existing approaches, we analyze the outputs of our best model (IÃP2QÃ with pre-training) when used to generate CrossVQA benchmarks (Sec. 3.3.2).
Statistics and Examples of CrossVQA. Table 4 presents basic statistics of the six CrossVQA test splits generated by our VQAG model. Figure 3 provides examples.
Human Evaluation. We first conduct a human study to verify question relevance and answer correctness of 3000 samples from the generated splits. More concretely, we present each <image, question, answer> triplet to three crowd workers and ask them to verify if the generated question is relevant to the image. Questions that are annotated as not relevant by at least two workers are discarded. For each of the relevant questions, we also ask the     workers to verify if the generated answer is correct, and, if incorrect, ask them to write a correct answer (See Appendix C).
As shown in Table 5, workers annotate a large portion of the generated questions by our VQAG model as relevant (QR percentages between 77.4% and 97.8%), showcasing the effectiveness of the proposed VQAG model. Answer correctness is found to be relatively lower (AC percentages between 51.6% and 74.8%), a result that indicates that CrossVQA is a challenging new benchmark for visual question answering. We find that the questions belonging to count, time, spatial, food  and attribute categories are relatively more difficult for our model to generate correct answers.
Further Analysis. We first assess the controllability ability of our VQAG model in the generation of VQA2-style or VizWiz-style questions. In Table 6, we use the Jensen-Shannon (JSD) divergence between the unigrams and bigrams distributions of questions between each data pair to measure their "style" distance. Regardless of the image sources, the generated VQA2-style (VizWiz-style) questions are much more similar to VQA2 (VizWiz) than the original VizWiz (VQA2) questions are. We then focus on the generated ques-   tions/answers on OID and assess the benefits of pre-training and control signals on out-of-domain images. Figure 4 shows a qualitative comparison of questions generated without (red) and with pre-training (green). We observe that the pre-trained model generates more accurate and informative questions (e.g., fire hydrant vs. fire extinguisher, fish vs. shark). In Figure 5, the sunburst plots (shown at the top) of the first three words of the questions exhibit much higher diversity with control signals. Further, in Figure 6, the distribution of answer categories demonstrate that control signals increase the entropy of answer category distribution, helping the heavy tail ones.

Cross-Dataset VQA Experiments
Performance  state-of-the-art VQA models: ViLBERT (VB) (Lu et al., 2019a), LXMERT (Tan and Bansal, 2019), and VisualBERT (Li et al., 2019), all trained on VQA2. In Table 7 (top three rows), we find that ViLBERT outperforms other baselines on CrossVQA splits with VQA2 images or VQA-style questions, so we provide a detailed analysis of ViL-BERT. Figure 7 compares the CrossVQA performance of (a) ViLBERT trained on the VQA2 dataset, and (b) ViLBERT trained on VQA2 and fine-tuned on VizWiz. We find that both VQA models show accuracy drops on all six splits, compared to the SOTA accuracy 71.0% on VQA2 test set and 54.7% on VizWiz test set (left-most column). This indicates that the questions in CrossVQA are harder for SOTA models to get right. Moreover, the model trained on VQA2 drops by up to 40% on VizWiz and OID input images, a rather unexpected (and never-before quantified) result. Similarly, the model trained on VizWiz underperforms on splits with VQA and OID images by similarly large margins. This suggests that the VQA models struggle to generalize when there is a mismatch in image distribution. In contrast, the drop in accuracy is relatively low for mismatches in language distribution, indicating that these models are relatively less robust to visual features compared to language features. We believe that the rich object-level features and interactions available in the visual space could be causing the models to overfit to training image distribution and therefore the models struggle to generalize to new image distribution.

Adaptation Techniques with Auxiliary Losses.
We also examine if the contrastive and multitask (MTL) losses (Akula et al., 2020a) improve the adaptation performance of ViLBERT on CrossVQA in Table 7. In contrastive leaning, neg-  The last five rows of Table 7 show the performance of ViLBERT (VB) using these contrastive and MTL losses. Although the losses slightly improve the accuracy on in-domain CrossVQA split I vqa2 , QA vqa2 , they fail to improve generalization on cross-domain splits I vqa2 , QA vzwz , I vzwz , QA vqa2 and I oid , QA vqa2 , suggesting that there is ample room for improvement (see Appendix E).

Conclusion
We present a step toward scalable and systematic evaluation of VQA systems. Key to our approach is an accurate and controllable VQAG module that is capable of generating disentangled distribution shifts. We generate CrossVQA benchmarks, a collection of test splits based on VQA2, VizWiz, and Open Images datasets. We validate their utility by showing that existing VQA models struggle to perform well in this evaluation scenario and identifying the image distribution mismatch as the main factor.

A Appendix
In this supplementary material, we begin by providing more details on our VQAG implementation. We then provide additional results and detailed analysis comparing the diversity, novelty of questions generated using our VQAG model and baselines. Next, we present our experiment interfaces used for conducting human studies and show additional results. We then present details of contrastive learning and multi-task learning models used in our adaptation analysis. Finally, we present more statistics from CrossVQA.

B Implementation Details
The models are optimized with Adam (Kingma and Ba, 2015) with an initial learning rate of 0.000032. We use a linear decay learning rate schedule with warm up and employ early stopping based on validation set accuracy. If not pre-trained, we train our VQAG model for a maximum of 2M iterations. With pre-trained initialization, we train our VQAG model for a maximum of 500, 000 iterations. Both the encoder and decoder layers of transformer have 6 layers each with 8 heads for multiheaded attention. The vocabulary embedding size is 512, and the hidden embedding size is 1024. We train our models with a global batch size of 4096 over Google Cloud 32-core TPUs 2 . The average training time for pre-training on conceptual captions dataset is 52 hours, and training on VQA2.0 and VizWiz takes up to 21 hours. We condition our VQAG model using the expected answer categories (Ã) of the output answer as one of the control signals, in order to maximize the relevance between image, question and expected answer in the generated test sets. These answer categories can be objects, attributes, colors, materials, time, etc. Specifically we use 16 categories (similar to (Krishna et al., 2019)), covering more than 80 objects, 40 attributes, 17 colors, and 8 materials. Table 8 presents the list of all the 16 categories and provides examples of answers for each of the categories.
The decoder generates the question and the answer(s) separated by delimiters, for example, question sep answer1 dsep answer2. We use beam search (width = 5, alpha = 0.6) to generate the target question and answer(s) during decoding.   Table 9 presents exam-Q u e s t io n G e n e r a t iv e S t r e n g t h Q u e s t io n In v e n t iv e n e s s A n s w e r G e n e r a t iv e S t r e n g t h A n s w e r In v e n t iv e n e s s

Examples of Invented Questions
Examples of Invented Answers Q1: What hand is the man using to write with? Q2: Are most of the lights on or off in the living room? Q3: Will this woman be drinking beer? Q4: What is the number on the front side of the bike? Q5: In this scene how many sheep can be clearly seen? Q6: What is the purpose of the number on the yellow board? Q7: Which sheep is the older in the picture? Q8: Is the fire hydrant old or new? Q9: What is the first letter of the word on the blue sign? Q10: What is the name of the logo on top of the keyboard?
{at least 10 years, above doorway, inside the baggage, behind red car, towards bottom left side, dirt bikes, fishing boats, fork and sharp knife, riding big elephants, right side of road} Table 9: Examples of unseen questions and answers invented by our VQAG Model ples of the invented/unseen questions and answers that are not seen by our VQAG model during training. In the next section, we verify the question relevance and answer correctness of these o.o.d questions.

D Additional Human Evaluation Results
We verify question relevance and answer correctness of the samples in CrossVQA splits where the VQAG model is trained on combined train sets of VQA2.0 and VizWiz. In this section, we present additional results on human evaluation of VQAG model that is trained on only VQA2.0 train split. We generate questions and answers for VQA2.0 val split (in-domain) and VizWiz, OID val splits (o.o.d). Figure 9 shows the interface used for conducting this study. Questions that are annotated as not relevant by at least two workers are considered as irrelevant. For each of the relevant questions, we ask the workers to verify if the generated answer is correct, and if incorrect, ask them to write the correct answer.  Figure 9) are indeed relevant and not due to random noise. Furthermore, in Table 10, we also show the QA and AC percentages across seen and unseen questions generated by VQAG model. We see higher drop in AC percentage on unseen questions compared to the drop in QR, indicating that unseen questions are relatively harder for the model to generate correct answers.

E More Details on our Base Model
Both the encoder and the decoder contain a stack of L layers, with each layer consisting of a multi-head self-attention layer followed by a feedforward layer. For a given token embedding, the self-attention layer produces a weighted representation of all other tokens in the input. This weighted representation is then combined with the input representation Question: What are the bricked letters on the surface? Answer: Can't tell 1. Does the question apply to the image? 2. Is the answer correct?
3. What is the correct answer?
Add a correct answer here...

Task: Assess the quality of the question and the answer presented for the image.
More instructions on how to complete the task are available in this guidelines doc.  of the given token and it is passed to the next layer. Specifically, each attention head first calculates the queries Q, keys K and values V as follows: where X contains all the input features stacked into a matrix, and W Q , W K , and W V are learned projection matrices. The output of the attention head is then computed as follows: where d k , d v are the dimension of the keys K and values V respectively. Intuitively, with the above attention, the encoder jointly attends to information from different representation subspaces at different positions in the input image. The point-wise feedforward network (FFN) is applied to each output of the attention layer and it consist of two linear transformations, with a ReLU activation in between, where W 1 , b 1 and W 2 , b 2 are the weights and biases of two fully connected layers.
Embedding Regional Image Features We extract image objects and their features using a Faster RCNN (Ren et al., 2015) object detector model, trained on Visual Genome (Krishna et al., 2017). We extract 100 object regions per image. The resulting bounding boxes are considered as visual tokens. Similar to the positional encoding in language models (Vaswani et al., 2017), for each visual token, the spatial position of bounding box is also encoded. We use a 5-d vector, p spatial , to encode the top-left, bottom-right, and the bounding box area relative to the image, i.e., p spatial = [ x tl W , y tl H , x br W , x br H , w·h W ·H ]. Embedding Global Image Features Similar to (Thapliyal and Soricut, 2020

F Models for Adaptation Analysis
ViLBERT Training: As discussed in Section 4 of the main paper, we use ViLBERT (Lu et al., 2019b) for our adaptation experiments.
ViLBERT uses a pretrain-then-transfer learning approach to jointly learn visual and textual representations from large-scale data, and utilizes them to answer VQA questions. Specifically, we consider 8-layer ViLBERT implementation available at the link https://github.com/ jiasenlu/vilbert_beta. On VQA train splits, we train the model for a maximum of 25 epochs and use early-stopping based on the validation performance. We use an initial learning rate of 3e −5 and use a linear decay learning rate schedule with warm up. We train on 8 Tesla V100 GPUs with a total batch size of 512.
Contrastive Learning using ViLBERT: In implementing the contrastive loss functions, we randomly sample negatives from the mini-batch for computational efficiency (similar to (Akula et al., 2020a)). We sampled 64 negatives from each batch for both Sum-H and Max-H losses and fine-tune the margin parameters based on development split.
Multi-Task Learning using ViLBERT: We present our multi-task learning (MTL) architecture in Figure 10. The shared layers of ViLBERT constitute transformer blocks (TRM) and co-attentional transformer layers (Co-TRM) (Lu et al., 2019b). The weights for the task-specific layers are randomly initialized, whereas the shared layers are initialized with weights pre-trained on 3.3 million image-caption pairs from Conceptual Captions dataset (Sharma et al., 2018). We use a binary crossentropy loss for all the auxiliary tasks GQA (Hudson and Manning, 2019), visual common sense reasoning (VCR) (Zellers et al., 2019a), and referring expression recognition (REF) (Cirik et al., 2018). We considered RefCOCOg (Mao et al., 2016) dataset for REF task. We optimize each task alternatively in mini-batches based on a mixing ratio and employ early-stopping based on the validation performance. In all our contrastive learning and multi-task learning experiments, we use an initial learning rate of 4e-5, and use a linear decay learning rate schedule with warm up. We train on 4 RTX 2080 GPUs with a total batch size of 256.
Transfer Learning using ViLBERT: In addition to the contrastive learning and MTL based adaptation results presented in Section 4 of main paper, we also explore transfer learning (TL) based models. Specifically, we first pre-train ViLBERT on auxiliary tasks, in contrast to joint training in MTL, and then fine-tune it on VQA train split. As shown in Table 11, we did not find any significant improvement in model's performance on CrossVQA.

G More Details on CrossVQA
In addition to the statistics presented in the Section 4 of the main paper, we present additional details of our CrossVQA splits. Figure 12a and Figure 12b show a word cloud plot for the majority  questions and answers across all the six splits. A variety of objects and answers can be seen in the plots, suggesting that our splits are diverse. Moreover, the relative frequency of the most frequent spatial relationships across all the six splits in Figure 13 show that CrossVQA comprises of rich and diverse spatial relationships. Figure 11 shows question length distribution of all the six splits. As we expected, we find that splits with VizWiz style questions, i.e. I vqa2 , QA vzwz , I vzwz , QA vzwz , and I oid , QA vzwz contain more words in the question on average than other splits in CrossVQA.