WeaQA: Weak Supervision via Captions for Visual Question Answering

Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated \textit{Image-Question-Answer} (I-Q-A) triplets. This has led to heavy reliance on datasets and a lack of generalization to new types of questions and scenes. Linguistic priors along with biases and errors due to annotator subjectivity have been shown to percolate into VQA models trained on such samples. We study whether models can be trained without any human-annotated Q-A pairs, but only with images and their associated textual descriptions or captions. We present a method to train models with synthetic Q-A pairs generated procedurally from captions. Additionally, we demonstrate the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models. Our experiments on three VQA benchmarks demonstrate the efficacy of this weakly-supervised approach, especially on the VQA-CP challenge, which tests performance under changing linguistic priors.


Introduction
Since Visual Question Answering (VQA) was first proposed as a Turing test (Malinowski and Fritz, 2014), several human-annotated datasets (Mogadala et al., 2019) have been used to train and evaluate VQA models. Unfortunately, heavy reliance on these datasets for training has the unwanted side-effects of bias towards answer styles, question-types (Chao et al., 2018), and spurious correlations with language priors (Agrawal et al., 2018). Similar findings have been reported for natural language tasks (Gururangan et al., 2018;Niven and Kao, 2019;Kaushik et al., 2020). Evaluating VQA models on test-sets that are very similar to training sets is deceptive and inadequate and not an accurate measure of robustness.
To address this, one line of work has focused on balancing, de-biasing, and diversifying samples (Goyal et al., 2017;Zhang et al., 2016). However, crowd-sourcing "unbiased" labels is difficult and costly; it requires a well-designed annotation interface and a large-scale annotation effort with dedicated and able annotators (Sakaguchi et al., 2020). The alternative (that this paper aligns itself with) is to avoid the use of explicit human annotations and instead to train models in an unsupervised manner by synthesizing training data. These techniques, coined unsupervised 1 , come with many advantages -human bias and subjectivity are reduced; the techniques are largely domain-agnostic and can be transferred from one language to another (low resource languages) or from one visual domain to another. For instance, template-based Q-A generation developed for synthetic blocks-world images in CLEVR  can also be used to generate Q-A pairs for natural complex scenes in GQA (Hudson and Manning, 2019) or the referring-expressions task .
In this work, we train VQA models without using human-annotated Q-A pairs. Instead, we rely on weak supervision from image-captioning datasets, which provide multi-perspective, concise, and less subjective descriptions of visible objects in an image. We procedurally generate Q-A pairs from these captions and train models using this synthetic data, and only evaluate them on established human-annotated VQA benchmarks.
Why Captions? Image captioning, like VQA, has been a central area of vision-and-language research. Datasets such as MS-COCO (Lin et al., 2014;Chen et al., 2015) contain captions that describe objects and actions in images of everyday scenes. During the construction of MS-COCO, human captioners were instructed to refrain from describing past and future events or "what a person might say". On the other hand, annotators of VQA (Antol et al., 2015) were instructed to ask questions that "a smart robot cannot answer, but a human can" and "interesting" questions that may require "commonsense". Different sets of annotators provided answers to these questions and were allowed to speculate or even guess an answer that most people would agree on. It has also been shown that multiple answers may exist for questions in common VQA datasets (Bhattacharya et al., 2019).
In Figure 2, the first VQA-v2 question asks how many doors the car has. Although commonsense (and linguistic priors) would suggest that "Most cars have four doors", only two doors can be seen in the image. What should the model predict, two or four? The second question is subjective and has multiple contradicting answers from different annotators (where one should draw the line between opaque, transparent, or reflective is not very clear). Similarly, the first GQA question is ambiguous and could refer to either the skier or the photographer.
Thus the very nature of the data-collection procedure and instructions for VQA brings in human subjectivity and linguistic bias as compared to caption annotations, which are designed to be simple, precise, and non-speculative. Motivated by this, we study the benefits of using captions to synthesize Q-A pairs, using three types of methods: 1. template-based methods similar to (Ren et al., 2015a;Gokhale et al., 2020b), 2. paraphrasing and back-translation (Sennrich et al., 2016) which provide linguistic variation, 3. synthesis of questions about image semantics using the QA-SRL (He et al., 2015) approach. Since our Q-A pairs are created synthetically, there does exist a domain shift as well as label (answer) shift from evaluation datasets such as VQA-v2 and GQA as shown in Figure 2, thus posing challenges to this weakly-supervised method.
We evaluate two models, UpDown (Anderson et al., 2018) and a transformer-encoder (Vaswani et al., 2017) based model pre-trained on synthetic Q-A pairs and image-caption matching task. To remove the dependence on object bounding-boxes and labels needed to extract object features, we propose spatial pyramids of image patches as a simple and effective alternative.
To the best of our knowledge, this is the first work on the unsupervised 1 visual question answering, with the following contributions: 1 adhering to the usage of this term in Lewis et al. (2019a). • We introduce a framework for synthesizing (Question, Answer) pairs from captions. • Since synthetic samples (unlike popular benchmarks) include multi-word answer phrases, we propose a sub-phrase weightedanswer loss to mitigate bias towards such multi-word answers. • We propose pre-training tasks that use spatial pyramids of image-patches instead of object bounding-boxes, further removing the dependence on human annotations. • Extensive experiments and analyses under zero-shot transfer and fully-supervised settings on VQA-v2, VQA-CP, and GQA show our model's efficacy and establish a strong baseline for future work on unsupervised visual question answering.
Label shift or Prior Probability Shift (Storkey, Figure 2: Examples of images and human-annotated Q-A pairs from VQA and GQA and our synthetic Q-A pairs. 2009) has been implicitly explored in VQA-CP (Agrawal et al., 2018), where the conditional probabilities of answers given the question type deviate at test-time. Teney et al. (2020c) have identified several pitfalls associated with the models and evaluation criteria for VQA-CP.
Unsupervised Extractive QA in which aligned (context, question, answer) triplets are not available, has been studied (Lewis et al., 2019b;Banerjee and Baral, 2020;Rennie et al., 2020;Fabbri et al., 2020;Li et al., 2020;Banerjee et al., 2021) by training models on procedurally generated Q-A pairs. Captions have been used to generate Q-A pairs for logical understanding (Gokhale et al., 2020b) and commonsense video understanding (Fang et al., 2020a). Li et al. (2018); Krishna et al. (2019) have explored Visual Question Generation from an input image and answer.
Weak supervision is an active area of research; for instance in action/object localization (Song et al., 2014;Zhou et al., 2016) and semantic segmentation (Khoreva et al., 2017;Zhang et al., 2017) without pixel-level annotations, but only class labels. There is also interest growing in leveraging natural language captions or textual queries as weak supervision for visual grounding tasks (Hendricks et al., 2017;Mithun et al., 2019;Fang et al., 2020b).
Visual Feature Extractors such as VGG (Simonyan and Zisserman, 2015) and ResNet (He et al., 2016) have been widely used for many computer vision tasks. Object-based features such as RCNN (Girshick et al., 2014) and Faster-RCNN (Ren et al., 2015b) have become the standard for V& L tasks (Anderson et al., 2018).

Framework for Synthesizing Q-A Pairs
Problem Statement: Consider a dataset containing images and associated captions as shown in Figure 2. Our work deals with learning VQA using these image-caption data, without any labeled Q-A pairs, and answer questions about unseen images.

Question Generation
Several studies (Du et al., 2017;Lewis et al., 2019a) have been dedicated to the complex domain of question generation. We approach it conservatively, using template-based methods and semantic role labeling, with paraphrasing and back-translation for improving the linguistic diversity of templatebased questions. We begin by extracting object words from the caption by using simple heuristics such as extracting noun-phrases and using numerical quantifiers in the caption as soft approximations of objects' cardinality. If object-words are available explicitly, we used them as is. Questions are categorized based on answer types; Yes-No, Number, Color, Location, Object, and Phrases.
Template-based: To create Yes-No questions, modal verbs are removed from the caption, and a randomly chosen question prefix such as "is there", "is this" is attached. For instance, the caption "A man is wearing a hat and sitting" is converted to "Is there a man wearing a hat and sitting", with the answer "Yes". To create the corresponding question with the answer "No", we use either negation or replace the object-word with an adversarial word or antonym, thus obtaining "Is there a dog wearing a hat and sitting" for which the answer is "No". An adversarial word refers to an object absent in the image but similar to objects in the image. To compute similarity, we use Glove (2014) word-vectors.
For Object, Number, Location, and Color questions, we follow a procedure similar to Ren et al. (2015a). To create "what" questions for the Object type, we extract objects and noun phrases from captions as potential answers and replace them with what. The question is rephrased by splitting long sentences into shorter ones and converting indefinite determiners to definite. A similar procedure is used for Number questions; numeric quantifiers of noun phrases are extracted and replaced by "how many" and "what is the count" to form the question. Color questions are generated by locating the color adjective and the corresponding noun phrase and replacing them in a templated question: "What is the color of the object?". Location questions are similar to Object questions, but we extract phrases with "in", "within" to extract locations, with places, scenes, and containers as answers.
Semantic Role Labeling: QA-SRL (He et al., 2015) was proposed as a paradigm to use natural language to annotate data by using Q-A pairs to specify textual arguments and their roles. Consider the caption "A girl in a red shirt holding an apple sitting in an empty open field". Using QA-SRL with B-I-O span detection and sequenceto-sequence models (FitzGerald et al., 2018), for the "when", "what", "where", and "who" questions, we obtain Q-A pairs belonging to the Phrases category such as: (what is someone holding?, an apple) (who is sitting?, girl in a red shirt holding an apple) (where is someone sitting?

, an empty open field)
These examples illustrate that QA-SRL questions are short and use generic descriptors such as something and someone instead of elaborate references, while the expected answer phrases are longer and descriptive. Thus to answer these, better semantic image understanding is required.

Paraphrasing and Back-Translation (P&B):
We apply two natural language data augmentation techniques, paraphrasing, and back-translation to increase the linguistic variation in the questions. To paraphrase questions, we train a T5 (Raffel et al., 2019) text generation model on the Quora Question Pairs Corpus (). For back-translation, we train another T5 text generation model on the Opus corpus (2012), translate the question to an intermediate language (Français, Deutsche, or Español), and translate the question back to English. For example: Is the girl who is to the left of the sailboats wearing a backpack?   Español La chica que está a la izquierda de los veleros lleva mochila?   English Does the girl to the left of the sailboats carry a backpack?

Domain Shift w.r.t. VQA-v2 and GQA
Compared to current VQA benchmarks (which typically contain one-word answers), answers to QA-SRL questions are more descriptive and contain adjectives, adverbs, determiners, and quantifiers, as seen in Figure 2. On the other hand, synthetic questions have less descriptive subjects due to the use of pronouns. Our synthetic data contains 90k unique answer phrases, compared to 3.2k in VQA and 3k in GQA. Around 200 answers from VQA are not present in our answer phrases, such as time (11:00) and proper nouns (LA Clippers), both of which are not present in caption descriptions.
Moreover, our training data contains Q-A pair such as ("Where is the man standing?, "to the left of the table"), generated by QA-SRL with long phrases as answers. However, the test set contains questions such as ("Which side of the car is the tree?", "left"), which expects only "left" as the answer. So although the word "left" is seen as a sub-phrase of our training answers, it is not explicitly seen as an only correct answer.
Some of our synthetic template-based questions about counting and object presence are similar in style to those in VQA and GQA. However, QA-SRL questions require a semantic understanding of the actions depicted in the image, which are rare in VQA and GQA. We quantify this by plotting the t-SNE components of document vector embed-dings of the questions from VQA, GQA, and our synthetic data, in Figure 3, and observe that our synthetic questions are a distinct cluster, while VQA and GQA overlap with each other. As such, a linguistic domain shift exists between these synthetic source questions and human-annotated target questions. In this paper, we address the challenge of learning VQA on a synthetically generated dataset and evaluating models on conventional benchmarks which have questions and answers that deviate linguistically from synthetic training samples.

Method
Recently, multiple deep transformer-based architectures have been proposed (Tan and Bansal, 2019;Chen et al., 2019), that are pretrained on a combination of multiple VQA and image captioning datasets such as Conceptual Captions (Sharma et al., 2018), SBU Captions (Ordonez et al., 2011), Visual Genome (Krishna et al., 2017), and MSCOCO (Lin et al., 2014). These models are resource intensive as they are trained on a huge collection of data with 3 million images. We train our models only on MS-COCO captions and images (∼204k), without access to any humanauthored Q-A pairs or object bounding boxes.

Spatial Pyramid Patches
"Bottom-Up" object features (Anderson et al., 2018) extracted from Faster R-CNN (Ren et al., 2015b) have become the de-facto features used in state-ofthe-art VQA models. These VQA models thus only use features of detected objects as input, and ignore the rest of the image. Although object features are discriminative, dense annotations are required for training and additional large deep networks for extraction. Object detection can be imperfect for small and rare objects (Wang et al., 2019); for instance if an object detection model detects only four out of six bananas in an image, features of the other two bananas will not be used by VQA models. This creates a performance bottle-neck for questions about counting or rare objects.
We take a step back and postulate that the use of features of the entire image in context could reduce this bottleneck. Image features extracted from a ResNet (He et al., 2016) trained for the Im-ageNet (Russakovsky et al., 2015) classification task, which is widely used for computer vision tasks, have been previously used for VQA models (Goyal et al., 2017). Unfortunately, since Ima-geNet contains iconic (single-object) images, using these features for non-iconic VQA images is restrictive since many questions refer to multiple objects and backgrounds in the image. Inspired by Spatial Pyramid Matching (Lazebnik et al., 2006) for image classification, we propose spatial pyramid patch features to represent the input VQA image into a sequence of features at different scales.
We divide each image I into a set of image patches {I k 1 , . . . , I kn }, each I k i being a k i × k i grid of patches, and extract ResNet features for each patch. Larger patches encode global features and relations, while smaller patches encode local and low-level features.
Encoder: Our Encoder model is similar to the UNITER single-stream transformer, where the sequence of word tokens w = {w 1 , ..., w T } and the sequence of image patch features v = {v 1 , ..., v K } are taken as input. We tokenize the text using a WordPieces (Wu et al., 2016) tokenizer similar to BERT (Devlin et al., 2019), and embed the text tokens through a text-embedder . The visual features are projected to a shared embedding space using a fully-connected layer. A projected visual position encoding, indicating the patch region (top-right, bottom-left) is added to the visual features. We concatenate both sequences of features and feed them to L cross-modality attention layers. Parameters between the cross-modality attention layers are shared to reduce parameter count and increase training stability (Lan et al., 2020), and a residual connection and layer normalization is added after cross-modal attention layer similar to Vaswani et al. (2017).

Pre-training Tasks and Loss Functions
We train the Encoder model using three pre-training tasks: Masked Language Modeling, Masked Question Answering, and Image-Text Matching.
Masked Language Modeling (MLM): We randomly mask 15% of the word tokens from the caption and ask the model to predict them. For the caption "There is a man wearing a hat", the model gets the input "There is [MASK] wearing a hat". Without the image, there can be multiple plausible choices for the [MASK] token, such as "woman", "man", "girl", but given the image the model should predict "man". This task has been shown to effectively learn cross-modal features (2019).  Figure 4: Our model architecture makes the use of spatial pyramids of image patches as inputs to the Encoder, which is trained for three pre-training tasks as shown.
Masked Question Answering (MQA): In this task, the answer tokens are masked, and the model is trained to predict the answer tokens. For example in Figure 2, for the input " When is someone competing? [MASK] [MASK]", the model should predict, "at night". To answer such questions, the model needs to interpret the image.

Image-Text Matching (ITM):
We use the five captions provided by MS-COCO as positive samples for each image. To obtain negative samples, we randomly sample captions from other images that contain a different set of objects. We train the model on a binary classification task (matching / not matching) for each image-caption pair.
For VQA and ITM, we use the final layer representation z [CLS] of [CLS] token , followed by a feed-forward and softmax layer. For MLM and MQA we feed corresponding token representations to a different feed-forward layer. We train the model using cross-entropy loss for all three tasks.
Sub-phrase Weighted Answer Loss: As observed before, the questions generated in QA-SRL have long answer phrases. For instance "What is parked?" has the answer "two black cars". We extract all possible sub-phrases that can be alternate answers, but assign them a lower weight than the complete phrase, computed as W sub = WordCount(sub)/WordCount(ans). Thus "two black cars" has a weight 1.0, while the extracted sub-phrases and weights are: (two, 0.33), (2, 0.33), (black, 0.33), (cars, 0.33), (two cars, 0.66), (2 cars, 0.66), (black cars, 0.66), (car, 0.33). This enforces a distribution over the probable answer space instead of a strict "single true answer" training. We train the model with this additional binary cross-entropy loss, where the model predicts a weighted distribution y wa over the answer vocabulary. The vocabulary is defined from the synthetic QA answer-space.

Experimental Setup
Datasets: We evaluate our methods on the three popular visual question answering benchmarks: VQA-v2, VQA-CP-v2, and GQA. Answering questions in VQA-v2 and VQA-CP v2 requires image and question understanding, whereas GQA further requires spatial understanding such as compositionality and relations between objects. We evaluate our methods under zero-shot transfer (trained only on procedurally generated samples), and fullysupervised (where we finetune our model using the associated train annotations) settings. We use exact-match accuracies for GQA, and use VQAmetric (Agrawal et al., 2017) for VQA.
Training: Our Encoder has 8 cross-modal layers with a hidden dimension of 768. The weights are initialized using the standard definition as provided in the Huggingface repository . Our models are pre-trained for 40 epochs with a learning rate of 1e−5, batch size of 256, using Adam optimizer. For finetuning, we use a learning rate of 1e−5 or 5e−5 and batch size of 32 for 10 epochs. We use a ResNet-50 pretrained on ImageNet to extract features from image patches with 50% overlap, and Faster R-CNN pretrained on Visual Genome to extract object features. We evaluate both frozen and finetuned ResNet, and observe finetuning the feature extractor to perform better. All our models are trained using 4 Nvidia V100 16 GB GPUs. All results in the fully supervised setting are reported for from-scratch trained final classification layers.  2 ZSL refers to zero-shot transfer setting and FSL refers to our models further finetuned on the respective train split. Underline⇒unsupervised best, bold⇒overall best. Baselines are trained on train-split, our models on synthetic data.

Model
All  Unsupervised Question Answering: Tables 1, 2 and 3 summarize our results on the three benchmark datasets. We can observe that our method outperforms specially designed supervised methods for bias removal in VQA-CP; our model with UpDown is 1.1% better than the supervised Up-Down. Under the ZSL setting for VQA-CP, our Encoder model is 6.1% better than UpDown with patches, and 6.5% better than UpDown with Object features, for VQA-v2: 6.2%, 5.4% respectively, and for GQA: 2.2%, 3.0% respectively. For VQA-CP, our procedurally generated Q-A pairs and patch-features when used with either Up-Down or Encoder are better than the baseline supervised UpDown model, showing the improvements are model-agnostic. This also shows the merits of using our Q-A generation methods when train and test-sets deviate linguistically.
Most GQA questions require understanding spatial relationships between objects. Such questions are infrequent in our synthetic training data since captions do not contain detailed spatial relationships among objects. Thus, the ZSL performance is not as competitive for GQA when compared to our performance on VQA and VQA-CP. Improving spatial and compositional question-answering with weak supervision is an interesting future pursuit.
Fully Supervised Question Answering: In the FSL setting, our methods' performance is not far from SOTA methods, even though our method uses significantly fewer annotations (no access to object bounding boxes). In GQA, the Encoder model performs on par with MAC (2018) and BAN (2018), which unlike us, use object relationship annotations. This suggests that cross-modal transformer layers can learn spatial relations from spatial pyramidal   features.

Impact of each question-generation technique:
In Table 4 we can observe the effect of different question generation techniques. All models use spatial image patch features. QA-SRL based questions and the SWA-Loss contribute the most towards gains in performance, and the paraphrased questions provide larger linguistic variation.

Effect of Spatial Pyramids:
We study the effect of progressively increasing the number of spatial image patches (i.e., decreasing the patch size). Table 5 shows that an optimum exists at grid-size of 7 × 7 after which the addition of smaller patches is detrimental. Similarly, only using patches of large size does not allow models to focus on specific image regions. Thus a trade-off exists between global context and region-specific features. Changing the feature extractor from ResNet-50 to ResNet-101 only results in a minor improvement of 0.01% to 0.30%. Removing visual position embeddings has a significant effect on performance, with a drop of 4.60% to 8.00% in both ZSL and FSL settings.
Impact of Pre-training Tasks: Table 6 shows the effect of different pretraining tasks on the downstream zero-shot transfer VQA task. We need the   SWA task, as it is used to perform the zero-shot QA task. The combination of MLM, MQA, and ITM, all of which need image understanding, shows improved performance on the downstream task, indicating better cross-modal representations.
Effect of size of synthetic training set: Figure 1 shows our Encoder model's learning curve for the zero-shot transfer setting trained on our synthetic Q-A pairs. The performance stagnates after a critical threshold of 10 6 samples is reached. Our experiments also suggest that randomly sampling a set of questions for each image per epoch leads to a 4% gain compared to training on the entire set.
Error Analysis: Our ZSL method is pretrained on longer phrases and hence tends to generate more detailed answers, such as "red car" instead of "car". Although the SWA loss is designed to encourage a distribution over the shorter phrases, the bias is not entirely removed. On automated evaluation, we observe that for 42% of questions, the target answer is a sub-phrase of our predicted answer. Manual evaluation of 100 such samples shows that 87% of such detailed predicted answers are plausible. This shows the relevance of learning from captions and quantifies the bias towards short "true" answers in human-annotated benchmarks, calling for better evaluation metrics that do not penalize VQA systems for producing descriptive or alternative accurate answers.
In the FSL setting, we either finetune our pre-trained QA classifier with the SWA Loss or train a separate feedforward layer from scratch for the task. The pre-trained QA classifier predicts longer phrases as answers, leading to a drop in accuracy. The feedforward layer performs better (+6%), indicating our Encoder captures relevant features necessary to generalize to the benchmark answer-space. Note that we do not use object annotations during training, unlike existing methods. Our error analysis and Figure 3 show the shift in question-space and answer-space between synthetic and human-authored Q-A pairs. These (along with inadequate evaluation metrics) act as the primary sources explaining the performance-gap between weakly-supervised methods and the fullysupervised setting. It remains to be seen whether more sophisticated question generation can be developed to reduce the performance gap further and mitigate the heavy reliance on human annotations.

Discussion and Conclusion
Prior work (Chen et al., 2019;Jiang et al., 2020) has demonstrated that the use of object boundingboxes and region features leads to significant improvements on downstream tasks such as captioning and VQA. However, little effort has been dedicated to developing alternative methods that can approach similar performance without relying on dense annotations. We argue that weakly supervised learning coupled with data synthesis strategies could be the pathway for the V&L community towards a "post-dataset era". 2 In this work, we take a step towards that goal. We address the problem of weakly-supervised VQA with a framework for the procedural synthesis of Q-A pairs from captions for training VQA models, where benchmark datasets can be used only for evaluation. We use spatial pyramids of patch features to increase the annotation efficiency of our methods. Our experiments and analyses show the potential of patch-features and procedural data synthesis and reveal problems with existing evaluation metrics.

Ethical Considerations
Captions and Question-Answer pairs are both annotated by humans in existing image captioning and visual question answering datasets. However, captions arguably contain a lesser degree of subjectivity, ambiguity, and linguistic biases than VQA annotations, due to the design of annotation prompts 2 A. Efros, Imagining a post-dataset era, ICML'20 Talk. that limit the introduction of these biases. Our work points to the potential of procedurally generated annotations in providing robustness improvements under changing linguistic priors in VQA test sets (Table 1). Hendricks et al. find that gender bias exists in image-captioning datasets and is amplified by models; further research in self-supervised data synthesis could potentially help alleviate such social biases.

B Dataset Analysis
In Table 9, we compare the distribution per answertype of our synthetically generated samples with the distribution in the VQA-CP-v2 (Agrawal et al.,     We further analyze this shift, by computing the t-SNE projections of questions using mean-pooled Glove (Pennington et al., 2014) embeddings for our generated questions and observe the overlap with human-authored questions in VQA and GQA (Hudson and Manning, 2019). Figure 6. We observe a marked shift between the question clusters for our procedurally generated questions and human annotated questions from VQA and GQA.
Similarly, we also show the distribution of answers in our dataset in Figure 7. It can be seen that our dataset has a slight imbalance in the proportion of questions with answer "yes" and "no". Numeric answers 0,1,2,3 are most frequent. Answers about people such as man, woman, people, person, group of people are also more common in the dataset. The remaining answers have a long-tailed distribution, since there are ∼ 90k unique answers in our dataset compared to ∼ 3.5k in VQA and ∼ 2k in GQA.

C Training Details
We use the HuggingFace  and PyTorch frameworks (Paszke et al., 2019). Hyperparameters and other training settings are given in Table 10.  Table 11: Illustration of using paraphrasing to improve the linguistic variation of our questions and answers.