Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering

Visual question answering (VQA) is challenging not only because the model has to handle multi-modal information, but also because it is just so hard to collect sufficient training examples — there are too many questions one can ask about an image. As a result, a VQA model trained solely on human-annotated examples could easily over-fit specific question styles or image contents that are being asked, leaving the model largely ignorant about the sheer diversity of questions. Existing methods address this issue primarily by introducing an auxiliary task such as visual grounding, cycle consistency, or debiasing. In this paper, we take a drastically different approach. We found that many of the “unknowns” to the learned VQA model are indeed “known” in the dataset implicitly. For instance, questions asking about the same object in different images are likely paraphrases; the number of detected or annotated objects in an image already provides the answer to the “how many” question, even if the question has not been annotated for that image. Building upon these insights, we present a simple data augmentation pipeline SimpleAug to turn this “known” knowledge into training examples for VQA. We show that these augmented examples can notably improve the learned VQA models’ performance, not only on the VQA-CP dataset with language prior shifts but also on the VQA v2 dataset without such shifts. Our method further opens up the door to leverage weakly-labeled or unlabeled images in a principled way to enhance VQA models. Our code and data are publicly available at https://github.com/heendung/simpleAUG.


Introduction
"A picture is worth a thousand words," which tells how expressive an image can be, but also how challenging it is to teach a machine to understand an image like we humans do. Visual question answering (VQA) (Antol et al., 2015;Goyal et al., 2017;Zhu Figure 1: Illustration of our approach SIMPLEAUG. We show a training image and its corresponding question-answer pairs in VQA v2 (Goyal et al., 2017), and our generated pairs. A VQA model (Anderson et al., 2018) trained on the original dataset just cannot answer these new questions on the training image correctly, and we use them to improve model training. et al., 2016) is a principled way to measure such an ability of a machine, in which given an image, a machine has to answer the image-related questions in natural language by natural language. While after years of effort, the state-of-the-art machine's performance is still behind what we expect (Hu et al., 2018;Yu et al., 2018;Anderson et al., 2018;Lu et al., 2019;Hudson and Manning, 2019).
Several key bottlenecks have been identified. In particular, a machine (i.e., VQA model) learned in the conventional supervised manner using humanannotated image-question-answer (IQA) triplets is shown to overlook the image or language contents (Agrawal et al., 2016;Goyal et al., 2017;Chao et al., 2018a), over-fit the language bias (Agrawal et al., 2018), or struggle in capturing the diversity of human language (Shah et al., 2019;Chao et al., 2018b). Many recent works thus propose to augment the original VQA task with auxiliary tasks or losses such as visual grounding (Selvaraju et al., 2019;, de-biasing (Cadene et al., 2019;Clark et al., 2019;Ramakrishnan et al., 2018), or (cycle-)consistency (Shah et al., 2019;Gokhale et al., 2020a) to address these issues.
Intrigued by these findings and solutions, we investigate the bottlenecks further and argue that they may result from a more fundamental issue -there are simply not enough training examples (i.e., IQA triplets). Concretely, most of the existing VQA datasets annotate each image with around ten questions, which are much fewer than what we humans can ask about an image. Take the popular VQA v2 dataset (Goyal et al., 2017), for instance, the trained VQA model can answer most of the training examples (on average, six questions per image) correctly. However, if we ask some more questions about the training images -e.g., by borrowing relevant questions from other training images -the same VQA model fails drastically, even if the model has indeed seen these images and questions during training (see Figure 1). Namely, the VQA model just has not learned enough through the human-annotated examples, leaving the model unaware of the huge amount of visual information in an image and how it can be asked via natural language.
At first glance, this seems to paint a grim picture for VQA. However, in this paper we propose to take advantage of this weakness to strengthen the VQA model: we turn implicit information already in the dataset, such as unique questions and the rich contents in the training images, into explicit IQA triplets which can be directly used by VQA models via conventional supervised learning.
We propose a simple data augmentation method SIMPLEAUG, which relies on (i) the original imagequestion-answer triplets in the dataset, (ii) midlevel semantic annotations available on the training images (e.g., object bounding boxes), and (iii) pretrained object detectors (Ren et al., 2016) 1 . Concretely, we build upon the aforementioned observations -questions annotated for one image can be valuable add-ons to other relevant imagesand design a series of mechanisms to "propagate" questions from one image to the others. More specifically, we search images that contain objects mentioned in the question and identify the answers using information provided by (ii) and (iii), such as numbers of objects, their attributes, and existences. SIMPLEAUG requires no question generation step via templates or language models (Kafle et al., 2017), bypassing the problems of limited diversity or artifacts. Besides, SIMPLEAUG is completely detached from the training phase of a VQA model and is therefore model-agnostic, making it fairly simple to use to improve VQA models.
We validate SIMPLEAUG on two datasets, VQA v2 (Goyal et al., 2017) and VQA-CP (Agrawal et al., 2018). The latter is designed to evaluate VQA models' generalizability under language bias shifts. With SIMPLEAUG, we can not only achieve comparable gains to other existing methods on VQA-CP, but also boost the accuracy on VQA v2, demonstrating the applicability of our method. We note that many of the prior works designed for VQA-CP indeed degrade the accuracy on VQA v2, which does not have language bias shifts between training and test data. SIMPLEAUG further justifies that mid-level vision tasks like object detection can effectively benefit high-level vision tasks like VQA.
In summary, our contributions are three-folded: • We propose SIMPLEAUG, a simple and modelagnostic data augmentation method that turns information already in the datasets into explicit IQA triplets for training VQA models.
• We provide comprehensive analyses on SIM-PLEAUG, including its applicability to weaklylabeled and unlabeled images.
2 Related Work VQA datasets. More than a dozen datasets have been released (Lin et al., 2014;Antol et al., 2015;Zhu et al., 2016;Goyal et al., 2017;Krishna et al., 2017;Hudson and Manning, 2019;Gurari et al., 2019). Most of them use natural images from largescale image databases, e.g., MSCOCO (Lin et al., 2014). For each image, human annotators are asked to generate questions (Q) and provide the corresponding answers (A). Doing so, however, is hard to cover all the knowledge in the visual contents.
Leveraging side information for VQA. A variety of side information beyond the IQA triplets has been used to improve VQA models. For example, human attentions are used to enhance the explainability and visual grounding of VQA models (Patro and Namboodiri, 2018;Selvaraju et al., 2019;Das et al., 2017;. Image captions contain substantial visual information and can be used as an auxiliary task (i.e., visual captioning) to strengthen VQA models' visual and language understanding Kim and Bansal, 2019;Wang et al., 2021;Banerjee et al., 2020;Karpathy and Fei-Fei, 2015). Several papers leveraged scene graphs and visual relationships as auxiliary knowledge for VQA (Johnson et al., 2017;Hudson and Manning, 2019;Shi et al., 2019). A few works utilized mid-level vision tasks (e.g., object detection and segmentation) to benefit VQA (Gan et al., 2017;Kafle et al., 2017). Most of these works use side information by defining auxiliary learning tasks or losses to the original VQA task. In contrast, we directly turn the information into IQA triplets for training.
Data augmentation for VQA. Several existing works investigate data augmentation. One stream of works creates new triplets by manipulating images or questions (Chen et al., 2020;Agarwal et al., 2020;Tang et al., 2020;Gokhale et al., 2020a). See § 4.1.4 for some more details. The other creates more questions by using a learned language model to paraphrase sentences (Ray et al., 2019;Shah et al., 2019;Whitehead et al., 2020;Kant et al., 2021;Banerjee et al., 2020) or by learning a visual question generation model (Kafle et al., 2017;Li et al., 2018;Krishna et al., 2019). The closest to ours is the pioneer work of data augmentation by Kafle et al. (2017), which also creates new questions by using mid-level semantic information annotated by humans (e.g., object bounding boxes). Their question generation relies on either pre-defined templates or a learned language model, which may suffer limited diversity or labeling noise. In contrast, we directly reuse questions already in the dataset and show that they are sufficient to augment high-quality questions for other images. Besides, we further explore machine generated annotations (e.g., via an object detector (Ren et al., 2016)), opening the door to augment triplets using extra unlabeled images. Furthermore, we benchmark our method on the popular VQA v2 (Goyal et al., 2017) and challenging VQA-CP (Agrawal et al., 2018) datasets, which are released after the publication of Kafle et al. (2017). Overall, we view our paper as an attempt to revisit simple data augmentation like Kafle et al. (2017) for VQA, and show that it is indeed quite effective.
Robustness of VQA models. Goyal et al. (2017);Chao et al. (2018a);Agrawal et al. (2018) pointed out the existence of superficial correlations (e.g., language bias) in the datasets and showed that a VQA model can simply exploit them to answer questions. Existing works to address this can be categorized into three groups. The first group attempts to reduce the language bias by designing new VQA models or learning strategies (Agrawal et al., 2018;Ramakrishnan et al., 2018;Cadene et al., 2019;Clark et al., 2019;Grand and Belinkov, 2019;Clark et al., 2019;Jing et al., 2020;Niu et al., 2021;Gat et al., 2020). For example, RUBi and Ensemble (Cadene et al., 2019;Clark et al., 2019) explicitly modeled the questionanswer correlations to encourage VQA models to explore other patterns in the data that are more likely to generalize. The second group leverages side information to facilitate visual grounding Selvaraju et al., 2019;Teney and Hengel, 2019). For example, Wu and Mooney (2019) used extra visual or textual annotations to determine important regions where a VQA model should focus on. The third group implicitly or explicitly augments the VQA datasets, e.g., via self-supervised learning, counterfactual sampling, adversarial training, or image and question manipulation (Abbasnejad et al., 2020;Teney et al., 2020;Chen et al., 2020;Gokhale et al., 2020a;Liang et al., 2020;Gokhale et al., 2020b;Li et al., 2020;Ribeiro et al., 2019;Selvaraju et al., 2020). SIMPLEAUG belongs to the third group but is simpler in terms of methodology. Besides, SIM-PLEAUG is completely detached from VQA model training and thus model-agnostic. Moreover, SIM-PLEAUG can improve on both VQA-CP (Agrawal et al., 2018) andVQA v2 (Goyal et al., 2017).

SIMPLEAUG for Data Augmentation
3.1 Implicit information SIMPLEAUG leverages three sources of information that implicitly suggest extra IQA triplets beyond those provided in a VQA dataset. The first one is the original IQA triplets in the dataset. We find that for two similar images that locally share common objects or globally share common layouts, their corresponding annotated questions can either be treated as paraphrases or extra questions for each other. The second one is the object instance labels like object bounding boxes that are annotated on images, e.g., MSCOCO images (Lin et al., 2014). These labels provide accurate answers to "how many" or some of the "what" questions, and many VQA datasets are built upon MSCOCO images. The third one is an object detector pre-trained on an external densely-annotated dataset like Visual Genome (VG) (Krishna et al., 2017). This detector can provide information not commonly annotated on images, such as attributes or fine-grained class

The SIMPLEAUG pipeline
SIMPLEAUG processes each annotated IQA triplet (i, q, a) in term, and propagates q to other relevant images. To begin with, SIMPLEAUG extract meaningful words from the question, similar to (Chen et al., 2020; Wu and Mooney, 2019). We leverage a spaCy part-of-speech (POS) tagger (Honnibal and Montani, 2017) to extract "nouns" and tokenize their singular and plural forms. We remove words such as "picture" or "photo", which appear in many questions but are not informative for VQA 2 .
Given the meaningful "nouns" of a question, we then retrieve relevant images and derive the answers, using MSCOCO annotations or Faster R-CNN detection. Concretely, we split questions into four categories and develop specific question propagation rules. Figure 2 illustrates the pipeline.
Yes/No questions. We apply the Faster R-CNN detector trained on VG to each image i beside image i. The detector returns a set of bounding boxes and their labels. We ignore images whose object labels have no overlap with the nouns of question q, and assign answer "yes" or "no" to the remaining images as follows.
• "Yes": if the labels of image i cover all nouns 2 For example, in a question, "What is the person doing in the picture?", the word "picture" refers to the image itself, not an object within it. We found 8% of the questions like this in VQA v2 (Goyal et al., 2017). Some questions really refer to "pictures" or "photos" within an image (e.g., "How many pictures on the wall?"), but there are <1% such questions. of question q, we create (i , q, yes).
• "No": if the labels of image i only cover some of the nouns of q, we create (i , q, no). For instance, if the question is "Is there a cat on the pillow?" but the image only contains "pillow" but no "cat", then the answer is "no". We develop two verification strategies at the end of this subsection to filter out outlier cases. Color questions. To prevent ambiguous cases, we only consider questions with a single noun (besides the word "color"). We again apply the Faster R-CNN detector, which returns for each image a set of object labels that may also contain attributes like colors. We keep images whose labels cover the noun of question q. For each such image i , we create a triplet (i , q,â), whereâ is the color attribute provided by the detector. As there are likely some other object-color pairs in i , we investigate replacing the noun in q by each detected object name and create some more IQA triplets about colors. Number questions. We again focus on questions with a single noun (besides the word "number"). We use MSCOCO annotations, which give each image a set of object bounding boxes and labels. We find all the images whose labels cover the noun of question q. For each such image i , we derive the answer by counting annotated instances of that noun and create a triplet (i , q,â), whereâ is the count. Some of the nouns (e.g., "animal") are supercategories of sub-category objects (e.g., "dog" or "cat"). Thus, if the noun of q is a super-category (e.g., q is "How many animals are there?"), we follow the category hierarchy provided by MSCOCO and count all its sub-category instances. Other questions. We focus on "what" questions with a single noun and use MSCOCO annotations. We find all the images whose labels cover the noun of question q. (We also take the super-category cases into account.) For each such image i , we check whether its MSCOCO labels contain the answer a of question q, i.e., according to the original (i, q, a) triplet. For instance, if q is "What animal is this?" and a is "sheep", then we check if image i 's labels contain "sheep". If yes, we create a triplet (i , q,â = a). This process essentially discovers "what" questions that can indeed be asked about i .
Verification. The above rules simplify a question by only looking at its nouns, so they may lead to triplets whose answers are incorrect. To mitigate this issue, we develop two verification strategies.
The first strategy performs self-verification on the original (i, q, a) triplet, checking if our rules can reproduce it. That is, it applies the aforementioned rules to image i to derive the new triplet (i, q,â). Ifâ does not match a, i.e., using the rules creates a different answer, we skip this question q.
The second strategy verifies our rules using IQA triplets annotated on retrieved image i . For example, if image i has an annotate triplet (i , q , a ) whose q has the same category and nouns as q, then we compare it to the created triplet (i , q,â). Ifâ does not match a , then we disregard (i , q,â).

Paraphrasing by similar questions
Besides the four question propagation rules that look at the image contents, we also investigate a simple paraphrasing rule by searching similar questions in the dataset. Concretely, we apply the averaged word feature from BERT (Devlin et al., 2018) to encode each question as it better captures the object-level semantics for searching questions mentioning the same objects. Two questions are similar if their cosine similarity is above a certain threshold (0.98 in the experiments). If two IQA triplets (i, q, a) and (i , q , a ) have similar questions, we create two extra triplets (i, q , a) and (i , q, a ) by switching their questions as paraphrasing.
We choose a high threshold 0.98 to avoid false positives. On average, each question finds 11.4 similar questions, and we only pick the top-3 questions. We found that with this design, an extra verification step, like checking if the image i contains nouns of the paraphrasing question q, does not further improve the overall VQA accuracy. Thus, we do not include an extra verification step for paraphrasing.  (Lin et al., 2014) and uses the same training/validation/testing splits. On average, six questions are annotated for each image. In total, VQA v2 has 444K/214K/448K training/validation/test IQA triplets. VQA-CP v2 (Agrawal et al., 2018) is a challenging adversarial split of VQA v2 designed to evaluate the model's capability of handling language bias/prior shifts between training and testing. For instance, "white" is the most frequent answer for questions that start with "what color..." in the training set whereas "black" is the most common one in the test set. Such prior changes also reflect in individual questions, e.g., the most common answer for "What color is the banana?" changes from "yellow" during training to "green" during testing. VQA-CP v2 has 438K/220K training/test IQA triplets. Evaluation metrics. We follow the standard evaluation protocol (Antol et al., 2015;Goyal et al., 2017). For each test triplet, the predicted answer is compared with answers provided by ten human annotators in a leave-one-annotator-out fashion for robust evaluation. We report the averaged scores over all test triplets as well as over test triplets of Yes/No, number, or other answer types.

Implicit knowledge sources
MSCOCO annotations (Lin et al., 2014). MSCOCO is the most popular benchmark nowadays for object detection and instance segmentation, which contains 80 categories (e.g., "cat") as well as the corresponding super categories (e.g., "animal"). Object instances of all 80 categories are exhaustively annotated in all images, leading to approximately 1.2 million instance annotations.  (Krishna et al., 2017). This pretrained detector can provide object attributes (e.g., color and material) whereas MSCOCO annotations only contain object names (e.g., "person" and "bicycle"). We use the detector provided by Anderson et al. (2018), which detects 36 objects per image.

Base VQA models
SIMPLEAUG is model-agnostic, and we evaluate it by using its generated data to augment the training set for training three base VQA models.
Bottom-Up Top-Down (UpDn) (Anderson et al., 2018). UpDn is a widely used VQA model. It first detects objects from an image and encodes them into visual feature vectors. Given a question, UpDn uses a question encoder to produce a set of word features. Both visual and language features are then fed into a multi-modal attention network to predict the answer.
Learned-Mixin+H (LMH) (Clark et al., 2019). LMH is a learning strategy to de-bias a VQA model, e.g., UpDn. During training, LMH uses an auxiliary question-only model to encourage the VQA model to explore visual-question related information. During testing, only the VQA model is used. LMH is shown to largely improve the performance on VQA-CP v2 but can hurt that on VQA v2.
LXMERT (Tan and Bansal, 2019). We also study SIMPLEAUG with a stronger, transformer-based VQA model named LXMERT. LXMERT leverages multi-modal transformers to extract multi-modal features, and exploits a masking mechanism to better (pre-)train the model. While such a masking mechanism can be viewed as a way of data augmentation, SIMPLEAUG is fundamentally different from it in two aspects. First, SIMPLEAUG generates new triplets while masking manipulates existing triplets. Second, SIMPLEAUG is detached from model training and is therefore compatible with masking. As will be shown in the experimental results, SIMPLEAUG can provide solid gains to LXMERT on both VQA v2 and VQA-CP.

Compared data augmentation methods
We compare SIMPLEAUG with three existing data augmentation methods for VQA.
Template-based augmentation proposed by Kafle et al. (2017) generates new question-answer pairs using MSCOCO annotations (cf. § 2). We reimplement the method following the paper.
Counterfactual Samples Synthesizing (CSS) (Chen et al., 2020) generates counterfactual triplets by masking critical objects in images or words in questions and assigning different answers. These new training examples force the VQA model to focus on those critical objects and words, improving both visual explainability and question sensitivity.

Implementation details
Data augmentation by SIMPLEAUG. We speed up the implementation by grouping IQA triplets of the same unique question and only propagating the question once. We remove redundant triplets if the retrieved image already has the same question. To prevent creating too many triplets from paraphrasing ( § 3.3), for each question q we only search for its top-3 similar questions q and only create (i , q, a ) if a is a rare answer to q -we define a to be a rare answer if there are fewer than five (q, a ) pairs in the dataset. We emphasize that we only apply SIMPLEAUG to IQA triplets in the training set and search images in the training set. VQA models. For the base VQA models, we use the released code from corresponding papers. Please see Appendix for more details. The rationale of training with multiple stages is to prevent the augmented data from dominating the

Main results on VQA v2 and VQA-CP v2
Table 2 summarizes the main results on VQA v2 val and VQA-CP v2 test. We experiment SIMPLEAUG with different base VQA models and compare it to state-of-the-art methods. SIMPLEAUG achieves consistent gains against the base models on all answer types (columns). When paired with LXMERT, SIMPLEAUG obtains the highest accuracy on both datasets, except MUTANT (loss) which applies extra losses besides data augmentation.
SIMPLEAUG is model-agnostic. SIMPLEAUG can directly be applied to other VQA models. Besides UpDn, in Table 2 we show that SIMPLEAUG can lead to consistent gains for two additional VQA models. LMH is a de-biasing method for UpDn, which however hurts the accuracy on VQA v2. With SIMPLEAUG, LMH can largely improve on VQA v2. LXMERT is a strong transformer-based VQA model, and SIMPLEAUG can also improve  Figure 1 and Figure 3 ask about image contents different from the original questions. In contrast, paraphrasing only paraphrases the original questions of that image. As shown in Table 2, question propagation generally leads to better performance, especially on "Num" and "Other" answers, suggesting the importance of creating additional questions to cover image contents more exhaustively.
On verification for question propagation. Table 3 compares SIMPLEAUG (propagation) with and without the verification strategies (cf. § 3.2). Verification improves accuracy at nearly all cases.  We report results on VQA-CP v2, using the UpDn model.

Method
Aug Type All Y/N Num Other Multiple-stage training. In Table 4, we compare the three training strategies with original triplets (O) and augmented triplets (A). We also train on O for multiple stages (i.e., more epochs) for a fair comparison. O → A → O in general outperforms others, and we attribute this to the clear separation of clean and noisy data -the last training stage may correct noisy information learned in early stages .
Training with SIMPLEAUG triplets alone. We further investigate training the UpDn model with augmented triplets alone (A). On VQA-v2, we get 39.62% overall accuracy, worse than the baseline trained with original data (63.48%). This is likely due to the noise in the augmented data. On VQA-CP, we get 51.60%, much better than the baseline (39.74%) but worse than training with both augmented and original triplets (52.65%). We surmise that SIMPLEAUG triplets help mitigate the language bias shifts in VQA-CP. Please see Appendix for a human study on assessing the quality of the triplets augmented by SIMPLEAUG.
Effects of augmentation types. We experiment with propagating each question type alone on VQA-CP v2, using UpDn as the base model. In Table 5, we show the separate results of SIMPLEAUG with different question types. The augmented questions notably improve the corresponding answer type.

SIMPLEAUG in additional scenarios
We explore SIMPLEAUG in the scenarios where there are (i) limited questions per image, and (ii) Learning with limited triplets. We randomly keep a fraction of annotated QA pairs for each training image on VQA-CP v2. Table 6 shows that even under this annotation-scarce setting (e.g., only 10% of QA pairs are kept), SIMPLEAUG can already be effective, outperforming the baseline UpDn model trained with all data. This demonstrates the robustness of SIMPLEAUG on dealing with the challenging setting with limited triplets.
Learning with weakly-labeled or unlabeled images. We simulate the scenarios by keeping the QA pairs for a fraction of images (i.e., labeled data) and removing the QA pairs entirely for the other images. Conventionally, a VQA model cannot benefit from the images without QA pairs, but SIM-PLEAUG could leverage them by propagating questions to them. Specifically, for images without QA pairs, we consider two cases. We either keep their MSCOCO object instance annotations (i.e., weaklylabeled data) or completely rely on object detectors (i.e., unlabeled data). Table 7 shows the results, in which we only apply SIMPLEAUG to the weaklylabeled and unlabeled images. As shown, SIM-PLEAUG yields consistent improvements, opening up the possibility of leveraging additional images to improve VQA.

Qualitative results
We show a training image and its augmented QA pairs by SIMPLEAUG in Figure 3. A VQA model trained on the original IQA triplets cannot answer many of the newly generated questions, even if the image is in the training set, showing the necessity to include them for training a stronger model. More qualitative results can be found in Appendix.

Conclusion
We proposed SIMPLEAUG, a data augmentation method for VQA that can turn information already  Figure 3: Qualitative results. We show the training image and its QA pairs from VQA-CP, and the generated QA pairs by SIMPLEAUG. / indicates if the baseline VQA model (trained without SIMPLEAUG) answers correctly/incorrectly. In augmented QA pairs, the first three are from question propagation and the last one is by paraphrasing.
in the datasets into explicit IQA triplets for training. SIMPLEAUG is simple but by no means trivial. First, it justifies that mid-level vision tasks like object detection can effectively benefit VQA. Second, we probably will never be comprehensive enough in annotating data, and SIMPLEAUG can effectively turn what we have at hand (i.e., "knowns") to examples a VQA model wouldn't have known (i.e., "unkowns"

A.3 Additional details of SIMPLEAUG
As mentioned in the main paper, for each annotated IQA triplet (i, q, a) in the dataset, SIMPLEAUG propagates q to other relevant images. To begin with, we find unique questions by filtering out any duplicate sentences. We then extract meaningful words from the unique questions in line with (Chen et al., 2020;. Concretely, we remove the question type from q and then apply a spaCy part-of-speech (POS) tagger (Honnibal and Montani, 2017) to extract "nouns". To handle the synonyms, we further consider the singular/plural forms and super-categories of nouns 6 . Moreover, we remove non-informative words (e.g., "picture" or "photo") in the sentence. For example, in a question, "What is the man doing in the picture?", the word "picture" refers to an image itself but not any specific object. There are around 8% of triplets like this in VQA v2 (Goyal et al., 2017). While it is possible that both sentence and image may contain such non-informative contents (e.g., "How many pictures on the wall?"), there are <1% such questions.

B Results on GQA Dataset
We further conduct a preliminary study of SIM-PLEAUG on the popular GQA dataset (Hudson and Manning, 2019), which focuses on compositional VQA tasks and consists of 22M questions about various day-to-day images. Each image in GQA is associated with a scene graph (Johnson et al., 2015) which consists of the objects, attributes, and relationships. We focus on binary questions (35% of all questions) and propagate a question q to an image i according to the image's scene graph. Particularly, we leverage the semantic type of the question (e.g., "attribute", "relation") and the scene graph to generate the answer. For example, suppose q asks if an object contains a certain "attribute", we check the scene graph's node of that object to determine the answer. SIMPLEAUG can improve the accuracy of UpDn from 56.06% to 56.52%, justifying its generalizability and applicability.

C Human Evaluation on SIMPLEAUG Triplets
SIMPLEAUG requires no sentence/image generation steps, and thus all examples are natural annotations from humans, largely alleviating the artificial noise that the previous methods may have. To further evaluate the quality of the augmented triplets, we randomly select 500 images and pick 5 augmented QA pairs per image from each type (4 by propagation Y/N, Num, Other, Color and 1 by paraphrasing). For those 2, 500 triplets, we ask 5 different crowd workers to evaluate "relatedness (1/0)" of the augmented questions and "correctness (1/0)" of the answer given the question and image. That is, if the question makes sense for the corresponding image, rate 1, otherwise 0; if the answer is correct, rate 1, otherwise 0 (see Figure 4). Table 8 shows the human study results. The average relatedness / correctness are 86.75% / 67.60% for propagation and 80.80% / 64.40% for paraphrasing. We note that these generated data are based on human-annotated questions in the dataset. Therefore, there are no artifacts in the questions. More- • What color is his shirt?
• How many people are standing on there surfboard?
• What sport is depicted?
• Is this man wearing any safety gear? • What color is the sky?
• How many people are gathered around?
• What sport is represented in this scene?
• What is the man walking on? • What color is his shirt?
• How many trains can you see?
• What vehicle is shown?
• • What is the color of the grass?
• How many people might live here?
• What are the yellow vehicle?
• Is that the ocean? • What is the color of the grass?

Yes
• How many cows are stacked?
• Which animal is it?
• What color is the photo frame? • What is the color of the grass?
• How many animal is there in the picture?
• What type of animal is behind them?
• • What color is the button?
• How many cats are in the picture?
• What kind of animal is pictured?
• What color are the cat's eyes?
Yes Black 1 Cat Blue Figure 4: Examples in human study. For each image, we pick 5 triplets created by SIMPLEAUG (4 by propagation Y/N, Num, Other, Color and 1 by paraphrasing) and ask crowd workers to evaluate the IQA triplets by the question's relatedness (1 / 0) to the image and the answer's correctness (1 / 0) to the image and question.

Image
Augmented QA Pairs Original QA Pairs

Question Answer Question Answer
• How many giraffe?
• What color is his eyes?
• What color is the giraffe?
• What color is the sign? • Are the humans on the ground?
• What is the fence made of?
Yes Yes Wood • How many animals are in this?
• How many cows are visible?
• Are the cows hornless?
• What color is the cow?
• What color are the rocks? • Is this a farm?
• Is the cow under a tree?

Cow
No Yes • How many of these animals are laying down?
• How many cats are pictured?
• What color is the suitcase?
• How many suitcases are they? • Is this an old suitcase?
• What kind of cat?
• What are the cats laying on?

Yes
Yes Black Suitcase • How many people are clearly visible in this picture?
• How many people are standing around?
• How many people are actually in the photo?
• What color is the sky? • How many animals are here?
• What animals are shown?

Yes 4 Elephant
• What color is the stove?
• What color is the hair?
• What color is the pants?
• What color is the table?
• How many people are actually in this photo? • What color is the field?
• What is the color of the cloud?
• How many elephants in the picture?
• How many animals are seen? • Is the water fresh looking?

Green
• Is there a man in the picture?
• Are any of the elephants on the dirt road?
• What is the water on the ground?
Overcast No Yes Yes Mud Figure 5: Additional qualitative results on VQA-CP. We show the original image, the generated QA pairs by SIMPLEAUG, and the original QA pairs. / indicates if the baseline VQA model (trained without SIMPLEAUG) predicts correctly/incorrectly.