COVR: A Test-Bed for Visually Grounded Compositional Generalization with Real Images

While interest in models that generalize at test time to new compositions has risen in recent years, benchmarks in the visually-grounded domain have thus far been restricted to synthetic images. In this work, we propose COVR, a new test-bed for visually-grounded compositional generalization with real images. To create COVR, we use real images annotated with scene graphs, and propose an almost fully automatic procedure for generating question-answer pairs along with a set of context images. COVR focuses on questions that require complex reasoning, including higher-order operations such as quantification and aggregation. Due to the automatic generation process, COVR facilitates the creation of compositional splits, where models at test time need to generalize to new concepts and compositions in a zero- or few-shot setting. We construct compositional splits using COVR and demonstrate a myriad of cases where state-of-the-art pre-trained language-and-vision models struggle to compositionally generalize.


Introduction
Models for natural language understanding (NLU) have exhibited remarkable generalization abilities on many tasks, when the training and test data are sampled from the same distribution. But such models still lag far behind humans when asked to generalize to an unseen combination of known concepts, and struggle to learn concepts for which only few examples are provided (Finegan-Dollak et al., 2018;Bahdanau et al., 2019a). Humans, conversely, do this effortlessly: for example, once humans learn the meaning of the quantifier "all", they can easily understand the utterance "all cheetahs have spots" if they know what "cheetahs" and "spots" mean (Chomsky, 1957;Montague, 1970;Fodor and Pylyshyn, 1988). This ability, termed compositional generalization, is crucial for building models that generalize to new settings (Lake et al., 2018). In recent years, multiple benchmarks have been created, illustrating that current NLU models fail to generalize to new compositions. However, these benchmarks focused on semantic parsing, the task of mapping natural language utterances to logical forms (Lake and Baroni, 2018;Kim and Linzen, 2020;Keysers et al., 2020). Visual question answering (VQA) is arguably a harder task from the perspective of compositional generalization, since the model needs to learn to compositionally "execute" the meaning of the question over images, without being exposed to an explicit meaning representation. For instance, in Fig. 1, a model should learn the meaning of the quantifier "all' from the first example, and the meaning of a conjunction of clauses from the second example, and then execute both operations compositionally at test time.
Existing VQA datasets for testing compositional generalization typically use synthetic images and contain a limited number of visual concepts and reasoning operators (Bahdanau et al., 2019a,b; Ruis et al., 2020), or focus on generalization to unseen lexical constructions rather than unseen reasoning skills (Hudson and Manning, 2019; Agrawal et al., 2017). Other datasets such as GQA (Hudson and Manning, 2018) use real images with synthetic questions, but lack logical operators available in natural language datasets, such as quantifiers and aggregations, and contain "reasoning shortcuts", due to a lack of challenging image distractors (Chen et al., 2020b).
In this work, we present COVR (COmpositional Visual Reasoning), a test-bed for visuallygrounded compositional generalization with real images. We propose a process for automatically generating complex questions over sets of images ( Fig. 2), where each example is annotated with the program corresponding to its meaning. We take images annotated with scene graphs (Fig. 2a) from GQA, Visual Genome (Krishna et al., 2016), and imSitu (Yatskar et al., 2016), automatically collect both similar and distracting images for each example, and filter incorrect examples due to errors in the source scene graphs (Fig. 2b,c). We then use a template-based grammar to generate a rich set of complex questions that contain multi-step reasoning and higher-order operations on multiple images (2d). To further enhance the quality of the dataset, we manually validate the correctness of development and test examples through crowdsourcing and paraphrase the automatically generated questions into fluent English for a subset of the automatically generated dataset (2e). COVR contains 262k examples based on ∼89k images, with 13.9k of the questions manually validated and paraphrased.
Our automatic generation process allows for the easy construction of compositional data splits, where models must generalize to new compositions, and is easily extendable with new templates and splits. We explore both the zero-shot setting, where models must generalize to new compositions, and the few-shot setting, where models need to learn new constructs from a small number of examples.
We evaluate state-of-the-art pre-trained models on a wide range of compositional splits, and expose generalization weaknesses in 10 out of the 21 setups, where the generalization score we define is low (0%-70%). Moreover, results show it is not trivial to characterize the conditions under which generalization occurs, and we conjecture generalization is harder when it requires that the model learn to combine complex/large structures. We encourage the community to use COVR to further explore compositional splits and investigate visually-grounded compositional generalization. 1

Related Work
Prior work on grounded compositional generalization has typically tested generalization in terms of lexical and syntactic constructions, while we focus on compositions in the space of meaning by testing unseen program compositions. For example, the "structure split" in Hudson and Manning (2019) splits examples based on their surface form. As a result, identical programs are found in both the training and the test splits. The "content split" from the same work tests generalization to unseen lexical concepts. This is a different kind of skill than the one we address, where we assume the model sees all required concepts during training. C-VQA (Agrawal et al., 2017) splits are based only on question-answer pairs and not on the question meaning. CLEVR-CoGenT (Johnson et al., 2017) tests for unseen combinations of attributes, thus focuses more on visual generalization. Other datasets that do split samples based on their programs (Bahdanau et al., 2019a,b;Ruis et al., 2020) are using synthetic images with a small set of entities and relations (≤ 20). GQA (Hudson and Manning, 2018) uses real images with synthetic questions, which in theory could be used to create compositional splits. However, our work uses multiple images which allows testing reasoning operations over sets and not only over entities, such as counting, quantification, comparisons, etc. Thus, the space of possible compositional splits in COVR is much larger than GQA. If we anonymize lexical items in GQA programs (i.e., replace them with a single placeholder), the number of distinct programs (79) is too low to create a rich set of splits where all operators appear in the training set. In contrast, COVR contains 640 anonymized programs, allowing us to create a large number of splits. Moreover, our process for finding high-quality distracting images mitigates issues with reasoning shortcuts, where models solve questions due to a lack of challenging image distractors (Chen et al., 2020b). Finally, other VQA datasets with real images and questions that were generated by humans (Suhr et al., 2019;Antol et al., 2015) do not include a meaning representation for questions and thus cannot be easily used to create compositional splits.

Dataset Creation
The goal of COVR is to facilitate the creation of VQA compositional splits, with questions that require a high degree of compositionality on both the textual and visual input.
Task definition Examples in COVR are (q, I, a) triples, where q is a complex question, I is a set of images, and a the expected answer. Unlike most visually-grounded datasets , which contain 1-2 images, each example in COVR contains up to 5 images. This allows us to (a) generate questions with higher-order operators, and (b) detect good distracting images. Also, questions are annotated with programs corresponding to their meaning, which enables creating compositional splits. object object relmod object "man" "standing" "uncorking" "wine bottle" "with" "corkscrew" attribute relation ROOT Figure 3: An example subgraph that shows all supported nodes, referring to "standing man uncorking a wine bottle with a corkscrew". The types of the nodes are inside the circle, and their name above it.
High-level overview Fig. 2 provides an overview of the data generation process. Given an image and its annotated scene graph describing the image objects and their relations, we iterate through a set of subgraphs. For example, in Fig. 2a the subgraph corresponds to "a man is catching a frisbee and wearing jeans". Next, we sample images with related subgraphs: (a) images that contain the same subgraph ( Fig. 2b), and (b) images that contain a similar subgraph, to act as distracting images (e.g., images with a man catching a ball or a woman catching a frisbee, Fig. 2c). To ensure the quality of the distracting images, we propose models for automatic filtering and validation ( §3.2). Next (Fig. 2d), we instantiate questions from a templatebased grammar by filling slots with values computed from the selected subgraphs, and automatically obtain a program for each question. Last, we balance the answers and question types, and use crowdsourcing to manually validate and provide fluent paraphrases for the evaluation set and a subset of the training set ( Fig. 2e).

Extracting Subgraphs
A scene graph describes objects in an image, their attributes and relations. We use existing datasets with annotated scene graphs, specifically imSitu (Yatskar et al., 2016) and GQA (Hudson and Manning, 2018), which contains a clean version of Visual Genome's human-annotated scene graphs (Krishna et al., 2016). While Visual Genome's scenes have more detailed annotations, imSitu has a more diverse set of relations, such as "person measuring the length of the wall with a tape", which also introduces ternary relations ("uncorking", in Fig. 3). Given a scene graph, we extract all of its subgraphs using the rules described next. A subgraph is a directed graph with the following node types: object, attribute, relation and rel-mod (relation modifier), where every node has a name which describes it (e.g. "wine"). Fig. 3 Type Example

VERIFYATTR
Is the sink that is below a towel white? CHOOSEATTR Is the man that is wearing a jersey running or looking up? QUERYATTR What is the color of the cat that is on a black floor? COMPARECOUNT There are more coffee tables that are in living room than couches that are in living room COUNT How many people are erasing a mark from a paper? VERIFYCOUNT There is at most 1 cup that is behinda man that is wearing a jacket COUNTGROUPBY How many images contain exactly 2 men that are in water? VERIFYCOUNT-GROUPBY There is at least 1 image that contain exactly 2 women that are carrying surfboard VERIFYLOGIC Are there both people that are packing a tea into a mason jar and people that are packing a salt into a bag? VERIFYQUANT No boys with large trees behind them are wearing jeans VERIFYQUANT-ATTR Do all dogs that are on a bed have the same color?

CHOOSEOBJECT
The woman that is wearing dress is carrying a bottle or a purse? QUERYOBJECT What is the woman that is wearing glasses holding? VERIFYSAME-ATTR Does the pillow that is on a bed and the pillow that is on a couch have the same color? CHOOSEREL Is the sitting man holding a hat or wearing it? shows an example subgraph. A valid subgraph has a root object node, and has the following structure: Every object has an outgoing edge to ≤ 2 relation nodes and ≤ 1 attribute node. Every relation node has an outgoing edge to exactly one object node and optionally multiple rel-mod nodes. Every rel-mod node has an outgoing edge to exactly one object node. The depth of the subgraph is constrained such that every path from the root to a leaf goes through at most two relation nodes. For brevity, we refer to subgraphs with a textual description ("standing man uncorking a wine bottle with a corkscrew").

Finding Related Images
Given a subgraph g, we pick candidate context images that will be part of the example images I. We want to only pick related images, i.e., images that share sub-structure with g. Images that contain g in their scene graph will be used to generate counting or quantification questions. Images with different but similar subgraphs to g will be useful as image distractors, which are important to avoid "reasoning shortcuts" (Cirik et al., 2018;Agrawal et al., 2018;Chen et al., 2020a,b;Bitton et al., 2021;Kervadec et al., 2021). For example, for the question in Fig. 2e, it is not necessary to perform all reasoning steps if there is only one man in all images in I. To find images with similar subgraphs, we define the edit distance between two graphs to be the min-imum number of operations required to transform one graph to another, where the only valid operation is to substitute a node with another node that has the same type. A good distracting image will (a) contain a subgraph at edit distance 1 or 2, and (b) not contain the subgraph g itself. For the question in Fig. 2e, a good distracting image will contain, for example, "a man holding a frisbee wearing shorts" or "a woman catching a frisbee wearing jeans". We extract related images by querying all scene graphs that exhibit the mentioned requirements using a graph-based database. 2 Filtering overlapping distractors A drawback of the above method is that the edit distance between two subgraphs could be 1, but the two subgraphs might still semantically overlap. For example, a subgraph "woman using phone" is not a good distractor for "woman holding phone" since "holding" and "using" are not mutually exclusive. A similar issue arises with objects and attributes (e.g. "man" and "person", "resting" and "sitting").
We thus add a step when sampling distracting images to filter such cases. We define m(x 1 , x 2 ) to be the probability that node x 1 and x 2 are mutually exclusive. For example, for an object x, m(x, ·) should return high probabilities for all objects except x, its synonyms, hypernyms, and hyponyms. When performing node substitutions (to compute the edit distance) we only consider nodes x 1 , x 2 such that m(x 1 , x 2 ) > 0.5. To learn m(x, ·), we fine-tune RoBERTa (Liu et al., 2019) separately on nouns, attributes and relations, with a total of 6,366 manually-annotated word pairs, reaching accuracies of 94.4%, 95.7% and 82.5% respectively. See App. A for details. Incomplete scene graphs A good distracting image should not contain the source subgraph g. However, because scene graphs are often incomplete, naïve selection from scene graphs can yield a high error rate. To mitigate this issue, we fine-tune LXMERT (Tan and Bansal, 2019), a pre-trained multi-modal classifier, to recognize if a given subgraph is in an image or not. We do not assume that this classifier will be able to comprehend complex subgraphs, and only use it to recognize simple subgraphs, that contain ≤ 2 object nodes, and up to a single relation and attribute. We train the model on binary-labeled image-subgraph pairs, where we sample a simple subgraph g s and its image I from the set of all subgraphs, and use I as a  positive example, and an image I that contains a subgraph g s as a negative example, where I is a distracting image for I, according to the procedure described above. For example, for the subgraph "man wearing jeans" the model will be trained to predict 'True' for an image that contains it, and 'False' for an image with "man wearing shorts", but not "man with jeans". After training, we filter out candidate distracting images for the subgraph g if the model outputs a score above a certain threshold τ for all of the simple graphs in g. We adjust τ such that the probability of correctly identifying missing subgraphs is above 95% according to a manually-annotated set of 441 examples. We also use our trained model to filter out cases where an object is annotated in the scene graph, but the image contains other instances of that object (e.g., if in an image with a crowd full of people, only a few are annotated). See App. B for details.

Template-based Question Generation
Once we have a subgraph, an image, and a set of related images, we can generate questions. For a subgraph g and its related images, we generate questions from a set of 15 manually-written templates (Table 1), that include operators such as quantifiers, counting, and "group by". Further extending this list is trivial in our framework. Each template contains slots, filled with values conditioned on the subgraph g as described below. Since we have 15 templates with multiple slots, and each slot can be filled with different types of subgraphs, we get a large set of possible question types and programs. Specifically, if we anonymize lexical terms (i.e. replace nouns, relations and attributes with a single placeholder) there are 640 different  programs in total.
We define a set of preconditions for each template (Table 2) which specifies (a) the types of subgraphs that can instantiate each slot, and (b) the images that can be used as distractors. The first precondition type ensures that the template is supported by the subgraph-e.g., to instantiate a question that verifies an attribute (VERIFYATTR), the root object must have an outgoing edge to an attribute. The second precondition type ensures that we have relevant distractors. For example, for VERIFYATTR, we only consider distracting subgraphs if the attribute connected to their root is different from the attribute of g's root. In addition, the distracting subgraphs must have at least one more different node. Otherwise, we cannot refer to the object unambiguously: if we have a "white dog" as a distracting image for a "black dog", the question "Is the dog black?" will be ambiguous.
When a template, subgraph, and set of images satisfy the preconditions, we instantiate a question by filling the template slots. There are three types of slots: First, slots with the description of g or a subset of its nodes. E.g., in VERIFYATTR (Table 2) we fill the slot G-NOATTRIBUTE with the description of g without the attribute node ("sink that is below a towel"), and the slot ATTRIBUTE with the name of that attribute ("white"). In CHOOSEOBJECT (second row), we fill the slots G-SUBJECT , REL and OBJ with different subsets of g: "woman lighting a cigar on fire", "using" and "candle".
The second slot type fills the description of a different subgraph than g, or a subset of its nodes. In COMPARECOUNT (third row), we fill G2 with the description of another subgraph, sampled from the distracting images. Similarly, in CHOOSEOB-JECT, we fill DECOYOBJ with the node "lighter". Last, some slots are filled from a closed set of words that describe reasoning operations: In COM-PARECOUNT, we fill COMPARATIVE with one of "less", "more" or "same number".
Once slots are filled, we compute the corresponding program for the question. Each row in the program is an operator (such as Find, Filter, All) with a given set of arguments (e.g. "black") or dependencies (input from other operator). The list of program operators and a sample program are in App. G.

Quality Assurance and Analysis
We perform the generation process separately on the training and validation set of GQA and imSitu graph scenes, which yields 13 million and 1 million questions respectively. We split the latter set into development and test sets of equal size, making sure that no two examples with the same question text appear in both of them. Balancing the dataset Since we generate a question for every valid combination of subgraph and template, the resulting set of questions possibly contains correlations between the language of the question and the answer, and has skewed distributions over answers, templates, and subgraph structures. To overcome the first issue, whenever possible, we create two examples with the same question, but a different set of images and answers (186,841/75.3% questions appear with at least two different answers in the training set). Then, to balance our dataset, we use a heuristic procedure that leads to a nearly uniform distribution over templates, and balances the distribution over answers and over the size and depth of subgraphs as much as possible (App. C). We provide statistics about our dataset in Table 3

Compositional Splits
Our generation process lets us create questions and corresponding programs with a variety of reasoning operators. We now show how COVR can be used to generate challenging compositional splits. We propose two setups: (1) Zero-shot, where training questions provide examples for all required reasoning steps, but the test questions require a new composition of these reasoning steps, and (2) Fewshot, where the model only sees a small number of examples for a specific reasoning type (Bahdanau et al., 2019a;Yin et al., 2021).
Because each question is annotated with its program, we can define binary properties over the program and answer, where a property is a binary predicate that typically defines a reasoning type in the program. For example, the property HAS-QUANT is true iff the program contains a quantifier, and HAS-QUANT-NONE iff it contains the quantifier NONE. We can create any compositional split that is a function of such properties. We list the types of properties used in this work in Table 5.
All compositional splits are based on the original training/validation splits from Visual Genome and imSitu to guarantee that no image appears in both the training set and any of the test sets. Splits are created simply by removing certain questions from the train and test splits. If we do not remove any question, we get an i.i.d split.
Zero-shot We test if a model can answer ques- The horse is pulling people with a rope or a leash?
There are two people next to a tree.
There are two children cleaning the path with a broom All men wear jeans .
No women are standing .
No men wearing jeans are standing .

HAS-SAMEATTR-COLOR
What is the color of X?
Do all X have the same material?
Do all X have the same color ?
TPL-CHOOSEOBJECT What is the man carrying ?
Is the man X or Y ?
The man is carrying a X or Y ?

HAS-QUANT-COMPSCOPE
True if the quantifier's scope is "complex", i.e., includes an attribute or a relation. RM/V/C True if g's structure contains either a Rel-Mod node, an object node with two outgoing edges to relation nodes (V-shape) or chain of more than a single relation (C). TPL-X True if the question originated from the template X. ANS-X True if answer is of type X ∈ { NUM, ATTR, NOUN} LEXICAL-X True if g contains a node with the name X. Another popular zero-shot test is the program split. In a similar fashion to the compositional generalization tests in Finegan-Dollak et al. (2018), we randomly split programs after anonymizing the names of nodes, and hold out 20% of the programs to test how models perform on program structures that were not seen during training. We also perform a lexical split, where we hold out randomly selected pairs of node names (i.e., names of objects, relations or attributes) such that the model never sees any pair together in the same program during training. We create 3 random splits where we hold out 20% of all pairs. Few

Experiments
Experimental Setup We consider the following baselines: (a) MAJ, the majority answer in the training set, and (b) MAJTEMPL: an oracle-based baseline that assumes perfect knowledge of the template from which the question was generated, and predicts the majority answer for that template.  For templates that include two possible answers ("candle or lighter"), it randomly picks one. We use the Volta framework (Bugliarello et al., 2021) to train and evaluate different pre-trained vision-and-language models. We use Volta's controlled setup models (that have a similar number of parameters and pre-training data) of VisualBERT (Li et al., 2019) and VILBERT (Lu et al., 2019). In this section we show results only for VisualBERT, and results for VILBERT can be found in App. H, showing mostly similar scores.
A vision-and-language model provides a representation for question-image pairs, (q, I). We modify the implementation to accept N images by running the model with (q, I) as input for each image I ∈ I, and then passing the N computed representations through two transformer layers with a concatenated [CLS] token. We pass the [CLS] token representation through a classifier layer to predict the answer. The classifier layer and the added transformer layers are randomly initialized, and all parameters are fine-tuned during training.
To estimate a lower bound on performance without any reasoning on images, we evaluate a textonly baseline VB TEXT that only sees the input text (image representations are zeroed out).
For the compositional splits, we evaluate Vi-sualBERT trained on the entire data (VB iid ), the text baseline (VB TEXT ), and the compositionallytrained models VB 250 , VB 0 for the few-shot (M =250) and zero-shot setups, respectively. To control the training size, we also evaluate VB iid-size , a model trained with a similar data size as the compositionally-trained model, by uniformly downsampling the training set. All models are evaluated on the same subset of the development compositional split. To focus on the generalization gap, we define a generalization score ("Gen. Score") that measures the proportion of the gap between VB TEXT and our upper-bound, VB iid-size , that is closed by a model. In all compositional splits, we train the models 8 epochs and early-stop using the subset of the development set that does not contain any of the compositional properties we test on (Teney et al., 2020).
Results First, we show how models perform on paraphrased and automatically-generated questions in the i.i.d setup in Table 6. The difference between VB TEXT and MAJTEMPL is small (3.3%), suggesting that the answer to most questions cannot be inferred without looking at the images. We also show that when the model is trained with random images instead of distracting ones (VB EASYDISTRACTORS ), accuracy drops by 10.7%, showing the importance of training on good distracting images. In addition, there is still a large gap from human performance, at 91.9%, which we estimate by evaluating human answers on 160 questions. Finally, we observe a 9.7% performance drop when training on the automatically-generated examples and testing on the paraphrased examples. Accuracy per template is shown in App. H.
Next, we report results on the compositional splits. We show results on automatically-generated questions (not paraphrased), to disentangle the effect of compositional generalization from transfer to natural language. App. H reports results for the paraphrased test set, where generalization scores are lower, showing that transfer to natural language makes compositional generalization even harder. Table 7 shows results in the few-shot setup, where in 5 out of 11 setups the generalization score is ≤ 70. VB 250 generalizes better in cases where the withheld operator is similar to an operator that appears in the training set. For instance, HAS-QUANT-ALL has higher generalization score compared to HAS-QUANT since it sees many examples with the quantifiers "some" and "none", HAS-COMPAR-MORE has a higher score compared to HAS-COMPAR, and HAS-LOGIC-AND has a perfect generalization score. This suggests that when the model has some representation for a reasoning type it can generalize better to new instances of it.
The large gap between the nearly-perfect score of HAS-NUM-3 (92%), and the low score of HAS-NUM-3-ANS-3 (0%), where in both the number 3 is rarely seen in the question, and in the latter it is also rare as an answer, suggests that the model learns good number representations just from seeing numbers in the answers. Other cases where the generalization scores are low are HAS-QUANT,    Table 8 shows results for the zero-shot setup. A model that sees examples where the quantifier scope is complex, but never in the context of the quantifier ALL, fails to generalize (HAS-QUANT-COMP & HAS-QUANT-ALL, 26%). The model also fails to generalize to the template CHOOSEOBJECT, although it saw at training time the necessary parts in the templates CHOOSEATTR and VERIFYOBJECT. Similarly, the model fails to generalize to the template VER-IFYATTR, and to TPL-VERIFYCOUNT ∪ TPL-VERIFYCOUNTGROUPBY, where we hold out all verification questions with counting, even though the model sees verification questions and counting in other templates. Last, the model struggles to generalize in the program split. Conversely, the model generalizes well to questions with the Count operator where the subgraph contains a complex sub-graph (HAS-COUNT & RM/V/C) or an attribute (HAS-COUNT & HAS-ATTR), and in the lexical split, where the model is tested on unseen combinations of names of nodes.
A possible explanation for the above is that compositional generalization is harder when the model needs to learn to combine large/complex structures, and performs better when composing more atomic constructs. However, further characterizing the conditions under which compositional generalization occurs is an important question for future work.

Conclusion
We present COVR, a test-bed for visuallygrounded compositional generalization with real images. COVR is created automatically except for manual validation and paraphrasing, and allows us to create a suite of compositional splits. COVR can be easily extended with new templates and splits to encourage the community to further understand compositional generalization. Through COVR, we expose a wide range of cases where models struggle to compositionally generalize.

A Filtering Overlapping Distracors
We take the published RoBERTa (Large, Liu et al. 2019) model that is already fine-tuned on MNLI (Williams et al., 2018), and further fine-tune it separately on pairs of nouns, attributes and relations to predict whether a pair of words or phrases are mutually exclusive. To leverage the knowledge learned during pre-training, we use the same setup as the training on MNLI, where the model is given two phrases and predicts one of three classes: "contradiction", "entailment" and "neutral".
To collect the list of pairs to annotate that will be used for fine-tuning, we fetched all pairs that have been used in Visual Genome within the same context. For attributes, we took all pairs of attributes that have appeared within the context of the same object (this way, we will be likely to collect "red" and "green" since they appear within the context objects such as "apple", but not "red" and "grilled"). For nouns, we consider all pairs of nouns that have been used with the same relation. For relations, we consider all pairs of relations that have been used with the same pair of nouns. While there are other resources that could have been useful for fine-tuning (e.g. WordNet, Fellbaum 1998), we did not use any such external knowledge base since it allowed us to have exact control on the subtleties of the data in our training context.
We train all models for 50 epochs with a learning rate of 3e −5 . For the nouns models, we use 2,366 manually annotated pairs of nouns for training and validation. The model is trained to predict "contradiction" whenever nouns are mutually-exclusive, i.e. when none of the words is a synonym, hypernym, or hyponym of the other, and "neutral" otherwise (we do not use the entailment class). We randomly shuffle the internal order of each pair for regularization. We get an accuracy score of 94.4% on 20% of the pairs which were held-out for validation. Similarly we train a model that predicts mutual-exclusiveness of attributes over 3,053 pairs, and get an accuracy of 95.7%.
Unlike the other two models, for the relations model we do not require complete mutualexclusiveness, and we do not assume symmetrical annotations, i.e., that if m(x 1 , x 2 ) then m(x 2 , x 1 ), to increase the probability of finding pairs where m returns a score higher than 0.5 for a relation x. For example, we annotate pairs such that m("riding on", "near") = 1 but m("near", "'riding on') = 0, since most often, if some object is hanging on another object, the annotation of the relations between the two objects in Visual Genmoe will be specific, i.e. "riding on" or "on" and not "near". This way, for a question such as "Is the man riding a motorcycle" we might get distracting images with a man "standing near" a motorcycle, but for a question such as "Is the man near a motorcycle" we will not get distracting images with a man "riding" a motorcycle, as then the question will be ambiguous. Note that while this can potentially introduce some noise (i.e., in some rare cases "a man riding a motorcycle" might be annotated as if the man is "near" a motorcycle), such mistakes will hopefully be overridden with the second validation that we use (incomplete scene graphs, App B). We annotate 917 pairs of relations, where every pair is annotated in both directions. We get an accuracy of 82.5% on the held-out set.

B Incomplete Scene Graphs
We use LXMERT (Tan and Bansal, 2019) to train a classifier that predicts whether a simple subgraph exists in an image. See §3.2 for details on the data we train on. We extract image and objects features with the bottom-up top-down attention method of Anderson et al. (2018) as performed in LXMERT's paper, and fine-tune the pre-trained model. To extract the training data, we use all subgraphs from all images for which we have at least one valid negative image (from both the training and test sets). This results in 6,520,367 positive and negative examples. Since we need the model to predict results not only on the test set, but also on the training set, we split all examples (training and test) into 5 splits based on their image, and train 5 different models, where each model does not see a different fifth of the images during training. Then, to predict whether a simple subgraph exists in an image, we use the model that was not trained on that image.
We manually annotate 441 examples where we determine if a simple subgraph exists in an image and use these annotations for early stopping and to adjust a threshold τ . We use this threshold to filter out candidate distracting images for a subgraph g if the model outputs a score above a certain threshold τ for all of the simple graphs in g. Note that each negative example is a candidate distracting image to some subgraph g. We use g to further adjust τ in the following way. By definition, a candidate simple graph of a distracting image has a non-empty set d of nodes that are different than g. Based on our annotated examples, we found that the model should have a different threshold τ for different types of nodes in d. Specifically, we found that the model performed best when d contained nodes of type object, then relation, and finally attribute. Thus, we use a different τ for each type: τ = 0.05 for object, τ = 0.1 for relation and τ = 0.5 for attribute. If there are more than one type of nodes in d, we take the one that gives the maximal τ .
The described procedure can be used to detect unannotated objects, however, it will not be useful in the non-rare case where an object is annotated in the scene graph, but the image contains more similar objects in the image (e.g. there is a crowd full of people in the image, but only a few of the people are annotated). We thus add another verification step for each simple sub-graph g. First, we take the annotated positions of all instances of g in the scene. For example, if there are three annotated "apples", we will take the positions (bounding boxes annotations from Visual Genome) of all three. Then, we use our trained LXMERT model with the textual description of g (e.g. "apple") and the image, but this time we zero-out the image parts that contain the apples according to their annotated positions. 3 Essentially, we are querying the model if there are any other "apples" other than those that are annotated. We use a similar procedure as before to find the best threshold, 0.5. Since the LXMERT model is never trained with zeroed-out parts, during the described fine-tuning procedure we also zero-out 15% of the bounding boxes.

C Downsampling & Balancing
We use the following downsampling method to balance the dataset and reduce bias as much as possible, separately for the training, development and test sets. At a high-level, we start with a total of N questions and group them by their templates, such that we have T groups. We then use a heuristic ordering method that prioritizes or balances different desired features, described next, and finally we take the top S = N T questions from each group, such that we get an equal number of questions per template. The ordering method is defined as follows, starting with an empty list L t for a template t. Each question is automatically annotated with the following three features: (1) whether this question appears at least twice with different answers, (2) the answer to the question and (3) the structure of the source subgraph for that question, specifically a tuple with its size and its depth. We first add to L t all questions where the first feature is positive (in all cases this was less than S). Then, to balance between the different question answers, at each step until |L t | = S, we count the appearances of all answers and sample an answer a from the answers that appeared least in L t . Then, we count the appearances of all subgraph structures, and sample a question with answer a, such that its subgraph structure appeared least. We stop once |L t | = S.

D Additional Statistics
Answers distribution We show the distribution over the top 30 answers of the training set in Fig. 4, excluding true/false answers. As can be seen, the most common answers are the numbers 0-5, followed by common colors, attributes and relations. Fig. 5 the distribution over the number of images each question in the training set contains. As can be seen, most questions contain exactly 5 images. Occurrences of operators in questions We show in Fig. 6, for a selected set of operators (find, filter, and with_relation), the distribution over the number of occurrences of that operator in a single question (e.g. the program for the question "There is 1 green banana on a tree that is next to a man" contains 3 find operators, one filter and two with_relation). The graphs show that find appears between one to eight times in a single question, and filter and with_relation between zero to six. Note that a question that contains six with_relation does not imply that a single reference to an object contains six relations, since a question can contain more than one object reference (e.g. in COMPARECOUNT).

E Crowdsourcing Details
We use Amazon Mechanical Turk (AMT) for two different tasks: validation of questions and paraphrasing them.

E.1 Validation
We wanted to make sure that our validation and test examples are of high-quality by manually validating that the question is logically valid, there are no ambiguous object references, and the answer is correct. To maintain high-quality work in AMT, we first created a qualification task by annotating ourselves 100 examples, finding that the percentage of valid questions from the automatically generated samples was 83%. Workers were asked to choose one of the following options: "Answer is COR-RECT", "I cannot understand the question", "I cannot determine if the answer is correct" or "Answer is WRONG". We filtered workers by their performance: workers that have gained over 85% accuracy were given feedback and were approved for the main task that contained the rest of the questions. During their work, we have repeatedly sampled the annotations of the workers and gave feedback where necessary, and also measured the accuracy of their submissions: all workers got an accuracy of between 95% to 98%. Workers were paid 0.5$ for a batch of 5 questions. Screenshots of the instructions and the HIT can be seen in Figures 7 and 8.
Analysis We sample 40 examples that were filtered out by the annotators to analyze the different causes for invalid generated questions. We find that most errors (70%) were due to problematic scene graph annotations: either because of missing annotations (53%) or wrong annotations (17%). The former type would make the answers to questions that require counting to be incorrect, and also questions that ask about a specific object (e.g. a question about "a man wearing a hat" will be invalid if there's more than one such man), and in general, means that our automatic validation mechanism failed to recognize that object. The latter type (wrong annotations, in contrast to missing annotations) can cause any question to be incorrect, and could not have been detected by our automatic validation methods. Other errors were questions about color (6%) that were not accurate (e.g. a question about whether two benches are brown will be given the answer 'true' since they are both annotated brown, but in practice, they could have significantly different shades of brown which might lead to the correct answer 'false'). The rest of the errors (16%) were due to various issues that make the answer unclear, such as questions that require to count feathers or meat.  Does the trees that are behind a zebra and the trees that are behind a fire hydrant have the same color?
Are the trees behind a zebra the same color as those behind a fire hydrant? There is 1 bottle that is on bench that is in front of tree Is there a bottle on a bench in front of a tree? No forks that are on a white plate are silver None of the forks on a white plate are silver. Do all boats that are in a harbor have the same color?
Are all the boats in the harbor the same color? Is the person that is wearing a yellow jacket skiing?
Is the person in the yellow jacket skiing? How many images with mushrooms that are on a pizza that is on a table?
The pizza on the table -how many mushrooms are on it?
Is there either a girl that is holding a bouquet and is wearing dress or a girl that is holding a book and is wearing a hat?
Is there either a girl holding a bouquet and wearing a dress, or a girl holding a book and wearing a hat? There are less boats that are on water than surfboards that are on water There are fewer boats on water than surfboards.
There are at least 4 people that are buttering pan There are four or more people buttering a pan. What is the material of the table that is under a coffee mug? What material is the table under the coffee mug? Table 9: Examples for crowd-sourced paraphrasing.

E.2 Paraphrasing
For the question paraphrasing task, we again conducted a qualification task in addition to the final task. All potential workers were first added to the qualification task and asked to paraphrase 10 questions each. The paraphrases were then manually analyzed for meaning preservation and fluency and only the workers with very good performance were added to the final task which was used to paraphrase the bulk of the questions. In either case, we shared feedback with the workers via Google spreadsheets (one for each worker). Additionally, we regularly sampled and analysed the workers' paraphrases in the final task and used the same spreadsheets to share any necessary feedback. The workers were asked to periodically check their feedback spreadsheets and the workers that ignored the feedback were disqualified from the final task. We qualified 14 workers to the final task most of whom wrote good paraphrases. We only had to disqualify one worker for not taking note of their feedback.
Both the qualification and final tasks had the same instructions, examples and HIT interface. Screenshots can be seen in Figures 9 and 10. Workers were paid $0.7 for every task completed in both AMT tasks -with 5 questions per task. Additionally, as shown in Figure 10, workers were provided a comment box to leave comments in case they could not understand the question. Comments were left for a very small fraction of questions (less than 2%), mostly to indicate questions that were invalid or unclear. We removed all questions with comments in the final datasets. Some examples of the crowd-sourced paraphrases are shown in Table 9.

G Programs
We list all program operators in Table 11, together with their input arguments/dependencies and output. A sample program can be found in Table 10. Tables 12 and 13 show the accuracy for each template for both the nonparaphrased and the paraphrased versions, for models that were trained on all data. The results of Index Operator Arguments 1 Find "table"  2 Filter 1, "wood" 3
We report results on the compositional splits when they are evaluated on the paraphrased questions in Tables 14 and 15. The generalization scores are lower than the results for the non-paraphrased data, showing that transfer to natural language makes compositional generalization harder.
Effect of M We show how M , the number of examples the model sees from the compositional subset, affects the accuracy in Figure 13. The graph shows that using 50 examples barely has an effect, and that most of the improvement is achieved when increasing the number of examples from 125 to 2500. Increasing it further shows diminishing improvements.

VILBERT Results
To assess whether the results we get are specific to the model that we used (VisualBERT), we run additional compositional tests on a different model, VILBERT, using Volta's framework (Bugliarello et al., 2021). The model has the same number of parameters and was trained on the same pre-training data. Results in Tables 16  and 17 show that for most of the compositional splits, both of our tested models get similar generalization scores.

What is being held by the man in the jacket? umbrella
No men are surfing on a white surfboard. True

At least one image depicts a girl hitting a ball while wearing gloves. False
There are precisely two men wearing white shorts in how many images? 1 There is at least 1 image with exactly 2 dark bottles on a counter. True Figure 11: Selected examples from COVR validation set.  Figure 12: Selected examples from COVR validation set.

QueryName
(1) object Returns the name of 'object' Find (1) name Returns all objects from all scenes that are named 'name'.

UniqueImages
(1) objects Returns a set (without duplicates) of all images of the given 'objects'.

GroupByImages
(1) objects Returns (image, objects_in_image) tuples where all object in 'objects' that are in the same image are grouped together and coupled with that image.