QLEVR: A Diagnostic Dataset for Quantificational Language and Elementary Visual Reasoning

Synthetic datasets have successfully been used to probe visual question-answering datasets for their reasoning abilities. CLEVR (johnson2017clevr), for example, tests a range of visual reasoning abilities. The questions in CLEVR focus on comparisons of shapes, colors, and sizes, numerical reasoning, and existence claims. This paper introduces a minimally biased, diagnostic visual question-answering dataset, QLEVR, that goes beyond existential and numerical quantification and focus on more complex quantifiers and their combinations, e.g., asking whether there are more than two red balls that are smaller than at least three blue balls in an image. We describe how the dataset was created and present a first evaluation of state-of-the-art visual question-answering models, showing that QLEVR presents a formidable challenge to our current models. Code and Dataset are available at https://github.com/zechenli03/QLEVR


Introduction
Visual question answering is at the locus of computer vision and natural language processing, and its objective is developing computer vision systems that can answer arbitrary natural language questions about images (Lu et al., 2016;Schwartz et al., 2017;Ramakrishnan et al., 2018;Gat et al., 2020). This is useful across a range of applications, including medical image analysis, accessibility for visually impaired, video surveillance, art and advertisement (Barra et al., 2021).
The complexity of visual question answering naturally depends on the complexity of the images and the complexity of the natural language questions. The task reduces to object recognition for very simple questions of the form: (1) Is there a triangle in this image?
Question: Are all the cyan metallic triangular prisms on the brown plane? Answer: True Question: On the non-white planes on the left rear side of the black wood rectangular plane, all the cyan metallic cubes but at least 2 are larger than at most 7 cubes; is it right? Answer: False Figure 1: A sample image and questions from QLEVR. Tasks involve attribute recognition, counting, comparing numbers, spatial relationships, and understanding of quantifiers.
Object recognition can of course be a very complex task on its own, depending on the types of objects, the number of possible objects to be recognized, the amount of supervision for inducing a good model, general image quality, etc. However, more complex queries such as (2) make visual question answering much harder: (2) Is there a triangle inside a circle in this image?
Answering such a question in the presence of an image requires a computer vision system that not only recognizes objects, but also relations between them. CLEVR (Johnson et al., 2017) probes computer vision systems's ability to answer even more complex queries, such as, for instance: (3) Is there a cyan cube to the right of the yellow sphere?
Question (3) involves reasoning about the relation between two objects, as well as the compositional semantics of color adjectives. In addition to shapes and colors, CLEVR also includes questions about sizes and quantities.
In this paper, we present a novel visual questionanswering dataset that goes beyond CLEVR in focusing specifically on quantificational language, e.g.: (4) Are most of the cyan cubes to the right of the yellow sphere?
Given the complexity of quantificational language, the rich typology of expressions of quantification across different languages, and the interest from philosophy, it is perhaps surprising that quantificational language has received relatively little attention in the NLP community (see §2), but we believe it is a crucial step in pushing the research horizons in (visual) question-answering.
Contributions Based on a comprehensive typology of English quantifiers, we build a dataset of 100,000 synthetic images and 999,446 unique questions to these images. This is roughly the same size as or a little bigger than CLEVR (Johnson et al., 2017). Our questions are on average longer than previous datasets. We evaluate three baselines from Johnson et al. (2017), a text-only baseline based on BERT (Devlin et al., 2019), and MAC (Hudson and Manning, 2018) on QLEVR and analyze performance across quantifier types.

Related Work
Visual Question Answering Challenge Datasets Several synthetic challenge datasets for visual question answering exist: Andreas et al. (2016) presents SHAPES, a predecessor to CLEVR and QLEVR, relying also on synthetic constellations of colored geometric shapes and template-driven question generation. Pezzelle and Fernández (2019) create a similar dataset to probe visual question answering models for knowledge of adjectival semantics. A portion of the visual question answering dataset (Agrawal et al., 2017)  Synthetic visual question answering datasets have several advantages over ones based on real images and questions that tend to suffer from selection biases (Liu et al., 2021), but of course they are limited in what can be induced from them. They are therefore mostly useful for probing the limitations of visual question answering architectures and off-the-shelf models. Showing results only on synthetic data is often seen as a weakness in the literature (Hassantabar, 2018), but synthetic data is useful for diagnosing the errors of visual question answering systems, in our case highlighting the challenges posed by quantifiers.
Quantifiers Quantifiers have been largely ignored in the NLP community. Question-answering datasets have been developed for numerical reasoning in English (Dua et al., 2019), and some have identified quantifier words as important sources of errors for textual entailment systems (Joshi et al., 2020). Fang and Lou (2021) recently focused on the two quantifier words part and whole in an error analysis for named entity recognition.

QLEVR
We design a challenge dataset called QLEVR (for Quantificational Language and Elementary Visual Reasoning) that requires more complex reasoning than previous visual question-answering datasets. QLEVR is designed to probe the visual reasoning capabilities of visual question-answering systems with respect to quantificational language, including detecting members of sets, quantifying sets, and reasoning about the relationships between sets. To this end, we automatically construct scene graphs (Johnson et al., 2015) and use these to generate synthetic images with ground-truth locations, attributes, and relationships for planes and objects. Each scene graph can be queried in a number of way, and we design query templates to render natural language questions involving complex reasoning about sets of such planes and objects. We describe each of these steps in detail: Image Generation All images in QLEVR are images of objects organized in a particular way on a desk-like surface. Figure 2 shows how the images are generated. We construct a scene graph for a two-dimensional image containing areas and ob- jects of different sizes and shapes. Scene graphs determine the ground-truth locations, bounding boxes, attributes and relationships for the planes and objects in the form of a graph or tree structure. Nodes are planes or objects annotated with attributes, each of which is connected to its spatially related nodes.
Each image contains one to five areas or geometric planes. These can be either triangular, rectangular or circular. The rest of the desk area we refer to as the white non-geometric plane. Geometric planes come in two materials (marble and wood), three colors (black, gray, and brown), and random sizes.
Each geometric plane contains one to ten (1-10) objects, with different sizes and shapes, and the non-geometric plane contains one to twelve (1-12) objects, with different sizes and shapes. Object come in seven shapes (cone, cube, cylinder, pentahedron, sphere, triangular prism, and tetrahedron), two absolute sizes (small and large), five materials (metal, rubber, leather, marble, and wood), and eight colors (blue, brown, cyan, gray, green, purple, red and yellow). The spatial relationships between planes and objects include front, back, left and right, as well as right front, right rear, left front and left rear.
We render three-dimensional images of the scene graphs with Blender (Community, 2018). Light settings and three preset camera positions were chosen at random, after validating that all objects were at least partially visible. Since the depth of the scene can affect the judgment of the spatial relationship in the three-dimensional image, the desk boundary is always visible as a reference for determining the depth of the scene. Minimum distances between objects and planes were kept to reduce the ambiguity of spatial relationships. See Appendix B for more details. Question Generation Quantifiers are often said to be among the most important and complex constructs of natural languages (Hintikka, 1977;Barwise and Cooper, 1981). As pointed out by by Bernardi and Pezzelle (2021), visual questionanswering models need to master a wide range of linguistic phenomena, including negation, entailment, mutual exclusivity and so on. We add (generalized) quantifiers to this list and design a dataset to probe the ability of visual question-answering systems to handle quantifiers in combination with other linguistic phenomena. See Table 1 for the quantifiers included in QLEVR.
See Figure 2 for how questions are formed from scene graphs. In brief, we think of the scene graph as a model and evaluate various combinations of logical operators, including quantifiers, on the scene graph, i.e., performing a model checking (Clarke et al., 2009) procedure.
We introduce the notion of a question family, defined by a set of operators and a scene graph. Each question family is associated with 2-6 text templates and a set of synonyms (for shapes, colors, materials, and spatial relationships). The templates were written by hand. Each question template can thus generate multiple questions. For example, the where upper-cased variables refer to words, and lower-cased variables to suffixes, can generate the question (6) Are there exactly 2 small red rubber objects on the black wooden triangular plane?
We construct a total of 671 different templates, which are randomly constructed from 11 plane templates and 61 object templates. Our questions involve attribute recognition, counting, comparing numbers or attributes, spatial relationships, and understanding of quantifiers. Figure 2 shows the operators built in the given question family, such as filter, relate, and at least. Note that many (generalized) quantifiers are related by entailment. The question (7) Are all the red cubes on the marble planes?
is, assuming an image with red cubes, semantically equivalent to (8) Are no red cubes not on the marble planes?
The semantics of combinations of quantifiers can be derived using squares of opposition (Westerståhl, 2012). We exploit these entailment relations in creating QLEVR.  Some combinations of key values may generate unreasonable questions. We therefore define restrictions for each question family to avoid the generation of pragmatically odd, ill-posed or trivial questions. For example, the phrase on the marble plane where there are at least 5 red objects would be pragmatically odd if there was only one marble plane in the scene. The sentence (9) On the marble plane, do between 2 and 4 cubes have the same size as most of the cylinders?
is ill-posed if there are no cubes on the marble plane. Finally, questions like Are there more red cubes than cubes? are trivial, because they can be answered in the absence of the image. The assertion is always true. The opposite would, for example, be true of (10) On the plane with 8 balls, are there exactly 3 balls?
We present many examples of images and questions in the Appendix, but see also Figure 3 for a complex question with embedded quantifiers.
Dataset Characteristics QLEVR has 1,000,000 questions for 100,000 images, with each image having 10 questions generated from different question families. The dataset is balanced, preventing answering in the absence of the images. In addition, the answer distribution across question families is constrained by acceptance-rejection sampling. The data is randomly split, with 70% for training data, 15% for validation and 15% for heldout evaluation data (the test set). As shown in figure 4, QLEVR includes 27 different quantifiers. Questions contain 1-4 quantifiers. Table 2 shows the diversity and complexity of the QLEVR questions. Almost all the questions are unique. Very few questions appear in several splits, and always in conjunction with new scene graphs.

Experiments
In this section, we evaluate the performance of baselines and near-state-of-the-art models on the QLEVR dataset and perform detailed error analysis. We ran each each method three times with different random seeds and report the test set performance for the model that achieved the best performance on the validation data.

Models
We first present three purely text-based models, Q-type (Antol et al., 2015), LSTM (Hochreiter and Schmidhuber, 1997) and BERT (Devlin et al., 2019), to evaluate the level of visual reasoning needed for QLEVR. If these perform at random (0.5), we have successfully constructed a dataset in which questions cannot be answered in the absence of images. It is important to include text-only models as baselines in visual question answering to control for spurious correlations (Gat et al., 2020). We shall see in §4.2 that while Q-type performs at chance level, the BERT and LSTM baselines are able to pick up on some spurious correlations. We also evaluate two standard visual question answering architectures, one based on a combination of convolutional and recurrent neural networks (CNN+LSTM), and one attention- • Q-type (Antol et al., 2015): Similar to the "per Q-type prior" method in (Antol et al., 2015), this baseline predicts the most popular answer for each question type.
• LSTM (Hochreiter and Schmidhuber, 1997): Question words are embedded as 300-dimensional vector sand fed into an LSTM network. The last hidden state representation is passed into a multi-layer perceptron (MLP) to predict the final answer. All experiments use  Table 3 shows the results of the five methods described in §4.1 on the test set of QLEVR. We make the following observations.

Analysis by Quantifier Type
1. Q-type exhibits performance levels around 50% for every quantifier type, showing that the answer distribution of QLEVR is uniform.
2. Text-only LSTM and BERT achieve an average accuracies of 64.6% and 65.8%, respectively. These results suggest that even if the answers of each question family are distributed uniformly, there may still be spurious correlations: Objects with more detailed attribute descriptions may be more likely to get a false answer. For example, the question "Are there more than 3 small blue cubes on the black planes?" is more likely to get a false answer than "Are there more than 3 blue objects on the black planes?").   Figure 4 for the number distribution of quantifiers.  Figure 4 shows the distribution of the number of quantifiers in each QLEVR question.
higher than for quantifiers that require a number of objects to match exact values (e.g., exactly N).
5. Quantifiers without numerals (e.g., all, most, not all, some and some but not all) lead to lower accuracies than other quantifiers, showing that reasoning with these quantifiers is harder. This highlights the need for including such quantifiers in challenge datasets to push advancements in visual question answering. Figure 5 shows how accuracy varies as the number of quantifiers in the questions increases. The more quantifiers in a question, the more complex its semantics will be.

Number of Planes
We also test the visual reasoning abilities of these models by examining error across the number of planes involved in answering the question. Appendix A introduces all the plane templates in our question families. We use the plane template "on the <PC> <PM> <PS> plane<ps>" for our analysis, because this template has no influence of quantifiers or spatial relationships in targeting planes. QLEVR test set has 13,612 questions with this plane template. The left graph in Figure 6 shows how the accuracy varies with the increase in the number of target planes that need to be reasoned with. Among the 13,612 questions, 10,288 of them involve a single plane and 3,324 of them involve multiple planes. We can see that for language-only models Q-type, BERT and LSTM, the number of target planes does not significantly affect the accuracy. However, for CNN+LSTM and MAC, questions involving just a single plane are harder to answer than those involving multiple planes. This is because for visual models, planes enable disambiguation and thereby reduce the required reasoning. The right graph in Figure 6 compares accuracy on questions that do not refer to specific planes (no attribute), to questions that refer to specific planes (with attributes). Among the 13,612 questions, 1,340 questions do not refer to specific planes, whereas 12,272 do. This distinction has little impact on the perfor- Figure 6: The results of questions contains "on the <PC> <PM> <PS> plane<ps>". Left: The effect of different number of target planes on the accuracy of the answers; single means that the reasoning process basically only needs to consider one plane in the image, while multiple means that multiple planes need to be considered. Right: The effect of whether the plane has attribute description on the accuracy of the answer; no attribute ("on the planes") means that the reasoning process does not need to consider planes in the image and see the image as a whole, while with attributes ("e.g., on the wooden plane") means that specific plane(s) needs to be considered. mance of our text-only models. For CNN+LSTM and MAC, however, examples in the no attribute class exhibit higher accuracies than those in with attributes. This, again, shows performance is better when less visual reasoning is required.

Discussion
In this paper, we proposed a dataset, which we call QLEVR -for Quantificational Language and Elementary Visual Reasoning. QLEVR probes the ability of visual question-answering systems to reason with quantificational language, including 27 different quantifiers and combinations thereof. It requires complex visual reasoning to locate the specific planes and understand various relationships between objects. We increase the semantic diversity of the questions by negating quantifiers and by using different templates for semantically equivalent questions. Our analysis highlights how challenging such examples are to visual question-answering systems, and we hope that QLEVR will help guide push research horizons in visual question-answering by zooming in on the challenges posed by quantificational language. One fundamental limitation is that QLEVR only considers English questions, and we plan to extent it to other, typologically unrelated languages. Besides, QLEVR can easily be extended by adding new question families, and questions whose answers are not limited to true or false, e.g., with numbers or attributes as answer types. In addition to the three-dimensional images, we also provide two-dimensional images and scene graphs recording the ground-truth information (see Figure 2). It is also possible to generate questions about 2D im-ages by simply modifying our question families. We hope these two datasets can be used for transfer learning for visual question answering in the future.

Supplementary Material A Question Templates
As described in Section 3, QLEVR question templates are composed of 11 plane templates and 61 object templates randomly paired. In this section we detail the difference between these templates.
Plane Templates. The role of the plane templates is to raise our question for specific planes (regions) in the image through some restrictions (attributes, spatial relations and explicitly restricted quantifier phrases). Basically, the plane templates can generate questions with following types: • On the white non-geometric planes.
• On the geometric plane with a different shape (color/material) from other planes. To avoid pragmatically odd questions, we ensure that the number of planes obtained by the plane templates with restrictions of spatial relations and explicitly restricted quantifier phrases (e.g. On the brown planes behind the gray plane, or On the brown plane where there are exactly 3 balls) is less than the number of planes obtained by the templates without these restrictions (e.g. On the brown planes) for the same scene graph.
Object Templates. We can use the operators representation of the questions templates to analyze model performance on the following forms of reasoning: Figure 7: Accuracy per question type on the QLEVR dataset.
• Existence type 1: Questions ask whether a certain type of quantifier-restricted object exists on one or some specific planes (e.g., "Whether all the cyan cubes [Plane Template]?").
• Existence type 2: Questions ask whether a certain type of quantifier-restricted object exists in a certain direction of a unique object (e.g., "[Plane Template], are there fewer than 3 balls behind the cyan cube?").
• Comparing attributes: Questions ask whether two types of quantifier-restricted objects have the same value for some attributes (e.g., "[Plane Template], is there any small cylinders that has the same color as most leathery tetrahedrons?").
• Quantity comparison: Questions compare the size of two sets of objects (e.g., "[Plane Template], are there more big blocks than rubber balls?").
• Size comparison: Questions ask which of two quantifier-restricted objects has a larger size (e.g., "[Plane Template], some red cones are larger than some but not all of the metal cones; is it right?").
• Spatial relations: Questions involves the spatial relationship between objects (e.g., "[Plane Template], are there more big blocks in front of the yellow cylinder than rubber balls to the left rear of the small block?"). Figure 7 shows the performance on above question types. As can be seen, MAC outperforms other models on most question types. The only exception is: on quantity comparison task, BERT performs slightly better than MAC, showing that MAC has better reasoning ability in complex scenes. Questions of Existence type 1 obtain better results than Existence type 2 for vision-language model CNN+LSTM and MAC, suggesting that the position relationship between object and plane is easier to be inferred by the models than the spatial relationship between the objects. For questions of Quantity comparison, MAC and CNN+LSTM performs on par with LSTM, suggesting that the image features extracted by ResNet-101 may contain little information related to counting in complex scenes. Figure 8 shows the materials and object models made through Blender (Community, 2018), as well as the performance of different colors on these materials. Two different materials of leather, marble, and wood were made respectively to further enrich the diversity of objects in the dataset. The images of the plane materials were made by modifying the images under CC0 1.0 Universal. 1 Note that after the overall scene rendering, objects of certain materials will produce different effects according to the color and material of the plane in contact with them, as well as the position of the camera and lights.

C Example Images and Questions
The remaining pages show some images and questions generated by the combination of our different plane templates and object templates. Each question is annotated with its answer and contained quantifiers, where N stands for Number, F stands for Fraction and O stands for Object. Figure 8: From left to right, the object shapes in (a) are cone, cube, cylinder, pentahedron, sphere, triangular prism, and tetrahedron; the plane attributes in (b) are black marble, black wood, brown marble, brown wood, gray marble and gray wood; the colors in (c)~(g) are blue, brown, cyan, gray, green, purple, red and yellow.
Question: Whether all the large brown objects are on the white plane? Answer: False Quantifiers: all Question: Some large rubber tetrahedron is not on the gray marble plane; is it right? Answer: True Quantifiers: not all (some ¬) Question: It is not the case that all the big blue rubbery spheres are not on the gray rectangular plane; is it right? Answer: False Quantifiers: some (¬ all ¬) Question: It's not the case that some large purple metallic triangular prism is on the planes where there are 9 items in total; is it right? Answer: True Quantifiers: total, no (¬ some) Question: Whether some but not all of the large purple rubber objects are on the marble planes where there are 4 blue objects in total? Answer: False Quantifiers: total, some but not all Question: Are there at most 3 small blue objects on the dappled planes where there are 5 big triangular prisms in total? Answer: True Quantifiers: total, at most N Question: All the big wooden blocks but at least 2 are not on the planes where there are exactly 2 cylinders on each plane; is it right? Answer: True Quantifiers: each, exactly N, at least N (all but at least N ¬) Question: It is not the case that at most 2 wood cylinders are on the quadrilateral plane where there are exactly 2 purple wood cylinders; is it right? Answer: False Quantifiers: each, exactly N, more than N (¬ at most N) Question: Are there fewer than 2 small purple dappled cylinders on the planes where there is exactly 1 purple block on each plane? Answer: True Quantifiers: each, exactly N, fewer than N Question: All the tiny red wood objects but 1 are not on the non-white plane where the shape of the plane is different from that of other planes; is it right? Answer: True Quantifiers: exactly N (all but N ¬) Question: Are there between 1 and 3 small red leathery cubes on the circular plane to the left rear of the brown quadrilateral plane? Answer: True Quantifiers: between Question: All the red leather objects but at most 3 are on the geometric plane where the material of the plane is different from that of other planes; is it right? Answer: False Quantifiers: all but at most N