Language-Mediated, Object-Centric Representation Learning

We present Language-mediated, Object-centric Representation Learning (LORL), a paradigm for learning disentangled, object-centric scene representations from vision and language. LORL builds upon recent advances in unsupervised object discovery and segmentation, notably MONet and Slot Attention. While these algorithms learn an object-centric representation just by reconstructing the input image, LORL enables them to further learn to associate the learned representations to concepts, i.e., words for object categories, properties, and spatial relationships, from language input. These object-centric concepts derived from language facilitate the learning of object-centric representations. LORL can be integrated with various unsupervised object discovery algorithms that are language-agnostic. Experiments show that the integration of LORL consistently improves the performance of unsupervised object discovery methods on two datasets via the help of language. We also show that concepts learned by LORL, in conjunction with object discovery methods, aid downstream tasks such as referring expression comprehension.


Introduction
Cognitive studies show that human infants develop object individuation skill from diverse sources of information: spatial-temporal information, object property information, and language (Xu, 1999(Xu, , 2007Westermann and Mareschal, 2014). Specifically, young infants develop object-based attention that disentangles the motion and location of objects from their visual appearance features. Later on, they can leverage the knowledge acquired through word learning to solve the problem of object individuation: words provide clues about object identity and type. The general picture from cognitive science is that object perception and language codevelop in support of one another (Bloom, 2002 Figure 1: Two illustrative cases of Language-mediated, Object-centric Representation Learning. Different colors in the segmentation masks indicate individual objects recognized by the model. LORL can learn from visual and language inputs to associate various concepts: black, pan, leg, with the visual appearance of individual objects. Furthermore, language provides cues about how an input scene should be segmented into individual objects: (a) segmenting the frying pan and its handle into two parts (Segmentation II) yields an incorrect answer to the question ; (b) an incorrect parsing of the chair image makes the counting result wrong.
Our long-term goal is to endow machines with similar abilities. In this paper, we focus on how language may support object discovery and segmentation. Recent work has studied the problem of unsupervised object representation learning, though without language. As an example, factorized, object-centric scene representations have been used in various kinds of prediction (Goel et al., 2018), reasoning (Yi et al., 2018), and planning tasks (Veerapaneni et al., 2020), but they have not considered the role of language and how it may help object representation learning.
As a concrete example, consider the input images shown in Fig. 1 and the paired questions. From language, we can learn to associate concepts, such as black, pan, and leg, with the referred object's visual appearance. Further, language provides cues about how an input scene should be segmented into individual objects: a wrong parsing of the input scene will lead to an incorrect answer to the question. We can learn from such failure that the handle belongs to the frying pan (Fig. 1a) and the chair has four legs (Fig. 1b).
Motivated by these observations, we propose a computational learning paradigm, Languagemediated, Object-centric Representation Learning (LORL), associating learned object-centric representations to their visual appearance (masks) in images, and to concepts-words for object properties such as color, shape, and material-as provided in language. Here the language input can be either descriptive sentences or question-answer pairs. LORL requires no annotations on object masks, categories, or properties during the learning process.
In LORL, four modules are jointly trained. The first is an image encoder, learning to encode an image into factorized, object-centric representations. The second is an image decoder, learning to reconstruct masks for individual objects from the learned representations by reconstructing the input. These two modules share the same formulation as recent unsupervised object discovery research: learning to decompose the image into a series of slot profiles, comprised of pixel masks and latent embeddings. Each slot profile is expected to represent a single object in the image.
The third module in LORL is a pre-trained semantic parser that translates the input sentence into a semantic, executable program, where each concept (i.e., words for object properties such as red) is associated with a vector space embedding. Finally, the last module, a neural-symbolic program executor, takes the object-centric representation from Module 1, intermediate representations from Module 2, and concept embeddings and the semantic program from Module 3 as input, and outputs an answer if the language input is a question, or TRUE/FALSE if it's a descriptive sentence. The correctness of the executor's output and the quality of reconstructed images (as output of Module 2) are the two supervisory signals we use to jointly train Modules 1, 2, and 4.
We integrate the proposed LORL with stateof-the-art unsupervised discovery methods, MONet (Burgess et al., 2019) and Slot Atten-tion (Locatello et al., 2020). The evaluation is based on two datasets: ShopVRB (Nazarczuk and Mikolajczyk, 2020) contains images of daily objects and question-answer pairs; PartNet (Mo et al., 2019) contains images of furniture with hierarchical structure, supplemented by descriptive sentences we collected ourselves. We show that LORL consistently improves existing methods on unsupervised object discovery, much more likely to group different parts of a single object into a single mask.
We further analyze the object-centric representations learned by LORL. In LORL, conceptually similar objects (e.g. objects of similar shapes) appear to be clustered in the embedding space. Moreover, experiments demonstrate that the learned concepts can be used in new tasks, such as visual grounding of referring expressions, without any additional fine-tuning.

Related Work
Unsupervised object representation learning. Given an input image, unsupervised object representation learning methods segment objects in the scene and build an object-centric representation for them. A mainstream approach has focused on using compositional generative scene models that decompose the scene as a mixture of component images (Greff et al., 2016;Eslami et al., 2016;Greff et al., 2017;Burgess et al., 2019;Engelcke et al., 2020;Greff et al., 2019;Locatello et al., 2020;Goyal et al., 2020). In general, these models use an encoder-decoder architecture: the image encoder encodes the input image into a set of latent object representations, which are fed into the image decoder to reconstruct the image. Specifically, (Greff et al., 2019;Burgess et al., 2019;Engelcke et al., 2020) use recurrent encoders that iteratively localize and encode objects in the scene. Another line of research (Eslami et al., 2016;Crawford and Pineau, 2019;Kosiorek et al., 2018;Stelzner et al., 2019;Lin et al., 2020) leverages object locality to attend to different local patches of the image. These models often use a pixel-level reconstruction loss. In contrast, we propose to explore how language, in addition to visual observations, may contribute to object-centric representation learning. There has also been work that uses other types of supervision, such as dynamic prediction (Kipf et al., 2020;Bear et al., 2020) and multi-view consistency (Prabhudesai et al., 2020). In this paper, we focus on unsupervised learning of object-centric representations from static images and language.
Visual concept learning. Learning visual concepts from language and other forms of supervision provides useful representations for various downstream tasks, such as image captioning (Yin and Ordonez, 2017;Wang et al., 2018), visual-question answering (Yi et al., 2018Huang et al., 2019), shape differentiation (Achlioptas et al., 2019), image classification (Mu et al., 2020), and scene manipulation (Prabhudesai et al., 2020). Previous work has been focusing on various types of representations (Ren et al., 2016;Wu et al., 2017), training algorithms (Faghri et al., 2018;Morgado et al., 2020) and supervision (Johnson et al., 2016;Yang et al., 2018). In this paper, we focus on learning visual concepts that can be grounded in objectcentric representations. Recent work on objectcentric grounding of visual concepts (Wu et al., 2017;Mao et al., 2019;Hudson and Manning, 2019;Prabhudesai et al., 2020) has shown great success in achieving high performance in downstream tasks and strong generalization from a small amount of data. However, these methods assume pre-trained object detectors to generate object proposals in the scene. In contrast, our LORL learns to individuate objects and associates concepts with the learned object-centric representations without any annotations on object segmentation masks or properties.

Preliminaries
Before delving into our language-mediated objectcentric representation learning paradigm, we first discuss a general formulation that unifies multiple concurrent unsupervised object representation learning methods and a neuro-symbolic framework for learning visual concepts from language.

Unsupervised Object-Centric Representation Learning
Given an image I, a typical unsupervised object representation learning model will decompose the scene into a series of slot profiles {(z 1 , x 1 , m 1 ), . . . , (z K , x K , m K )}, where each slot profile is expected to represent an object (or nothing, as the number of slots may be greater than the actual number of objects in the scene). Here z i is the object feature, x i is the object image, and m i is the object mask specifying its location in the scene.
In our paper, we focus on two recent models, MONet (Burgess et al., 2019) and Slot Attention (Locatello et al., 2020). MONet uses a recurrent spatial attention network (Ronneberger et al., 2015) to segment out objects in the scene, and adopts a variational autoencoder (Kingma and Welling, 2014) to encode objects as well as reconstruct object images for self-supervision. At a very high level, its objective function is calculated as where the first term is a pixel-wise L 2 reconstruction loss and the second term computes the KL divergence between the distribution of z k 's and a prior Gaussian distribution.
Slot Attention uses a transformer-like attention network (Vaswani et al., 2017) to extract object features, and decode them with convolutional neural networks to component images and object masks. The model is trained by the same reconstruction loss in the form of the L 2 -norm: (2)

Neuro-Symbolic Concept Learning
The neuro-symbolic concept learner (NS-CL; Mao et al., 2019) learns visual concepts by looking at images and reading paired questions and answers. NS-CL takes a set of segmented objects in a given image as its input, extracts their visual features with a ResNet (He et al., 2015), translates the input question into an executable program by a semantic parser, and executes the program based on the object-centric representation to answer the question. The key idea of NS-CL is to explicitly represent individual concepts in natural language (colors, shapes, spatial relationships, etc.) as vector space embeddings, and associate them with the object embeddings. NS-CL answers the input question by executing the program based on the object-centric representation. For example, in order to query the name of the white object in Fig. 2, NS-CL first filters out the object by computing the cosine similarity between the concept white and individual object representations, which produces a "mask" vector where each entry denotes the probability that an object has been selected. The output "mask" on the objects is fed into the next module and the execution will continue. The last query operation produces the answer to the question. The vector embeddings of individual objects and the concepts are jointly trained based on language supervision.

Language-mediated, Object-centric Representation Learning
Marrying the ideas of unsupervised object-centric representation learning and neuro-symbolic concept learning, we are able to learn an object-centric representation using both visual and language supervision. Fig. 2 shows an overview of Languagemediated, Object-centric Representation Learning (LORL). In LORL, four modules are optimized jointly: an image encoder, an image decoder, a semantic parser, and a neuro-symbolic program executor.
Image encoder. Given an input image, we first use the image encoder ( Fig. 2a) to individuate objects in the scene and extract an object-centric scene representation. It takes the input image as its input, individuates objects in the scene, and produces a collection of latent slot embeddings {z i }.
Image decoder. The decoder (Fig. 2b) takes the object-centric representation produced by the image encoder and produces a 3-tuple for each individual slot (x k , m k , s k ), where x k reconstructs the RGB image of the slot, m k reconstructs the mask, and s k ∈ [0, 1] is a scalar indicating the objectness of the slot. That is, whether k-th slot corresponds to a single object in the scene. Here, we have extended the general pipeline we described in Section 3.1 with an objectness indicator. It serves dual purposes. First, it weights each reconstructed component image while generating the reconstructed image. Mathematically, the reconstructed image I is computed as: I = K k=1 s k · (m k x k ). Second, it mediates the output of all filter operations in the program executor.
In this paper, we will experiment with two image encoder-decoder options: MONet (Burgess et al., 2019) and Slot Attention (Locatello et al., 2020). They are both compatible with the learning paradigm described above. For both models, we use a single linear layer to predict the objectness score for each slot on top of the second-last layer of their image decoders.
Semantic parser. A pre-trained semantic parser ( Fig. 2c) will translate the input question into an executable program composed of primitive operations, such as filter, which filters out objects with certain concepts and query, which queries the attribute of the input object. We use roughly the same domain-specific language (DSL) for representing programs as CLEVR (Johnson et al., 2017a, see also the appendix for details). All concepts that appear in the program, such as white, are associated with distinct, learnable concept embedding vectors.  Neuro-symbolic program executor. The program executor (Fig. 2d) takes the object-centric representation from the image encoder {z k }, the objectness score {s k } from the image decoder, the concept embeddings and the program generated by the semantic parser as its input. It executes the program based on the visual and concept representations to answer the question. The original program executor in NS-CL (Section 3.2) assumes a pre-trained object detector for generating object proposals. In LORL, we associate each object representation with an objectness score s k . Recall that a filter operation in NS-CL produces a mask vector indicating whether an object has been selected. Here, we mediate the output of an filter(c) operation as min(s k , filter(c)). Intuitively, a slot will be selected only if, first, it has concept c and, second, it corresponds to a single object in the scene.
Training paradigm. During training, we jointly optimize the image encoder, the image decoder, and the concept embeddings. They are trained by minimizing the loss L: For MONet-based image encoder-decoder, we use Equation 1 as the perception loss L perception , while for Slot Attention-based encoder-decoder, we use Equation 2. The neuro-symbolic program executor produces a distribution over candidate answers to the input question. We use the cross-entropy loss between the predicted answer and the ground truth answer as L reasoning . We use a three-stage training paradigm in LORL. First, we train the model with only visual inputs with L perception for N 1 epochs. Next, we fix the image encoder and the image decoder, and optimize the concept embeddings with the loss term L reasoning for N 2 epochs. During this second stage, the image encoder and the decoder can already produce decent object segmentation results. Finally, we jointly optimize all three modules for N 3 epochs. We provide detailed information about the hyperparameters for different models in the appendix.

Experiments
We first evaluate whether the representations learned by LORL lead to better image segmentation with the help of language. We then evaluate how these representations may be used for instance retrieval, visual reasoning, and referring expression comprehension.

Image Segmentation
Data. We use two datasets for image segmentation evaluation. The first, Shop-VRB-Simple, is based on Shop-VRB (Nazarczuk and Mikolajczyk, 2020), a dataset of complex household objects and question-answer pairs. The second is based on chairs in PartNet (Mo et al., 2019), a dataset where the objects are different parts of a chair. Fig. 3 shows some examples from the two datasets.
Shop-VRB is a visual reasoning dataset, similar to CLEVR (Johnson et al., 2017a), but with complex household objects of different sizes, weights, materials, colors, shapes, and mobility. Because the original Shop-VRB dataset includes very small and highly transparent objects and complex backgrounds, which current unsupervised representation learning models cannot handle, we generate 10K images with a clean background ourselves us- ing large objects from the dataset. We also pair every image with 9 questions, resulting in 90K questions in total. The test split has 960 images and 8.6K questions. We name this variant Shop-VRB-Simple.
While the previous literature on unsupervised object segmentation mainly focuses on settings where objects are spatially disentangled, we also explore how language may help when objects of interest are different parts of a global shape. To this end, we collect a new dataset, PartNet-Chairs, using chair shapes from PartNet. Every image here shows a chair, where each part of the chair (legs, seat, back, arms) is randomly assigned a color. We select six different chair shapes with one or four legs and zero or two arms. We generate 5K images for training. Each image is paired with 8 descriptive sentences generated from human-written templates, resulting in 40K examples in total. The test split has 960 images. Each sentence describes the name and color of parts. We provide all templates in the supplementary material. We are interested in whether object-centric representation learning models may separate these parts and whether and how language may help in this scenario.  show results with Slot Attention on both datasets.
Metrics. We use three metrics for evaluation. Following (Greff et al., 2019), we first use the Adjusted Rand Index (ARI; Rand, 1971;Hubert and Arabie, 1985). It treats segmentation as a clustering problem: each mask is the cluster index that the pixels within belong to. ARI is computed as the similarity between the predicted and ground truth clusters, and ranges from 0 (random) to 1 (perfect match).
In practice, we found this pixel-wise metric is sensitive to the size of objects: a model that infrequently makes mistakes on large objects will have lower ARI than one that frequently mis-segments small objects. Thus, we in addition design two object-centric metrics: • Ground Truth Split Ratio (GT Split) measures the ratio of objects (GT masks) that are covered by more than one prediction mask. • Prediction Split Ratio (Pred Split) measures the ratio of prediction masks that cover more   than one object (GT mask). Concretely, we first assign each pixel to the prediction mask with the maximum value at the pixel. We say a prediction mask covers an object if it covers at least 20% of the object's pixels. The GT and Pred Split ratios are thus defined as: GTSplit = # of objects that are covered by > 1 masks # of objects that are covered by > 0 masks , PredSplit = # of masks that cover > 1 objects # of masks that cover > 0 objects .
Ideally, there is a one-to-one correspondence between objects and predicted masks, with both GT Split and Pred Split being 0. Please refer to the appendix for detailed comparison of the proposed metrics with other metrics.
Results. The quantitative results on SHOP-VRB-Simple are summarized in Table 1. We show the mean and standard error on each metric over 3 runs. Since our semantic parsing module is trained on paired question-program pairs, it achieves nearly perfect accuracy (>99.9%) on test questions. Thus, in later sections, we will focus on evaluating object segmentation, concept grounding, and downstream task performances. LORL helps Slot Attention achieve better segmentation results in all three metrics. From visualizations in Fig. 4, we find that the original Slot Attention model struggles with metallic objects; but with LORL, it performs much better in those cases.
To further explore how LORL helps Slot Attention on failure cases, we calculate the Ground Truth Split Ratio for each object category, and find that Slot Attention most often fail to segment coffee makers, blenders, and toasters as a whole. These objects have complex sub-parts and their appearance changes quickly when the viewpoint changes. With the help of language, Slot Attention improves consistently over its ablative variants across all the three metrics, reducing the GT Split by 50% on average (Table 1b). Furthermore, we include ablation studies on how different types of questions and different modules (the objectness score module and the concept learning module) contribute to the performance improvement in the appendix.
On PartNet-Chairs, LORL also helps both MONet and Slot Attention improve with a large margin, as shown in Table 2. The results are averaged over 4 runs. MONet in general performs well on this dataset, though it still sometimes merges different parts with the same color into a single mask. An example can be found in Fig. 5, column 3, where the blue arm and the blue bottom in the input image are put into the same mask by MONet. Such an issue is alleviated in LORL + MONet. Fig. 5 also includes examples to show how LORL helps Slot Attention. Table 2, the improvement on Slot Attention is larger and more consistent, compared   with the improvement on MONet. We hypothesize that this is because the two models adopt different approaches for aligning object features and masks. While MONet uses separate modules for segmentation and object representation learning, Slot Attention obtains masks by directly decoding object representations. Having a shared representation might have allowed Slot Attention to gain more from language supervision.

Instance Retrieval
We now analyze the learned object representations on Shop-VRB-Simple. We first use them for instance retrieval: for each model, we randomly select a segmented object and use its learned representation to search for its k nearest neighbors in the feature space. Then, for each selected object, we compute how many of the k nearest neighbors belong to the same category. During searching, we only consider object representations whose corresponding mask, after decoding, has at least an Intersection over Union (IoU) of at least 0.75 with a ground truth object mask. We sample 1,000 object features from each model for evaluation. Table 3 includes results with k = 1, 3, 5, suggesting that the object representations learned by LORL + Slot Attention are better for retrieval, compared with features learned by Slot Attention alone without language. This is because Slot Attention often confuses categories that are visually similar but conceptually different, such as baking tray and chopping board.

Visual Reasoning
As another analysis, we also evaluate how the learned representation of LORL + Slot Attention performs on visual question answering on the Shop-VRB-Simple dataset. Here we compare with an ablated version of LORL, where we only train the model for the first two stages, as stated at the end of Section 4. We do not train the model for the third stage-jointly optimizing or fine-tuning all three trainable modules. We name this ablation LORL + SA (No FT). Through this analysis, we hope to understand the importance of joint training of the vision modules (Modules 1 and 2) and the reasoning module (Module 4). Table 4 shows that joint training is crucial for visual reasoning. This resonates with the previous result, where visually similar objects are clustered together in the latent space, impeding the usefulness of the information encoded.

Referring Expression Comprehension
Finally we evaluate the representations learned by LORL on referring expression comprehension, where given an expression referring to a set of objects in the scene, like "The white plates", the model is expected to return the corresponding object masks. After learning all needed concepts from question-answer pairs, LORL can naturally handle referring expression without any further training, if we assume a pre-trained semantic parser.
We The relatively comparable results are strong evidence that the representations learned by LORL also transfer to a new task.

Conclusion
We have proposed Language-mediated, Objectcentric Representation Learning (LORL), a paradigm for learning object-centric representations from vision and language. Experiments on Shop-VRB-Simple and PartNet-Chairs show that language significantly contributes to learning better representations. This behavior is consistent across two unsupervised image segmentation models.
Through systematic studies, we have also shown how LORL helps models to learn object representations that encode conceptual information, and are useful for downstream tasks such as retrieval, visual reasoning, and referring expression comprehension.

A Domain-Specific Language (DSL)
LORL extends the domain-specific language of the CLEVR dataset (Johnson et al., 2017a) to accommodate descriptive sentences. Specifically, we add an extra primitive operation: Equal(X, y). It takes two inputs. In our case, the first argument X is the output of a Query, Exist, or Count operation. All three operations output a distribution over possible answers. The second argument y is either a word or number, such as TRUE, white, or 4. The Equal operation computes the probability of X=y. In LORL, models are trained to maximize the output probability.

B Hyperparameters
For optimization hyperparameters, we largely adopt original settings in Burgess et al. (2019) and Locatello et al. (2020). Table 5 summarizes the hyperparameters for the loss weights (α and β), the number of training epochs of different stages (N 1 , N 2 , N 3 ), and the batch size. We early-stop the training when QA performance converges. We skip the second training phase on PartNet-Chairs, because the first phase (vision-only) yields poor segmentation performance on this dataset. Establishing a meaningful grounding of concepts could be hard in this case. If we keep the second training phase for LORL + Slot Attention on PartNet-Chairs, the model converges slower in the third training phase (15 more epochs in our experiments), but the final performance remains the same.  Learning rate scheduling. For Slot Attention models, during the first training stage (perceptiononly), we use the learning rate schedule described in the original paper on both datasets. Initially, the learning rate is linearly increased from zero to 4 × 10 −4 in the first 10K iterations. After that, we decay the learning rate by 0.5 for every 100K iterations. On PartNet-Chairs, after the first stage, Slot Attention models continue to use the same learning rate scheduling. For Shop-VRB-Simple, we switch to a fixed learning rate of 0.001 during N 2 phase, (a) There are 4 objects in the scene; one of them is split into 3 masks. Thus, GT Split = 1/4 = 0.25.
(b) There are 5 masks in the scene; one of them covers two objects. Thus, Pred Split = 1/5 = 0.2. which takes 20 epochs. After 20 epochs, we decrease the learning rate to 2 × 10 −4 . We further decrease the learning rate to 2 × 10 −5 after another 65 epochs. We use the Adam optimizer (Kingma and Ba, 2015) for Slot Attention models. For MONet models, we use RMSProp with a learning rate of 0.01 during the first stage, and use 0.001 for the second and the third stage.
Meanwhile, we also follow NS-CL (Mao et al., 2019) to use curriculum learning. Specifically, in the second training stage, we limit the number of objects in the scene to be 3. In the third training stage, we gradually increase the number of objects in the scene and the complexity of the questions.

C Implementation Details
In this section, we provide additional implementation details of our experimental setups and metrics.
GT and Pred split ratios. In this paper, we have introduced two new metrics for evaluating the performance of unsupervised object segmentation: namely, the GT split and and the Pred split ratio.
A simple example of how we can compute  Figure 7: Comparison of ARI and GT/Pred Split Ratios on example images. Pixels with the same color represent a mask produced by the models. We find that ARI is very sensitive to the size of objects, while split ratios capture object-level failures where an object is split or multiple objects are merged.
GT/Pred split ratios is shown in Fig. 6. At a high level, the GT split ratio computes the percentage of objects that are split into multiple parts in model segmentation. Meanwhile, Pred split ratio computes the percentage of objects that are merged into a single object in model segmentation. We introduce these two new metrics because the ARI score is evaluated at the pixel level and does not account for the variance of object sizes. By contrast, GT split and Pred split metrics are computed at the object level. This difference is illustrated in Fig. 7.
For concrete examples, in the Fig. 6 (a.1), two objects, the chopping board and the thermos, are wrongly segmented. In Fig. 6 (a.2), only one object mis-segmented. However, the ARI score of the first image is much higher because the coffee maker has a large size. GT split ratio is evaluated on the object level and thus favor the second one. Similarly, in the Fig. 6 (b.1), the four legs are merged into two masks, while in Fig. 6 (b.2), the seat and the back of the chair are merged into a single object. However, the first segmentation result has a significantly higher ARI score because the chair legs only contribute to a small area in the image. In this paper, we propose to jointly use ARI scores and the proposed GT/Pred split ratio to evaluate segmentation masks.
Throughout the paper, we have been using IoU = 0.2 as the threshold while computing the GT/Pred split ratios. Table 6 summarizes the results with different IoU thresholds. LORL consistently improves the baseline.
Referring expression comprehension. In this experiment, the data is generated using the code adapted from Liu et al. (2019). It contains two types of expressions, the first one directly refers to an object by its properties: for example, "the white plate". The second type of sentences refers to the object by relating it with another object: for example, "the object that is in front of the mug." The output of the model is the masks of all referred objects. The dataset is composed of the same set of concepts and same DSL as in the Shop-VRB-Simple.
We For all methods, including ours and the baseline, we assume a pretrained semantic parser. Since the neuro-symbolic program executor outputs a distribution over all objects indicating whether they are selected, we directly multiply its output with the object segmentation masks to get the final output.

D Additional Results
The following section presents a collection of ablation studies on different modules of LORL, as well as a few extensions.
Objectness score. To validate the effectiveness of the proposed objectness score module, we hereby compare two models: the original Slot Attention model and the Slot Attention model augmented with the proposed objectness score module. Both models are trained using images only (there is no language in the loop), on the Shop-VRB-Simple    dataset. The Table 7 summarizes the result. The objectness module alone does not contribute to the segmentation performance.
Question type. In this section, we investigate how different types of questions affect LORL. We use the Shop-VRB-Simple dataset for evaluation. There are three types of questions in the dataset, counting (e.g., how many plates are there?), existence (e.g., is there a toaster?), and query (e.g., what is the color of the mug?). We train LORL +SA with only one single type of questions (the number of total questions is the same). Results are summarized in Table 8. In general, training on all three types of questions improves the segmentation accuracy. The largest gain comes from the query question. Interestingly, the best result is achieved when trained on the original dataset, where the ratio of counting, existence, and query questions is 1:1:7. Note that all these models are trained with the same number of questions and thus they are directly comparable with each other.
Data efficiency. In addition, we provide another analysis by comparing models trained with different number of question-answer pairs. The results are shown in Table 9. Adding more language data consistently improves the result. All results are based on the LORL +SA model trained on the Shop-VRB-Simple dataset.   Table 12. All models are trained with the same set of question-answer pairs. Note that NSCL has the access to a pretrained object detection module, while LORL +SA and IEP do not. LORL +SA outperforms IEP, which is trained with exactly the same amount of supervision as ours. It also achieves a comparable result as NS-CL.
Integration with SPACE. SPACE (Lin et al., 2020) is another popular method for unsupervised object-centric representation learning. SPACE uses parallel spatial attention to decompose the input scene into a collection of objects, and it is also compatible with the proposed learning paradigm LORL. We include additional results of LORL +SPACE on the CLEVR dataset. Shown in the Table 11, LORL +SPACE shows a significant advantage over the vanilla SPACE model. Additionally, we find that SPACE shows poor segmentation results on Shop-VRB-Simple and ParNet-Chairs, no matter whether it is integrated with LORL. For example, it frequently segments complex objects into too many fragments on Shop-VRB-Simple. We conjecture that this is because SPACE was designed for segmenting objects of similar sizes.
Baseline using language supervision. We also conducted an additional baseline model that uses language supervision in a different way. Specifically, based on the Slot Attention model, we use a GRU to directly encode question and answer, and concatenate it with the image feature to obtain the object representation. On Shop-VRB-Simple, this    Concept quantification. Although LORL without the objectness score can achieve a comparable result in terms of QA accuracy, objectness score is crucial if we want to evaluate how models discover objects in images. Here, we show that, on the Shop-VRB-Simple dataset, LORL +SA shows significant improvement in recovering a holistic scene representation. Specifically, we extract a scene graph for each scene, where each node corresponds to a detected object. We represent each node i as a set of concepts C i associated with the object (e.g., {large, brown, wooden, chopping board}). We associate a concept with a detected object if its cosine similarity with the object representation is greater than 0. We heuristically remove nodes that are not associated with any concepts (by treating them as "background" objects) or have objectness scores that are smaller than 0.5. This results in a scene graph, where each node corresponds to a detected object. In the following, we compare it against the groundtruth scene graph.
For each pair of groundtruth node i and detected node j, we compute the concept IoU score based on their associated concepts C i and C j as: Next, we perform a maximum weight matching between the detected scene graph and the groundtruth scene graph with the Hungarian algorithm. We use the IoU score as the weight for every edge and remove edges whose IoU score is smaller than a given threshold. Finally, based on the macthing, we can compute the precision and recall of the detected scene graph. We show the average precision and recall over the entire test set images in Table 10. The results suggest that objectness score significantly improves the precision of the extracted concepts.