Image Manipulation via Multi-Hop Instructions - A New Dataset and Weakly-Supervised Neuro-Symbolic Approach

We are interested in image manipulation via natural language text – a task that is useful for multiple AI applications but requires complex reasoning over multi-modal spaces. We extend recently proposed Neuro Symbolic Concept Learning (NSCL) (Mao et al., 2019), which has been quite effective for the task of Visual Question Answering (VQA), for the task of image manipulation. Our system referred to as N EURO SIM can perform complex multi-hop reasoning over multi-object scenes and only requires weak supervision in the form of annotated data for VQA. N EURO SIM parses an instruction into a symbolic program, based on a Domain Specific Language (DSL) comprising of object attributes and manipulation operations, that guides its execution. We create a new dataset for the task, and extensive experiments demonstrate that N EURO SIM is highly competitive with or beats SOTA baselines that make use of supervised data for manipulation.


Introduction
The last decade has seen significant growth in the application of neural models to a variety of tasks including those in computer vision (Chen et al., 2017;Krizhevsky et al., 2012), NLP (Wu et al., 2016), robotics and speech (Yu and Deng, 2016).It has been observed that these models often lack interpretability (Fan et al., 2021), and may not always be well suited to handle complex reasoning tasks (Dai et al., 2019).On the other hand, classical AI systems can seamlessly perform complex reasoning in an interpretable manner due to their symbolic representation (Pham et al., 2007;Cai and Su, 2012).But these models often lack in their ability to handle low-level representations and be robust to noise.Neuro-Symbolic models (Dong et al., 2019;Mao et al., 2019;Han et al., 2019) overcome these limitations by combining the power of (purely) neural with (purely) symbolic representations.Studies (Andreas et al., 2016;Hu et al., 2017;Johnson et al., 2017a;Mao et al., 2019) have shown that neuro-symbolic models have several desirable properties such as modularity, interpretability, and improved generalizability.
Our aim in this work is to build neuro-symbolic models for the task of weakly supervised manipulation of images comprising multiple objects, via complex multi-hop natural language instructions.Specifically, we are interested in weak supervision that only uses the data annotated for VQA tasks, avoiding the high cost of getting supervised annotations in the form of target manipulated images.
Our key intuition here is that this task can be solved simply by querying the manipulated representation without ever explicitly looking at the target image.The prior work includes weakly supervised approaches (Nam et al., 2018;Li et al., 2020) that require textual descriptions of images during training and are limited to very simple scenes (or instructions).(See Section 2 for a survey).
Our solution builds on Neuro-Symbolic Concept Learner (NSCL) proposed by (Mao et al., 2019) for solving VQA.We extend this work to incorporate the notion of manipulation operations such as change, add, and remove objects in a given image.As one of our main contributions, we design novel neural modules and a training strategy that just uses VQA annotations as weakly supervised data for the task of image manipulation.The neural modules are trained with the help of novel loss functions that measure the faithfulness of the manipulated scene and object representations by accessing a separate set of query networks, interchangeably referred to as quantization networks, trained just using VQA data.The manipulation takes place through interpretable programs created using primitive neural and symbolic operations from a Domain Specific Language (DSL).Separately, a network is trained to render the image from a scene graph representation using a combination of L 1 and adversarial losses as done by (Johnson et al., 2018).The entire pipeline is trained without any intermediate supervision.We refer to our system as Neuro-Symbolic Image Manipulator (NEUROSIM).Figure 1 shows an example of I/O pair for our approach.Contributions of our work are as follows: 1. We create NEUROSIM, the first neuro-symbolic, weakly supervised, and interpretable model for the task of text-guided image manipulation, that does not require output images for training.2. We extend CLEVR (Johnson et al., 2017b), a benchmark dataset for VQA, to incorporate manipulation instructions and create a new dataset called as Complex Image Manipulation via Natural Language Instructions (CIM-NLI).We also create CIM-NLI-LARGE dataset to test zero-shot generalization.3. We provide extensive quantitative experiments on newly created CIM-NLI, CIM-NLI-LARGE datasets along with qualitative experiments on Minecraft (Yi et al., 2018).Despite being weakly supervised, NEUROSIM is highly competitive to supervised SOTA approaches including a recently proposed diffusion based model (Brooks et al., 2023).NEUROSIM also performs well on instructions requiring multihop reasoning, all while being interpretable.We publicly release our code and data 1 .

Related Work
Table 1 categorizes the related work across three broad dimensions -problem setting, task complexity, and approach.The problem setting comprises two sub-dimensions: i) supervision type -self, direct, or weak, ii) instruction format -text or UIbased.The task complexity comprises of following sub-dimensions: ii) scene complexity -single or multiple objects, ii) instruction complexity -zero or 1 https://github.com/dair-iitd/NeuroSIMmulti-hop instructions, iii) kinds of manipulations allowed -add, remove, or change.Finally, the approach consists of the following sub-dimensions: i) model -neural or neuro-symbolic and ii) whether a symbolic program is generated on the way or not.Dong et al. (2017), TAGAN (Nam et al., 2018), and ManiGAN (Li et al., 2020) are close to us in terms of the problem setting.These manipulate the source image using a GAN-based encoder-decoder architecture.Their weak supervision differs from ours -We need VQA annotation, they need captions or textual descriptions.The complexity of their natural language instructions is restricted to 0-hop.Most of their experimentation is limited to single (salient) object scenes.
In terms of task complexity, the closest to us are approaches such as TIM-GAN (Zhang et al., 2021), GeNeVA (El-Nouby et al., 2019), which build an encoder-decoder architecture and work with a latent representation of the image as well as the manipulation instruction.They require a large number of manipulated images as explicit annotations for training.
In terms of technique, the closest to our work are neuro-symbolic approaches for VQA such as NSVQA (Yi et al., 2018), NSCL (Mao et al., 2019), Neural Module Networks (Andreas et al., 2016) and its extensions (Hu et al., 2017;Johnson et al., 2017a).Clearly, while the modeling approach is similar and consists of constructing latent programs, the desired tasks are different in the two cases.Our work extends the NSCL approach for the task of automated image manipulation.Jiang et al. (2021), Shi et al. (2021) deal with editing global features, such as brightness, contrast, etc., instead of object-level manipulations like in our case.Recent models such as Instruct-Pix2Pix (Brooks et al., 2023), DALL-E (Ramesh et al., 2022) and Imagen (Saharia et al., 2022) on text-to-image generation using diffusion models are capable of editing images but require captions for input images; preliminary studies (Marcus et al., 2022) highlight their shortcomings in composi- tional reasoning and handling relations.

Motivation and Architecture Overview
The key motivation behind our approach comes from the following hypothesis: consider a learner Figure 3 captures a high-level architecture of the proposed NEUROSIM pipeline.NEUROSIM allows manipulating images containing multiple objects, via complex natural language instructions.Similar to Mao et al. (2019), NEUROSIM assumes the availability of a domain-specific language (DSL) for parsing the instruction text T into an executable program P .NEUROSIM is capable of handling addition, removal, and change operations over image objects.It reasons over the image for locating where the manipulation needs to take place followed by carrying out the manipulation operation.The first three modules, namely i) visual representation network, ii) semantic parser, and iii) concept quantization network are suitably customized from the NSCL and trained as required for our purpose.In what follows, we describe the design and training mechanism of NEUROSIM.

Modules Inherited from NSCL
1] Visual Representation Network: Given input image I, this network converts it into a scene graph G I = (N, E).The nodes N of this scene graph are object embeddings and the edges E are embeddings capturing the relationship between pair of objects (nodes).Node embeddings are obtained by passing the bounding box of each object (along with the full image) through a ResNet-34 (He et al., 2016).Edge embeddings are obtained by concatenating the corresponding object embeddings.2] Semantic Parsing Module: The input to this module is a manipulation instruction text T in natural language.Output is a symbolic program P generated by parsing the input text.The symbolic programs are made of operators, that are part of is the embedding for the object o pertaining to the color attribute.Each symbolic concept s ∈ S a for a particular attribute a (e.g., different colors) is also assigned a respective embedding in the same continuous space R dattr .Such an embedding is denoted by c s .These concept embeddings are initialized at random, and later on, fine-tuned during training.An attribute embedding (e.g.v color ) can be compared with the embeddings of all the concepts (e.g., c red , c blue , etc.) using cosine similarity, for the purpose of concept quantization of objects.
Training for VQA: As a first step, we train the above three modules via a curriculum learning process (Mao et al., 2019).The semantic parser is trained jointly with the concept quantization networks for generating programs for the question texts coming from the VQA dataset.The corresponding output programs are composed of primitive operations coming from the DSL (e.g.filter, count, etc.) and do not include constructs related to manipulation operations.This trains the first three modules with high accuracy on the VQA task.

Novel Modules and Training NEUROSIM
NEUROSIM training starts with three sub-modules trained on the VQA task as described in Section 3. , where s * a is the desired changed value for the attribute a.This network is trained using the following losses. (1) where, h a (x) gives the concept value of the attribute a (in symbolic form s ∈ S a ) for the object x.The quantity p (h a (x) = s) denotes the probability that the concept value of the attribute a for the object x is equal to s and is given as follows p (h a (x) = s) = exp dist(fa(x),cs) / s∈Sa exp dist(fa(x),c s ) where, dist(a, b) = (a ⊤ b − t 2 )/t 1 is the shifted and scaled cosine similarity, t 1 , t 2 being constants.The first loss term ℓ a penalizes the model if the (symbolic) value of the attribute a for the manipulated object is different from the desired value s * a in terms of probabilities.The second term ℓ a , on the other hand, penalizes the model if the values of any of the other attributes a ′ , deviate from their original values.Apart from these losses, we also include following additional losses.
(3) The add network is trained in a self-supervised manner.For this, we pick a training image and create its scene graph.Next, we randomly select an object o from this image and quantize its concepts, along with a relation with any other object o i in the same image.We then use our remove network to remove this object o from the scene.Finally, we use the quantized concepts and the relation that were gathered above and add this object o back into the scene graph using g addObj (•) and g addEdge (•).Let the embedding of the object after adding it back is o new .The training losses are as follows: where s a j is the required (symbolic) value of the attribute a j for the original object o, and r is the required relational concept.O is the set of the objects in the image, e old,i is the edge embedding for the edge between original object o and its neighboring object o i .Similarly, e new,i is the corresponding embedding of the same edge but after when we have (removed + added back) the original object.The loss terms ℓ concepts and ℓ relation ensure that the added object comprises desired values of attributes and relation, respectively.Since we had first removed and then added the object back, we already have the original edge and object representation, and hence we use them in loss terms given in equation 9. We use adversarial loss equation 10 for generating real (object, edge, object) triples and also a loss similar to equation 5 for generating real objects.

Image Rendering from Scene Graph
5] Rendering Network: Following Johnson et al. (2018), the scene graph for an image is first generated using the visual representation network, which is the processed by a GCN and passed through a mask regression network followed by a box regression network to generate a coarse 2dimensional structure (scene layout).A Cascaded Refinement Network (Chen and Koltun, 2017) is then employed to generate an image from the scene layout.A min-max adversarial training procedure is used to generate realistic images, using a patchbased and object-based discriminator.

Experiments
Datasets: Among the existing datasets, CSS (Vo et al., 2019) contains simple 0-hop instructions and is primarily designed for the text-guided image retrieval task.Other datasets such as i-CLEVR (El-Nouby et al., 2019) and CoDraw are designed for iterative image editing.i-CLEVR contains only "add" instructions and CoDraw doesn't contain multi-hop instructions.Hence we created our own multi-object multi-hop instruction based image manipulation dataset, referred to as CIM-NLI.This dataset was generated with the help of CLEVR toolkit (Johnson et al., 2017b).CIM-NLI consists of (Source image I, Instruction text T , Target image I * ) triplets.The dataset contains a total of 18K, 5K, 5K unique images and 54K, 14K, 14K instructions in the train, validation and test splits respectively.Refer to Appendix B for more details about the dataset generation and dataset splits.Baselines: We compare our model with purely supervised approaches such as TIM-GAN (Zhang et al., 2021), GeNeVA (El-Nouby et al., 2019) and InstructPix2Pix (Brooks et al., 2023).In order to make a fair and meaningful comparison between the two kinds (supervised and our, weaklysupervised) approaches, we carve out the following set-up.Assume the cost required to create one single annotated example for image manipulation task be α m while the corresponding cost for the VQA task be α v .Let α = α m /α v .Let β m be the number of annotated examples required by a supervised baseline for reaching a performance level of η m on the image manipulation task.Similarly, let β v be the number of annotated VQA examples required to train NEUROSIM to reach the performance level of η v .Let β = β m /β v .We are interested in figuring out the range of β for which performance of our system (η v ) is at least as good as the baseline (η m ).Correspondingly we can compute the ratio of the labeling effort required, i.e., α * β, to reach these performance levels.If α * β > 1, our system achieves the same or better performance, with lower annotation cost.Weakly supervised models (Li et al., 2020;Nam et al., 2018) are designed for a problem setting different from ours -single salient object scenes, simple 0-hop instructions (Refer Section 2 for details).Further, they require paired images and their textual descriptions as annotations.We, therefore, do not compare with them in our experiments.See Appendix G, H for computational resources and hyperparameters respectively.
Evaluation Metrics: For evaluation on image manipulation task, we use three metrics -i) FID, ii) Recall@k, and iii) Relational-similarity (rsim).FID (Heusel et al., 2017) measures the realism of the generated images.We use the implementation proposed in Parmar et al. (2022) to compute FID.Recall@k measures the semantic similarity of gold manipulated image I * and system produced manipulated image I.For computing Recall@k, we follow Zhang et al. (2021), i.e. we use I as a query and retrieve images from a corpus comprising the entire test set.rsim measures how many of the ground truth relations between the objects are present in the generated image.We follow (El-Nouby et al., 2019) to implement rsim metric that uses predictions from a trained object-detector (Faster-RCNN) to perform relation matching between the scene-graphs of ground-truth and generated images.

Performance with varying Dataset Size
Table 2 compares the performance of NEUROSIM other SoTA methods two level of β 0.054 and 0.54 representing use of 10% and 100% samples from CIM-NLI.Despite being weakly supervised, NEUROSIM performs significantly better than the baselines with just 10k data samples (especially TIM-GAN) and not too far from diffusion model Table 2: Performance comparison of NEUROSIM with TIM-GAN and GeNeVA, and InstructPix2Pix (IP2P) with 10% data (β = 0.054) and full data (β = 0.54).We always use 100K VQA examples (5K Images, 20 questions per image) for our weakly supervised training.R1, R3 correspond to Recall@1,3 respectively.FID: lower is better; Recall/rsim: higher is better.See Section 4.1 for more details.based IP2P in full data setting, using the R@1 performance metric.This clearly demonstrates the strength of our approach in learning to manipulate while only making use of VQA annotations.We hypothesize that, in most cases, NEUROSIM will be preferable since we expect the cost of annotating an output image for manipulation to be significantly higher than the cost of annotating a VQA example.To reach the performance of the NEUROSIM in a low data regime, TIM-GAN requires a larger number of expensive annotated examples (ref.Table 13 in Appendix).The FID metric shows similar trend across dataset sizes and across models.The FID scores for NEUROSIM could potentially be improved by jointly training VQA module along with image decoder and is a future direction.
We evaluate InstructPix2Pix (IP2P) (Brooks et al., 2023), a state-of-the-art pre-trained diffusion model for image editing, in a zero-shot manner on the CIM-NLI dataset.Considering its extensive pre-training, we expect IP2P to have learned the concepts present in the CIM-NLI dataset.In this setting IP2P achieves a FID score of 33.07 and R@1 score of 7.48 illustrating the limitations of large-scale models in effectively executing complex instruction-based editing tasks without full dataset fine-tuning.

Performance versus Reasoning Hops
Table 3 (right) compares baselines with NEU-ROSIM for performance over instructions requiring zero-hop (ZH) versus multi-hop (1 − 3 hops) (MH) reasoning.Since there are no Add instructions with ZH, we exclude them from this experiment for the comparison to be meaningful.GeNeVA performs abysmally on both ZH as well as MH.We see a significant drop in the performance of both TIM-GAN and IP2P when going from ZH to MH instructions, both for training on 5.4K, as well as, 54K datapoints.In contrast, NEUROSIM trained on 10% data, sees a performance drop of only 1.5 points showing its robustness for complex reasoning tasks.

Zero-shot Generalization to Larger Scenes
We developed another dataset called CIM-NLI-LARGE, consisting of scenes having 10 − 13 objects (See Appendix B for details).We study the combinatorial generalization ability of NEU-ROSIM and the baselines when the models are trained on CIM-NLI containing scenes with 3 − 8 objects only and evaluated on CIM-NLI-LARGE.
Table 3 captures such a comparison.NEUROSIM does significantly better, i.e., 33 pts (R1) than TIM-GAN and is competitive with IP2P when trained on 10% (5.4Kdata points) of CIM-NLI.We do see a drop in performance relative to baselines when they are trained on full (54K) data, but this is expected as effect of supervision takes over, and ours is a weakly supervised model.Nevertheless, this experiments demonstrates the effectiveness of our model for zero-shot generalization, despite being weakly sueprvised.

Qualitative Analysis and Interpretability
Figure 4 shows anecdotal examples for visually comparing NEUROSIM with baselines.Note, GeNeVA either performs the wrong operation on the image (row #1, 2, 3) or simply copies the input image to output without any modifications.TIM-GAN often makes semantic errors which show its lack of reasoning (row #3) or make partial edits (row #1).IP2P also suffers from this where it edits incorrect object (row #1,2).Compared to baselines, NEUROSIM produces semantically more meaningful image manipulation.NEUROSIM can also easily recover occluded objects (row #4).For more results, see Appendix I, J. NEUROSIM produces interpretable output programs, showing the steps taken by the model to edit the images, which also helps in detecting errors (ref.Appendix L).

Evaluating Manipulated Scene Graph
We strongly believe image rendering module of NEUROSIM pipeline and encoder modules used for computing Recall@k add some amount of inefficiencies resulting in lower R1 and R3 scores for us.Therefore, we decide to assess the quality of manipulated scene graph G I .

R1 R3
Text For this, we consider the text guided image retrieval task proposed by (Vo et al., 2019).In this task, an image from the database has to be retrieved which would be the closest match to the desired manipulated image.Therefore, we use our manipulated scene graph G I as the latent representation of the input instruction and image for image retrieval.We retrieve images from the database based on a novel graph edit distance between NEUROSIM generated G I of the desired manipulated images, and scene graphs of the images in the database.This distance is defined using the Hungarian algorithm (Kuhn, 1955) with a simple cost defined between any 2 nodes of the graph (ref.Appendix D for details).Table 4 captures the performance of NEUROSIM and other popular baselines for the image retrieval task.NEUROSIM significantly outperforms supervised learning baselines by a margin of ∼ 50% without using output image supervision, demonstrating that NEUROSIM meaningfully edits the scene graph.Refer to Section 4.7 for human evaluation results and Appendix Section D-E, K, for more results including results on Minecraft dataset and ablations.

A Hybrid Approach using NEUROSIM
From Table 3, we observe that both TIM-GAN and IP2P suffer a significant drop in performance when moving from ZH to MH instructions, whereas NEUROSIM is fairly robust to this change.Further, we note that the manipulation instructions in our dataset are multi-hop in terms of reasoning, but once an object of interest is identified, the actual manipulation operation can be seen as single hop.We use this observation to design a hybrid supervised baseline that utilizes the superior reasoning capability of NEUROSIM and high quality editing and generation capabilities of IP2P.
We take the CIM-NLI test set and parse the textinstructions through our trained semantic-parser to obtain the object embeddings over which the manipulation operation is to be performed.We utilize our trained query networks to obtain the symbolic attributes such as color, shape, size and material of the identified object.Using these attributes we simplify a complex multi-hop instruction into a simple instruction with 0 or 1 hops using a simple template based approach (see Appendix Section N for details).These simplified instructions are fed to the fine-tuned IP2P model to generate the edited images.We refer to our hybrid approach as IP2P-NS where NS refers to Neuro-Symbolic.Table 5 presents the results.We find that there is a clear advantage of using a hybrid neuro-symbolic model integrating NEUROSIM with IP2P.We see a significant gain on FID, recall, rsim when we use the hybrid approach, especially in the low resource setting (β = 0.054).Compared to IP2P, the hybrid neuro-symbolic approach results in better FID, recall and rsim scores, except a small drop in R1 for β = 0.54 setting.This opens up the possibility of further exploring such hybrid models in future for improved performance (in the supervised setting).

Human Evaluation
For the human evaluation study, we presented 10 evaluators with a set of five images each, including: The input image, the ground-truth image and manipulated images generated by NEUROSIM 5.4K, TIM-GAN 54K, and IP2P 54K.Images generated by the candidate models were randomly shuffled to prevent any bias.Evaluators were asked two binary questions, each requiring a 'yes' (1) or 'no' (0) response, to assess the models: (Q1) Does the model perform the desired change mentioned in the input instruction?, (Q2) Does the model not introduce any undesired change elsewhere in the image?Refer to Appendix Section M for more details about exact questions and the human evaluation process.
The average scores from the evaluators across different questions can be found in Table 6.The study achieved a high average Fleiss' kappa score (Fleiss et al., 2013) of 0.646, indicating strong inter-evaluator agreement.Notably, NEU-ROSIM (5.4K) outperforms TIM-GAN and IP2P (54K) in Q1 suggesting its superior ability to do reasoning, and identify the relevant object as well as affect the desired change.In contrast, TIM-GAN and IP2P score significantly better in Q2, demonstrating their ability not to introduce unwanted changes elsewhere in the image, possibly due to better generation quality compared to NEUROSIM.

Conclusion
We present a neuro-symbolic, interpretable approach NEUROSIM to solve image manipulation task using weak supervision in the form of VQA annotations.Our approach can handle multi-object scenes with complex instructions requiring multihop reasoning, and solve the task without any output image supervision.We also curate a dataset of image manipulation and demonstrate the potential of our approach compared to supervised baselines.Future work includes understanding the nature of errors made by NEUROSIM, having a human in the loop to provide feedback to the system for correction, and experimenting with real image datasets.

Ethics Statement
All the datasets used in this paper were synthetically generated and do not contain any personally identifiable information or offensive content.The ideas and techniques proposed in this paper are useful in designing interpretable natural languageguided tools for image editing, computer-aided design, and video games.One of the possible adverse impacts of AI-based image manipulation is the creation of deepfakes (Vaccari and Chadwick, 2020) (using deep learning to create fake images).To counter deepfakes, several researchers (Dolhansky et al., 2020;Mirsky and Lee, 2021) have also looked into the problem of detecting real vs.fake images.

Limitations
A limitation of our approach is that when transferring to a new domain, having different visual concepts requires not only learning new visual concepts but also the DSL needs to be redefined.Automatic learning of DSL from data has been explored in some prior works (Ellis et al., 2021(Ellis et al., , 2018)), and improving our model using these techniques are future work for us.We can also use more powerful graph decoders for image generation, for improved image quality, which would naturally result in stronger results on image manipulation.

Appendix
A Domain Specific Language (DSL)  Mao et al. (2019).The last 3 operations (Change, Add, and Remove) were added by us to allow for the manipulation operations.Table 8 shows the type system used by the DSL in this work.The first 5 types are inherited from (Mao et al., 2019) while the last one is an extension of the type system for handling the inputs to the Add operator.

B Dataset Details
We use CLEVR dataset and CLEVR toolkit (code to generate the dataset).These are public and are under CC and BSD licenses respectively, and are used by many works, including ours, for research purposes.We now give details of the datasets we create, building upon CLEVR.

B.1 CIM-NLI Dataset
This dataset was generated with the help of CLEVR toolkit (Johnson et al., 2017b) by using following recipe.
1. First, we create a source image I and the corresponding scene data by using Blender (Community, 2018) software.
2. For each source image I created above, we generate multiple instruction texts T 's using its scene data.These are generated using templates, similar to question templates proposed by (Johnson et al., 2017b).
3. For each such (I, T ) pair, we attach a corresponding symbolic program P (not used by NEUROSIM though) as well as scene data for the corresponding changed image.
4. Finally, for each (I, T ) pair, we generate the target gold image I * using Blender software and its scene data from the previous step.
Below are some of the important characteristics of the CIM-NLI dataset.
• Each source image I comprises several objects and each object comprises four visual attributescolor, shape, size, and material.
• Each instructions text T comprises one of the following three kinds of manipulation operations -add, remove, and change.
• An add instruction specifies color, shape, size, and material of the object that needs to be added.It also specifies a direct (or indirect) relation with one or more existing objects (called reference object(s)).The number of relations that are required to traverse for nailing down the target object is referred to as # of reasoning hops and we have allowed instructions with up to 3-hops reasoning.
We do not generate any 0-hop instruction for add due to ambiguity of where to place the object inside the scene.
• A change instruction first specifies zero or more attributes to uniquely identify the object that needs to be changed.It may also specify a direct (or indirect) relation with one or more existing reference objects.Lastly, it specifies the target values of an attribute for the identified object which needs to be changed.
• A remove instruction specifies zero or more attributes of the object(s) to be removed.Additionally, it may specify a direct (or indirect) relation with one or more existing reference objects.
Table 9 captures the fine grained statistics about the CIM-NLI dataset.Specifically, it further splits each of the train, validation, and test set across the instruction types -add, remove, and change.

B.2 CIM-NLI-LARGE Dataset
We created another dataset called CIM-NLI-LARGE to test the generalization ability of NEU-ROSIM on images containing more number of objects than training images.CIM-NLI-LARGE tests the zero-shot transfer ability of both NEU-ROSIM and baselines on scenes containing more objects.
Each image in CIM-NLI-LARGE dataset comprises of 10−13 objects as opposed to 3−8 objects in CIM-NLI dataset which was used to train NEU-ROSIM.The CIM-NLI-LARGE dataset consists of 1K unique input images.We have created 3 instructions for each image resulting in a total of 3K instructions.The number of add instructions is significantly less since there is very little free space available in the scene to add new objects.To create scenes with 12 and 13 objects, we made all objects as small size and the minimum distance between Operation Signature [Output ← Input])

Scene
ObjSet ← () Returns all objects in the scene.

Filter
ObjSet ← (ObjSet, ObjConcept) Filter out a set of objects from ObjSet that have a concept (e.g.red) specified in Obj-Concept.

Relate
ObjSet ← (ObjSet, RelConcept, Obj) Filter out a set of objects from ObjSet that have concept specified relation concept (e.g.RightOf) with object Obj.objects was reduced so that all objects could fit in the scene.Table 10 captures the statistics about this dataset.

B.3 Multi-hop Instructions
In what follows, we have given examples of the instructions that require multi-hop reasoning to nail down the location/object to be manipulated in the image.
• Remove the tiny green rubber ball.(0-hop) • There is a block right of the tiny green rubber ball, remove it.(1-hop) • Remove the shiny cube left of the block in front of the gray thing.(2-hop) • Remove the small thing that is left of the brown matte object behind the tiny cylinder that is behind the big yellow metal block.We begin by extending the type system of (Mao et al., 2019) and add ConceptSet because our add operation takes as input a set of concepts depicting attribute values of the new object being added (refer Table 8 for the details).Next, in a manner similar to (Mao et al., 2019), we use a rule based system for extracting concept words from the input text.We, however, add an extra rule for extracting ConceptSet from the input sentence.Rest of the semantic parsing methodology remains the same as given in (Mao et al., 2019), with the difference being that our training is weakly supervised (refer Section 3.3 of the main paper).

C.1.2 Training
As explained in Section 3.3 of the main paper, for training with weaker form of supervision, we use an off-policy program search based REINFORCE (Williams, 1992) algorithm for calculating the exact gradient.For this, we define a set of all possible program templates P t .For a given input instruc-

Remarks
ObjConcept Concepts for any given object, such as blue, cylinder, etc.
Attribute Attributes for any given object, such as color, shape, etc.
RelConcept Relational concepts for any given object pair, such as RightOf, LeftOf, etc.

ObjectSet Depicts multiple objects
ConceptSet A set of elements of ObjConcept type

C.2 Manipulation Network
In what follows, we provide finer details of manipulation network components.
Change The loss used to train the weights of the change network is a weighted sum of losses equation 1 to equation 5 given in the main paper.This leads to the overall loss function given below.
The object discriminator is a neural network with input dimension 256 and a single 300 dimensional hidden layer with ReLU activation function.This discriminator is trained using standard GAN objective ℓ objGAN .See Add Network: The neural operation in the add operator comprises of predicting the object representation for the newly added object using a function g addObj (•).This function is modeled as a single layer neural network without any activation.The input to this network is a concatenated vector represents the concatenation of all the concept vectors of the desired new objects.The vector o rel is the representation of the object with whom the relation (i.e.position) of the new object has been specified and c r is the concept vector for that relationship.The input dimension of g addObj (•) is (k * 64 + 256 + 64) and the output dimension is 256.For predicting representation of newly added edges in the scene graph, we use edge predictor g addEdge (•).The input to this edge predictor function is the concatenated representation of the objects which are linked by the edge.The input dimension of g addEdge (•) is (256 + 256) and the output dimension is 256.
The loss used to train the add network weights is a weighted sum of losses equation 6 to equation 10 along with an object discriminator loss.The overall loss is given by the following expression.
The object discriminator is a neural network with input dimension as 256 and a single 300 dimensional hidden layer with ReLU activation function.This discriminator is trained using the standard GAN objective ℓ objGAN .Note, ℓ objGAN has 2 parts -i) the loss for the generated (fake) object embedding using the add network, and ii) the loss for the real objects (all the unchanged object embeddings of the image).The former is unscaled but the latter one is scaled by a factor of 1/(num_objects).
The edge discriminator is a neural network with input dimension as (256 * 3) and a single 300 dimensional hidden layer with ReLU activation function.As input to this discriminator network, we pass the concatenation of the two objects and the edge connecting them.This discriminator is trained using the standard GAN objective ℓ edgeGAN .See

D.2 Image Retrieval Task
A task that is closely related to the image manipulation task is the task of Text Guided Image Retrieval, proposed by (Vo et al., 2019).Through this experiment, our is to demonstrate that NEU-ROSIM is highly effective in solving this task as well.In what follows, we provide details about this task, baselines, evaluation metric, how we adapted NEUROSIM for this task, and finally performance results in Table 12.This table is a detailed version of the Table 4 in the main paper.
Task Definition: Given an Image I, a text instruction T , and a database of images D, the task is to retrieve an image from the database that is semantically as close to the ground truth manipulated image as possible.
Note, for each such (I, T ) pair, some image from the database, say I ∈ D, is assumed to be the ideal image that should ideally be retrieved at rank-1.This, so called desired gold retrieval image might even be an image which is the ideal manipulated version of the original images I in terms of satisfying the instruction T perfectly.Or, image I may not be such an ideal manipulated image but it still may be the image in whole corpus D that comes closest to the ideal manipulated image.
In practice, while measuring the performance of any such system for this task, the gold manipulated image for (I, T ) pair is typically inserted into the database D and such an image then serves as the desired gold retrieval image I.
Baselines: Our baselines includes popular supervised learning systems designed for this task.The first baseline is TIRG proposed by Vo et al. (2019) where they combine image and text to get a joint embedding and train their model in a supervised manner using embedding of the desired retrieved image as supervision.For completeness, we also include comparison with other baselines -Concat, Image-Only, and Text-Only -that were introduced by Vo et al. (2019).
A recent model proposed by Chen et al. ( 2020) uses symbolic scene graphs (instead of embeddings) to retrieve images from the database.Motivated by this, we also retrieve images via the scene graph that is generated by the manipulation module of NEUROSIM.However, unlike Chen et al. (2020), the nodes and edges in our scene graph have associated vectors and make a novel use of them while retrieving.We do not compare our performance with (Chen et al., 2020) since its code is unavailable and we haven't been able to reproduce their numbers on datasets used in their paper.Moreover, (Chen et al., 2020) uses full supervision of the desired output image (which is converted to a symbolic scene graph), while we do not.
Evaluation Metric: We use Recall@k (and report results for k = 1, 3) for evaluating the performance of text guided image retrieval algorithms which is standard in the literature.
Retrieval using Scene Graphs: We use the scene graph generated by NEUROSIM as the latent representation to retrieve images from the database.We introduce a novel yet simple method to retrieve images using scene graph representation.For converting an image into the scene graph, we use the vi- β is the ratio of the number of annotated (with output image supervision) image manipulation examples required by the supervised baselines, to the number of annotated VQA examples required to train NEUROSIM.In Table 13, we show a detailed split of the performance, for the add, change, and remove operators, across the same values of β as taken before.We find that for the change operator, NEU-ROSIM performs better than TIM-GAN by a margin of ∼ 8% (considering Recall@1) for β ≤ 0.1.For the remove operator, NEUROSIM performs better than TIM-GAN by a margin of ∼ 4% (considering Recall@1) for β ≤ 0.2.Overall, NEUROSIM performs similar to TIM-GAN, for β = 0.2, for remove and change operators.All models perform poorly on the add operator as compared to the change and remove operators.We find that having full output image supervision allows TIM-GAN to reconstruct (copy) the unchanged objects from the input to the output for all the operators.This results in a higher recall in general but its effect is most pronounced in the Recall@3.NEUROSIM, on the other hand, suffers from rendering errors which makes the overall recall score (especially Recall@3) lower.We believe that improving image rendering quality would significantly improve the performance of NEUROSIM and we leave this as future work.(Yi et al., 2018).Specifically, we create zero and one hop remove instructions and one hop add instructions similar to the creation of CIM-NLI.This dataset contains scenes and objects from the Minecraft video game and is used in prior works for testing Neuro-Symbolic VQA systems like NSCL (Mao et al., 2019) and NS-VQA (Yi et al., 2018).The setting of the Minecraft worlds dataset is significantly different from CLEVR in terms of concepts and attributes of objects and visual appearance.Experiment: We use the above dataset for testing the addition and removal of objects using Neu-roSIM (See Fig 6).We train NeuroSIM's decoder to generate images from scene graphs of the minecraft dataset.We assume access to a parser that gives us programs for an instruction.For removal, we use the same remove network as described above, while for addition, we assume access to the features of object to be added, which is added to the scene graph of the image and the decoder decodes the final image.See Figure 6 for a set of successful examples on the Minecraft dataset.We see that using our method, one can add and remove objects from the scene successfully, without using any output image as supervision during training.Though we have assumed the availability of a parser in the above set-up, training it jointly with other modules should be straightforward, and can be achieved using our general approach described in Section 3 of the main paper.

E End-to-end Training
The main objective of this work is to make use of weakly supervised VQA data for the image manipu-Method Instruction  (Brooks et al., 2023) with varying β levels, split across add, remove and change instructions.The '-' entries for GeNeVA and IP2P were not computed due to excessive training time (inference time as well in case of IP2P); Geneva's performance is abysmal even when using full data.TIM-GAN does the best among baselines in terms of its recall score at β = 0.54.We always use 100K VQA examples (5K Images, 20 questions per image) for our weakly supervised training.R1 and R3 correspond to Recall@1 and 3, respectively.For Recall, higher score is better.
lation task without using output image supervision.But a natural extension of our work is to use output image supervision as well, to improve the performance of NEUROSIM.We devised an experiment to compare how much performance boost can be obtained by utilizing ground truth output (manipulated) images as the supervision for different modules of NEUROSIM.This experiment demonstrates the value of end-to-end training for NEUROSIM and how it can exploit the supervised data.We refer to this variant as NEUROSIM(e2e).We begin with a pre-trained NEUROSIM model trained with VQA annotations and then fine-tune it using supervised manipulation data.The detailed results are given in Table 15.This experiment demonstrates that with a small amount of supervised data, the performance of NEUROSIM can be significantly improved (e.g., more than 9 points increase for the change instruction with only 5.4K supervision examples) Given the significant increase in performance of NEUROSIM when using supervised data, we also test it's generalization capability (Analogous to Section 4.2, 4.3), and quality of scene graph retrieval (Analogous to Section 4.5 ).
From Table 16, we see that NEUROSIM(e2e) shows improved zero-shot generalization to larger scenes.Even when trained on just 5.4k CIM-NLI data, NEUROSIM(e2e) improves over TIM-GAN-54k by 3.9 R@1 points.A 5.3 point improvement over TIM-GAN is observed when full CIM-NLI data is used.
Next, we measure drop in performance with increasing reasoning hops.From Table 17, we see that NEUROSIM(e2e) achieves the lowest drop when compared to TIM-GAN.NEUROSIM(e2e) improves over weakly supervised NEUROSIM baseline by 6.6 R@1 points.
Finally, we measure the quality of scene graphs via retrieval.From Table 14: Performance scores (Recall@1) for NEUROSIM with TIM-GAN, GeNeVA and IP2P with increase in reasoning hops, for add, remove, and change instructions.Along with each method, number of data points from CIM-NLI used for training are written.

F LLMs as few-shot parser
We also tested the semantic parsing ability of Large Language Models (LLMs), specifically GPT-4 for our task.The task of semantic parsing is given manipulation instruction text in natural language, generated the symbolic program by parsing the input text.To provide GPT-4 with context, we designed an extensive prompt that begins with our DSL followed by six different in-context examples representing various instruction types for few-shot learning.This prompt is then followed with the instruction text that we want to parse.We tested GPT-4 on a randomly sampled subset of our test dataset.For evaluation, we measured the accuracy of semantic parsing using an exact match between the generated symbolic program and the groundtruth symbolic program.
The detailed results are given in Table 19.Interestingly, we observed that GPT-4 performed poorly on Add instructions, achieving less than 10% of parsing accuracy.To address this, we prompted GPT-4 separately with additional few-shot examples for Add instructions, which led to the results displayed in the

G Computational Resources
We trained all our models and baselines on 1 Nvidia Volta V100 GPU with 32GB memory and 512GB system RAM except IP2P which was trained on 8-A100

H Hyperparameters and Validation Accuracies H.1 Training for VQA Task
The hyperparameters for the VQA task are kept same as default values coming from the prior work (Mao et al., 2019).We refer the readers to (Mao et al., 2019) for more details.We obtained a question answering accuracy of 99.3% after training on the VQA task.

H.2 Training Semantic Parser
The semantic parser is trained to parse instructions.Learning of this module happens using the REINFORCE algorithm as described in Section C of this appendix.During REINFORCE algorithm, we search for positive rewards from the set {7, 8, 10}, and negative rewards from the set {0, 2, 3}.We finally choose a positive reward of 8 and negative reward of 2. For making this decision, we first train the semantic parser for 20 epochs and then calculate its accuracy by running it on the quantized scenes from the validation set.For a particular output program, we say it is correct if it leads to an object being selected (see Section C of the appendix for more information) and this is how the accuracy of the semantic parser is calculated.This accuracy is a proxy for the real accuracy.An alternative is to use annotated ground truth programs for calculating accuracy and then selecting hyperparameters.However, we do not use ground truth programs.All other hyperparameters are kept the same as used by (Mao et al., 2019) to train the parser on VQA task.We obtain a validation accuracy of 95.64% after training the semantic parser for manipulation instructions.

H.3 Training Manipulation Networks
The architecture details of the manipulation network are present in Section C of this appendix.We use batch size of 32, learning rate of 10 −3 , and optimize using AdamW (Loshchilov and Hutter, 2019) with weight decay of 10 −4 .Rest of the hyperparameters are kept the same as used in (Mao et al., 2019).During training, at every 5 th epochs, we calculate the manipulation accuracy by using the query networks that were trained while training the NEUROSIM on VQA data.This serves as a proxy to the validation accuracy.
• For the change network training, we use the query accuracy of whether the attribute that was supposed to change for a particular object, has changed correctly or not.Also, whether any other attribute has changed or not.
• For the add network training, we use the query accuracy of whether the attributes of the added object are correct or not.Also, whether the added object is in a correct relation with reference object or not.
We obtained a validation accuracy (based on querying) of 95.9% for the add network and an accuracy of 99.1% for the change network.

H.4 Image Decoder Training
The architecture of the image decoder is similar to (Johnson et al., 2018) but our input scene graph (having embeddings for nodes and edges) is directly processed by the graph neural network.We use a batch size of 16, learning rate of 10 −5 , and optimize using Adam (Kingma and Ba, 2015) optimizer.The rest of the hyperparameters are same as (Johnson et al., 2018).We train the image decoder for a fixed set of 1000K iterations.

I Qualitative Analysis
Figures 7, 8, 9 compare the images generated by NEUROSIM, TIM-GAN, and GeNeVA on add, change and remove instructions respectively.NEU-ROSIM's advantage lies in semantic correctness of manipulated images.For example, see Figure 7 row #3,4; Figure 8 row #2; 9 all images.In these images, NEUROSIM was able to achieve semantically correct changes, while TIM-GAN, GeNeVA faced problems like blurry, smudged objects while adding them to the scene, removing incorrect objects from the scene, or not changing/partially changing the object to be changed.Images generated by TIM-GAN are better in quality as compared to NEU-ROSIM.We believe the reason for this is that TIM-GAN, being fully supervised, only changes a small portion of the image and has learned to copy a significant portion of the input image directly to the output.However, this doesn't ensure the semantic correctness of TIM-GAN's manipulation, as described above with examples where it makes errors.The images generated by NEUROSIM look slightly worse since the entire image is generated from object based embeddings in the scene graph.Improving neural image rendering from scene graphs can be a promising step to improve NEUROSIM.

J Errors
Figure 10 captures the images generated by our model where it has made errors.The kind of errors that NEUROSIM makes can be broadly classified into three categories.
• [Rendering Errors] This set includes images generated by our model which are semantically correct but suffer from rendering errors.The common rendering errors include malformed cubes, partial cubes, change in position of objects, and different lighting.
• [Logical Errors] This set includes images generated by our model which have logical errors.
That is, manipulation instruction has been interpreted incorrectly and a different manipulation has been performed.This happens mainly due to an incorrect parse of the input instruction into the program, or manipulation network not trained to perfection.For example, change network changing attributes which were supposed to remain unchanged.
• [VQA Errors] The query networks are not ideal and have errors after they are trained on the VQA task.This in turn causes errors in supervision (obtained from query networks) while training the

TIM-GAN Input Image
Instruction NEUROSIM GeNeVA There is a shiny thing that is on the right side of the shiny block, add a big gray metallic ball in front of it.
There is a rubber thing behind the matte thing in front of the tiny rubber object, add a tiny blue shiny sphere behind it.
Add a small gray rubber cylinder that is in front of the big cube.

Ground Truth
Add a large gray metallic cylinder that is in front of the small rubber object behind the tiny green matte cylinder.
There is a purple shiny object in front of the purple metal ball, add a large red matte ball to the left of it.

TIM-GAN Input Image
Instruction NEUROSIM GeNeVA There is a rubber thing in front of the red matte ball; change the shape of it to cylinder.
Change material of the rubber object in front of the small rubber thing that is left of the tiny gray matte sphere that is in front of the yellow block to shiny.
There is a small matte thing; change the color of it to purple.

Ground Truth
There is a cylinder that is behind the small metallic cylinder; change the size of it to tiny.
There is a tiny cylinder that is to the left of the small blue thing to the left of the big green metallic cylinder; change the material of it to matte.manipulation networks and leads to a less than optimally trained manipulation network.Also, during inference, object embeddings may not be perfect due to the imperfections in the visual representation network and that leads to incorrect rendering.

K Ablations
Table 20 shows the performance of NEUROSIM when certain loss terms are removed while learning of the networks.This depicts the importance of loss terms that we have considered.In particular we test the performance of the network by removing Table 20: Ablations conducted by removing some loss terms.ℓ is the total loss before any ablation.For each loss term being removed, the superscript denotes which network it belongs to (add or change).Ablations are conducted for the setting where β = 0.054 (see main paper Section 4 for the definition of β) failure cases of NEUROSIM also means that it can be selectively trained to improve certain parts of the network (for eg individually training on change instructions to improve the change command, if the model is performing poorly on change instructions).We now assess the correctness of intermediate programs using randomly selected qualitative examples present in Figure 11.Since no wrong program was obtained in the randomly selected set, we find 2 more data points manually, to show some wrong examples.

M Human Evaluation Details
See Table 21 for the questions (paraphrased) asked to the evaluators.Detailed instructions and an example of the questions provided to the evaluators can be found in Figure 12.A total of 10 evaluators, consisting of a mix of undergraduate and post-graduate students, were involved in the study.The same set of 30 random images were given to each evaluator.They were compensated at a rate three times the average hourly salary in the country of origin.Each evaluator was given upto 24 hours to complete the task.

N Simplifying Multi-Hop Instructions using NeuroSIM Modules
In this section, we provide details on our method of utilizing the trained semantic parser to convert the complex multi-hop instruction into a simplified 0 or 1 hop instruction.We generate three simplified Example, if the MH instruction is "Change the size of the big thing that is behind the metallic cylinder behind the purple object that is to the right of the big brown shiny object to tiny" , we find the placeholder attributes to be operation=change, at-tribute=size, color=yellow, shape=cube, size=large, material=rubber, attribute'=tiny.Hence the simplified instruction becomes, "Change the size of the large yellow rubber cube to tiny".Add and Remove instructions follow similarly.

Input Image Instruction
Change the shape of the big gray thing to cube.
Remove the gray rubber thing in front of the gray matte sphere behind the large gray matte sphere.
Remove the brown metal object that is left of the blue matte block that is left of the brown thing on the right side of the large cyan metal cube.
There is a matte block that is in front of the big gray rubber object; change the material of it to shiny.
Add a tiny purple metal ball that is in front of the blue object that is behind the matte ball.

Generated Program Correctness
Remove the cylinder that is to the right of the matte cylinder that is in front of the small red matte thing.
Add a large purple shiny sphere behind the shiny cube behind the tiny gray object.
Change the material of the small metallic block in front of the brown metal block to rubber.
Change the shape of the large object in front of the tiny yellow sphere to ball.Question 1: [Change] Are all the attributes (color, shape, size, material, and relative position) of the changed object mentioned in the instructions identical between the ground truth image and the system-generated image?
[Add] Are all the attributes (color, shape, size, material, and relative position) of the added object mentioned in the instructions identical between the ground truth image and the system-generated image?
[Remove] Are same objects removed in ground truth image and the system-generated image?
Question 2: [Change] Are all the attributes (color, shape, size, material, and relative position) of the remaining objects identical between the ground truth image and the system-generated image?
[Add] Are all the attributes (color, shape, size, material, and relative position) of the remaining objects identical between the ground truth image and the system-generated image?
[Remove] Are all the attributes (color, shape, size, material, and relative position) of the remaining objects identical between the ground truth image and the system-generated image?
Table 21: Questions asked to human evaluators for evaluating NEUROSIM and TIM-GAN.Note that there are some variations in the questions for Change, Add, and Remove instructions dues to different semantic nature of the instructions.
3004 3005 Figure 1: The problem setup.See Section 1 for more details.

Figure 2 :Figure 3 :
Figure 2: Motivating example NEUROSIM.Best viewed under magnification.See Section 3.1 for more details

Figure 4 :
Figure 4: Visual comparison of NEUROSIM with various baselines.See Section 4.4 for more details.
.6 77.0 88.8 2.2 49.2 84.8 94.5 NEUROSIM 35.0 45.3 65.5 91.3 35.1 45.5 66.7 91.5 rsim FID R1 R3 rsim IP2P 3.4 40.6 77.0 88.8 2.2 49.2 84.8 94.5 NEUROSIM 35.0 45.3 65.5 91.3 35.1 45.5 66.7 91.5 IP2P-NS 1.96 45.5 83.2 94.0 1.8 48.0 85.5 95.6 Fig 5a for an overview of the change operatorRemove Network: The remove network is a symbolic operation as described in Section 3.3 of the main paper.That is, given an input set of objects, the remove operation deletes the subgraph of the scene graph that contains the nodes corresponding to removed objects and the edges incident on those nodes.See Fig 5c for an overview of the remove operator.
(a) Change operator overview.(b) Add operator overview.(c) Remove operator overview.

Figure 5 :
Figure 5: Overview of new operators (change, add and remove) added to the DSL.
Fig 5b for an overview of the add operator D Additional Results D.1 Detailed Performance for Zero-Shot Generalization on Larger Scenes

D. 5
Results on Datasets from different domains D.5.1 Minecraft Dataset Dataset Creation: We create a new dataset having (Image, instruction) by building over the Minecraft dataset used in

Figure 6 :
Figure 6: Results for addition and removal of objects from images of the minecraft dataset

Figure 7 :
Figure 7: Visual comparison of NEUROSIM with TIM-GAN and GeNeVA for the add operator.The red bounding boxes in the ground truth output image indicate the objects required to add to the input image.

Figure 8 :
Figure 8: Visual comparison of NEUROSIM with TIM-GAN and GeNeVA for the change operator.The red bounding boxes in the input and ground truth output image indicate the objects required to be changed.
templates one for each edit operation.1. Change the[attribute]  of [size] [color] [material] [shape] to [attribute'] 2. Remove the [size] [color] [material] [shape] 3. Add a [size] [color] [material] [shape] to the [relation] of [shape'].Next, given a multi-hop instruction we parse it using our semantic parser which gives us the object's embedding on which either an operation is to be executed (in case of change and remove operations) or a new object has to be inserted in relation to it (in case of add operation).The trained query-networks predicts the symbolic values of the concepts in the placeholders.

✅Figure 11 :
Figure 11: Qualitative examples of generated programs by NEUROSIM.
new = g addObj ({c sa 1 , c sa 2 , • • • , c sa k }, o rel , c r )where, o rel is the object embedding of an existing object, relative to which the new object's position r is specified.For each existing objects o i in the scene, an edge e new,i is predicted between the newly added object o new and existing object o i in following manner: e new,i = g addEdge ( o new , o i ).Functions g addObj (•) and g addEdge (•) are quasi-symbolic operations.Symbolic operations in add network comprise adding the above node and the incident edges into the scene graph.
Table 2 contains the results obtained by IP2P after fine-tuning for 16k iterations on CIM-NLI dataset.

Table 3 :
(Left) Performance on generalization to Larger Scenes.(Right) R1 results for 0-hop (ZH) vs multihop (MH) instruction-guided image manipulation.See Sections 4.2 and 4.3 for more details.

Table 4 :
G I Quality via image retrieval.

Table 6 :
Human evaluation comparing various models.
Table 7 captures the DSL used by our NEUROSIM pipeline.The first 5 constructs in this table are common with the DSL used in

Table 8 :
Extended type system for the DSL used by NEUROSIM.

Table 9 :
Statistics of CIM-NLI dataset introduced in this paper.
tion text T , we create a set of all possible programs {P T } from P t .For e.g.add, or remove; and reasoning() either selects objects for change or remove, or it selects a reference object for adding another object in relation to it.After a hyperparameter search for the reward (refer

Table 11
below is a detailed version of the Table3in the main paper.This table compares the performance of NEUROSIM with baseline methods TIM-GAN, GeNeVA and IP2P for the zero-shot generalization to larger scenes (with ≥ 10 objects), while the models were trained on images with 3 − 8 objects.Relative to the main paper's table 3, this table offers separate performance numbers for each of the add, remove and change instructions.

Table 12 :
Performance scores (Recall@1) on the Image Retrieval task, comparing NEUROSIM with TIM-GAN and GeNeVA with increase in reasoning hops, for add, remove, and change instructions.Along with each method, number of data points from CIM-NLI used for training are written.
Table 18, we see that super-

Table 16 :
Zero-shot generalization to larger scenes (Extension of Table3of main paper).

Table 17 :
table.Even with the additional Performance with increasing reasoning hops (Extension of Table 3 of main paper).

Table 18 :
Quality of scene graph measured via retrieval (Extension of Table 4 of main paper) 80 GB GPUs.Our image decoder training takes about 4 days of training time.Training of the VQA task takes 5 − 7 days of training time and training the Manipulation networks take 4 − 5 hours of training time.