Mind the Context: The Impact of Contextualization in Neural Module Networks for Grounding Visual Referring Expressions

Neural module networks (NMN) are a popular approach for grounding visual referring expressions. Prior implementations of NMN use pre-defined and fixed textual inputs in their module instantiation. This necessitates a large number of modules as they lack the ability to share weights and exploit associations between similar textual contexts (e.g. “dark cube on the left” vs. “black cube on the left”). In this work, we address these limitations and evaluate the impact of contextual clues in improving the performance of NMN models. First, we address the problem of fixed textual inputs by parameterizing the module arguments. This substantially reduce the number of modules in NMN by up to 75% without any loss in performance. Next we propose a method to contextualize our parameterized model to enhance the module’s capacity in exploiting the visiolinguistic associations. Our model outperforms the state-of-the-art NMN model on CLEVR-Ref+ dataset with +8.1% improvement in accuracy on the single-referent test set and +4.3% on the full test set. Additionally, we demonstrate that contextualization provides +11.2% and +1.7% improvements in accuracy over prior NMN models on CLOSURE and NLVR2. We further evaluate the impact of our contextualization by constructing a contrast set for CLEVR-Ref+, which we call CC-Ref+. We significantly outperform the baselines by as much as +10.4% absolute accuracy on CC-Ref+, illustrating the generalization skills of our approach.


Introduction
Visual referring expression recognition is the task of identifying the object in an image that is referred to by a natural language expression (Kazemzadeh et al., 2014;Mao et al., 2016). It is a fundamental  Figure 1: An example from the CLEVR-Ref+ dataset. In addition to passing textual inputs (arguments) cubical, large and metallic to neural modules, we also provide them with the relevant neighborhood of arguments as context (highlighted in blue).
language-to-vision matching problem and has several downstream applications such as question answering, robot navigation, and image retrieval Qi et al., 2020;Young et al., 2014;Tu et al., 2014;Qi et al., 2015;Akula and Zhu, 2019;Akula, 2015;Palakurthi et al., 2015). Recently, neural module networks (NMN; Andreas et al. 2016b;Hu et al. 2017b;Liu et al. 2019) have been gaining popularity as a promising approach for solving this task. Briefly, NMN models use an explicit modular reasoning process where a program generator first analyzes the input referring expression and predicts a sequence of learnable neural modules (e.g. count, filter, compare). Next, an execution engine dynamically assembles these modules to predict the target object in the image. Such a module based hierarchical reasoning process helps NMNs in providing high model interpretability and therefore facilitates in improving overall trust in the model (Andreas et al., 2016b;Akula et al., 2020b).
Although achieving promising results, exist-ing NMN models primarily focused on designing module architectures with textual inputs directly hard-coded in the module instantiation (Johnson et al., 2017b;Liu et al., 2019). For example, processing the textual inputs 'red' and 'blue' require the instantiation of two different modules filter_color [red] and filter_color [blue]. However, such a design demands a large number of learnable modules (and network parameters) and they cannot share weights for similar contextual textual inputs (e.g. 'dark cube' vs. 'black cube', 'shiny cylinder' vs. 'metallic cylinder'). Lack of these contextual signals leads to poor generalization performance on unseen but known language contexts (Lake and Baroni, 2018;Bahdanau et al., 2019).
Moreover, in the prior implementations of NMN such as IEP-Ref (Johnson et al., 2017b;Liu et al., 2019), the modules in execution engine are not conditioned on the surrounding context of their textual input in the expression. This is problematic as the modules are not given the opportunity to watch the neighborhood of textual input that helps in extracting the informative visiolinguistic context from the module's visual input. For example, the module filter_color[dark] needs to pick a black colored cube or a red-colored cube depending on the neighborhood context in the expression (e.g. "the dark thing that is hardly visible" vs. "the dark thing among the red cubes") and the type of cubes available in its visual input. Few implementations of NMN such as FiLM (Perez et al., 2018) and N2NMN (Hu et al., 2017a) parametrize the surrounding context of their textual input. However, the visiolinguistic context in these modules is rather shallow as they cannot jointly co-attend over potential objects of interest directly from the visual input and textual inputs.
In this work, we address the aforementioned issues and evaluate the impact of contextual signals in improving the performance of NMN models. First, we address the problem of hard-coded language inputs by parameterizing the module arguments (Figure 1), i.e., for example, we treat "filter_size" module as parameterized by textual input "large" instead of as a standalone function "filter_size[large]" ( §3). We show that module parametrization reduces the total number of learnable modules by 75% without affecting the performance of NMNs.
Second, we use the ground-truth annotations in CLEVR-Ref+ (Liu et al., 2019), a challenging synthetic referring expression dataset, to show the evidence that providing the relevant neighborhood context of the textual input to the neural module (see Figure 1) is beneficial for improving the model's grounding performance ( §4.1). We next propose a contextualization method to learn to select the most relevant neighborhood context by jointly co-attending on visual and textual inputs, eliminating the need for ground-truth contextual information ( §4.2).
Our experimental results show that our approach is effective in capturing visiolinguistic relations and contextual dependencies, especially when the textual inputs are long, and has complex linguistic structures. We demonstrate that our proposed method significantly improves the performance of NMN ( §5.4) in grounding visual referring expressions. Specifically, on CLEVR-Ref+ benchmark, we outperform competing NMN approaches such as IEP-Ref, FiLM and N2NMN by as much as +8.1% accuracy on single-referent split (S-Ref) and +4.3% on full-referent split (F-Ref). Additionally, we also test our approach on CLOSURE (Bahdanau et al., 2019) and NLVR2 (Suhr et al., 2019) benchmarks. CLOSURE is a VQA benchmark consisting of CLEVR-like questions with emphasis on simple and complex referring expressions. NLVR2 is a language grounding task where the goal is to determine whether an expression is true based on two paired real images. Our approach significantly outperforms the existing NMN approaches with +11.2% and +1.7% improvements in accuracy on CLOSURE and NLVR2 respectively.
We further evaluate the impact of our contextualization by constructing a set of contrasting perturbations around CLEVR-Ref+ test instances (Gardner et al., 2020), and call our new dataset CC-Ref+ ( §5.6). We significantly outperform the state-ofthe-art models by as much as +10.4% absolute accuracy on CC-Ref+.

Related Work
Referring Expression Recognition. Visual referring expression recognition (REF) is the task of identifying the object in an image that is referred to by a natural language expression (Mao et al., 2016;Kazemzadeh et al., 2014). Datasets containing real images and expressions such as RefCOCO+ (Kazemzadeh et al., 2014) and Ref-COCOg (Mao et al., 2016) have been proposed to evaluate the progress on this task. Multi-modal transformers (Lu et al., 2019;Li et al., 2019;Tan and Bansal, 2019), using pretrain-then-transfer approach, have shown superior performance on these datasets. However, these models fail to learn robust visio-linguistic contextual representations and are shown to exploit the imbalanced distribution in the train and test splits (Akula et al., 2020a;Cirik et al., 2018). Recently, CLEVR-Ref+ (Liu et al., 2019) has been introduced as a synthetic diagnostic benchmark that allows control over dataset bias. There are nearly 0.8M referring expressions of which 32% of expressions refer to only a single object (Single-referent) and 68% refer to more than one object (Multi-referent). In this paper, we refer to the full dataset as F-Ref and the single-referent subset as S-Ref. Module network (Liu et al., 2019;Johnson et al., 2017a;Andreas et al., 2016b) based architectures achieved new state-of-the-art performance on this dataset.
Neural Module Networks. Neural module networks (NMNs) learn to parse textual expressions as executable programs composed of learnable neural modules (Andreas et al., 2016b;Johnson et al., 2017a,b;Hu et al., 2017a). Each of these modules are specialized to compute basic reasoning tasks and can be assembled to perform complex and compositional reasoning. (Andreas et al., 2016b) used dependency trees (Zhu et al., 2013) to generate the execution layouts. (Andreas et al., 2016a) proposed dynamic NMNs that learns and adapts the structure of the execution layouts to the question. (Johnson et al., 2017b) proposed homogeneous (IEP) and generic neural modules, unlike fixed and hand-crafted neural module, in which the semantics of each neural module is learnt during training. IEP model achieves promising performance on CLEVR dataset. (Liu et al., 2019) proposed IEP-Ref by extending IEP model to CLEVR-Ref+ dataset and outperformed all the prior works. Although, compositional by design, the visiolinguistic context in these modules is rather shallow and fail to ground novel combinations of known linguistic constructs (Bahdanau et al., 2019). The major difference between our work and these prior works of NMN is that we explicitly parametrize and contextualize the neural modules by jointly attending over the visual and textual inputs.

Module Parameterization in NMN
We propose parametrization as the first step to enable weight sharing and exploiting associations between similar textual contexts. Specifically, we evaluate the effectiveness of parameterizing module textual inputs using IEP-Ref (Liu et al., 2019) as the baseline NMN implementation. IEP-Ref, a NMN solution based on IEP (Johnson et al., 2017b), is the current state-of-the-art model on CLEVR-Ref+ dataset. 1 As shown Figure 2(a), the neural modules in IEP-Ref are represented using a standard Residual Convolution Block (RCB). Formally, each RCB module (f n ) of arity n receives n feature maps (F i ) of shape 128 × 20 × 20 and outputs a same-sized tensor f o = f n (F 1 , F 2 , ..., F n ).
We parameterize each RCB module m as follows: (a) we feed all the words in the textual input e m into an LSTM; (b) The last hidden state of LSTM h t is then used to perform element-wise multiplication with the output of the first convolution layer in the RCB block to produce joint representation c m of module's textual input (e m ) and visual input (v m ), which is then passed to ReLU function (see Appendix Figure 1b): (1)

Using Ground-Truth Annotations
We extend our parameterized model by contextualizing it with the neighborhood context of textual input in the referring expression. Figure 1 shows an example. We leverage the ground-truth annotations available in CLEVR-Ref+ to provide neighborhood context for the modules as follows: Let us denote the ground-truth neural modules as m 1 , m 2 , m 3 , ..., m n for a given input referring expression q. Suppose the modules m j and m k are children for the parent module m i in the ground-truth execution tree. We modify the architecture of each neural module shown where we concatenate the groundtruth arguments of all the children modules m j and m k and pass it as the neighborhood context to the parent module m i (see Appendix Figure 1c). We test if this contextualization helps.
As an ablation, we also test the model performance where the entire expression q is provided as neighborhood context for the modules instead of the relevant neighborhood. Table 3 shows the results. Using the entire expression as the neighborhood context did not show any improvements in the model performance, perhaps due to the difficulty in searching and extracting relevant context from long CLEVR-like expressions. On the other hand, providing ground-truth neighborhood context shows significant improvement in the performance (1.71% on F-test and 3.19% on S-Test), indicating that model is able to extract informative visiolinguistic clues. Since the ground-truth human annotations are costly and difficult to obtain, we next propose a    contextualization method that enables the modules to learn to select the most relevant neighborhood context without requiring ground-truth annotations.

Using Memory-augmented Block
We incorporate a memory-augmented LSTM block (Graves et al., 2014) in the neural module to guide the attention towards the relevant and informative neighborhood words in the input expression (q). Figure 2(b) shows our contextualized module architecture. Our design enhances the module's capacity to exploit the visiolinguistic context between the visual input v m and the selective set of words that are stored in the memory over multiple timesteps.
The memory M consists of a set of row vectors as memory slots. LSTM (i.e., controller) has read and write heads into M , which helps in retrieving representations from M or place them into M . In the first time step (t 0 ), we feed visual input and then in the later time steps textual input is fed. More formally, given a input referring expression q, at each time step (t), LSTM produces a key, k i,t , which is either used to retrieve a particular location l from the row M t or to store in M t . We feed the referring expression q into LSTM as: (2) We then compute the cosine similarity measure between h t and each individual row j in M : A read weight vector w t is computed using a softmax over the cosine similarity and then a memory row m t is retrieved. The vectors m t , h t are concatenated with the textual input (e m ) and then an element-wise multiplication is performed with the output of the convolution layer before passing to the ReLU function (see Appendix A).

Datasets
We evaluate our approach on F-Ref and S-Ref splits of CLEVR-Ref+ (Liu et al., 2019). In addition, we also test our approach on CLOSURE (Bahdanau et al., 2019) and NLVR2 (Suhr et al., 2019) benchmarks. CLOSURE is a VQA benchmark, consisting of synthetically generated image and question pairs with emphasis on grounding simple and complex referring expressions. NLVR2 is a language grounding task where the goal is to determine whether an expression is true based on two paired real images. While reporting results on CLOSURE, we train our NMN model using CLEVR (Johnson et al., 2017a) train and val splits.

Baselines
We compare the performance of our approach against the following baselines: (1) IEP-Ref (Liu et al., 2019) is the current state-of-the-art NMN model for CLEVR-Ref+ benchmark which uses explicit program generator and execution engine (PG+EE) to predict the answer; (2) FiLM (Featurewise Linear Modulation) (Perez et al., 2018) is a NMN model which introduces new layers in the RCB block that learn parameters γ i,c and β i,c for scaling up or down the CNN activations (F i,c ) by conditioning on the input referring expression MAC (Hudson and Manning, 2019) is an end-toend differentiable architecture designed to perform an explicit multi-step reasoning process by decomposing them into a series of attention-based reasoning steps; (4) VectorNMN (Bahdanau et al.,

2019) is a direct extension to FiLM that uses vectorvalued inputs and outputs for the modules instead
of high-capacity 3D tensors; (5) NS-VQA (Yi et al., 2018) uses structural scene representation from input image in addition PG+EE components in IEP-Ref; (7) N2NMN uses hand-crafted and parameterized neural modules; (8) LCGN (Hu et al., 2019) uses a graph network where each node represents an object, and is described by a context-aware representation from related objects conditioned on the textual input.
To gain better insight into the relative contribution of the design choices we made, we perform experiment with the following ablated models: (9) P-Ref+LSTM+Attn uses attention instead of an external memory block for selecting the neighborhood words in the expression; (10) P-Ref+Curriculum Learning: We employ a curriculum training (Platanios et al. 2019) regime to train the P-Ref model in order to improve its performance without contextualization (See Appendix A.3).

Implementation Details
The memory matrix in our model discussed in section 4.2 consists of 128 rows and 80 columns. The controller is a single layer LSTM network. We use GloVe to obtain the word embedding (dimension = 300) of each word in the textual input. When training, we first train our program generator (PG) and use it as a fixed module for training the execution engine (EE). We use 18K ground-truth programs to train the program generator (PG). We train PG and EE using Adam (Kingma and Ba, 2015) with learning rates 0.0005 and 0.0001, respectively. Note that PG is trained for a maximum of 32,000 iterations, while EE is trained for a maximum of 450,000 iterations. We employ early stopping based on validation set accuracy. We do not find any significant improvements with the joint optimization of PG and EE. We train on one RTX 2080ti GPU with a batch size of 8. Table 4 shows results in comparison with the baselines. We find that our contextual NMN model (P-Ref+LSTM+Mem) significantly outperforms all prior work by large margins. In addition to outperforming NMN baselines such as FiLM, N2NMN, IEP-Ref, we also outperform the non-NMN baselines such as LCGN demonstrating the effectiveness of the introduced memory module in captur-r1 e1 r2 e2

Evaluation
Ground-Truth IEP-Ref: filter_material(metallic) P-Ref+LSTM+Mem: filter_material(metallic) r1: The gray object that is the second one of the thing(s) from right or that is same size as [the first one of the big metallic sphere(s) from front] e1 r1: The gray object that is the second one of the thing(s) from right or that is same size as the first one of the big metallic sphere(s) from front r1: The gray object that is the second one of the thing(s) from right or that is same size as the first one of the big metallic sphere(s) from front r2: Find the object that is behind [the yellow metallic sphere] e2 and in front of a rubber cylinder r2: Find the object that is behind the yellow metallic sphere and in front of a rubber cylinder r2: Find the object that is behind the yellow metallic sphere and in front of a rubber cylinder r1: The gray object that is the second one of the thing(s) from right or that is same size as [the first one of the big metallic sphere(s) from front] e1 r1: The gray object that is the second one of the thing(s) from right or that is same size as the first one of the big metallic sphere(s) from front r1: The gray object that is the second one of the thing(s) from right or that is same size as the first one of the big metallic sphere(s) from front r2: Find the object that is behind [the yellow metallic sphere] e2 and in front of a rubber cylinder r2: Find the object that is behind the yellow metallic sphere and in front of a rubber cylinder r2: Find the object that is behind the yellow metallic sphere and in front of a rubber cylinder r1: The gray object that is the second one of the thing(s) from right or that is same size as [the first one of the big metallic sphere(s) from front] e1 r1: The gray object that is the second one of the thing(s) from right or that is same size as the first one of the big metallic sphere(s) from front r1: The gray object that is the second one of the thing(s) from right or that is same size as the first one of the big metallic sphere(s) from front r2: Find the object that is behind [the yellow metallic sphere] e2 and in front of a rubber cylinder r2: Find the object that is behind the yellow metallic sphere and in front of a rubber cylinder r2: Find the object that is behind the yellow metallic sphere and in front of a rubber cylinder r1: The gray object that is the second one of the thing(s) from right or that is same size as [the first one of the big metallic sphere(s) from front] e1 r1: The gray object that is the second one of the thing(s) from right or that is same size as the first one of the big metallic sphere(s) from front r1: The gray object that is the second one of the thing(s) from right or that is same size as the first one of the big metallic sphere(s) from front r2: Find the object that is behind [the yellow metallic sphere] e2 and in front of a rubber cylinder r2: Find the object that is behind the yellow metallic sphere and in front of a rubber cylinder r2: Find the object that is behind the yellow metallic sphere and in front of a rubber cylinder r1: The gray object that is the second one of the thing(s) from right or that is same size as [the first one of the big metallic sphere(s) from front] e1 r1: The gray object that is the second one of the thing(s) from right or that is same size as the first one of the big metallic sphere(s) from front r1: The gray object that is the second one of the thing(s) from right or that is same size as the first one of the big metallic sphere(s) from front r2: Find the object that is behind [the yellow metallic sphere] e2 and in front of a rubber cylinder r2: Find the object that is behind the yellow metallic sphere and in front of a rubber cylinder r2: Find the object that is behind the yellow metallic sphere and in front of a rubber cylinder r1: The gray object that is the second one of the thing(s) from right or that is same size as [the first one of the big metallic sphere(s) from front] e1 r1: The gray object that is the second one of the thing(s) from right or that is same size as the first one of the big metallic sphere(s) from front r1: The gray object that is the second one of the thing(s) from right or that is same size as the first one of the big metallic sphere(s) from front r2: Find the object that is behind [the yellow metallic sphere] e2 and in front of a rubber cylinder r2: Find the object that is behind the yellow metallic sphere and in front of a rubber cylinder r2: Find the object that is behind the yellow metallic sphere and in front of a rubber cylinder ing visiolinguistic relations and contextual dependencies from the longer CLEVR-like expressions. Specifically, we achieve +4.3% on F-test and +8.1% on S-Test, compared with the current state-of-theart NMN model IEP-Ref. Most significant gains on S-Test also suggest the superior generalization skills of our model in learning from fewer training samples.
The ablation results are shown in Table 5. As we can see, all the ablative baselines underperform, confirming the importance of our proposed contextualization approach. Specifically the improvements obtained with module contextualization in both IEP-Ref and FiLM demonstrate that our approach can generalize across diverse NMN architectures.
Performance on CLOSURE and NLVR2 benchmarks is shown in Table 6. We achieve +11.2% in accuracy on CLOSURE test split compared to the best prior model Vector-NMN, indicating that our model generalizes well to unseen compositions. We also surpass all the existing NMN based models for NLVR2 dataset which has real images unlike synthetic images in CLEVR-Ref+ and CLOSURE.   correctly locates objects based on their contextual relevance.

Model Parameters
Our proposed model has 3 times fewer parameters than the baseline model IEF-Ref in total (see Table 7    smaller than FiLM.

The CC-Ref+ Dataset
We further examine the robustness of the models by creating contrast sets (similar to Gardner et al. 2020) that help in exposing model brittleness by probing a model's decision boundary local to examples in the test set. Specifically, we follow a three stage approach to collect our contrast set: Stage 1: First, we randomly sample 100 singlereferent expressions from the test split containing only a single spatial relation (e.g. The first one of the tiny rubber thing from left). We then sample another 100 expressions containing two spatial rela-  tions (e.g. The first one of the thing from left that is behind the big yellow matte object). Similarly we sample a third subset of 200 expressions containing 3 or more relations. Finally, we sample 100 expressions containing at least one compare relations (e.g. Any other tiny object as the same color as the big yellow metallic cube). This constitutes a total of 500 expressions. Stage 2: We then manually perturb the semantics of various parts of these 500 referring expressions such that the ground-truth referent object changes. For example, we modify the expression first one of the tiny rubber thing from left to first one of the tiny metallic thing from right. We call this perturbed test split CC-Ref+. We show random selection of CC-Ref+ examples in Table 8. Stage 3: Finally, we verify and validate the correctness of the new ground-truth annotations using two human annotators. The annotations that are not consistent among the two human annotators are removed and we re-iterate the above three steps until we collect a validated set of 500 contrast samples 3 . In Figure 4, we summarize the size and complexity of our CC-Ref+ split.

Evaluation on CC-Ref+ Dataset
As shown in Figure 5, performance of baseline models drop by >10% on CC-Ref+ and the models struggle to correctly ground the perturbed samples containing compare relations (e.g. same_color) or that containing more than 2 spatial relations (e.g. front, left) in the expression. Our method shows least drop (<5%) in performance indicating its superiority in grounding expressions with complex linguistic constructs (see Appendix B for more detailed analysis). In Figure 6 and Figure 7, we further analyze the model's performance when one of the object attributes namely, color, size, shape, material, ordinality, and visibility are perturbed in the contrast sets. We found that both IEP-Ref and our model are robust to perturbations in color indicating that this is a relatively easier concept to ground in the images. In contrast to the findings in (Liu et al., 2019), we see a significant drop by up to 15% in the performance of IEP-Ref on all the other attributes such as shape and visibility. Our proposed approach P-Ref+LSTM+Mem shows relatively low drop in the logical, material and ordinal perturbations, insignificant drops (< 3%) in color, visible perturbations and a slight improvement (+2%) in shape perturbations. This clearly suggests that our approach generalizes well and is robust to contrastive perturbations in the input. The performance gap of P-Ref+LSTM+Mem in logical, ordinal and material perturbations show that these are relatively difficult concepts for the model to learn. We hope that CC-Ref+ dataset will foster more research in this area.

Conclusion
Neural module networks (NMNs) are widely used in language and vision tasks. We show that contextualizing these modules dramatically reduces the number of modules required and improve their grounding abilities, achieving a new state-of-the-art results on the CLEVR-Ref+ visual referring expressions task. Our analysis on CLEVR-Ref+, CLO-SURE, NLVR2 and a new contrast set CC-Ref+ demonstrate that our proposed method enhances NMNs' ability to exploit visiolinguistic relationships. Arjun

A Appendix
In this supplementary material, we begin by providing more details on CLEVR-Ref+ F-Ref / S-Ref splits and the neural modules in IEP-Ref to supplement Section 2 and Section 3 of the main paper, respectively. We then provide the details of our models (e.g., initialization & training, hyperparameters). Finally, we provide CC-Ref+ dataset annotation details, statistics, random examples, and more analysis to supplement Section 4 of the main paper.

A.1 F-Ref and S-Ref splits in CLEVR-Ref+
Visual referring expression recognition is the task of identifying the object in an image that is referred to by a natural language expression (Kazemzadeh et al., 2014;Mao et al., 2016). It is a fundamental language-to-vision matching problem and has several downstream applications such as question answering . CLEVR-Ref+ (Liu et al., 2019) is a recently proposed dataset for visual referring expression recognition (RefExp) task, which consists of synthetic images and referring expressions. Specifically, it contains the ground-truth functional program representations that describe the intermediate visual reasoning as a chain of logical operations (i.e., neural modules) that need to be executed to find the target referent object (e.g., filter color, compare, filter size, and relate). There are nearly 0.8M referring expressions of which 32% of expressions refer to only a single object (Singlereferent) and 68% refer to more than one object (Multi-referent). In this paper, we refer to the full dataset as F-Ref and the single-referent subset as S-Ref. Detailed statistics of the splits are presented in Table 9.       Figure 9: Overview of our curriculum learning baseline.
We use 18K ground-truth programs to train the program generator (PG). We train PG and EE using Adam (Kingma and Ba, 2015) with learning rates 0.0005 and 0.0001, respectively. Note that PG is trained for a maximum of 32,000 iterations, while EE is trained for a maximum of 450,000 iterations. We employ early stopping based on validation set accuracy. We do not find any significant improvements with the joint optimization of PG and EE. We train on one RTX 2080ti GPU with a batch size of 8.

Curriculum
Learning Baseline: Prior literature shows that curriculum learning (CL) may greatly facilitate the learning of complex tasks for neural architectures (Platanios et al., 2019). Therefore, we employ a curriculum training (CL) regime as an additional baseline to train the P-Ref model in order to improve its performance without contextualization. An overview of the CL model is shown in Figure 9. To estimate the difficulty of the expressions, we define a scoring function inspired by what we, as humans, intuitively may consider difficult when grounding the expressions: • Longer expressions are difficult to ground.
• Expressions with a large number of spatial relationships such as "left", "front", "right", "behind" are more likely to have difficult linguistic structures.
• Expressions requiring a large number of neural modules are difficult to ground.
• Expressions involving comparison modules are difficult to ground.
Using the above heuristics, we evaluate the difficulty of all expressions in the training set on a scale of 1 to 10. During the training, we initialize the model competency to 1. All the training expressions with difficulty level less than or equal to the current model competency are used for training the model. We use a validation set of expressions for each of these difficulty levels. As the model's performance on the validation set starts to saturate, we increment the competency level of the model. We stop training immediately after the model's competency reaches above 10. We use GloVe to obtain the word embedding (dimension = 300) of each word in the textual input. When training, we first train our program generator (PG) and use it as a fixed module for training the execution engine (EE). We use 18K ground-truth programs to train the program generator (PG). We train PG and EE using Adam (Kingma and Ba, 2015) with learning rates 0.0005 and 0.0001, respectively. PG is trained for a maximum of 32,000 iterations, and EE is trained for a maximum of 450,000 iterations. We employ early stopping based on validation set accuracy. We do not observe any significant improvements with the joint optimization of PG and EE. All of our CL experiments were conducted on one RTX 2080ti GPU with a batch size of 8.  We follow a three stage approach to collect our contrast set: Stage 1: First, we randomly sample 100 singlereferent expressions from the test split containing only a single spatial relation (e.g. The first one of the tiny rubber thing from left). We then sample another 100 expressions containing two spatial relations (e.g. The first one of the thing from left that is behind the big yellow matte object). Similarly we sample a third subset of 200 expressions containing 3 or more relations. Finally, we sample 100 expressions containing at least one compare relations (e.g. Any other tiny object as the same color as the big yellow metallic cube). This constitutes a total of 500 expressions. Stage 2: We then manually perturb the semantics of various parts of these 500 referring expressions such that the ground-truth referent object changes. For example, we modify the expression first one of the tiny rubber thing from left to first one of the tiny metallic thing from right. We show random selection of CC-Ref+ examples in Table 13. Stage 3: Finally, we verify and validate the correctness of the new ground-truth annotations using two human annotators. The annotations that are not consistent among the two human annotators are removed and we re-iterate the above three steps until we collect a validated set of 500 contrast samples 5 .

B CC-Ref+ Annotation, Statistics, and Visualization
In Table 12, we summarize the size and complexity of our CC-Ref+ split.

B.1 Detailed Analysis of Models on CC-Ref+
In section 4.2 of main paper, we compared the performance of baseline models and our proposed method on CC-Ref+ in terms of number of relations (e.g. in the front, to the left, of same shape as) present in the expressions. In this section, we present more analysis in terms of object attributes. In CLEVR-Ref+, there are six types of object attributes namely, color, size, shape, material, ordinality, and visibility. We analyze the model's performance when one of these attributes are perturbed in the contrast sets. Additionally, we also compare the performance on contrast examples that involve logical AND/OR modifications. An example of contrast sample in CC-Ref+ involving logical AND/OR perturbation is as follows: Original: The objects that are either the first one of the small metal object(s) from right or the first one of the metallic cube(s) from left. CC-Ref+: The objects that are first one of the small rubber object(s) from right and the first one of the metallic object from front.  Figure 12, and Figure 13 shows the performance of models P-Ref+LSTM+Attn, P-Ref+CL, and P-Ref+LSTM+Mem respectively. We found that all the four models are robust to perturbations in color indicating that this is a relatively easier concept to ground in the images. In contrast to the findings in (Liu et al., 2019), we see a significant drop by up to 15% in the performance of baseline models on all the other attributes such as shape and visibility. P-Ref+CL also experience significant drops in accuracy on CC-Ref+. However it is found to be relatively more robust to the perturbations compared to the other baselines indicating that curriculum learning helps in adapting to contrast sets. Our proposed approach P-Ref+LSTM+Mem shows relatively low drop in the logical, material and ordinal perturbations, insignificant drops (< 3%) in color, visible perturbations and a slight improvement (+2%) in shape perturbations. This clearly suggests that our approach generalizes well and is robust to perturbations in the input. The performance gap of P-Ref+LSTM+Mem in logical, ordinal and material perturbations show that these are relatively difficult concepts for the model to learn. We hope that CC-Ref+ dataset will foster more research in this area.
Original: The big objects that are the first one of the block(s) from right or metallic object(s) CC-Ref+: The big objects that are the first one of the block(s) from left and rubber object(s) Original: The brown things that are big object(s) or the second one of the small metal thing(s) from left CC-Ref+: The cyan things that are big object(s) or the first one of the small metal thing(s) from left Original: The small objects that are the third one of the object(s) from left or purple shiny ball(s) CC-Ref+: The large objects that are the third one of the object(s) from left or purple shiny ball(s) Original: The tiny things that are the first one of the sphere(s) from left or the fourth one of the object(s) from front CC-Ref+: The tiny things that are the first one of the cylinder(s) from left or the fourth one of the object(s) from front Original: The matte things that are either the sixth one of the tiny thing(s) from right or the fifth one of the thing(s) from front CC-Ref+: The matte things that are tiny thing(s) and the second one of the thing(s) from front Original: The things that are either object(s) that are behind the tiny brown rubber thing(s) or the first one of the tiny brown thing(s) from left CC-Ref+: The things that are either object(s) that is in front of the tiny brown metallic thing(s) or the second one of the tiny brown thing(s) from left