Region under Discussion for visual dialog

Visual Dialog is assumed to require the dialog history to generate correct responses during a dialog. However, it is not clear from previous work how dialog history is needed for visual dialog. In this paper we define what it means for a visual question to require dialog history and we release a subset of the Guesswhat?! questions for which their dialog history completely changes their responses. We propose a novel interpretable representation that visually grounds dialog history: the Region under Discussion. It constrains the image’s spatial features according to a semantic representation of the history inspired by the information structure notion of Question under Discussion.We evaluate the architecture on task-specific multimodal models and the visual transformer model LXMERT.


Introduction
Visual Dialog (VD) is a task that combines natural language understanding grounded in vision with dialog. Being visual, VD is closely related to the area of Visual Question Answering (VQA). On VQA, important progress has been obtained recently with models that connect vision and language and are pre-trained on a variety of tasks (Tan and Bansal, 2019). Arguably, less progress has been made on the dialog part of VD, which is the topic of this paper. Currently, the two most popular datasets for visual dialog are VisDial (Das et al., 2017) and Guess-What?! (de Vries et al., 2017). The former contains chit-chat conversations about an image whereas the latter contains dialogs about a visual game whose goal is reference resolution, hence its dialogs are task-oriented. Reference resolution is a fundamental task in situated dialog (Clark and Wilkes-Gibbs, 1986;Clark, 1996;Foster et al., 2009). Questions in reference resolution can be classified as intrinsic of the target ("It is a car?") or relative to the context ("On the left?") (Clark and Marshall, 1981).
Visual Dialog is assumed to require the dialog history to generate correct responses. However, it is not clear from previous work how dialog history is used for VD (Agarwal et al., 2020). In this paper we define history dependence in terms of a representation that is interpretable as a region of the visual common ground shared between dialog participants (Traum, 1994;Clark, 1996). This representation, which we call Region under Discussion (RuD), is inspired by the pragmatic theory of Question under Discussion (QuD). QuD (Roberts, 2012;Ginzburg, 2012;Velleman and Beaver, 2016) is a somewhat overlooked but conceptually fruitful theory for spelling out the connection between the information structure of a sentence or question and the discourse or dialog in which the utterance occurs. In this paper we define RuD and use it to connect a question to its visual dialog history; we make the following contributions: 1 • We define what it means for a visual question to require dialog history considering intrinsic and relative visual properties. • We design a methodology for annotating a subset of the Guesswhat?! questions for which their dialog history is required because it completely changes their responses. • We propose an interpretable representation of history based on the Question under Discussion (QuD) theory; we call our representation Region under Discussion (RuD). • We extend the Oracle model by de Vries et al. (2017) and the LXMERT-based model of Testoni et al. (2020) with our RuD. • We find that RuD summarizes dialog history in an interpretable visual way which is linguistically well founded and improves responses for history dependent questions.

Region under Discussion (RuD)
Following Clark (1996) we define dialog common ground to be the commitments that the dialog partners have agreed upon during the dialog. An important part of the common ground is the Question under Discussion (QuD) (Ginzburg, 2012;De Kuthy et al., 2020). QuD is an analytic tool that has become popular among linguists and language philosophers as a way to characterize how a sentence fits in its context (Velleman and Beaver, 2016). The idea is that each sentence in discourse is interpreted with respect to a QuD. The QuD is defined by the dialog or discourse history. The linguistic form and the interpretation of an utterance, in turn, may depend on the QuD that provides the constraints that define the utterance's context. Similarly, we define a Region under Discussion (RuD) for visual dialog as a representation of the constraints that the dialog history establishes. The interpretation of a question depends on its RuD. Figure 1 shows a dialog from the GuessWhat?! visual dialog dataset (de Vries et al., 2017). Guess-What?! is a cooperative reference resolution game: two players attempt to identify an object in an image. The Questioner does not know the target object and has to find it by asking questions; the Oracle knows the target and provides yes/no answers. For each question in the dialog, its dialog history is defined as the previous questions together with their answers (DeVault et al., 2009). In the figure, the target is highlighted in green. The baseline Oracle model proposed by de Vries et al. correctly answers the first four questions, failing only in question number 5 with a no answer. This question does not look particularly difficult. So, why did it create a problem? Because question 5 is the only question for which the dialog history modifies the response. All the other questions can be answered correctly just by looking at the image and ignoring what was said before. That is, questions 1 to 4 are VQA turns because they do not need the dialog history. 2 If we answer question 5 is it on the left? ignoring the dialog history the correct answer is no, because the target is clearly to the right of the picture, not to the left. The RuD for this question, depicted in blue in the figure, modifies the response.
In this work, we model in the RuD the constraints that are related to intrinsic properties of the target that have been previously agreed upon between the dialog participants. An intrinsic property is one that is inherent and inseparable from the target and is not dependent on the visual context that the target is put in. In this example, such intrinsic property is the fact that the target is a car, which is established in question 2. Another intrinsic property may be that the target is a vehicle, but not the fact that the target is together with another car. We say that such property is not intrinsic of the target but relative to the position of the car. We decide to represent in the RuD only intrinsic history motivated by literature from robot dialog, where intrinsic properties are plentiful and stable constraints (Tan et al., 2020). Using intrinsic properties appears as the most common strategy for recovering from ambiguous dialog situations, as they reduce the cognitive effort (Marge and Rud nicky, 2015). We believe that restricting the RuD to intrinsic properties allows us to focus on the phenomena we are interested in while keeping the model simple and easily interpretable. Summing up, most questions in this dialog can be correctly answered independently of the dialog: they do not need the history. In effect, except for one turn, Figure 1 is just visual question answering.
In this paper we model dialog history as constraints that represent the part of the image which the dialog partners agree is the RuD and over which the rest of the questions are to be interpreted. For our example, with respect to the blue box, the correct answer of Is it on the left? is yes since the car is on the left of the agreed RuD.

Methodology
In this section we describe the dataset and we show how we annotate a subset of questions whose dialog history completely change their responses. We then explain how we build a semantic history for each dialog in order to construct a RuD and how we extend Oracle models with RuDs.

Dataset and annotation
The GuessWhat?! dataset (de Vries et al., 2017) contains around 135k successful human-human dialogs with an average of 5 questions in natural language created by crowdsourcers playing the reference game on MS COCO images (Lin et al., 2014). The set contains around 672K questions which are grounded on about 63K unique images. Following Shekhar et al. (2019), we classify the questions into different types. In Table 1, we show the test set support for each type as well as a sample question. The table shows that the most frequent types of questions in the dataset are object and spatial questions. They constitute about 40% of the total questions. Object questions are intrinsic and do not depend on the RuD to be interpreted. Differently, spatial, color and size questions are relative and can have their meaning changed due to the RuD as defined and illustrated in Sections 2 and 4.
To spot history dependent questions, we first sample a set of relative questions that follow a positively answered object question in a dialog. Then, two annotators identify questions such that the polarity of the answer changes when the question is asked considering its history. The annotation procedure is as follows: (1) Look at the picture and the candidate question without looking at the dialog history.
(2) Answer the question with "yes", "maybe yes", "maybe no", "no" or "I don't know" (3) Compare to the answer in the corpus that the person gave to that question considering the dialog history. (4) If the answers do not coincide, mark the question as history dependent. 3 In this setting, disagreements between annotators mostly arise from different views on vague properties of objects.
Surprisingly, and in contrast to what is usually assumed in previous work (Agarwal et al., 2020), visual questions dependent on dialog history do not contain more pronouns and ellipses than history independent visual questions. From the 1658 questions analyzed, two annotators agreed that 204 questions are history dependent. We call these 204 questions our GWHist test set 4 . By this procedure, we marked 12.3% of the questions in the sample as history dependent.

Semantic history
To build the RuDs, we parse and match the questions in each dialog history to build a semantic history, this is, a representation of the known intrinsic properties of the target object. Then, we use this information to filter the objects in the image and obtain a set of candidate objects that will be part of the RuD.
Parsing. We parse questions that establish relations of types "is a" and "is the" between a noun phrase (NP) and the target object. The answers to these questions usually convey information about the category of the object, as in "Is it a person?". A positive answer to a category question implies that the candidate objects include only objects of that category, while a negative answer implies that these objects are not candidates.
We define regular expressions for the most common syntactic patterns. We tokenize and POS tag the questions using NLTK and Stanza (Bird et   2009; Qi et al., 2020). Table 2 shows some of the main patterns we use.
Matching. After parsing, the obtained NPs are lemmatized using NLTK and matched to the 80 categories from the COCO dataset. Lemmatization is particularly useful to match questions using plural nouns (as "boats" in example 4, Table 2). Matching is done using exact string comparison. Two complementary matching strategies are discussed in the following two paragraphs.
In the case of some category questions with twotoken NPs, only the second token refers to the category, while the first one refers to another intrinsic property (as color in example 2, Table 2). In this case, we match only the second token to a category, if the answer is positive. A negative answer is not informative about the category (it may be a green car).
Some NPs refer to categories not present in COCO but to supercategories, i.e., nouns that cover several COCO categories (e.g. "food", covering "apple", "banana", "broccoli", etc.). We match these nouns using a pre-computed list of known supercategories. The supercategories, and its mapping to categories, are obtained from WordNet (Fellbaum, 1998) by extracting hypernym relations.
Filtering. The parsing and matching processes result in a semantic history that is available for each question in a game. The semantic history is the ordered list of positive and negative relations to (super)categories found in the previous turns (e.g. [(pos, " vehicle"), (neg, "car")] means that the target is a vehicle but it is not a car). The objects in the image are filtered using the history to obtain a set of candidate objects. Next, we describe our approaches for positive and negative elements of the history separately.
For the positive history we use only the last el-ement, assuming that it is the most specific one. We select the objects that are consistent with the (super)category of this element. For the negative history, our policy is to remove all the objects in the negated (super)categories from the candidates. For example, in Figure 1 the RuD after question 1 is answered with no removes the boy on the skateboard from the candidates. Here, we assume that all the negative elements identify objects that can be removed from the RuD, regardless of the order in which they appear. After processing the semantic history, we check the candidate objects set for well-formedness. We say that the set is ill-formed if it does not include the target object. In this case, we force the inclusion of the target object as an ad-hoc policy.
Coverage. To evaluate the coverage of the semantic history, we apply it to the validation set of the GuessWhat!? dataset. In addition to the fullfeatured process, we try three feature ablations by removing either supercategory matching (-super), second-token matching (-2nd) or negative history (neg). This way, we are able to assess the individual contribution of each of these features.
A summary of the coverage is shown in Table 3. We report here the total number of questions with non-empty semantic histories, and the counts for different types of candidate objects sets: ill-formed sets such as empty ones (empty) and those that exclude the target (w/o tgt), and well-formed sets such as those that only include the target (only tgt) and those that include the target and some other distractor objects (tgt+dist).
Despite the simplicity of our approach, there is an important coverage of the questions, with more than 60% having semantic history. We also see that there is a low rate of ill-formed candidate sets of ∼3%. 5 Ablations show that, as expected, negative history almost doubles the coverage. Also, WordNet-based supercategories makes an important contribution to coverage, at the expense of a significant increase on ill-formed candidate sets.

Extending oracle models with RuD
In this section we extend two popular models for the Oracle in visual dialog, namely the Ques-tion+Category+Spatial (QCS) baseline proposed by de Vries et al. (2017)     In what follows, we build upon these models and propose two simple extensions to encode the RuD. We name our models as QCS+RuD and CMO+RuD, respectively.
For both models, we define the RuD as the smallest bounding box that encloses all the objects in the set of candidates. The candidates objects are computed from the dialog history as described in 3.2. If no history is available we set the RuD to match the whole image.
QCS takes as input a question encoded by an LSTM as well as category and spatial feature embeddings of the target. An MLP on top of these features classifies the question into three possible answers: no, yes and n/a (non answerable). The spatial embedding in QCS corresponds to an 8dimensional vector that encodes the coordinates of the top-left and bottom-right corners, center and size of the target bounding box, normalized such that the image width and height coordinates range from -1 to 1. We extend this encoding by adding the same 8-dimensional vector but shifted and scaled according to the RuD position and scale. Concretely, let (x 1 , y 1 , x 2 , y 2 ) be top-left and right-bottom coordinates of the target bounding and (X 1 , Y 1 , X 2 , Y 2 ) that of the RuD. Let us define x 0 = (x 1 + x 2 )/2, y 0 = (y 1 + y 2 )/2 and let (w, h) and (W, H) denote the width and height of the target box and RuD, respectively. We add the following features to the QCS input embedding: The proposed architecture is shown in Figure 2a.
For questions without history, the RuD spatial embedding is defined to be the same as the spatial embedding w.r.t. the entire image.
For CMO, the model expects as inputs not only word and region embeddings but also their location with respect to the query and reference image, respectively. For the visual modality, this information is encoded in the form of bounding box coordinates after the object detection module. In our case, this corresponds to the coordinates of the top-left and bottom-right corners of each object bounding box. Using the same notation as before, we encode each box spatial coordinates as In Figure 2b we show how we implement RuD for CMO. Note that, in this case, coordinates lying outside the RuD will be negative or with a value greater than one. This does not happen for the QCS+RuD model because only the coordinates of the target are modified and these always fall inside the RuD.

Results and discussion
In this section we first report the empirical results of our experiments, then we argue that RuD summarizes history in a visually interpretable way through a qualitative analysis. Finally we discuss the limitations of our implementation of RuD.
We performed our experiments with the previously proposed models for the Oracle task. We implement both of our models as three-way classifiers using MLPs and a cross-entropy loss, accordingly with the relevant literature. For the QCS baseline, we follow de Vries et al. (2017)

Empirical results
We report empirical results for the Oracle task of the GuessWhat?! benchmark (de Vries et al., 2017) and for the history dependent subset GWHist described in Section 3. We evaluate the RuDaugmented models and compare them with their respective RuD-less baselines.
In Table 4 we show the accuracy in the test set of each of our models for the questions that were augmented with a semantic history. We use Oracle response accuracy as an evaluation metric because it compares the model response to the human ground truth answer. In addition to the GuessWhat?! test set and our GWHist subset, we report the results on the two more frequent types of questions: object and spatial 6 . The table shows that the RuDaugmented models do not outperform the RuD-less models on the object subset. This is to be expected, since object questions are not history dependent. We anticipated this in Section 2. For example, a car will always be a car no matter what was said about it before.
The accuracy for spatial questions and the whole GuessWhat?! (GW) dataset is slightly higher for the models that add RuD but the difference is small. This is due to the fact that most questions in GW including spatial questions are not history depen- 6 The analysis of the accuracy across all types of questions is included in Appendix A.  Table 4: Test response accuracy for the Oracle models discussed in Section 3 with and without Region under Discussion (RuD). Results are shown for the question types object and spatial. Last two rows show the accuracy on the whole test set (GW) and on a history dependent subset (GWHist) dent as we argued in Section 3. However, the effect of adding the RuD on accuracy is clear in the history-dependent GWHist, where QCS+RuD and CMO+RuD show an increment of 41% and 46%, respectively. The initial accuracy for the GWHist is very low for both QCS and CMO models. In fact, the accuracy is close to one minus the accuracy on GW. These are hard questions that are wrongly answered without the dialog history as we explained in Section 3. The fact that both models consistently improve shows that the RuD is capturing the region of the image on which the history dependent question is being interpreted. With a 0.416 maximum accuracy for history dependent questions there is still a lot of phenomena that our models are not able to handle. Below we discuss the kinds of history dependent questions that our models are able to handle and also illustrate those that they cannot.

Qualitative analysis
In this subsection we argue that RuD summarizes history in an interpretable visual way for different types of questions. Size, color and spatial questions can have a meaning which is relative to their RuDs.
We also discuss details about the GWHist and we show examples of the phenomena we found during the annotation.
In the first example we see a size question, the big one near the white plate? in position 8, that gets correctly answered by the RuD-augmented CMO. In this picture, the target is the biggest bottle visible marked in green. The model can use the RuD to determine which of the biggest bottle relative to the other bottle present on the scene.
The second image shows an example of a color questions that improves when answered within the RuD. The model is able to take advantage of the RuD to answer the question it is brown? on position 4. Despite the car being some gamma of gray in  the illumination conditions of the scene and given the answer "no" to the question it is grey? before, we could make an argument that the target is the browner object in the region.
In the third example, question 8 is it the first one? is interpreted with respect to a thin and long RuD which establishes an order in the traffic lights.
In Figure 4 we show an example of a history dependent question that is not improved with the RuD-augmented models. In this case, the question is most first?. This example shows one of the limitations of our approach. A model that would correctly answer these sorts of questions would need to take into account the second question in right? to infer the direction of the search and arrange candidate objects in a row indexed from right to left.
During annotation we also found a variety of examples of questions that asked for objects other than the target. These questions change their se-mantics completely when isolated from the dialog history. We found that many of these history dependent questions come from an object question that has already identified the category of the target object and now are looking for another salient object to univocally identifying it. We show examples of this and other history dependent questions that our models are not able to handle in the Appendix B.
Additionally, the GuessWhat?! dataset was generated by crowdworkers and some of the questions exhibit English errors. An example of this can be seen in the third question in Figure 4.

Limitations
In this work, we relied on the annotations of the COCO dataset to compute the RuDs. However, dialogs may contain questions that refer to objects not present in the annotations; those objects are invisible to our RuD computation. Depending on the COCO annotations makes it easy to compute RuDs Human response 1. is a banana? yes 2. in right? yes 3. most first? yes Figure 4: Example of a GWHist example that does not improve with our approach. Such an example is hard as it will need further dialog management to get that the questioner's attended point is at the right of the bananas and that a row of such would be indexed from right to left.
with intrinsic history. The same cannot be said about histories regarding attributes such as color, size, shape, etc. Dialogs contain questions that rely on these questions to build common ground. Lastly, many Questioners further constrain the RuD multiple times (either by using grouping, filtering by attributes, delimiting the area with respect to another object, etc). This process requires more history management than we do to compute the RuD for a given question. Most of these constraints require common sense reasoning, spatial understanding and a deep connection to the visual modality. As we explained in Section 2 in this paper we only consider intrinsic properties (that is, object questions) to constrain the RuD. This approach is not enough, for example, if the question 5 in Figure 1 would have been "Is it on the right?" the RuD would be too large.

Previous work
Visual Dialog played a prominent role in early work on natural language understanding (Winograd, 1972) and is now the focus of an active community investigating the interplay between computer vision and computational linguistics (Baldridge et al., 2018;Shekhar et al., 2019). On the GuessWhat?! task, most previous research has focused on the Questioner (Strub et al., 2017;Shekhar et al., 2019;Pang and Wang, 2020). Recent work suggests that the performance of the Oracle agent used by most work (de Vries et al., 2017) is quite different for types of questions (Mazuecos et al., 2020). Questioners that rely on the Oracle learn to prefer to ask only those questions that the Oracle can answer reliably. This has an impact on the type and linguistic variety of the generated questions, reducing the Guesswhat?! task to a simpler linguistic task (Shukla et al., 2019;Pang and Wang, 2020). Clark and Wilkes-Gibbs (1986) models the process of finding referring expressions as a collaborative process in which the speakers repair, expand on, or replace the noun phrase in an iterative process until they reach a version they mutually accept. This process is explicitly performed in a Guesswhat?! dialog although the role of the Oracle is simplified.
The Oracle model proposed by de Vries et al. (2017) is implemented with an MLP (as we described in Section 3). They showed that their best performing model was the one that takes the question, the target's category and its location as inputs. This has a major limitation: the model is blind and cannot see the image. This proposed model is widely used as the Oracle agent for all of the following research on the Questioner. Testoni et al. (2020) proposed an adaptation of LXMERT (Tan and Bansal, 2019) to improve on the previous Oracle, achieving a new SOTA for the GuessWhat?! Oracle without using dialog history as an input. This work showed various improvements in different types of questions, mainly on questions regarding location and other attributes and a little decrease in performance on object or super category questions due to not receiving the gold standard object category as input from the dataset. Their qualitative error analysis suggests that spatial questions are harder because they require history in order to be answered correctly in context. Agarwal et al. (2020) argues that although complex models that encode history for visual dialog have been proposed (Yang et al., 2019), such work has not demonstrated that history indeed matters for visual dialog. Agarwal et al. propose and apply a new methodology for evaluating history dependence of questions in visual dialog. They show crowdsourcers a question with its image without the dialog history and ask the crowdsourcer "would you be able to answer this question by looking at the image only or you need more information from the previous conversation?". But saying I can confidently tell the correct answer just by looking at the image is not the same as answering it in the same way that one would by looking at the previous conversation (remember the example in Section 2). Most questions are answerable no matter where they appear in a dialog because the answerer accommodates. Our method differs in that our's has the advantage of getting history dependent questions that are not evident at first glance (such as "is it on the left?" in Figure 1). We found a similar percent of questions in the GW dataset that are history dependent, as Agarwal et al. did on Visdial (12% vs. 11%). This may result in current dialogue models not learning history dependence since current mainstream vision and dialog datasets lack a significant amount of history dependency.
Dialog history has two characteristics that makes it difficult for current machine learning methods: not only it introduces variability with different histories for the same question, history dependence may also not be lexicalized, as in is it on the left? in Figure 1. History dependency is easier to spot when it is lexicalized with explicit pronouns (e.g. him in 'is it close to him?') or through noticeable ellipsis (e.g. a missing noun such as cars in 'are there two together?'). However, as we see in Figure 3, pronouns in task-oriented VD frequently are not anaphoric to the dialog history but to the image (e.g. the pronoun it in is it a person? is anaphoric to the target). Information structure theory (Roberts, 2012) and, in particular, QuD (Purver et al., 2003Ginzburg, 2012;De Kuthy et al., 2020) provide a framework for defining context dependence beyond pronouns and syntactic ellipsis.

Conclusions
We proposed a novel interpretable representation for visual dialog history: Region under Discussion (RuD). It constrains the image spatial features according to a semantic representation of the history inspired in the information structure notion of QuD. We evaluated our method on models for the Oracle task in the GuessWhat?! dataset. Our results show that our implementation of RuD leads to improvements in performance on history dependent questions. We release a manually annotated subset of such questions. Our experiments confirm that intrinsic properties do not benefit from dialog management whereas questions that ask for properties relative to the context see an improvement with it.
Interestingly, only a low percentage of questions (12%) are indeed history dependent in the Guesswhat?! dataset. However, a single error in a 10 turns GW dialog may cause the identification of the wrong referent, rendering the task unsuccessful. We agree with de Vries et al. (2020) that the simplified yes-no nature of this task allows us to focus on an interesting playground for working on conceptual advances in representation methods for dialog history. The Guesswhat?! task is ill-suited for incremental research, as it is unclear how small improvements will find their way to real applications. Our contribution is not incremental. Our paper makes a theoretical contribution by defining the new concept of Region under Discussion and linking it with the concept of Question under Discussion in dialog. Based on this theoretical contribution it proposes an interpretable, simple and extensible method for representing dialog history.
This work only adjusts the RuD to reflect the intrinsic properties of the target entity, not other attributes (color, shape, etc.) and spatial restrictions ("is it among the four in the back?"). Including other types of relations in the generation of the RuDs is a promising avenue for future research. In this regard, we are considering the following approaches: 1) RuD generation from scene graphs (a SG is a graphical representation of an image that encodes objects as nodes and pairwise relations as edges), and 2) learning RuD predictors from dialog data end-to-end. In both cases, we need a large and representative training set (SG/RuD annotated for each turn on each dialog) and such data is hard and expensive to gather. A possible solution in this case is to explore weakly supervised strategies, where the SG/RuD is treated as a latent variable.
We think that these contributions can be of use for the Questioner model, potentially helping Questioners learn dialog strategies instead of solving dialog tasks through Visual Question Answering.

Ethical considerations
In this paper we trained simple and complex deep learning models. We have consumed approximately 16.5Wh for each experiment with QCS and 266.67Wh for each one with CMO. We generated approximately 0.02 kgCO 2 eq and 0.77 kgCO 2 eq for each QCS and CMO experiment, respectively 7 . Each QCS experiment took approximately 9min to train its 4.3M trainable parameters. It raises to around 6.7hs to train the 207.94M parameters of the CMO models. We have not collected a new dataset so we have not used crowdsourcing. The annotation of the GWHist corpus was done by two of the authors who were not economically rewarded. However, this work builds upon work or which carbon footprint and the ethical considerations of crowdsourcing are important. We discuss these ethical considerations below.
First, the dataset that we use in this paper is described in (de Vries et al., 2017) which was crowdsourced. Crowdsourcing raises ethical concerns including paying a fair wage to crowdworkers, and limiting the amount of hits they make in a day so that they are not exhausted and overworked. de Vries et al. (2017) do not provide this information in their paper. Last but not least, machine learning models trained on long multimodal dialog histories may get very big very fast (Agarwal et al., 2020). We need models that learn to summarize dialog histories as we do with RuDs for the sake of the environment and the budget of low-resource researchers.

A.2 Extended Empirical results
In this subsetion we show the extended results displayed by games with and without history and by type of question.  QCS and QCS+RuD. Table 5 show results for QCS and QCS+RuD. Results are shown for the full featured RuD, as well as for the three feature ablations discussed in Section 3. The RuD-augmented models outperform the QCS baseline for questions that can be considered relative: spatial, size, and (arguably) color, texture and action. No improvement is observed for intrinsic questions: object and shape. Spatial questions, the most frequent relative question type, are the most benefited by the use of the RuD, with improvements in accuracy ranging from 1.4% to 1.8%. Ablations show that all the proposed features contribute to the overall performance, with the use of WordNet-based supercategory being the most contributing one. We experimented with word embedding to retrieve the semantic histories instead of using the semantic parser we proposed in Section 3. The relations between the content of a sentence and the generated histories were calculated using cosine distance. For that we tried different thresholds. A manual analysis showed that higher thresholds let too many errors in while lower thresholds got lower coverage than the proposed method. We then decided to stick with our semantic parser and leave the exploration of word embeddings for future work.
These empirical results suggest that our RuD seems to be capturing a fact about language: relative questions tend to depend on dialog history while intrinsic questions do not. However, the improvement on relative questions is small. We believe that working on more elaborate semantic history and RuD construction schemes can lead to further significant improvements in Oracle performance.
Final results for the test set are shown in Table 6. Accuracies are reported for the different question types, and also for questions with and without RuD ("w" and "w/o" resp.).
CMO and CMO+RuD. Results are shown in the last two groups of columns in Table 6. Compared to the QCS counterparts, we see an increase on performance for all question types, consistent with Testoni et al. (2020) results. Accuracy improves on more than 5 absolute points for CMO and CMO+RuD compared to QCS and QCS+RuD, respectively. However, when considering CMO vs. CMO+RuD we observe only marginal improvements on spatial, color, size and action questions. Consistently with what was found for the QCS+RuD model, intrinsic questions object and shape do not show improvements for CMO+RuD. The only significant improvement is observed in the spatial subset of history dependent questions (GWHist). Here we observe a large gap in favor of the CMO+RuD model on questions with RuD.   Table 6: Test classification accuracy for the Oracle models discussed in Section 3. Results are shown for different question types and for questions with and without history information (RuD). Last two rows show the accuracy on the whole test set (GW) and on a history dependent subset (GWHist).
We also consider a control configuration based on zeroing the spatial information associated to the visual input modality. This model obtains an overall accuracy of 0.750 on the full test set, a value that is below that obtained with the QCS baseline. This shows the importance of the spatial information for these types of models. When we consider the performance of this model on the GWHist subset, performance is around 50% (0.515, 0.566 and 0.300 for "all", "w" and "w/o", respectively). This is to be expected since the GWHist subset was designed such that the absence of history information would change the polarity of the answer. The observed 50% is close to a history-less majority class predictor. The performance of CMO+RuD of 0.301 on the history dependent set GWHist leaves much room for improvement. Below we illustrate the performance of CMO+RuD for relative questions. Then we turn to the limitations of our approach.

B More Qualitative Analysis
As explained in the qualitative analysis on Section 4, there were some questions that some questions asked for objects other than the target. An example of this can be seen in the first image in Figure 5 In isolation, final question "guy in red?" would be interpreted as just another object question, but in the context they are trying to identify another object in the image that is related to the target. This relation is usually set on before such question appear.
Some more limitations present in the data are that, sometimes, referenced objects are not present in the annotation. The second example in Figure 5 shows this phenomena. The fourth question, "is it small?" has the correct RuD, but fails to compare the orange to the little blueberries when answering to the question.
Coming back to the spatial questions, questions that ask for absolute spatial location of objects tend to have more ellipsis than other questions. Non history-dependent absolute spatial questions do not differ sintactically from their history dependent counterparts. It is when analyzed in context that one can start making distinctions between one and the other, but that escapes the form of the sentences and requires knowledge that the RuD-less models did not have access to before.
Exophora In the Guesswhat dataset we find that visual questions dependent on dialog history do not contain more pronouns and ellipses than historyindependent visual questions, as said in section 3. This is due to most questions having exophora in the corpus, relying heavily on the common visual context. Such exophoric pronouns are grounded in the task and the image and not in the previous dialogue. Exophoras not only refer to the target, but also to other salient objects that can be referred to with a pronoun without being linguistically introduced. For example, in a picture with 2 salient people a question such as "is it behind them?" is possible when the people were not referred to before.

C Annotation Tool
We used a web interface as shown Figure 6 for the annotation of the data. Each annotator was prompted with a question and were asked to answer the question with one of the 5 options shown in the