GoG: Relation-aware Graph-over-Graph Network for Visual Dialog

Visual dialog, which aims to hold a meaningful conversation with humans about a given image, is a challenging task that requires models to reason the complex dependencies among visual content, dialog history, and current questions. Graph neural networks are recently applied to model the implicit relations between objects in an image or dialog. However, they neglect the importance of 1) coreference relations among dialog history and dependency relations between words for the question representation; and 2) the representation of the image based on the fully represented question. Therefore, we propose a novel relation-aware graph-over-graph network (GoG) for visual dialog. Specifically, GoG consists of three sequential graphs: 1) H-Graph, which aims to capture coreference relations among dialog history; 2) History-aware Q-Graph, which aims to fully understand the question through capturing dependency relations between words based on coreference resolution on the dialog history; and 3) Question-aware I-Graph, which aims to capture the relations between objects in an image based on fully question representation. As an additional feature representation module, we add GoG to the existing visual dialogue model. Experimental results show that our model outperforms the strong baseline in both generative and discriminative settings by a significant margin.


Introduction
Vision-language tasks have drawn more attention with the development of multi-modal natural language processing (Baltrušaitis et al., 2018;Chen et al., 2020bChen et al., , 2019, such as image captioning (Xu et al., 2015;Anderson et al., 2016Anderson et al., , 2018Cornia et al., 2020;Ghanimifard and Dobnik, 2019), visual question answering (Ren et al., 2015a;Gao et al., 2015;Lu et al., 2016;Anderson et al., 2018; H0: a man in a suit and tie standing next to a woman with glasses H1: what color is the man 's suit ? grey with a blue and grey tie H2: what color is his hair ? White H3: is it styled nicely ? Yes Li et al., Huang et al., 2020) and visual dialog (Das et al., 2017;Kottur et al., 2018;Agarwal et al., 2020;Wang et al., 2020;Qi et al., 2020;Chen et al., 2021). Relations in these tasks are significant for reasoning and understanding the textual and visual information. Specifically, visual dialog, which aims to hold a meaningful conversation with a human about a given image, is a challenging task that requires models to reason complex relations among visual content, dialog history, and current questions.
Kinds of attention mechanisms are served as the backbone of previous mainstream approaches (Lu et al., 2017;Wu et al., 2018;Kottur et al., 2018;Gan et al., 2019;Guo et al., 2019b), following Das et al. 2017. HCAIE (Lu et al., 2017 provides a history-conditioned image attentive encoder to represent the question, the question-attended history, and the attended image. CoAtt (Wu et al., 2018) provides a sequential co-attention encoder to realize that each input feature is co-attended by the other two features in a sequential fashion. ReDAN (Gan et al., 2019) and DMAM (Chen et al., 2020a) use multi-step reasoning based on dual attention to answer a series of questions about an image. DAN (Guo et al., 2019b), MCAN (Agarwal et al., 2020) and LTMI (Nguyen et al., 2020) utilize multi-head attention mechanisms to manage multimodal intersection. However, these approaches tend to catch only the most discriminative information, ignoring other rich complementary clues, such as relations between objects in an image.
Recent visual dialog studies (Zheng et al., 2019;Schwartz et al., 2019;Jiang et al., 2020b;Guo et al., 2020;Jiang et al., 2020a) explore the higher-level semantic representation of images or dialog history, notably with graph-based structures for modeling the image or dialog history. Although graphbased structures have been considered, these graphbased models lack explicitly capturing complex relations within visual content or textual contexts, and relations between them. As shown in Figure 1, there are complex relations such as coreference relations among dialog history, dependency relations between words in the question, spatial relations between objects in the image. For example, to answer the question Q4 "is he looking at the woman ?", we firstly need to reason in dialog history to know who "he" is, then further understand the intention of the question with the understood of history and syntax of questions, and finally know clearly the spatial location and relation about "the man" and "the woman" in the image based on fully question understanding. How to 1) understand the coreference among history, 2) understand the intention of the question with its syntax and history, 3) understand the image with fully question understanding are worth exploring. Therefore, in this paper, we propose a novel relation-aware graph-over-graph network (GoG) for visual dialog. Specifically, GoG consists of three sequential graphs: 1) H-Graph, which aims to capture coreference relations among dialog history; 2) History-aware Q-Graph, which aims to fully rep-resent the question through capturing dependency relations between words based on coreference resolution on the dialog history; and 3) Question-aware I-Graph, which aims to capture the relations between objects in an image on the basis of fully question representation. As an additional feature representation module, we add GoG to the strong visual dialogue model LTMI (Nguyen et al., 2020). We test the effectiveness of our proposed model on two large-scale datasets: VisDial v0.9 and v1.0 (Das et al., 2017). Both automatic and manual evaluations show that our approach can be used to improve the prior strong models. The contributions of this work are summarized as follows: • We explore how to construct complex explicit relations in visual dialog, i.e., coreference relations among dialog history, dependency relations between words in the question, spatial relations between objects in the image.
• We propose a novel relation-aware graphover-graph network to reason relations within and among different graphs to obtain a highlevel representation of multi-modal information and use it to generate a visually and contextually coherent response.
• We conduct extensive experiments and ablation studies on two large-scale datasets Vis-Dial v0.9 and v1.0. Experimental results show that our GoG model can be used to improve the previous strong visual dialog model in both generative and discriminative settings.

Preliminary
Following Das et al. (Das et al., 2017), a visual dialog agent is given three inputs, i.e., an image I, dialog history (the caption and question-answer pairs) till round t − 1: ) and the current question Q t at round t, where Cap is the caption describing the image taken as H 0 , and H 1 , . . . , H t−1 are concatenations of questionanswer pairs. The goal of the visual dialog agent is to generate a response A t to the question Q t . As shown in Figure 2, our relation-aware graphover-graph network (GoG) firstly takes the image, the dialog history, and the question as inputs and  represent them using Faster RCNN (Ren et al., 2015b) and LSTM (Hochreiter and Schmidhuber, 1997). Secondly, GoG constructs the history graph, the history-aware question graph, and the questionaware image graph. Thirdly, GoG utilizes the attention alignment module to fuse the three graphs. Finally, GoG uses the fused multi-modal information to give corresponding answers. Firstly, we simply describe the feature representation of three inputs. Secondly, we introduce our graph attention. Then we describe how we apply our graph attention to the history graph, question graph and image graph to construct our graph-overgraph network. Finally, we describe how we apply our graph-over-graph network to the strong visual dialog models.

Feature Representation
Similar to (Anderson et al., 2018), we extract the image features by using a pretrained Faster RCNN (Ren et al., 2015b). We select µ object proposals for each image, where each object proposal is represented by a 2048-dimension feature vector. The obtained visual region features are denoted as v = v µ i=0 ∈ R µ×dv . To extract the question features, each word is embedded into a 300-dimensional vector initialed with the Glove vector (Pennington et al., 2014). The word embeddings are taken as inputs by an LSTM encoder (Hochreiter and Schmidhuber, 1997), which produces the initial question representation q ∈ R λ×dq . Each history sentence features are obtained as same as the question features. We concatenate the last state h last ∈ R d h of each turn history features to get the initial history represen- λ denotes the length of the question, t denotes the turn of dialog history, d q denotes the dimension of each word in questions, d h denotes the dimension of each word in history and [·, ·] denotes the concatenation operation.

Graph Attention
Given a target node i and a neighboring node j ∈ N (i) with a k × k adjacency matrix R, where N (i) is the set of k nodes neighboring with node i, and the representations of node i and node j are u i and u j , respectively. To obtain the correlation score s ij between node i and j, self-attention (Vaswani et al., 2017) is then performed on the vertices, which generates a relation score s ij between node features u i and u j : where U i and V j are trainable parameters. We apply a softmax function over the correlation score s ij to obtain weight α ij : where c {·} = W lab A ij is a bias term, lab(i, j) represents the label of each edge, and W lab is a learned parameter. The representations of neighboring nodes u j are first transformed via a learned linear transformation with W u . Those transformed representations are then gathered with weight α ij , Figure 3: The history is performed by coreference resolution. The same color box and the same number indicate the coreference relation between different expressions of the same entity.
followed by a non-linear function σ. This propagation can be denoted as: We utilize GraphAtt(·) to denote equations from Eq.

History Graph Construction
In practice, we observe that coreference relations exist in dialog history. To fully understand the coreference among dialog history, we utilize the coreference resolution tool (Lee et al., 2017) to identify coreference relations. We use the caption and questions to identify the relations instead of QA pairs because there is no ground truth answer in the test split. As shown in Figure 3, we provide a fourturn dialog to show coreference resolution. The same color boxes with the same numbers indicate the coreference relations. For example, the blue box with number 0 indicates they are related to the word "a man" with its attribute "in a suit and tie".
Pruned History Graph with Coreference Relations. We treat each turn history as a node. By analyzing the coreference relations of the history, we obtain the relations between history as shown in Figure 3. According to coreference relations, we construct a sparse graph, as shown in the history graph of Figure 2.
History Graph Attention. Given a graph with t nodes, i.e. a t-turn dialog, each turn representation in history is a node. We represent the graph structure with a t × t adjacency matrix A, where A ij = 1 if there is a coreference relation between node i and node j; else A ij = 0.
The relation-aware graph based history representation h * i is as follows: Figure 4: The question is performed by dependency parsing. The word in pink is the root node. The direction of green arrows indicates the dependency relation between two words, and the blue words (e.g., det, dobj) are relation types.

Question Graph Construction
In practice, we observe that two words in a sentence usually hold a certain relation. Such relations can be identified by the Neural Dependency Parsing (Dozat and Manning, 2017).
Pruned Question Graph with Dependency Relations. We treat each word in a question as a vertex. By parsing the dependency relations of a question, we obtain the relations between words as shown in Figure 4. According to dependency relations, we obtain our sparse question graph, as shown in the question graph of Figure 2.
History-aware Question Graph Attention. Given a graph with λ nodes, each word in a question is a node. We represent the graph structure with a λ × λ adjacency matrix B, where B ij = 1 if there is a dependency relation between node i and node j; else B ij = 0. In order to utilize the history to help understand questions, we use a history-aware attention mechanism to inject semantic information from the history into the question graph. The aware history representation is calculated as follows: The history-aware question features are achieved by concatenating the adaptive history representationĥ with each of question features q i , denoted as: The history-aware and relation-aware graph based question representation q * i is as follows: 2.6 Image Graph Construction Pruned Image Graph with Spatial Relations. By treating each object region in an image as a vertex, we can construct a fully-connected undirected graph, as shown in the image graph of Figure 2. Each edge represents a relation between two object regions. Spatial relations represent an object position in an image, which correspond to a 4dimension spatial coordinate [x 1 , y 1 , x 2 , y 2 ]. Note that (x 1 , y 1 ) is the coordinate of the top-left point of the bounding box and (x 2 , y 2 ) is the coordinate of the bottom-right point of the bounding box. Following Yao et al. (Yao et al., 2018), we classify different spatial relations into 11 different categories, such as inside, cover and overlap. We utilize the overlapping region between two object regions to judge whether there is an edge between two regions. If two object regions have overlapping parts, it means that there is a strong correlation between these two objects. If two object regions are too far away from each other, it means that there is no relation between these two objects. According to the spatial relations, we prune some irrelevant relations between objects and obtain a sparse graph, as shown in the image graph of Figure 2.
Question-aware Image Graph Attention. Given a graph with µ nodes, each object in an image is a node. We represent the graph structure with a µ × µ adjacency matrix D, where D ij = 1 if there is a spatial relation between node i and node j; else D ij = 0. Based on the fully question understanding, we use a question-aware attention mechanism to inject semantic information from the question into the image graph. The adaptive question representation is calculated as follows: where W q 1 and W q 2 are learned parameters. The question-aware image features are achieved by concatenating the aware question representationq with each of the µ image features v i , denoted as: The question-aware and relation-aware graph based image representation v * i is as follows:

Multi-modal Fusion
After obtaining the relation-aware representation, we fuse the question representation q * , history representation h * , visual representation v * through a multi-modal fusion strategy. We can use any existing visual dialog models to learn a joint representation J : where J is a visual dialog model and Θ are trainable parameters of the fusion module.  (Krishna et al., 2017), thus obtaining a 2048-dimension feature vector for each region. Following (Nguyen et al., 2020), we detect K = 100 objects from each image. For the question and history features, we first build the vocabulary composed of 11,322 words that appear at least five times in the training split. The captions, questions, and answers are truncated or padded to 40, 20, and 20 words, respectively. We employ multi-head attention with 4 heads for all three graph attention networks. The dimension of hidden features is set to 512.
Our model is implemented based on Py-Torch (Paszke et al., 2017). In experiments, we use Adam (Kingma and Ba, 2014) optimizer for training, with the mini-batch size as 32. For the choice of the learning rate, we employ the warm-up strategy (Goyal et al., 2017). Specifically, we begin Model VisDial v0.9 (val) VisDial v1.0 (val)

Attention-based Model
RvA (Niu et al., 2019) 55 Human Evaluation. We randomly extract 100 samples for human evaluation (Wu et al., 2018) and then ask 3 human subjects to guess whether the last response in the dialog is human-generated or machine-generated. If at least 2 of them agree it is generated by a human, we think it passes the Truing Test (M1). We record the percentage of responses that are evaluated better than or equal to human responses (M2), according to the human subjects' evaluation.

Main Results
Baseline methods. In our experiment, compared methods can be grouped into four types: (1) Fusion-based models. (2)  GoG denotes our relation-aware graph-overgraph network.
We use the strong model LTMI (Nguyen et al., 2020) 1 as our multi-modal fusion module. LTMI is a very strong model which achieves some the-state-of-the-art results. "Multi" indicates the model uses multi-task learning at training but utilizes the generative or discriminative decoder at inference, respectively. "Multi*" indicates the model uses multi-task learning and utilizes the discriminative decoder to improve the generative decoder. In general, our model outperforms the strong baseline by a significant margin. We use ttest to analyze our model and LTMI (Nguyen et al., 2020). The p-values is less than 0.01, indicating that the results are significantly different.
Generative Results As shown in the right half of Table 1, we compare generative performance on the val v1.0 split. Our method improves significantly (about 1% on all metrics) on the strong baseline LTMI (Nguyen et al., 2020) and outperforms all the compared methods on all metrics with large margins, which proves that GoG can Model VisDial v0.9 (val) VisDial v1.0 (test-std)    half of Table 2, we compare discriminative performance on the val v0.9 split. Our method improves a lot based on the LTMI (Nguyen et al., 2020). As shown in Table 3, our approach outperforms VDBERT (Wang et al., 2020) ‡ which trains from scratch without extra datasets. All the comparison show that our approach is valid due to explicit relation modeling.

Ablation Study
As shown in Table 4, we firstly remove the I-Graph, Q-Graph, H-Graph to validate the effect of each graph, respectively. Secondly, we validate the importance of concatenating operation. Finally, we use full connections to replace the relation in the graph to validate the importance of each relation. Firstly, the comparison between line 0 and line 1/2/3 shows all three graphs are crucial for visual dialog, leading to higher performance, and the I-Graph is most important. Secondly, the comparison between line 0 and line 4/5 shows that adding adaptive features gives a gain of approximately  about +0.2. Thirdly, the comparison between line 1 and line 6/7/8 shows that doing graphs with relations gives better gain than simple fully-connected graphs. Spatial relation is the pick of the bunch because the full connection of 100 objects in an image will bring lots of noise.

Human Study
As shown in Table 5, we conduct human study to further prove the effectiveness of our model. Our model achieves the highest scores both on the metric M1 and M2 compared with the previous model, LTMI (Nguyen et al., 2020). These results show that our model can generate a better contextually and visually coherent response.

Qualitative Results
As shown in Figure 5, we visualize the learned attention maps. For the image, the colorful region means higher attention weights. We draw the bounding boxes of the first three highest scores. For the question, the word which has the darker color has higher attention weights. For dialog history, the darker QA pairs have a higher coreference score with the question. Figure 6 (QA) pairs as observed nodes, and the current answer is deemed as an unobserved node inferred using EM algorithms (Moon, 1996) on the textual contexts. FGA (Schwartz et al., 2019) realizes a factor graph attention mechanism, which constructs the graph over all the multi-modal features and estimates their interactions. DualVD (Jiang et al., 2020b) constructs a scene graph to represent the image while embedding both relationships provided by (Zhang et al., 2019b) and original object detection features (Anderson et al., 2018). CAG (Guo et al., 2020) focuses on an iterative question-conditioned context-aware graph, including both fine-grained visual-objects and textualhistory semantics. In this paper, we model explicit complex relations within and among visual content, dialog history and the current question and design a graph-over-graph structure which are different from graph-based models mentioned above.

Graph Neural Network
Graph neural networks (Kipf and Welling, 2016;Veličković et al., 2017;Xinyi and Chen, 2018;Zhang et al., 2019a) have attracted attention in various tasks (Wang et al., 2019;Liu et al., 2018;Gu et al., 2019). The core idea is to combine the graphical structural representation with neural networks, which is suitable for reasoning-style tasks. For visual question answering, Liu et al. (Teney et al., 2017) propose the first GNN-based approach, which builds a scene graph of the image and parses the sentence structure of the question, and calculates their similarity weights. Li et al. (Li et al., 2019) propose to encode each image into a graph and model multi-type inter-object relations via a graph attention mechanism, such as spatial relations and semantic, and implicit relations (Li et al., 2019). Huang et al. (Huang et al., 2020) propose a novel dual-channel graph convolutional network for better combining visual and textual advantages. These approaches are limited to built independent graphs. There is no exploration of the coreference among dialog history and relations between graphs in the approach mentioned above.

Conclusion
In this paper, we present a relation-aware graphover-graph network (GoG), a novel framework for visual dialog, which models and reasons the explicit complex relations among visual content, dialog history, and the current question. GoG exploits the graph-over-graph structure to obtain three relation-aware multi-modal representation which  Figure 6: Examples of dialogs generated and retrieved by our model and the LTMI baseline. Our model provides answers that are more accurate than LTMI (green denotes correct answers, and red denotes wrong answers). Results from our model are also more natural and comprehensive (highlighted in blue).
can be added to prior visual dialog models. Experimental results on two large-scale datasets show that our approach improves the previous models by a significant margin. A Relation-aware Graph-over-Graph Network

A.1 Question Graph Construction
In practice, we observe that two words in a sentence usually hold certain relations. Such relations can be identified by the Neural Dependency Parsing (Dozat and Manning, 2017). As shown in Table 6, we list a part of commonly-used dependency relations.

A.2 Attention Alignment Module
After obtaining relation-aware features, we fuse the question representation q * , history representation h * , visual representation v * through a multi-modal fusion strategy. We can use any existing multimodal fusion method to learn a joint representation J : where J is a multi-modal fusion method and Θ are trainable parameters of the fusion module.
Here we utilize an efficient attention mechanism method (Nguyen et al., 2020) to fuse the multimodal information, which is the state-of-the-art model in visual dialog. Let A X (Y ) denotes the efficient attention mechanism (Nguyen et al., 2020) from the information X to the information Y . For example, A v * (v * ) denotes the efficient self-attention. The fused visual representation is obtained as follows: where W v * , W V 1 , W V 2 are learned parameters. q and h can be obtained similarly. Thus, the joint representation J is obtained:  where W J is a learned parameter.

A.3 Generative and Discriminative Decoders
Following Das et al. (Das et al., 2017), we consider both generative and discriminative decoders to score the candidate answers using the likelihood scores and the log-likelihood scores, respectively.
Generative Decoder Following Das et al. (Das et al., 2017), we design the generative decoder to score the candidate answers using the loglikelihood scores. Specifically, the generative decoder utilizes a two-layer LSTM (Hochreiter and Schmidhuber, 1997) to generate an answer using the context vector J as the initial hidden state. In the training phase, the generative decoder generates the next token based on the current token from the ground truth answer. In detail, we first append the special token "SOS" at the beginning of the ground truth answer and "EOS" at the end. We use Glove (Pennington et al., 2014) to initialize the embedding and obtain the embedding vectors a gt = [w 0 , w 1 , . . . , w N ] where w 0 is the embedding of "SOS" and w N is the embedding of "EOS". The hidden state h n at timestep n is computed as follows: h n = LSTM(w n−1 , h n−1 ), where h 0 is intializaed by J . Then we obtain the log-likelihood of n-th word as follows: where W and b are learned parameters. In the training phase, we minimize the summation of the  Table 7: Main comparisons on both VisDial v0.9 and v1.0 datasets using the generative decoder. † denotes that we re-implemented the model. Underline indicates the highest performance among previous approaches except pretraining-base models.
negative log-likelihood L G defined by: In the validation and test phase, we compute the summation s i of the log-likelihood for each candidate answerâ i : Then, the rankings of the candidate answers are derived as softmax(s 1 , . . . , s 100 ).
Discriminative Decoder A discriminative decoder outputs the likelihood score for each of 100 candidate answers for the current question. Similar to the generative decoder, we use LSTM to obtain the hidden state h n for b-th word but we do not use context vector J to initialize the h 0 . The representation of each candidate answer is a i = h N . The score p i for i-th candidate answer is computed by: p = logsoftmax(a T 1 J , . . . , a T 100 J ) In the test phase, we sort the candidate answers using these scores. In the training phase, the crossentropy loss L D .

A.4 Multi-Task Learning
According to (Nguyen et al., 2020), we apply our GoG to the state-of-the-art model (Nguyen et al., 2020) in the multi-task learning setting that accuracy is improved by training the entire network using the two decoders simultaneously. This is simply done by minimizing the sum of the losses, L D for the discriminative one and L G for the generative one: The increase in performance may be attributable to the synergy of learning two tasks while sharing the same encoder.

B.1 Main Results
Comparison with previous approaches using generative decoders. As shown in Table 7, we provide the full comparison with all the previous generative approaches.

B.2 Qualitative Results
More examples generated and retrieved by our GoG are provided in Figure 7. Due to the limited number of pages, we only provide an additional example of Figure 7.