Learning to Ground Visual Objects for Visual Dialog

Visual dialog is challenging since it needs to answer a series of coherent questions based on understanding the visual environment. How to ground related visual objects is one of the key problems. Previous studies utilize the question and history to attend to the image and achieve satisfactory performance, however these methods are not sufficient to locate related visual objects without any guidance. The inappropriate grounding of visual objects prohibits the performance of visual dialog models. In this paper, we propose a novel approach to Learn to Ground visual objects for visual dialog, which employs a novel visual objects grounding mechanism where both prior and posterior distributions over visual objects are used to facilitate visual objects grounding. Specifically, a posterior distribution over visual objects is inferred from both context (history and questions) and answers, and it ensures the appropriate grounding of visual objects during the training process. Meanwhile, a prior distribution, which is inferred from context only, is used to approximate the posterior distribution so that appropriate visual objects can be grounded even without answers during the inference process. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate that our approach improves the previous strong models in both generative and discriminative settings by a significant margin.


Introduction
With the development of deep learning, various vision-language tasks have been introduced and attracted widespread attention, such as image captioning Anderson et al., 2016Anderson et al., , 2018Cornia et al., 2020;Ghanimifard and Dobnik, 2019), visual question answering (Ren et al., 2015a;Gao et al., 2015;Lu et al., 2016;Anderson et al., 2018;Li et al., 2019; and visual dialog (Das et al., 2017;Chen et al., 2021a;Figure 1: Comparison between different responses when focusing on different visual objects. We see that when the model focuses on wrong visual objects it makes mistakes. (Only the response A3-2 is right.) Agarwal et al., 2020;Chen et al., 2021c;Qi et al., 2020). Specifically, visual dialog, which aims to hold a meaningful conversation (Chen et al., 2021e, 2020b with a human about a given image, is a challenging task that requires models to locate related visual objects in an image and answer the current question based on the history and the located visual objects. In order to answer the question correctly, we need to accurately locate the question-related visual objects. Most existing methods utilize kinds of attention mechanism (Lu et al., 2017;Wu et al., 2018;Kottur et al., 2018;Guo et al., 2019b) to capture the target visual objects. ReDAN  and DMAM (Chen et al., 2020a) use multi-step reasoning based on dual attention to iteratively update related visual objects. DAN (Guo et al., 2019b), MCAN (Agarwal et al., 2020) and LTMI (Nguyen et al., 2020) utilize multi-head attention mechanisms to manage multimodal intersection and obtain weight distributions. Moreover, there are some approaches (Zheng et al., 2019;Schwartz et al., 2019;Jiang et al., 2020b;Guo et al., 2020;Jiang et al., 2020a) using graph-based structures to capture related visual objects. FGA (Schwartz et al., 2019) realizes a factor graph attention mechanism, which constructs the graph over all the multi-modal features and estimates their interactions to ground visual objects. CAG (Guo et al., 2020) focuses on an iterative question-conditioned context-aware graph to locate related visual objects. However, the methods mentioned above obtain the prior distribution of visual objects through various interactions of questions, history and images, and finally use the prior distribution to obtain the final representation of the image. The prior distribution of visual objects is not enough to ground accurate visual objects, thus obtaining the wrong representation of the image.
In this paper, we propose a method to learn to ground visual objects in visual dialog. Specifically, we obtain the posterior distribution over visual objects by utilizing contexts and answers, while the prior distribution works without knowing answers in advance. Then we minimize the distance between the two distributions. During the training process, our model is trained to minimize the KL divergence between the prior distribution and the posterior distribution so that our model can approximate the posterior distribution accurately using the prior distribution. Then, during the inference process, the model grounds visual objects merely based on the prior distribution (i.e., without any posterior information). We show that through this process, the model can effectively learn to ground visual objects accurately and give informative and accurate responses by utilizing appropriate visual objects. We test the effectiveness of our proposed model on two large-scale datasets: VisDial v0.9 and v1.0 (Das et al., 2017). The contributions of this work are summarized as follows: • We explore the importance of answers in grounding visual objects related to questions in visual dialog.
• We propose a novel approach to realize learning to ground visual objects in visual dialog via bridging the gap between the prior and posterior distribution over visual objects.
• We conduct extensive experiments and ablation studies on two large-scale datasets Vis-Dial v0.9 and v1.0. Experimental results show that our approach can be used to improve previous visual dialog models in both generative and discriminative settings.

Methodology
According to Das et al. (2017), a visual dialog agent takes an image i, history (the caption and question-answer pairs) till round t − 1: ) and the current question q t at round t as inputs. Note that Cap is the caption describing the image taken as h 0 and h 1 , . . . , h t−1 are concatenations of questionanswer pairs. Visual dialog agent aims to predict an answer a t to the question q t .

Model Architecture
In this paper, we focus on training a neural visual dialog model with an effective visual objects grounding mechanism. As shown in Figure 2, we simplify existing visual dialog models into five major components: • The context encoder encodes dialog history h and the current question q t with an attention mechanism into a context vector x, and feeds it into the visual objects manager and the decoder.
• The visual encoder takes the image i as input and extract the image features v = {v 1 , v 2 , . . . , v µ } where µ denotes the number of object proposals for each image. Each object proposal is represented by a d v -dimension feature vector.
• The answer encoder encodes the groundtruth answer a t into a response vector y, and feeds it into the visual objects manager.
• The visual object manager consists of two sub-modules: a prior module and a posterior  module. Given the previously encoded x and v µ i=1 (and y if available), the visual object manager is responsible for deciding an appropriate distribution over visual objects and feeds the weighted visual object features v * (together with an attention-based context vector x) into the decoder.
• The decoder generates and retrieves responses based on the visual object feature v * and the attention-based context vector x.

Our Approach
When given the context vector x and the visual object features v = {v 1 .v 2 , . . . , v µ }, and response vector y, the goal of the visual object manager is to decide an appropriate distribution D over visual objects and obtain the weighted visual object representation v * based on the distribution D.
The visual object manager consists of two submodules: a prior module and a posterior module.
The Prior Module. The prior module aims to calculate the conditional probability distribution over µ visual objects, denoted by p(v|x): where f cv (·, ·) denotes the interaction function of the context vector x and the visual object features v i . For example, f cv (·, ·) can be the dot product, self-attention or other mechanisms to measure the association between v i and the context vector x. A high association means that v i is relevant to x and thus, v i has a larger weight. Note that p(v|x) is conditioned only on x and thus, it is a prior distribution over visual objects since it works without knowing the response. However, there can be different visual objects that are relevant to the contexts, and thus, it is difficult to select visual objects simply based on the prior distribution in training.
The Posterior Module. Motivated by this, in the posterior module, we define a posterior distribution over visual objects, denoted by p(v|x, y), by considering both contexts and responses: (2) where f cy (·, ·) denotes the interaction function of x and y. For example, the f cy (·, ·) can be an add operation, fully connected layer and other methods. Compared with the prior distribution, the posterior distribution is sharp since the actual visual objects used in the true response a t can be captured.
Bridging the Gap. Clearly, the discrepancy between prior and posterior distributions introduces great challenges in training the model: it is desirable to ground visual objects based on the posterior distribution, which, however, is unknown during inference. In this paper, we propose to approximate the posterior distribution using the prior distribution so that our model is capable of selecting appropriate visual objects even without posterior information. For this purpose, we introduce an auxiliary loss, namely the Kullback-Leibler divergence loss (KLDivLoss), to bridge the gap between the prior distribution and the posterior distribution. The KLDivLoss is defined as follows: ).
(3) When minimizing KLDivLoss, the posterior distribution p(v|x, y) can be regarded as labels and our model is instructed to use the prior distribution p(v|x) to approximate p(v|x, y) accurately. As a consequence, even when the posterior distribution is unknown in the inference process (since the actual response a t is unknown), the prior distribution p(v|x) can be effectively utilized to ground appropriate visual objects so as to generate and retrieve proper responses. To the best of our knowledge, it is the first neural model in visual dialog, which incorporates the posterior distribution as guidance, enabling accurate visual object grounding and highquality response generation and retrieval.

Application of Our Approach
We take the strong baseline LTMI (Nguyen et al., 2020) as a base model to introduce our approach, which mainly consists of the following components: Context Encoder and Answer Encoder: We use two bi-directional LSTM encoders to extract token-level representations Q ∈ R λ×dq and y ∈ R λ×dq of the question q t and the answer a t . We use another bi-directional LSTM encoder to extract sentence-level representations H ∈ R T ×dq of the history h. λ is the length of questions and answers with paddings, T is the turn of dialog and d q is the dimension. Q and H are fused into a context representation x with multi-head attention (Vaswani et al., 2017).
Visual Encoder: Similar to (Anderson et al., 2018), we extract the image features by using a pretrained Faster RCNN (Ren et al., 2015b). We select µ object proposals for each image, where each object proposal is represented by a 2048-dimension feature vector. We transform the obtained visual region features by a multi-layer perceptron and obtain the image features I = I µ I=0 ∈ R µ×dq . Prior Module: We use multi-head attention (Vaswani et al., 2017) as f cv (·, ·) to manage the multi-modal interaction. A cross-attention layer is firstly applied to outputs of the texutal and visual encoders: (5) where the softmax conducts the normalization over each column of the matrix. We convert the rep-resentationÎ into d q -dimension vectors V. This conversion is performed by a simple self-attention computation as follows: g is regarded as the prior distribution over visual objects.
Posterior Module: We simply utilize the add operation as f cy (·, ·) to manage the interaction of x and y: We replace x in Eq.(6) -Eq.(8) with x y and thus obtain the posterior distribution G and v post Generative and Discriminative Decoder: We utilize another LSTM as our discriminative and generative decoders following the previous studies (Das et al., 2017;Nguyen et al., 2020). Receiving the representation of context, images and the candidate answers, the two decoders compute the score of each candidate answer in different ways. The objective function of the base model is to minimize the negative log-likelihood L G of answer generated for the generative decoder or the crossentropy loss L D for the discriminative decoder. We utilize the Kullback-Leibler (KL) divergence loss to narrow the gap. The objective functions of the student are as follows: (Das et al., 2017) 51  Human Evaluation. Following (Wu et al., 2018;Chen et al., 2021b), we randomly extract 100 predicted samples for human evaluation. We ask 3 human subjects to guess whether the last response in the dialog is human-generated or machinegenerated. If at least 2 of them agree it is generated by a human, we think it passes the Truing Test (M1). We record the percentage of responses that are evaluated better than or equal to human responses (M2), according to the human subjects' evaluation.

Main Results
Baseline methods. In our experiment, compared methods can be grouped into four types: (1) Fusion-based models: LF (Das et al., 2017) and HREA (Das et al., 2017).
(2) Attention-based mod-Model VisDial v0.9 (val) VisDial v1.0 (test-std) (Das et al., 2017) 58    (Jiang et al., 2020a). We realize our model LTMI-LG which is based on the strong baseline LTMI (Nguyen et al., 2020) 1 . LTMI is a very strong model which achieves some the-state-of-the-art results. In general, our approach brings a large improvement to the strong baseline LTMI, which shows the effectiveness of our answer-aware knowledge distillation. We use t-test and analysis of variance (ANOVA) to analyze our model and LTMI. The p-values of these two analytical methods are all less than 0.01, indicating that the results are significantly different.
As shown in Table 3, we statistic the accuracy of grounding visual objects of our LTMI-LG, which is 82.1%. Our answer-aware knowledge distillation improves the accuracy from 68.6% (LTMI) to 1 We reproduce the result for LTMI by their official GitHub repo (https://github.com/davidnvq/visdial). We apply the default hyper-parameters as them.  Table 3: Accuracy of visual grounding with and without knowing the answer. We randomly sample 1000 samples and ask human annotators to ground the three most likely objects from the image.  82.1% (LTMI-LG), gaining 13.5% improvement. As shown in Figure 5, we provide predicted answers by LTMI and our LTMI-LG. Due to the improvement of visual grounding, our approach improves the generative and retrieval results of LTMI, managing to locate visual objects more accurately, as shown in Table 4. "Mean" denotes We set the distribution of the visual objects to uniform to make all visual objects have the same weights. "Random" denotes we randomize the distribution inbatch. "Human" denotes we annotate 100 images and utilize this distribution to generate responses.   The more appropriate visual objects, the better the model performance.
Generative Results. As shown in Table 1, we compare the generative performance among different methods on the VisDial v1.0 val and VisDial v0.9 val. With the guidance of the teacher, we train our LTMI-LG with the ability of accurately grounding visual objects. As a result, our approach improve significantly (nearly 1% on all metrics) compared with LTMI (Nguyen et al., 2020)   on all the metrics except Mean. We believe data is an important factor in deep learning (LeCun et al., 2015). VDBERT  works because it uses a lot of extra data for training. The reason why our method is effective is that we use the teacher to teach the student visual grounding, which can be regarded as a kind of data annotation.
Discriminative Results. As shown in Table 2 and and R@10 on the VisDial v1.0 test. Our approach also brings a large improvement on the Visdial v0.9 and achieves the best results on MRR, R@1 and R@5 among non-pre-trained models. In a discriminative setting, our approach performs worse than pre-training models VisualBERT (Murahari et al., 2020) and VDBERT  because pre-training models utilize extra large-scale datasets to train the models which are unfair compared with other models. As shown in Table 5, VDBERT ‡ which trains from scratch performs worse than our LTMI-LG.

Ablation Study
In order to transfer knowledge, we need a metric loss to measure the gap between teachers and students. In our main experiments, we utilize the kullback-leibler (KL) divergence loss to diminish the gap of the weight distribution between the student model and the teacher model, the mean squared loss to diminish the gap of the representation of images. To compare different losses, we utilize the mean squared loss for attention maps and KL loss for the representation of images as shown in Table 6. We find that KL loss is more suitable for attention distribution (better for NDCG, MRR and R@1) and MSE loss for the representation (better for R@5, R@10 and Mean). In addition, we use the attention via KL loss and the representation via MSE loss for distillation at the same time. The result is not so satisfactory and we think these two methods have some redundancy.

Case Study
As shown in Figure 4, we visualize the learned attention maps to understand the model. The colorful region means higher attention weights. We draw the bounding boxes of the first three highest scores. As shown in the top image in Figure 4, the question "is he standing ?" indicates the man's overall posture rather than the local. LTMI grounds the wrong visual objects while our model grounds the right objects. As shown in the bottom image in Figure 4, the question "is his tennis racket up in the air swinging ?" concerns the racket rather than the tennis balls. Our model grounds accurately while LTMI makes mistakes. These examples show that our LTMI-LG has learned the ability to ground  visual objects via our answer-aware knowledge distillation.

Human Study
As shown in Table 7, we conduct human study to further demonstrate the effectiveness of our model. Our model achieves the highest scores both on the metric M1 and M2 compared with LTMI.

Related Work
Recent several works (Shuster et al., 2018;Liang et al., 2021;Yang et al., 2020) explore leveraging visual information to enhance dialogue models. While visual dialog models focus on the intersection of questions, history and images. How to locate the related visual objects is quite important. MN (Das et al., 2017), HCIAE (Lu et al., 2017), CorefNMN (Kottur et al., 2018), CoAtt (Wu et al., 2018), RvA (Niu et al., 2019), DVAN (Guo et al., 2019b) utilize kinds of attention mechanisms as the backbone to locate the related visual ob-jects. VisualBERT (Murahari et al., 2020) and VDBERT  exploit large extra datasets to explore in visual dialog via pretraining language models. GNN-EM (Zheng et al., 2019), FGA (Schwartz et al., 2019), DualVD (Jiang et al., 2020b), CAG (Guo et al., 2020) and KBGN (Jiang et al., 2020a) utilize graph neural networks to obtain the representation of visual objects. However, most existing visual dialog models condition visual objects simply on history and questions, which we regard as a prior distribution over visual objects.
In this paper, we propose an approach to learn to ground visual objects via bridge the gap between the prior distribution and the posterior distribution.

Conclusion
In this paper, we propose a novel approach to learn to ground visual objects for visual dialog, which employs a novel visual objects grounding mechanism where both prior and posterior distributions over visual objects are used to facilitate visual objects grounding. Experimental results on two largescale datasets show that our approach improves the previous models by a significant margin.