Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a {\bf M}ultimodal {\bf I}ncremental {\bf T}ransformer with {\bf V}isual {\bf G}rounding, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly locate related objects in the image guided by textual entities, which helps the model exclude the visual content that does not need attention. On the basis of visual grounding, the multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the superiority of the proposed model, which achieves comparable performance.


Introduction
Recently, there is increasing interest in visionlanguage tasks, such as image caption Anderson et al., 2016Anderson et al., , 2018Cornia et al., 2020) and visual question answering (Ren et al., 2015a;Gao et al., 2015;Lu et al., 2016;Anderson et al., 2018). In the real world, our conversations (Chen et al., 2020b(Chen et al., , 2019 usually have multiple turns. As an extension of conventional single-turn visual question answering, Das et al. (2017) introduce a multi-turn visual question answering task named visual dialogue, which aims to Q1: how many people ? Q2: is anyone holding a frisbee ? Q3: is the coach on the right ? Q4: are they wearing matching uniforms ?
Caption: there is a frisbee team with their coach taking a team photo A1: 7 people A2: yes A3: yes, on the far right A4: all except the coach Figure 1: An example of visual dialogue. The color in text background corresponds to the same color box in the image, which indicates the same entity. Our model firstly associates textual entities with objects explicitly and then gives contextually and visually coherent answers to contextual questions. explore the ability of an AI agent to hold a meaningful multi-turn dialogue with humans in natural language about visual content.
Visual dialogue (Agarwal et al., 2020;Wang et al., 2020;Qi et al., 2020;Murahari et al., 2020) requires agents to give a response on the basis of understanding both visual and textual content. One of the key challenges in visual dialogue is how to solve multimodal co-reference (Das et al., 2017;Kottur et al., 2018). Therefore, some fusion-based models (Das et al., 2017) are proposed to fuse spatial image features and textual features in order to obtain a joint representation. Then attention-based models (Lu et al., 2017;Wu et al., 2018;Kottur et al., 2018) are proposed to dynamically attend to spatial image features in order to find related visual content. Furthermore, models based on object-level image features Gan et al., 2019;Chen et al., 2020a;Jiang et al., 2020a;Nguyen et al., 2020;Jiang et al., 2020b) are proposed to effectively leverage the visual content for multimodal co-reference. However, as implicit exploration of multimodal co-reference, these methods implicitly attend to spatial or object-level image features, which is trained with the whole model and is inevitably distracted by unnecessary visual content. Intuitively, specific mapping of objects and textual entities can reduce the noise of attention. As shown in Figure 1, the related objects can help the agent to understand the entities (e.g., Q1: "people", Q2: "frisbee", Q3: "coach") for the generation of correct answers. Then when it answers the question Q4 "are they wearing matching uniforms ?", the agent has already comprehended "people" and "coach" from the previous conversation. On this basis, it can learn the entity "uniforms" with the corresponding object in the image, and generate the answer "all except the coach". To this end, we need to 1) explicitly locate related objects guided by textual entities to exclude undesired visual content, and 2) incrementally model the multi-turn structure of the dialogue to develop a unified representation combining multi-turn utterances with the corresponding related objects. However, previous work overlooks these two important aspects.
In this paper, we thus propose a novel and effective Multimodal Incremental Transformer with Visual Grounding, named MITVG, which contains two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to establish specific mapping of objects and textual entities by explicitly locating related objects in the image with the textual entities. By doing so, our model can exclude undesired visual content and reduce attention noise. On the basis of visual grounding, the multimodal incremental transformer is used to model the multi-turn dialogue history combined with the specific visual content to generate visually and contextually coherent responses. As an encoder-decoder framework, MITVG contains a Multimodal Incremental Transformer Encoder (MITE) and a Gated Cross-Attention Decoder (GCAD).
We test the effectiveness of our proposed model on large-scale datasets: VisDial v0.9 and v1.0 (Das et al., 2017). Both automatic and manual evaluations show that our model substantially outperforms the competitive baselines and achieves the new state-of-the-art results on substantial metrics. Our main contributions are as follows: • To the best of our knowledge, we are the first to leverage visual grounding to explicitly locate related objects in the image guided by textual entities for visual dialogue.
• We propose a novel multimodal incremental transformer to encode the multi-turn dialogue history step by step combined with the visual content and then generate a contextually and visually coherent response.

Overview
In this section, we formally describe the visual dialogue task and then proceed to our proposed Multimodal Incremental Transformer with Visual Grounding (MITVG). Following Das et al.(2017), a visual dialogue agent is given three inputs, i.e., an image I, a dialogue history (the caption and question-answer pairs) till round t − 1: ) and the current question Q t at round t, where Cap is the caption describing the image taken as H 0 and H 1 , . . . , H t−1 are concatenations of questionanswer pairs. The goal of the visual dialogue agent is to generate a response (or answer) A t to the question Q t . Cap, Q * and A * are sentences. Figure 2 shows the framework of MITVG, which aims to explicitly model multi-turn dialogue history step by step based on the explicit modeling relationship between multiple modalities. MITVG firstly locates related objects in the image explicitly guided by the textual entities via visual grounding, then encodes multi-turn dialogue history in the order of the dialogue utterance based on visual grounding via Multimodal Incremental Encoder (MITE), and finally utilizes the outputs of both encoder and visual grounding to generate the response word by word via Gated Cross-Attention Decoder (GCAD).

Input Representation
Before describing our method, we introduce the input representation.
Image Features. We use a pre-trained Faster R-CNN model (Ren et al., 2015b) where K denotes the total number of the detected objects per image and V denotes the dimension of features for each object.
Language Features. The current (at the t-th round) L-word question features are a sequence of M -dimension word embedding with positional encoding added (Vaswani et al., 2017), as follows: where w j is the word embedding of the j-th word in the question Q t , and P E(·) denotes positional encoding function (Vaswani et al., 2017). For the dialogue history H = {H 0 , H 1 , . . . , H t−1 } and the answer A t , the dialogue history features u = {u 0 , u 1 , . . . , u t−1 } and the answer features a t are obtained in the same way as the question Q t .

Visual Grounding
To exclude the needless visual content, we introduce visual grounding, which is defined to ground a natural language query (phrase or sentence) about an image onto a correct region of the image. First of all, we use NeuralCoref 1 for reference resolution. For example, when it processes the question Q4 "are they wearing matching uniforms ?" shown in Figure 1, NeuralCoref takes the question Q4 and its history as inputs, and then generates a new question "are the people wearing matching uniforms ?" as a new Q4. As shown in Figure 3 (a), visual grounding model (Yang et al., 2019b) takes the i-th question Q i and the image I as inputs and generates initial visual grounding features, as follows: where VGM(·) denotes visual grounding model 2 .
Then v layer followed by a position wise feed-forward network (FFN) layer (stacked N v times) to generate the i-th visual grounding features as follows 3 : where n = 1, . . . , N v and MultiHead(·) denotes the multi-head self-attention layer (Vaswani et al., 2017), then where n = 1, . . . , N v and FFN(·) denotes the position wise feed-forward networks (Vaswani et al., 2017). After N v layers computation, we obtain the final visual grounding features v g i by: Actually, there are some questions that do not contain any entities in the visual dialogue, such as "anything else ?". For such questions, we use the features of the whole image instead, i.e. v g i = v.

Multimodal Incremental Transformer
Inspired by the idea of incremental transformer  which is originally designed for the single-modal dialogue task, we make an extension and propose a multimodal incremental transformer, which is composed of a Multimodal Incremental Transformer Encoder (MITE) and a Gated Cross-Attention Decoder (GCAD). The MITE uses an incremental encoding scheme to encode multi-turn dialogue history with an understanding of the image. The GCAD leverages the outputs from both the encoder and visual grounding via the gated cross-attention layer to fuse the two modal information in order to generate a contextually and visually coherent response word by word.

MITE
To effectively encode multi-turn utterances grounded in visual content, we design the Multimodal Incremental Transformer Encoder (MITE). As shown in Figure 3 (b), at the i-th round, where i = 1, 2, ..., t−1, the MITE takes the visual grounding features v g i , the dialogue history features u i and the context state c i−1 as inputs, and utilizes attention mechanism to incrementally build up the representation of the relevant dialogue history and the associated image regions, and then outputs the new context state c i . This process can be stated recursively as follows: where MITE(·) denotes the encoding function, c i denotes the context state after the dialogue history features u i and the visual grounding features v g i being encoded, and c 0 is the dialogue history features u 0 . As shown in Figure 3 (b), we use a stack of N h identical layers to encode v g i , u i and c i−1 , and to generate c i . Each layer consists of four sub-layers. The first sub-layer is a multi-head self-attention for the dialogue history: where n = 1, . . . , N h , C (n−1) is the output of the last layer N n−1 , and C (0) is the dialog history features u i . The second sub-layer is a multi-head cross-modal attention: where v g i is the visual grounding features. The third sub-layer is a multi-head history attention: where c i−1 is the context state after the previous dialogue history features u i−1 being encoded. That's why we call this encoder "Multimodal Incremental Transformer". The fourth sub-layer is a position wise feed-forward network (FFN): We use c i to denote the final representation at N h -th layer: The mulitmodal incremental transformer encoder at the current turn t, i.e., the bottom one in Figure 2, has the same structure as all the other MITEs but takes the visual grounding features v gt , the current question features q t and the context state c t−1 as inputs and generates the final context state c t .

GCAD
Motivated by the real-world human cognitive process, we design a Gated Cross-Attention Decoder (GCAD) shown in Figure 2, which takes the masked answer features a <z (where z = 1, 2, ..., Z and Z is the length of the answer), encoder outputs c t and visual grounding features v gt as inputs, and generates contextually and visually coherent responses grounded in an image. GCAD is composed of a stack of N y identical layers, each of which has three sub-layers.
The first sub-layer is a multi-head selfattention as follows: where n = 1, . . . , N y , R (n−1) is the output of the previous layer, and R (0) is the masked answer features a <z .
The second sub-layer is a multi-head gated cross-modal attention layer (GCA) as shown in Figure 4, calculated as: where n = 1, . . . , N y , • denotes Hadamard product, E (n) and G (n) denote the outputs of two crossattention functions, computed as follows: where α (n) , β (n) are two gates 4 : where σ denotes sigmoid function, W E , W G , b E , b G are learnable parameters, and [·, ·] indicates concatenation. 4 Our inspiration comes from Cornia et al. (2020). The third sub-layer is a position wise feedforward network (FFN): We use r z to denote the final representation at N yth layer: Finally, we use softmax to get the word probabilitiesâ z :â z = softmax(r z ).

Implementation and Evaluation
Implementation Details. Following previous work (Das et al., 2017), in order to represent words we firstly lowercase all the texts and convert digits to words, and then remove contractions before tokenization. The captions, questions and answers are further truncated to ensure that they are not longer than 40, 20 and 20 tokens, respectively. We construct the vocabulary of tokens that appear at least  Before we train our model, we use three external tools for image features extracting, reference resolution and visual grounding.
Reference Resolution we use NeuralCoref v4.0 for reference resolution, which is developed by huggingface. Introduction and code are available at https://github.com/huggingface/neuralcoref. Visual Grounding We use One-Stage Visual Grounding Model (Yang et al., 2019b) to obtain the visual grounding features. Introduction and code are available at https://github.com/zyangur/onestage grounding.
Automatic Evaluation. We use a retrieval setting to evaluate individual responses at each round of a dialogue, following Das et al. (2017). Specif-ically, at test time, apart from the image, ground truth dialogue history and the question, a list of 100-candidate answers is also given. The model is evaluated on retrieval metrics: (1) rank of human response (Mean, the lower the better), (2) existence of the human response in top − k ranked responses, i.e., R@k (3) mean reciprocal rank (MRR) of the human response (the higher the better) and (4) normalized discounted cumulative gain (NDCG) for VisDial v1.0 (the higher the better). During evaluation, we use the log-likelihood scores to rank candidate answers.
Human Evaluation. We randomly extract 100 samples for human evaluation according to Wu et al. (2018), and then ask 3 human subjects to guess whether the last response in the dialogue is humangenerated or machine-generated. If at least 2 of them agree it is generated by a human, we think it passes the Truing Test (M1). In addition, we record the percentage of responses that are evaluated better than or equal to human responses (M2), according to the human subjects' evaluation.

Main Results
We compare our proposed model to the stateof-the-art generative models developed in previous work. Current encoder-decoder based generative models can be divided into tree facets.
As shown in Table 1 and Table 2, our MITVG, which explicitly locates related objects guided by the textual entities and implements a multimodal incremental transformer to incrementally build the representation of the dialogue history and the image, achieves comparable performance on the Vis-Dial v0.9 and v1.0 datasets. Specifically, our model outperforms previous work by a significant margin both on the VisDial v0.9 dataset (0.87 on MRR, 0.31 on R@1, 1.17 on R@5, 0.75 on R10) and the VisDial v1.0 dataset (0.98 on MRR, 0.76 on R@1, 1.23 on R@5, 1.28 on R10, 0.82 on Mean, and  1.00 on NDCG). The improvement of R@10 is the largest and our method also gains a large increase on MRR and R@1 due to the explicit modeling of multiple modalities (Seeing Sec 3.5 for further quantitative analysis).
As shown in Table 3, we conduct human study to further prove the effectiveness of our model. Our model achieves the highest scores both on the metric M1 (0.76) and M2 (0.70) compared with the previous model, DMRM (Chen et al., 2020a). These results show that our model can generate a better contextually and visually coherent response.

Ablation Study
We also conduct an ablation study to illustrate the validity of our proposed Multimodal Incremental Transformer with Visual Grounding. The results are shown in Table 4.
We implement Multimodal Incremental Transformer without Visual Grounding ('MITVG w/o VG') to verify the validity of visual grounding. As shown in Table 4, comparing 'MITVG w/o VG' with MITVG, we find the metrics decrease obviously (0.46 on MRR, 0.60 on R@1, 0.68 on R@5, 0.46 on R@10 and 0.59 on Mean) if visual grounding is deleted from MITVG. This observation demonstrates the validity of visual grounding.
To verify the effectiveness of the incremental transformer architecture, we implement a Multimodal Incremental LSTM without Visual Grounding ('MI-LSTM w/o VG'). A 3-layer bidirectional LSTM (Schuster and Paliwal, 1997) with multihead attention and a 1-layer LSTM with GCA are applied for encoder and decoder, respectively. All the LSTM hidden state size is 512. Results in Table 4 demonstrate the effectiveness of our incremental transformer architecture (compare 'MITVG w/o VG' with 'MI-LSTM w/o VG'). Results from the comparison between 'MITVG w/o VG' and DMRM (Chen et al., 2020a) also show the validity of our incremental transformer to some extent.

Case Study
As shown in Table 5, we calculate the average number of the objects associated with entities in each question for assistant analysis. As shown in Figure 5 (a), owing to the explicit understanding of visual content via visual grounding and the multimodal incremental transformer architecture, our MITVG generates responses in keeping with human answers. For example, while answering the question Q1 ''how tall is the stack ?" and Q2 "what color are they ?", our model grounds the three suitcases accurately via visual grounding, thus giving the accurate responses "3 suitcases" and "blue and 2 red". However, as shown in Figure 5  (2020) refine history information from both topic aggregation and context matching. Different from these approaches, we explicitly establish specific mapping of objects and textual entities to exclude undesired visual content via visual grounding, and model multi-turn structure of the dialogue based on visual grounding to develop a unified representation combining multi-turn utterances along with the relevant objects.
Incremental Structures. There are some successes on introducing the incremental structure into tasks related to dialog systems (Zilka and Jurcicek, 2015;Coman et al., 2019;Das et al., 2017). In particular, Coman et al. (2019) propose an incremental dialog state tracker which is updated on a token basis from incremental transcriptions.  devise an incremental transformer to encode multi-turn utterances along with knowledge in related documents for document grounded conversations. Das et al. (2017) propose a dialog-RNN to produce an encoding for this round and a state for next round. Our model is different from these approaches mainly in two aspects: 1) we explicitly model the relationship between modalities, i.e., textual utterance and image objects, in visual dialogue through visual grounding; 2) based on the explicit association between modalities, our model incrementally encodes the dialogue history and the image with well-designed incremental multimodal architecture to sufficiently understand the dialogue content, thus generating better responses.

Conclusion
We propose a novel Multimodal Incremental Transformer with Visual Grounding for visual dialogue, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly model the relationship between multiple modalities. Based on visual grounding, multimodal incremental transformer aims to explicitly model multi-turn dialogue history in the order of the dialogue. Experiments on the VisDial v0.9 and v1.0 datasets show that our model achieves comparable performance.