SeqDialN: Sequential Visual Dialog Network in Joint Visual-Linguistic Representation Space

The key challenge of the visual dialog task is how to fuse features from multimodal sources and extract relevant information from dialog history to answer the current query. In this work, we formulate a visual dialog as an information flow in which each piece of information is encoded with the joint visual-linguistic representation of a single dialog round. Based on this formulation, we consider the visual dialog task as a sequence problem consisting of ordered visual-linguistic vectors.For featurization, we use a Dense SymmetricCo-Attention network (Nguyen and Okatani,2018) as a lightweight vison-language joint representation generator to fuse multimodal features (i.e., image and text), yielding better computation and data efficiencies. For inference, we propose two Sequential Dialog Networks (SeqDialN): the first uses LSTM(Hochreiter and Schmidhuber,1997) for information propagation (IP) and the second uses a modified Transformer (Vaswani et al.,2017) for multi-step reasoning (MR). Our architecture separates the complexity of multimodal feature fusion from that of inference, which allows simpler design of the inference engine. On VisDial v1.0 test-std dataset, our best single generative SeqDialN achieves 62.54% NDCG and 48.63% MRR; our ensemble generative SeqDialN achieves 63.78% NDCG and 49.98% MRR, which set a new state-of-the-art generative visual dialog model. We fine-tune discriminative SeqDialN with dense annotations and boost the performance up to 72.41% NDCG and 55.11% MRR. In this work, we discuss the extensive experiments we have conducted to demonstrate the effectiveness of our model components. We also provide visualization for the reasoning process from the relevant conversation rounds and discuss our fine-tuning methods. The code is available at https://github.com/xiaoxiaoheimei/SeqDialN.

danielwu@alumni.stanford.edu,vhying@stanford.eduAbstract.The key challenge of the visual dialog task is how to fuse features from multimodal sources and extract relevant information from dialog history to answer the current query.In this work, we formulate a visual dialog as an information flow in which each piece of information is encoded with the joint visual-linguistic representation of a single dialog round.Based on this formulation, we consider the visual dialog task as a sequence problem consisting of ordered visual-linguistic vectors.For featurization, we use a Dense Symmetric Co-Attention network [18] as a lightweight vison-language joint representation generator to fuse multimodal features (i.e., image and text), yielding better computation and data efficiencies.For inference, we propose two Sequential Dialog Networks (SeqDialN): the first uses LSTM [8] for information propagation (IP) and the second uses a modified Transformer [31] for multi-step reasoning (MR).Our architecture separates the complexity of multimodal feature fusion from that of inference, which allows simpler design of the inference engine.IP based SeqDialN is our baseline with a simple 2-layer LSTM design that achieves decent performance.MR based SeqDialN, on the other hand, recurrently refines the semantic question/history representations through Transformer's self-attention stack [31] and produces promising results on the visual dialog task.On VisDial v1.0 test-std dataset, our best single generative SeqDialN achieves 62.54% NDCG 4  and 48.63% MRR 5 ; our ensemble generative SeqDialN achieves 63.78% NDCG and 49.98% MRR, which set a new state-of-the-art generative visual dialog model.We fine-tune discriminative SeqDialN with dense annotations 6 and boost the performance up to 72.41% NDCG and 55.11% MRR.In this work, we discuss the extensive experiments we have conducted to demonstrate the effectiveness of our model components.We also provide visualization for the reasoning process from the relevant conversation rounds and discuss our fine-tuning methods.Our code is available at https://github.com/xiaoxiaoheimei/SeqDialN.

Introduction
Visual Dialog has attracted increasing research interest as an emerging field, bringing together aspects of computer vision, natural language processing, and dialog systems.In this task, an AI agent is required to hold a meaningful dialog with humans in natural, conversational language about visual content.Specifically, given an image, a dialog history, and a query about the image, the agent has to ground the query in image, infer context from history, and answer the query accurately [3].Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being grounded in vision enough to allow objective evaluation of individual responses and benchmark progress.
Previous research has tackled the visual dialog task from various theoretical perspectives.Attention mechanism based models generally consist of multiple attention layers.Each layer fuses the vision and language features at a finer semantic space to retrieve and augment the query-related information [36,5].Recursive Visual Attention(RvA) [19] utilizes elaborately designed supervisory modules to recursively retrieve query-related dialog information from dialog history.GNN [39] models the visual dialog task as a partially observed Markov Random Field, inferring the answer representation for the query by maximizing the likelihood of the probabilistic graphical model.These models broaden the insights and achieve impressive results.However, they all share a common challenge which is the semantic gap caused by vastly different feature semantics separately and independently learned from the vision and language domains.This semantic gap brings considerable complexity in feature fusion, where visual features are combined with textual features.
Recent work [13,30,2,12] in learning joint representation of visual content and natural language text aims to bridge the semantic gap between the two domains.These models are pretrained with large-scale visual-text datasets and significantly boosts the performance of downstream tasks with a jointly learned representation.For example, by ViLBERT [13], VisDial-BERT [17] sets a new state-of-the-art baseline for the visual dialog task.However, when trained with a smaller dataset such as VisDial v1.0-train, the performance of VisDial-BERT [17] drops significantly, which indicates its performance gain can be primarily attributed to the larger training dataset.In addition, ViLBERT [13] requires much higher computation and memory resources, as well as significantly increases training complexity.
Our work is inspired by the use of visual-linguistic joint representation to erase the semantic gap, where we embed the visual signals into the text snippets for each dialog round.In this way, we convert a visual dialog into an ordered vector sequence, where each vector is the joint visual-linguistic representation of a specific dialog round.Rather than using ViLBERT [13], we chose Dense Symmetric Co-Attention [18] as a lightweight joint visual-linguistic representation generator.In contrast to VisDial-BERT [17], which concatenates all rounds of the dialog history into a single textual input for ViLBERT [13], we keep each dialog round separate.Keeping this inherent sequential structure from the visual di- alog allows us to reason across the dialog history to find the most query-relevant dialog rounds.We propose two sequential networks to tackle the visual dialog task by viewing visual dialog as a vector sequence in the joint visual-linguistic representation space.
Fig. 1 illustrates a conceptual overview of the proposed method.The visual features and language embeddings are features learned from two independent domains.They are fed into the Dense Symmetric Co-Attention Network [18] to produce a visual-linguistic vector sequence in the joint visual-linguistic feature space.Our baseline model, the Information Propagation Network (Se-qIPN), uses a LSTM [8] to summarize the visual-linguistic sequence with the last hidden state.Despite its simplistic structure, SeqIPN significantly outperforms other well-known baselines, such as Memory Network [3] and Co-Att [14], on the NDCG metric.Multi-step reasoning network (SeqMRN) is based on Transformer [31].We expect the multi-head attention mechanism in Transformer better captures the relationship within the visual linguistic sequence.By stacking several Transformers to refine attentions in higher semantic space, we achieve multi-step reasoning.SeqMRN outperforms VisDial-BERT [17] by > 1.5% on NDCG when trained with comparable amount of data, while using only roughly 70% of the total parameters used in the latter.Note the pipeline in Fig. 1 facilitates the combination of different word embeddings and SeqDialN models.In this work, we compare two kinds of pre-trained word representations: GloVe [20] and Dis-tilBert [24].Our ablation test shows that SeqMRN with DistilBert embedding yields the best combination.Further experiments show SeqDialN sets a new state-of-the-art generative visual dialog model.
VLDialog and NDCGFinetune [17,22] tune with dense annotations 6 .P1P2 [21] identifies two causal principles and uses the relevance score of dense annotation 6 as supervised signals.Training on the dense annotation 6 makes these models perform very well on the NDCG metric but poorly on the others because the dense annotation 6 dataset doesn't correlate well with the original groundtruth answer to the question [17].In this work, we propose a reweighting method to lessen the decrease of performance measured by non-NDCG metrics.Our best model achieves slightly lower NDCG than [17,22,21] but outperforms them significantly on MRR.
The main contributions of this paper is four fold.(1) We formulate the visual dialog task as reasoning from a sequence in the joint visual-linguistic representation space.(2) We propose two sequential networks to tackle the visual dialog task in the joint visual-linguistic representation space.(3) We set a new state-ofthe-art generative visual dialog model.(4) We propose a reweighting method to maintain the overall performance during the fine-tuning with dense annotation.

Related Work
Attention Mechanism has been widely used to address image caption and visual question answering (VQA) tasks.In image caption, attention mechanism is generally used to weigh how image regions and previously generated words are relevant to the next word generation [34].
VQA focuses on providing a natural language answer given an image and a free-form, open-ended question.Image features come either from region proposals generated by object detection networks [6,23] or from grid features of convolution layers [27].Attention mechanisms have been deeply explored in VQA related work.Earlier attention mechanisms involve question-guided attention on image regions to retrieve the most informative regions, conditioned on the question [33,26].In deep networks, the attention mechanism helps refine semantic meanings at different levels.SANs [37] create stacked attention networks, producing multiple attention maps in a sequential manner to imitate multi-step reasoning.Afterward, image-guided attention started gaining interest.[15] introduces co-attention between image regions and words in the question.[38] utilizes image-guided attention to extract the language concept of an image and then combines this with a novel multi-modal feature fusion of image and question.
Recently, Dense Co-Attention Network (DCN) [18] proposes a symmetric co-attention layer to address VQA tasks.DCN is "dense symmetric" because it makes each visual region aware of the existence of each question word and vice versa.This fine-granularity co-attention enables DCN to discriminate subtle differences or similarities between vision and language features.In this work, we use DCN as the generator of joint visual-linguistic representation.
Pretrained Joint Visual-Linguistic Representation Model.Inspired by BERT [4], [13,30,29,12,2] stack Transformers [31] in Bert-style to learn general joint visual-linguistic representation.These models can be divided into two categories.2-Stream Architecture includes [13,30], which discriminates vision stream from language stream.2-Stream Architecture aligns the two modalities by iteratively making the two streams attend to each other.1-Stream Architecture includes [29,12,2], which packs visual regions and words in a single stream.1-Stream Architecture aligns the two modalities by self-attention over the entire stream.These models are pretrained with large-scale image/video and text pairs.For example, ViLBERT [13] and VL-BERT [29] are pretrained with Conceptual Caption (CC) [25] comprised of 3.3 million image-caption pairs.Downstream tasks can leverage large-scale vision-language datasets by directly fine-tuning these models with task specific data.
Visual Dialog is a natural extension of VQA.Given an image, a dialog history comprised of multiple rounds of conversations, the visual dialog agent is required to respond to a query.[3] proposes the visual dialog task and a general encoder/decoder framework to tackle the task.Early baselines include Late Fusion, Hierarchical Recurrent Encoder, and Memory Networks [3].[7] proposes a two-stage method which filters out the obviously irrelevant answers in primary stage, then re-ranks the rest answers in synergistic stage.[7] won the visual dialog challenge7 in 2018.Several models try to leverage the dialog structure to conduct explicit reasoning.GNN [39] abstracts visual dialog as a fully connected graph where each node represents a single dialog round and each edge represents semantic dependency of the two connected nodes.Conditioned on the observed dialog history nodes, the representation of unobserved query-answer node is inferred by maximizing the likelihood of the probability graph.Recursive Visual Attention(RvA) [19] designs sub-networks to infer the stopping condition when recursively traversing the dialog stack to resolve visual co-reference relationships.RvA won the visual dialog challenge 7 in 2019 by fine-tuning with dense annotations 6 .ReDAN [5] develops a recurrent dual attention network to progressively update the semantic representations of query, vision, and history, making them co-aware through multiple steps to achieve multi-step reasoning.ReDAN [5] achieves 64.47% NDCG on the VisDial v1.0 test-std set, is still the highest score among all published work trained without dense annotations 6 .
Based on ViLBERT [13], recent VisDial-BERT [17] leverages the joint visuallinguistic representation to tackle visual dialog task.By fine-tuning with dense annotations, VisDial-BERT [17] achieves state-of-the-art NDCG (74.47%) using a discriminative model.However, its non-NDCG performance is significantly lower.Futhermore, it's not easy to deploy a discriminative model in real applications.Similar performance degradation occurs to P1P2 [21], which also trained with dense annotations 6 .

Approach
The visual dialog task [3] is formulated as follows: at time t, given a query Q t grounded in image I, and dialog history (including the image caption C) The task requires the agent to predict the ground truth answer and rank other feasible answers as high as possible.
As illustrated in Fig. 1, we rely on Faster-RCNN [23] to extract features corresponding to salient image regions [1].The vision feature of image I is represented as F I ∈ R nv ×dv , where n v = 36 being the number of object-like region proposals in the image and d v = 2048 being the dimension of the feature vector.Q t and each item in H is padded or truncated to the same length d l .Thus, each sentence S is represented as F S ∈ R d l ×de , where d e being the dimension of the word embedding.To facilitate further discussion, we denote d h as the dimension of the hidden state throughout this section.
This section is organized into 5 sub-sections.Section 3.1 illustrates how we convert visual dialog into visual-linguistic vector sequence.Section 3.2 introduces the LSTM [8] based information propagation network.Section 3.3 introduces the Transformer [31] based multi-step reasoning network.Section 3.4 discusses decoders.Section 3.5 illustrates the reweighting method in fine-tuning with dense annotations 6 .

Convert Visual Dialog into Visual-Linguistic Vector Sequence
Dense Co-Attention Network (DCN) [18] proposes using contents in sub-grids of a convolutional neuron network as visual region features.However, we turn to use Faster R-CNN proposals [23,1] because people usually talk about objects in their conversations and Faster R-CNN proposals align better with this purpose of object identification.Given an image I with vision feature F I ∈ R nv×dv and a sentence S with embedding F S ∈ R d l ×de , we define DCN (I, S) ∈ R d h the Dense Co-attention [18] representation of I and S. We define an instance of t round visual dialog by a tuple D = (I, H t , Q t ).Using DCN, we convert dialog history H t into the visual-linguistic vector sequence H t as: Let Q t = DCN (I, Q t ), the original visual dialog then turns into a new tuple D = ( H t , Q t ) in the joint visual-linguistic representation space.Note that the sequential structure of H t is exactly the same as that of H t and image I no longer exists in D as an explicit domain.
To facilitate discussion in section 3.2, we define the question history Q t by: Note, Q t includes the visual-linguistic vector of the query Q t .

SeqIPN: Information Propagation Network
As illustrated in Fig. 2, Information Propagation Network is a 2-layer LSTM.
After converting the visual dialog into a tuple D = ( H t , Q t ) in the joint visuallinguistic representation space, we apply a LSTM to the visual-linguistic vector sequence H t and use the hidden state at time t as the summary of visual-linguistic history.Specifically: We apply the same LSTM to question history Q t and use Q t 's hidden state R Q as the context aware query.It is a trade-off between NDCG 4 and MRR 5 to propagate through the question history.The propagation can blend the previous questions with the real query Q t , which fools the discriminator and results in a drop on MRR 5 (< 1%).However, the impact is insignificant because the forget gate of the LSTM causes the influence of older questions to gradually fade away.On the other hand, R Q collects more semantic information to help discriminate answers that share similar semantic meanings, which results in the increase of NDCG 4 (> 1.5%).
We treat the concatenated representation [R L , R Q ] ∈ R 2d h as the information summary of D, which is linearly projected to R QL ∈ R d h as the final representation of D. R QL is fed into the decoder to rank candidate answers.

SeqMRN: Multi-step Reasoning Network
Transformer [31] was originally developed for sequence to sequence task using an encoder-decoder architecture.Transformer introduces a position aware selfattention mechanism by concatenating each member in the input sequence with a position sensitive feature.Transformer's encoder applies self-attention which makes each member aware of the existence of all other members in order to learn an informative representation for the input sequence.Transformer's decoder applies masked self-attention, which makes each member only attend to preceding members in order to align with the temporal structure of the output sequence.
We can't directly apply Transformer's encoder to encode the visual-linguistic vector sequence H t .This is because the encoder's self-attention mechanism makes a specific dialog round attend to future rounds, which results in a temporal paradox.Meanwhile, we observe that H t has a similar temporal structure as the masked self-attention of Transformer's decoder; that is, only predecessors are observable at a specific time.
In this work, we modify Transformer's encoder by replacing its self-attention with the decoder's masked self-attention, while keeping other modules unchanged.We focus on the modifications to enable multi-step reasoning via Transformer.For simplicity, we define three functions Query(), Key(), and V alue().Given a vector v ∈ R d h , Query(v), Key(v), and V alue(v) are vectors in R d h and represent v's query, key, and value described in [31] respectively.Fig. 3 is a conceptual architecture of the proposed Multi-step Reasoning Network(SeqMRN).{P 0 , • • • , P t−1 } are position features.According to [31], for 0 ≤ i ≤ t − 1, P i is a vector with dimension d h defined by: Given dialog tuple D = ( H t , Q t ), the position aware visual-linguistic sequence U t is defined by: In this work, we freeze position feature after initialization since VisDial v1.0 dataset is not sufficient to train it well from scratch in our experiments.
History Backward Self-Attention Layer As illustrated in Fig. 3, this layer applies masked self-attention within the position aware sequence U t .This layer allows a single dialog round to gather relevant information from previous conversations and embed the information into its own representation.
Specifically, for U i , 0 ≤ i ≤ t − 1, its attention logits with respect to all the other rounds of dialog is defined by: where τ i ∈ R t .Then, the context aware visual-linguistic sequence V t is defined by: Query Correction Layer In this layer, the query Q t renews its knowledge about the context based on V t .The attention weights reflect how Q t distributes its focus over V t , which enables reasoning across the dialog history.Specifically, the query's attention logits with respect to V t is defined by: However, we don't want history information in V t to overpower the query's own semantic meaning, thus we augment Q t by self-attention weight u q : Then, the query's correction △ Q t is defined as: Note that Question Correction Layer keeps V t unchanged.Contrary to Se-qIPN, we don't use question history Q t in SeqMRN because attention mechanism can make Q t indistinguishable from other questions in Q t .
Multi-step Reasoning History Backward Self-Attention Layer and Question Correction Layer form the building blocks of our proposed Multi-step Reasoning Network.As illustrated in Fig. 3, there is a residual connection between these two layers, specifically: where the results , where Members in D ′ are more environment aware than their corresponding members in D. We achieve multistep reasoning by stacking several such building blocks to progressively refine D. We consider L ′ t−1 of the last block as the summary of dialog history and consider Q ′ t of the last block as the context aware query.We project [

Decoder Module
D t 's representation R QL is used to rank the 100 candidate answers in A t .In this work, we report performance with both discriminative and generative decoders.
Discriminative Decoder For each candidate anwer A j t ∈ A t , a LSTM is applied to A j t to obtain its representation R j ∈ R d h .The score of A j t is defined by s j = R T j R QL .Like [7], we optimize the N-pair loss [28]: where s gt is the score of the ground truth answer, and we set τ = 0.25.
Generative Decoder Inspired by attention based NMT [16], we develop an attention based decoder.The decoder is a LSTM initialized by R QL .At time t, we compute similarity weights between current hidden state and the hidden states of previous timestamps instead of directly using the hidden state to generate the distribution over vocabulary.Then, the distribution is generated based on the weighted sum of hidden states.
During training, we maximize the log-likelihood of the ground-truth answers.During evaluation, we use the log-likelihood scores to rank answer candidates.We also divide the score by the square root of the answer length to discourage short answers.

Reweighting Method in Fine-tuning with Dense Annotations
VisDial v1.0 training dataset provides a subset named dense annotations 6 which contains 2K dialog instances.For each instance in dense annotations, two human annotators assign each of its candidate answer with a relevance score based on the ground-truth answer.[22] finetunes with dense annotations using a generalized cross entropy loss: where s is the score vector of candidate answers, y j is the relevance score label of the j th candidate answer.However, blindly optimizing this objective will significantly hurt non-NDGC metrics.To alleviate this issue, we propose a reweighting method to make the fine-tuning process aware of the importance of the ground truth answer.Specifically, we update the relevance label y by: where index gt is the index of the ground truth answer.Thus, the objective in our fine-tuning training is: L F = − 100 j=1 y ′ j log(sof tmax(s)[j]).

Quantitative Results
Evaluation Metric We use NDCG 4 , MRR 5 , recall (R@1, 5, 10), and mean rank to evaluate the models' performance.The Visual Dialog challenge 7 in 2018 and 2019 picked the winners based solely on NDCG.In 2020, the challenge adds MRR as another primary metric.
Ablation Study We compare the performance between SeqDialN models of different configurations.We use Memory Network (MN) [3], History-Conditioned Image Attentive Encoder (HCIAE) [14], Sequential Co-Attention Model (CoAtt) [32] and ReDAN [5] as baselines in this study because published work [5] reports the performance of these models with both discriminative and generative decoders.In Table 1, "-D" stands for discriminative model and "-G" for generative model.SeqMRN-DE-D and SeqMRN-DE-G outperform all baselines and other SeqDialN models on NDCG 4 for both discriminative and generative cases.Especially for the generative case, SeqMRN-DE-G outperforms the second place ReDAN-G(T=3) by > 3.6% NDCG.The discriminative SeqDialN models have worse MRR 5 compared to discriminative baselines.SeqMRN-GE-D achieves the best MRR in discriminative SeqDialN models, but still much less than that of ReDAN-D(T=3).However, the MRR difference between ReDAN-G(T=3) and SeqMRN-DE-G is merely 0.3, thus SeqMRN-DE-G still outperforms ReDAN-G(T=3) for average performance.We arrive at the conclusion that SeqMRN-DE-G is a new state-of-the-art generative visual dialog model.
SeqIPN with GloVe Embedding is the simplest SeqDialN.However, SeqIPN-GE-D achieves better NDCG than well-known discriminative baselines such as MN-D, HCIAE-D and CoAtt-D.In addition, SeqIPN-GE-G even outperforms all generative baselines on NDCG.The model simplicity and performance gain together validate the merit of considering visual dialog as a visual-linguistic vector sequence.
As shown in Table 1, generative SeqDialN's performance measured in NDCG is positively correlated to that measured in MRR.However, we identified a performance trade-off between NDCG and MRR for discriminative SeqDialN.For example, SeqMRN-GE-D performs slightly worse than SeqMRN-DE-D on NDCG but it performs significantly better than the latter on MRR (> 3%).The same thing can be observed between SeqIPN-GE-D and SeqIPN-DE-D.Such complementary behaviors between discriminative SeqDialN models would boost the performance of the entire ensemble model.

Ensemble SeqDialN Analysis
In this section, we add VisDial-BERT [17] as a baseline.At this stage, the comparison is conducted between models trained without dense annotation 6 .
"SeqDialN: 4 Dis." enhances the performance of discriminative SeqDialN on both NDCG and MRR by a significant margin.In contrast, the performance gain of "SeqDialN: 4 Gen." is very small compared to single generative SeqDialN.This observation indicates that the performance gain of the discriminative ensemble does not come from the architecture diversity between SeqIPN and SeqMRN, but from the complementary performance between discriminative SeqDialN models.
We also evaluate SeqDialN on VisDial v1.0 test-std set.Table 3 shows the comparison between our model and state-of-the-art visual dialog models trained without dense annotations 6 .SeqDialN achieves state-of-the-art performance on NDCG, even a single generative SeqDialN can outperform most previous work on that metric.At present, SeqDialN doesn't perform well on MRR, which is partly because it is hard for generative models to produce exactly the same answer as the ground truth, even when conditioned on the same semantic scenarios.
Table 3.Comparison of SeqDialN to state-of-the-art visual dialog models on VisDial v1.0 test-std set.↑ indicates higher is better.↓ indicates lower is better.† denotes ensembles.All models have been trained without dense annotations 6  .
Fine-tuning with Dense Annotations We fine-tune discriminative SeqDi-alN with dense annotations 6 .Table 4 shows the effectiveness of our proposed reweighting method (section 3.5).We list the fine-tuning statistics for one Se-qIPN and one SeqMRN, for statistics on other configurations (including generative cases), please refer to Appendix B in our supplementary materials.The non-NDCG metrics suffer from significant drop if we blindly fine-tune with dense annotations.In contrast, the reweighting method greatly lessens the drop on non-NDCG metrics.

Model
NDCG↑ MRR↑ R@1↑ R@5↑ R@10↑ Mean↓ SeqMRN-DE-D 70.23  block, the attention switches to the caption.Most likely in deeper stack, it make the inference like: only a color image makes a banana look "yellow".

Conclusion
We presented Sequential Visual Dialog Network (SeqDialN) based on a novel idea that treats dialog rounds as a visual-linguistic vector sequence.We studied both discriminative and generative models to create a new state-of-the-art generative visual dialog model.Even though our model is trained only on VisDial v1.0 dataset, it achieves competitive performance against other models trained on much larger vision-language datasets.Our next step is to leverage the pretrained joint visual-linguistic representations to enhance the generative SeqDialN's performance on the MMR metric.

Table 2 .
Comparison of SeqDialN to state-of-the-art visual dialog models on VisDial v1.0 validation set.