ReSee: Responding through Seeing Fine-grained Visual Knowledge in Open-domain Dialogue

Incorporating visual knowledge into text-only dialogue systems has become a potential direction to imitate the way humans think, imagine, and communicate. However, existing multimodal dialogue systems are either confined by the scale and quality of available datasets or the coarse concept of visual knowledge. To address these issues, we provide a new paradigm of constructing multimodal dialogues as well as two datasets extended from text-only dialogues under such paradigm (ReSee-WoW, ReSee-DD). We propose to explicitly split the visual knowledge into finer granularity (``turn-level'' and ``entity-level''). To further boost the accuracy and diversity of augmented visual information, we retrieve them from the Internet or a large image dataset. To demonstrate the superiority and universality of the provided visual knowledge, we propose a simple but effective framework ReSee to add visual representation into vanilla dialogue models by modality concatenations. We also conduct extensive experiments and ablations w.r.t. different model configurations and visual knowledge settings. Empirical, encouraging results not only demonstrate the effectiveness of introducing visual knowledge at both entity and turn level but also verify the proposed model ReSee outperforms several state-of-the-art methods on automatic and human evaluations. By leveraging text and vision knowledge, ReSee can produce informative responses with real-world visual concepts. Our code is available at https://github.com/ImKeTT/ReSee.


Introduction
With the availability of large-scale datasets (Li et al., 2017;Dinan et al., 2018) and pre-trained language models (Radford et al., 2019;Raffel et al., 2020), dialogue generation develop rapidly in recent years.Conducting effective linguistic Yeah, I've really enjoyed hiking in my time.The Appalachian trail was one of my favorites, but it's so long (2,200 miles!).
Can you tell what is the breed of the dog?It is Dalmatian.

Session-Level Image
Is boat in water?Yes, it is.communications often requires real-world experiences shared between speakers (Bisk et al., 2020).Text alone may fall short in accurately conveying rich world knowledge (Harnad, 1990), where visual signals are essential to share experiences and conduct high-quality conversations.As humans converse day to day, it is common and natural for them to group information into smaller chunks of memory through images.That explains why incorporating visual perceptions in dialogue systems can potentially bring the conversation quality to a higher level.
Visual dialogue (Das et al., 2017) was proposed to learn to communicate with users based on one simple image, making the visual knowledge very limited for a multi-turn dialogue session.In order to enhance the dialogue quality by providing larger capacity and flexibility of visual information, recent works have considered employing multiple images and image searching processes to better align with the dialogue context.Even so, they are confined to retrieving images on a coarse-grained dialogue concept (e.g., session-level) or leverage inaccurate visual knowledge searched from inadequate image resources (Liang et al., 2021;Shen et al., 2021).To sum up, current works have two main issues that may compromise the performance of multimodal dialogue.knowledge: existing multimodal dialogues mostly follow the framework of image-grounded conversation, which inherently provides insufficient visual knowledge (one image) and leaves lots of details unexploited for a complete conversation.(2) Potentially inaccurate visual knowledge: though recent explorations come up with using fine-grained images, they are limited in searching from small-scale image caption datasets (e.g., Shen et al. (2021) employs Flickr30k (Young et al., 2014) for this process).These defects will introduce knowledge bias into the system (e.g., entity images retrieved from Flickr30k may be wrong or monotonous w.r.t.given entities in Figure 2) and impair the conversational skills of a dialogue agent.
To overcome the above two shortcomings, we believe: (1) Compared with session-level visual knowledge, fine-grained visual knowledge such as entity-level image is more competent to help models build a comprehensive understanding of ongoing conversations.We thus propose to explicitly divide the visual standard of a dialogue session into turn-level and entity-level.(2) Instead of matching photos from existing image sets, we search images on the internet for every entity to obtain accurate and diverse visual representations accordingly.To justify the advantage of our approach in obtaining pictures with higher quality, we randomly sample 50 entities from existing dialogue data and either search corresponding images from the internet or retrieve them from a large image corpus with over 150K images. 1 We further conduct a human evaluation to quantify entity-image relevance.Images searched from the internet outperform and tie retrieved ones in 52% and 12% cases respectively. 2Based on the above-mentioned two concepts of visual knowledge, we take a step forward and come up with a novel framework to automatically construct multimodal dialogue data.
To verify the efficiency of provided visual information, we present RESEE, a generative conversational framework powered by real-world visual experiences.Our framework follows the encoderdecoder paradigm with either shared or separate encoder-decoder setup.We handle multimodal dialogue context by concatenating these information into the encoder, then the model generates plausible responses using its decoder.Three types of token embeddings are considered in the encoder module to sink in the knowledge from different modalities.To prove the effectiveness of RESEE, we further compare our dialogue model with several strong baselines, including four task-oriented pre-trained models and two similar multimodal dialogue systems.RESEE outperforms most baselines on both automatic and human evaluations.We also conduct comprehensive ablation experiments to demonstrate ( 1

Multimodal Dialogue Datasets
In this section, we introduce our framework for constructing multimodal dialogue datasets.The overall   data flow for dataset construction is in Figure 3.A dialogue session should consist of two aspects of visual information, namely the turn-level outline and entity-level details.We search for both visual concepts from either a very large image pool or the internet.In detail, we construct multimodal datasets extended from Wizard of Wikipedia (WoW) (Dinan et al., 2018), a knowledge-grounded dialogue dataset, and the commonly used Daily Dialogue (DD) (Li et al., 2017).

Turn-level Visual Knowledge
One dialogue turn is a single exchange of conversation between two speakers (e.g., a question and an answer).Intuitively, turn-level visual knowledge is helpful when there are more than one topic related to a dialogue session with multiple turns, and the turn-level visual knowledge should be highly relevant to the current ongoing conversation turn.Since one complex dialogue is generally long and diverse, instead of being restricted to one specific data domain, we gather a relatively large group of image-caption data and propose to use sentence similarity between captions and dialogue turns for image retrieval.Using similarity from only the language domain helps us mitigate biases caused by using multimodal similarity measurement from various image domains (Liang et al., 2021).
For the image set to be searched, we group four image-caption datasets, i.e., COCO2017 (Lin et al., 2014), Flickr30k (Young et al., 2014), NoCaps (Agrawal et al., 2019) and Localized Narratives (LN) (Pont-Tuset et al., 2020) with 826,539 image-caption pairs in total.Then we use the following steps for turn-level image retrieval: (1) Turn Summarization: To avoid information discrepancy between dialog turns and image captions arising from different sentence lengths.We first summarize the dialog turns into a shorter version.(2) Texual Representation: To fully leverage caption descriptions of images, we use pre-trained sentence BERT (Reimers and Gurevych, 2019) to get the textual representation of both summarized dialog turns and image captions.(3) Image Retrieval: Finally, we employ processed textual representations of dialogue turns as queries and representations of captions as keys to index the most relevant image to every dialogue turn from the image-caption database.And we further present the percentage of turn-level images retrieved from each image-caption dataset in Table 2.

Entity-level Visual Knowledge
The turn-level knowledge alone is not competent to provide full visual details for long and knowledgeable conversations.We thus propose to use entity-level images to empower the dialogue agent with insights into details.Specifically, entity-level visual knowledge involves images of both nouns and named entities from every dialogue.We use the following steps for entity extraction and their corresponding images acquirement: (1) Named Entity: We use a pre-trained RoBERTa model (Liu et al., 2019) to extract named entities in every dialogue instance.
(2) Regular Nouns: We then extract all nouns from dialogues using the public toolkit Stanza (Qi et al., 2020).
(3) Image Searching: Finally, we use two online search engines 3 to search images for the entity-level visual knowledge.Since we leverage two searching engines i.e., Qwant, Pixabay in this process, we make sure that there is at least one valid image for every extracted entity.

Overall Dataset
The proposed datasets are advantageous in comparing prior works by providing fine-grained and more accurate images related to the dialogue context.This is because (1) we explicitly split the visual knowledge into turn-level and entity-level; (2) we use a large image pool as well as online searching engines to acquire images.We additionally present examples and detailed statistics of RESEE-WoW and RESEE-DD in Appendix B. Note that, since turnlevel information is conveyed through sentences, whose semantic information may not be fully captured through conventional word matching, we did not employ online searching for turn-level images.

RESEE Methodology
We consider a simple approach to concatenate and to infuse multimodal information into plain dialogue models.As shown in Figure 4, we apply this approach to two transformer models with shared or separate encoder-decoder for dialogue responding.Formally, we define our modeling task as: given ., V m E } is a set of entity-level images from C. We aim to learn an appropriate response R with given information by modeling p(R | C, E, V T , V E ).

Model Input
We employ different encoders for different modality encoding.In concrete, we utilize transformer blocks (Vaswani et al., 2017) for word encoding, which projects word tokens to a continuous word embedding space.For image encoding, we utilize CLIP encoder (Radford et al., 2021) to capture the global information of a picture and then use MLP functions to transform it into the same embedding space as the word.To distinguish different modality information and to identify dialogue contexts from responses, we employ three kinds of token-wise embeddings and sum them up as the input to our transformer-based dialogue systems, namely token embedding, position embedding, and segment embedding.Token Embedding: The token embedding is the concatenation of V Tw , V Ew , E w , C w , R w , which denote the word embedding of turn-level and entity-level visual knowledge, extracted entities, dialogue context and response respectively.We additionally add special token [SEP] between different modalities and content from distinct speakers in the dialogue.Note that, we separate response embedding R w from this concatenation for the model with a separate encoder-decoder setting.Position Embedding: Since the transformer model itself cannot learn the token position, we employ position embedding to encode signals of the token order in the input sequence.Segment Embedding: Segment embedding is employed to differentiate which segment (turn-level or entity-level visual knowledge, textual entities, dialogue context or response) the token is in.

Model Training
Separate Encoder-Decoder Model (RESEE (SEP.)):Dialogue model with separate encoder decoder employs different sets of model parameters for context understanding and response generation respectively.We apply cross-attention (Vaswani et al., 2017) between the encoder output and the decoder input to bridge the gap between multimodal dialogue context learning and response generation.We first initialize it with T5 (2020) parameters.For the training objective, the model is optimized to recover the response R with the given multimodal knowledge Shared Encoder-Decoder Model (RESEE (SHARE)): Dialogue model with shared encoder decoder integrates the understanding and generation process with the same set of parameters.We take masked response prediction as the main training task to make the model aware of appropriate responses with multimodal dialogue context.In detail, we first initialize it with UNILM (2019).During training, 70% of the responses are replaced by a special token [MASK] or another token in the vocabulary.The masked response is denoted as R.In detail, we use the unmasked dialogue information [X, R\ R] to predict R: Besides, we also follow Liang et al. (2021) to consider entity knowledge bias when decoding.
Inspired by recent progress in language generative methods (Dong et al., 2019;Wang et al., 2021), for both types of models, we process the encoder input with bi-directional attention, while giving the decoder output causal attention masks.This masking strategy makes sure our models fully understand dialogue contexts and autoregressively generate tokens with learned knowledge.

Response Generation
For the separate encoder-decoder model, we feed multimodal information X to the model encoder and autoregressively generate responses from the decoder.As for the shared encoder-decoder model, we first encode X with a special token [BOS] behind it.Then, the model starts to generate by appending a [MASK] token to the input and samples a word from the predicted distribution over vocabulary.The [MASK] token is then replaced by the generated token and a new [MASK] is appended to the input sequence for next word prediction.Both generation processes terminate when the model predicts [EOS] token or reaches the max length.

Baselines
To verify the advantages of the proposed framework in dataset construction and multimodal dialogue generation, we take competitive DI-ALOGPT (Zhang et al., 2020), GPT-2 (Radford et al., 2019), UNILM (Dong et al., 2019) and T5 (Raffel et al., 2020)   We also observe that finetuned GPT-2 and DI-ALOGPT perform better than our method in PPL on both datasets.This is attributed to their pretraining stage which dedicates in directly optimizing model generation ability.However, our model can achieve better diversity compared with baselines, especially our model variants without textual entity input and/or entity-level visual knowledge.
We also present human evaluation results in Ta-ble 5,6 which further justify the outcomes and findings from automatic metrics above.

Visual Knowledge
We conduct extensive ablation experiments over variants of the input information to better understand their respective roles in the dialogue generation task.
(1) The performance improvement on our model benefits from both aspects of visual knowledge in providing external information.( 2) Fine-grained visual information (i.e., entity-level), plays a more important role in improving the generation performance than turn-level visual knowledge, which explains the necessity to find and utilize fine-grained visual clues.
(3) Turn-level images also prompt model performance (i.e., "-E." v.s."-E.-T.V."), which is consistent with findings from the traditional visual dialogue.(4) However, textual entities bring more performance gain comparing entity-level visual knowledge.We ascribe this to the model pre-training stage that is originally on the language domain, which makes it harder for dialogue models to understand visual information than to acquire knowledge from texts.( 5) Introducing visual knowledge improves the quality of generated responses, but generally degenerates the diversity.This is attributed to the constraints brought by fine-grained visual inputs.These inputs enlighten the model with explicit visual clues, making it compelling to specific knowledge but leading to a tolerable sacrifice of text diversity.

Multiple Entity-level Images per Entity
Since we provide a one-to-many mapping between entities in the dialogue context and their corresponding images, we conduct experiments with varied numbers of entity-level images as input.In provided by extra visual knowledge.
(2) However, giving too many entity-level images can be a showstopper for the model, i.e., the model with 5 images per entity generally performs worse.This might be attributed to the plain multimodal infusion method considered, where the model may confuse different images that belong to the same or another entity.
(3) More entity-level images jeopardize the model's output confidence with lower PPL yet make generated responses more diverse with consistently more distinct n-grams (i.e., higher Dist-1/2).

External Document Knowledge
Is the visual knowledge a complement of existing textual knowledge?To answer it, we conduct experiments over RESEE-WoW with provided topic passages appended to the input.In Table 6, we observe that (1) our visual knowledge can further boost model performance even with document knowledge, demonstrating the evidence provided by visual knowledge is complementary to existing textual knowledge.But the performance gain of adding documents to the visual models is not as significant as models without visual knowledge (T5).This indicates that there exist certain intersections between information provided by two modalities.
(2) Bringing document knowledge to the model greatly improves diversity.Because abundant textual information helps models understand dialogues comprehensively and generate responses diversely.

Case Analysis
We exhibit an example of generated responses in Figure 5.As this conversation is talking about the importance of dressing code in interviews, our dataset provides one turn-level image showing a professional person with a suit and a tie as well as three entities and their corresponding images.Compared with models without visual enhancement, our two models focus more on the provided visual contexts and generate responses that are highly relevant to dialogues and the reference.For example, our

RESEE (SHARE) :
Yes, i'll try my best to make a good impression on the interviewers RESEE (SEP.): I think so.you should be comfortable in your clothes and make good impression on the interviewer.models can produce words that pay more attention to "interviewer" and "clothes", which are missing in the unimodal counterparts.These demonstrate that our datasets provide useful visual information, which the proposed multimodal dialogue system captures and subsequently leverages to generate better responses that are relevant to the reference.Please refer to Appendix D for more examples.

Related Works
Visual Dialogue Dataset.Images can serve different purposes in a dialogue.Visual dialog (or visual question answering, VQA) is a task to answer questions about the factual contents of the image in a multi-turn manner.VisDial (Das et al., 2017) was constructed of one image and about 10 indepen- dent question-answer pairs grounded on the given image.De Vries et al. ( 2017) introduced image grounded QA dataset with pixel-level object location of the image.IGC (Mostafazadeh et al., 2017) was constructed based on Twitter conversations with (image, description, question-answer) triplet as samples.In visual-enhanced conversational recommendation, MMD (Saha et al., 2018) was a multimodal dataset under a shopping situation and aimed at providing applicable recommendations based on textual conversations as well as images of potential shopping items.MMConv (Liao et al., 2021) was applied in tourism scenarios across 5 real situations, it also provided a knowledge base and a photo gallery about recommended items.Recently, MMDialog (Feng et al., 2022) was proposed with massive multimodal open-domain conversations and associated images derived from social media.IMAD (Viktor and Denis, 2023) was constructed using massive amount of dialogues, with the last utterance to be replaced with collected images.
Open-domain Dialogue.Open-domain dialogue models aim at responding to general human-like conversations in various circumstances.While dialogue generation has a rich history, the area has made significant progress with the rising of pretrained models in varied linguistic domains (Zhang et al., 2020;Mi et al., 2022;Zhu et al., 2023b;Touvron et al., 2023b).The introduction of external knowledge in traditional models plays a vital role in leading them to intellectual dialogue agents.For example, Wu et al. (2021) leveraged three domains of knowledge to enhance the model performance in Chinese contexts.Wang et al. (2022) employed an extra retrieval process to find knowledgeable evidence as input to enlarge dialogue model capacities.Recent works focus on efficient knowledge integration like retrieval-free approaches (Wang et al., 2023a) and few-shot prompting (Wang et al., 2023b).Moreover, visual knowledge has also been recently considered to boost the performance of dialogue models.Multi-Modal BLENDER (Shuster et al., 2021) was pre-trained on large-scale visual question-answer datasets for image-grounded conversation.Liang et al. (2021) introduced a method to allocate conversations with a picture as external knowledge.Shen et al. (2021) extended the visual augmentation to the token-level, providing versatile visual information to the model.Most recently, as the emergence and wide spread of large language models (LLMs), such as GPT-3 (Brown et al., 2020), LLAMA (Touvron et al., 2023a,b), more and more works start incorporating LLMs as their text generative framework and get exceptional performance in the open-domain dialogue tasks (Zhu et al., 2023a;Liu et al., 2023;Ye et al., 2023;Dai et al., 2023).

Conclusion
In this paper, we present a paradigm for multimodal dialogue construction with two novel datasets and a multimodal dialogue responding framework RE-SEE.We explicitly separate the visual knowledge into two aspects, using online searching or retrieving from large image corpora to construct accurate and diverse visual knowledge.Transformer-based dialogue models with shared and separate encoderdecoder verify that provided visual knowledge promotes model capacity.Further, we explore feeding multiple entity-level images and external document knowledge into models.By providing fine-grained visual knowledge on dialogues, we demonstrate dialogue models can substantially achieve better performance across different setups and domains.
8 Limitations (1) The provided datasets are auto-constructed, meaning visual biases brought by online searching are inevitable.We plan to take our next step to make the dataset more accurate and to include more visual knowledge (e.g., visual knowledge from external document knowledge in WoW) in our multimodal dialogues.
(2) For now, we did not consider a one-to-one mapping between the textual entity and entity images in the model input, more sophisticated relations can also be introduced for better modal interaction and modeling.(3) Our framework offers a novel way to enhance text-only dialogue system performance by adding extra information from a multimodal perspective.However, this comes at the cost of extra computational overhead brought by learning visual knowledge.

Ethics Statement
We are aware that automatic dialogue generation may create deceptive, harmful, or objectionable content due to their internal biases (Curry and Rieser, 2018;Gehman et al., 2020).These biases are usually inherited from the training data itself.We observe that since our dataset construction is totally based on existing text-only dialogues, our RESEE framework can be used to mitigate those biases easily.For instance, one of our future work directions is to employ the proposed multimodal data collection method on detoxification dialogues (e.g., The Moral Integrity Corpus (Ziems et al., 2022)) for building safer and better dialogue agents.
We are well aware that the online searching process of entity-level images may cause biases (e.g., gender, race) in our constructed dataset.To mitigate the bias, we collect multiple images on the internet for one entity in dialogues (see Appendix B for statistical details of our datasets), so that the model can choose more than one specific image during model training.For licenses of images, other employed dialogue data, and the constructed datasets that are about to be released, please refer to Appendix A.1 for more details.mizer (Loshchilov and Hutter, 2017) with the learning rate linearly increasing from 0 to 0.005 for the first 20% training steps, then linearly decreasing to 0. We train the model until it has no progress on validation set (valid unseen set for RESEE-WoW).All experiments are conducted on two NVIDIA TITAN GPUs with 24G memory in total, it takes around 12 hours for RESEE-WoW training and 7 hours on RESEE-DD.

B.1 Dataset Statistics
First of all, for two text-only datasets we employed, WoW dataset is under an MIT License, and it is publicly available at https://parl.ai/projects/wizard_of_wikipedia/.DD dataset is licensed under CC BY-NC-SA 4.0, and the dataset can be obtained from http: //yanran.li/dailydialog.We present detailed dialogue dataset information, including unique turn-level image number, unique entitylevel image amount, turn and entity level images averaged on a dialogue session and average number of images that belong to one entity in Table 7.We also show the relationship between entity number per dialogue session and dialogue session number in Figure 6, the data distribution of how many examples are there for each (n entity-level image, m turn-level image) setting in Figure 7. From these four distribution figures, we can tell that the RE-SEE-WoW dataset has more concentrated turn-level image number and entity-level image number pairs, while the range of entity-level image number of RESEE-DD is wider.

B.2 Multimodal Examples
We present sampled examples from our constructed datasets RESEE-WoW and RESEE-DD in Figure 8.
From these examples, we can clearly tell the visual enhancement for dialogue understanding from both knowing named entities and enlarging impressions of regular nouns.For instance, the noun Ikebana is a proper noun in the dialogue, the model would never know what it looks like from just reading the dialogue contexts.However, the entity-level image provides the model with a straightforward approach to access related visual knowledge.Another example shows that images corresponding to abstract nouns such as love can provide an ambiance of romance for models, which may strengthen model's   understanding of dialogue histories and further assist it to produce high-quality responses.

C Baseline Details
We present the implementation details of several baselines.We took the pre-trained weights from Huggingface for GPT-210 and DIALOGPT11 model.For two models, we used their 24-layer version to make fair comparisons with rest methods.We used Adam (Kingma and Ba, 2014) optimizer with learning rate increases from 0 to 0.001 for the first 20% iterations for both GPT-2 and DI-ALOGPT.We truncate input data to a fixed length of 250 and make sure that the length of every generated response is no more than 30.We train two models on two datasets until they have no progress on validate sets, which takes around 3 epochs.All baselines are trained on the same machine as RE-SEE with two NVIDIA TITAN GPUs.

D Additional Qualitative Results
We also present more generated examples of our RESEE models as well as several baseline dialogue  models in Figure 9, 10, and 11.From these qualitative results, we can draw the conclusion that our RESEE method can better understand given dialogue contexts with enhanced visual knowledge, hence, generating responses with higher quality.

E Human Evaluation
For annotators, we hire three undergraduate students from America or China with fluent English reading skills.Each annotator is assigned 100 (instances)×6 (models)×4 (aspects) = 2, 400 rating tasks, resulting in 2, 400 (tasks)×3 (annotators) = 7, 200 human ratings in total.The annotators have acknowledged the use of annotated data sets and are paid an average annotation salary.All annotators were aware of the potential risks or ethical concerns of machine-generated texts.
Annotation Instruction Here we present the human evaluation standard: Fluency: 1.The system's result does not make sense and it is unreadable.
2. Choose this score when you are hesitant between score 1 and score 3.
3. The system's result contains minor errors but they do not affect your understanding.
4. Choose this score when you are hesitant between score 3 and score 5. 5.The system's result is human-like, grammatically correct, and very easy to understand.

Informativeness:
1.The system's result is dull, repetitive, and does not have new information.
2. Choose this score when you are hesitant between score 1 and score 3.
3. The system's result contains some new information and it displays a certain level of diversity.
4. Choose this score when you are hesitant between score 3 and score 5. 5.The system's result is very informative and contains novel content.In addition, it displays a high level of diversity and it is enjoyable to read. Relevance: 1.The system's result is completely irrelevant to the given reference.
2. Choose this score when you are hesitant between score 1 and score 3.
3. The system's result is partially related to the reference and some of its content can be found in the reference.
4. Choose this score when you are hesitant between score 3 and score 5. 5.The system's result is very related to the given reference and contains a diverse set of concepts in the reference.
Make Sense: • YES: the response is completely reasonable in context.
• NO: the response is confusing, illogical, out of context, or factually wrong.

Being Specific
• YES: the response is specific to the given context.
• NO: the response could be used in dozens of different contexts.

Figure 1 :
Figure 1: Traditional visual dialogue (left) is grounded on a single given picture, while the proposed multimodal dialogue (right) provides both Turn-Level and Entity-Level images based on text-only dialogue data.
Figure 2: Samples of entity images of Ikebana and flower from searching the internet v.s.retrieving from limited image-caption data.Images from the internet are more accurate and diverse compared to the counterpart.
) the model performance gains brought by different visual knowledge, (2) the model performance with increased visual knowledge volumes, and (3) the relation between the proposed visual knowledge and the conventional document knowledge.Contributions.(1) We provide a new paradigm to construct multimodal dialogue data and two datasets based on it.A comparison between ours and other multimodal dialogue datasets is in Table 1.(2) We propose a simple yet effective multimodal dialogue framework RESEE, which utilizes visual knowledge to generate informative and plausible responses.(3) Extensive experiments and promising results on two constructed datasets justify the effectiveness of our dialogue framework.

Apprentice:Figure 3 :
Figure 3: Data processing and construction of our dataset RESEE-WoW using one example from WoW.
you mean.I ought to wear right clothes at the right time.You got it .Only in this way can you gain the respect of the interviewer and his confidence in your judgement.It may not be true, but the first and lasting impression of you is determined by your clothes and behavior.that's the job.DIALOGPT : I think so.I think I'll wear a suit and tie.UNILM : I'll try to make sure that you don't lose your interviewers.T5 : I agree with you.

Figure 5 :
Figure 5: An example of responses generated by our models and baselines.Highlighted words overlap entities in the dialogue context or the response reference.

Figure 6 :
Figure 6: Data distribution of entities of one dialogue session on two datasets.The X axis represents entity number, while the Y axis represents dialogue session number.

Figure 7 :
Figure 7: Distribution of turn-level image and entity-level image numbers of two datasets.We use logarithm function to normalize the number of samples with varied turn-level and entity-level images and indicate their values using color bar.

Figure 8 :
Figure 8: Dataset sample for one dialogue turn on our multimodal datasets.Pictures pointed by dashed lines are entity-level images, while the one pointed by solid line is turn-level image for one instance.

Table 1 :
Statistic and comparison of our ReSee datasets comparing existing multimodal dialogue datasets."Avg.
Img." is the averaged number of images per dialogue session and "Img.Level" is image granularity.

Table 2 :
DatasetCap-Img WoW (%) DD (%) Image-caption pairs in four existing datasets for turn-level image retrieval, and the percentage (%) of retrieved images from two used datasets.

Table 3 :
Results of the proposed dialogue models on RESEE-WoW (test unseen set) and RESEE-DD (test set).Models without textual entity, turn-level or entity-level visual knowledge as the inputs are appended with "-E.", "-T.V." and "-E.V." respectively.Our dialogue framework RESEE with shared and separate encoder-decoder are appended with (SHARE) and (SEP.)respectively.We mark the best result with bold face and the second best with underline.

Table 4 :
Model performance with varied image number per entity during training ("n E.V.") over RESEE-DD.

Table 5 :
Human evaluation results.

Table 7 :
Statistics of two constructed multi-modal dialogue datasets.We present unique entity-level image count as well as unique image count of the 5 most similar image on turn-level visual data retrieval.The average, maximum and minimum number of images are based on one dialogue session.We also present the average number of searched valid pictures for every entity at the last row.