Maria: A Visual Experience Powered Conversational Agent

Arguably, the visual perception of conversational agents to the physical world is a key way for them to exhibit the human-like intelligence. Image-grounded conversation is thus proposed to address this challenge. Existing works focus on exploring the multimodal dialog models that ground the conversation on a given image. In this paper, we take a step further to study image-grounded conversation under a fully open-ended setting where no paired dialog and image are assumed available. Specifically, we present Maria, a neural conversation agent powered by the visual world experiences which are retrieved from a large-scale image index. Maria consists of three flexible components, i.e., text-to-image retriever, visual concept detector and visual-knowledge-grounded response generator. The retriever aims to retrieve a correlated image to the dialog from an image index, while the visual concept detector extracts rich visual knowledge from the image. Then, the response generator is grounded on the extracted visual knowledge and dialog context to generate the target response. Extensive experiments demonstrate Maria outperforms previous state-of-the-art methods on automatic metrics and human evaluation, and can generate informative responses that have some visual commonsense of the physical world.


Introduction
Building intelligent conversational agents that can not only converse freely with human but also have the ability to perceive the physical world, has been one of the longest standing goals of natural language processing (NLP) and artificial intelligence (AI). Although the recent large-scale conversation models trained on text-only corpora, such as Meena  (Adiwardana et al., 2020), Blender (Roller et al., 2020) and DialoGPT , have shown the compelling performance, they are still lack of the perception ability to our physical world. A recent study (Bisk et al., 2020) points out the successful linguistic communication relies on a shared experience of the world that makes language really meaningful. The visual perception is a rich signal for modeling a vastness of experiences in the world that cannot be documented by text alone (Harnad, 1990). On the other hand, human-human conversations involve their understandings of context, the background knowledge they had, and perhaps most importantly the experiences of the world they shared, e.g., what they have seen before. Figure 1 shows a conversation between humans. Human-A recalls his/her past experience of playing volleyball or having BBQ on the beach when human-B talks about vacation on the beach of Hawaii. However, the association relationship between beach and volleyball (or BBQ) is hard to capture in traditional knowledge bases, such as knowledge graph. Motivated by this, we select a common word "pizza" and collect the top 17 words that mostly co-occur with "pizza" on Google Knowledge Graph 1 and MS-COCO images 2 (Lin et al., 2014). As shown in Figure 2, the words cooccurring with "pizza" on knowledge graph tend to be the abstract concepts, while the co-occurrence relationship of object tags on images reflects some commonsense of our physical world, e.g., "pizza" is usually on the "dining table", people usually use "knife" when eating "pizza". Interestingly, we found the "pizza" also co-occurs with "cell phone" and even "plotted plant". This indicates when people eat pizza, they sometimes would put their cell phones aside on the table, or there might exist some plotted plants in the restaurant. Thus, empowering conversational agents to have the visual perception ability about the physical world is a key way for them to exhibit the human-like intelligence.
The existing works (Mostafazadeh et al., 2017;Huber et al., 2018; focus on exploring the multimodal dialog models that ground the conversation on a given image. Recently,  propose to learn the dialog generation model with both image-grounded dialogs and textual dialogs by resorting to text-to-image synthesis techniques Qiao et al., 2019) to restore a latent image for the text-only dialog. Even so, these works are still constrained by the assumption that the dialog is conducted center around a given (or synthesized) image.
In this paper, we take a step further to extend the assumption of image-grounded conversation to a fully open-ended setting where no imagedialog pairs are assumed available. Specifically, we present Maria, a neural conversational agent powered by visual world experiences which are retrieved from a pre-built image index, e.g., the 1 https://developers.google.com/ knowledge-graph/ 2 We calculate the co-occurrence distribution of object tags from the images in MS-COCO dataset. More examples could be found in Appendices.
Open Images Dataset (Kuznetsova et al., 2018). Maria consists of three components: text-toimage retriever, visual concept detector, and visualknowledge-grounded response generator. The retriever is responsible for retrieving a piece of visual world experiences, e.g., a correlated image to the dialog from an image index. The visual concept detector utilizes the object detector from UpDown (Anderson et al., 2018) to extract the regions features (i.e., bboxes) and the corresponding visual concepts (i.e., tags) from the retrieval images. Hence, we can construct (bboxes, tags, context, response) 4-tuple as the training data. Finally, these constructed 4-tuples are used to train the visualknowledge-grounded response generator, which is built on the top of a multi-layer Transformer architecture (Vaswani et al., 2017). To effectively inject the visual knowledge into the response generator, we carry out the Masked Concept Prediction and Visual Knowledge Bias besides the response generation objective. The former aims to align the semantic representations between textual words and image regions, while the latter tries to provide more visual knowledge to facilitate the dialog generation. The experimental results on Reddit Conversation Corpus (Dziri et al., 2019a) demonstrate that Maria significantly outperforms previous state-of-the-art methods, and can generate informative responses with visual commonsense of our physical world.
Overall, the contributions of this paper are summarized as follows: • We explore the task of image-grounded dialog generation under a fully open-ended setting where no specific image-dialog pairs are assumed available, i.e., zero-resource imagegrounded conversation. To the best of our knowledge, this is the first work to connect dialog corpus with the unpaired image data; • We present Maria, a neural conversational agent consisting of three flexible components, which can effectively capture the visual commonsense from images and accordingly generate informative and vivid responses; • Extensive experiments on the widely used Reddit Conversation Corpus are conducted to justify the effectiveness of Maria.

Related Work
Vision and Language In the research of vision and language, various tasks have been extensively studied, such as image captioning Lu et al., 2017;, visual question answering (Antol et al., 2015;Anderson et al., 2018), visual dialog (Das et al., 2017a,b). Popular benchmark datasets in this area include MS-COCO (Lin et al., 2014), VisDial (Das et al., 2017a) and Visual Genome (Krishna et al., 2017). Visual dialog is a task to answer the questions about the factual content of the image in a multi-turn manner. Differently, image-grounded conversation studies how to reply to a dialog context and a given image with proper responses in an open-ended way.
Dialog Generation Encouraged by the success of the neural sequence-to-sequence architecture (Sutskever et al., 2014) on machine translation, end-to-end neural approaches on open-domain dialog generation (Vinyals and Le, 2015;Shang et al., 2015;Serban et al., 2016;Sordoni et al., 2015;Xing et al., 2017;Wu et al., 2018;Adiwardana et al., 2020) have been widely studied in literature. Recently, there is an emerging trend towards grounding the dialog generation models on the external knowledge, such as knowledge graphs (Zhou et al., 2018), documents (Ghazvininejad et al., 2018;Dinan et al., 2019;Kim et al., 2020;Zhao et al., 2020a,b;Li et al., 2020) and images (Mostafazadeh et al., 2017;. Different from the previous work on knowledge-grounded conversation that connects dialogs with unpaired document knowledge , our work lies in the research of image-grounded conversation where a response is generated with a dialog context and a given image. Existing works (Mostafazadeh et al., 2017; in this direction assume there is a given (or synthesized) image for the dialog and explore the multimodal dialog models. In contrast to these works, we study the image-grounded conversation under a fully open-ended assumption where no paired dialog and image are assumed available, i.e., zeroresource image-grounded conversation.

Problem Formalization
Suppose we have a dialog set . . , n}, C i refers to a dialog context and R i is a response to C i . We assume there is a set of images V = {V j } m j=1 , where ∀j ∈ {1, . . . , m}, V j denotes an image. ∀C ∈ D, we assume that there is an image V that triggered by the given dialog context C and response R. Our goal is to estimate a generation model P (R|V, C) from D and V. Thus, given a new dialog context C associated with an image V , the model can generate a response R according to P (R|V, C).

Methodology
To learn such a generation model P (R|V, C), we need to tackle several challenges: (1) How to bridge the gap between unpaired dialog corpus and image data; (2) After obtaining the correlated images, how to extract the detailed visual features and concepts; (3) How to effectively inject the visual knowledge into response generator and enable it to generate responses that are visual-knowledge-grounded. Figure 3 illustrates the framework of our approach. We first build a large-scale image dataset and leverage a cross-modal matching model to retrieve a correlated image using the content of the dialog. Then an off-the-shelf object detector is applied to extracting the object features and visual concepts from the retrieval image. Finally, the response generator is trained to generate the target response conditioned on the context, extracted object features, and visual concepts. In the rest of this section, we will elaborate these three modules.

Text-to-Image Retriever
In this section, we develop a retrieval model that assigns each dialog with a correlated image V . Specifically, we train a text-to-image matching model from image captioning dataset and utilize it to construct the (C, R, V ) triple data.
Modeling To improve the efficiency of crossmodal retrieval model on large-scale dialog corpus and image dataset, we adopt a two-tower architecture (Lu et al., 2019) to accelerate the retrieval process where the image features can be pre-extracted offline. The model takes a sentence T and an image V as input, and predicts the relevance score s(T, V ) between the sentence and the image. We use a text encoder and an image encoder to produce the representations of T and V , respectively. The text encoder is a pre-trained BERT-base model (Devlin et al., 2019) and we use the hidden state of special token [CLS] as the embedding of T : Then a Multi-Layer Perceptron (MLP) projects the sentence embedding into the cross-modal space. We follow Tan and Bansal (2020) to perform L2normalization on the last output features, by which we can simplify the nearest neighbor search problem in the euclidean space to the Maximum Inner Product problem (Mussmann and Ermon, 2016): Similarly, the image encoder is composed of a pretrained ResNeXt backbone (Xie et al., 2017) and a MLP with L2 normalization: Thus, we define the relevance score s(T, V ) as an inner product of the language feature representation f t (T ) and image feature representation f v (V ): Training We train the cross-modal matching model on MS-COCO image captioning dataset (Lin et al., 2014), where each image is paired with 5 sentences describing its visual content. The model is optimized by minimizing the hinge loss so that the relevance score s (T, V ) of the positive imagesentence pair can be larger than the negative pair s (T, V − ) by at least a margin M : Inference Given the trained retrieval model, we can now assign each dialog with a correlated image V . To ensure the diversity and richness of the retrieval results, we fetch 500,000 images from the large-scale Open Images dataset (Kuznetsova et al., 2018) as our image set V. The image V i ∈ V with the maximum relevance score is paired with the given dialog (C i , R i ) ∈ D. Note that for the dialog in the training set, we use both the context C and response R are concatenated as the query for retrieval (i.e., T = (C, R)), which is beneficial to retrieving an image with the related visual knowledge. On the other hand, for the validation/test set of the dialog corpus, the query is only the context (i.e., T = C) so as to keep consistent with the real-world setting where the response is unavailable and need to be generated at inference.

Visual Concept Detector
Given the correlated image V i to the dialog as the visual clue, we can now extract the visual knowledge from it. One naive approach is to utilize the CNN-based models to extract the latent image features. However, this approach does not consider the fine-grained representation modeling for images, which is crucial for the dialog model to understand the local visual features in images. To address this issue, we adopt an object detection model (Anderson et al., 2018) pre-trained on Visual Genome (Krishna et al., 2017) to extract a set of salient object features O = {o k } K k=1 , where each object feature o k is a 2048-dimensional vector. These features represent the images at the level of objects and other salient regions, which has proven to be vital in many high-level image understanding tasks. Besides, the same detector is used to extract a set of visual concepts Q = {q m } K m=1 , where each concept q m is the high-precision textual label of the visual region, e.g., "sunset", "melon", etc. In this manner, we simultaneously obtain the fine-grained image representations and the necessary visual concepts for the subsequent dialog generation.

Visual-Knowledge-Grounded Response Generator
In this section, we propose a unified architecture to effectively inject a set of region features and corresponding visual concepts into the response generation model. In following parts, we describe the model design and training objectives in detail. Figure 4 shows the architecture of our response generation model, which is a multi-layer transformer network for both bidirectional vision/context (O, Q, C) encoding, and unidirectional response R decoding, via the flexible self-attention masks inspired by (Dong et al., 2019).

Input Representation
For each token, the final input representation to the multi-layer transformer network is the elementwise summation of four kinds of embeddings, including token-level, turn-level, position-level, and segment-level. Then, we concatenate all the input representations to one sequence for model training.
Token-Level The token-level embeddings are the concatenation of (O w , Q w , C w , R w ), which denote the token embedding sequence of visual objects, visual concepts, contexts and response respectively. Note that O w is the object embedding transformed by a linear layer into the same dimension as word embedding.

Turn-Level
Since the dialog is multi-turn, we encode this turn order with a relative turn embedding (Bao et al., 2020). Specifically, the turn number is counted from the last utterance of the dialogue to the beginning. Note that as for the tokens corresponding to O and Q, we simply set them the same as the first utterance of C.
Position-Level Positional embedding encodes the signal of the token order in the total input sequence, which is the same as positional encoding of the original transformer (Vaswani et al., 2017).

Segment-Level
Segment embedding is employed to differentiate which segment the token is in, i.e., O, Q, C or R.

Masked Concept Prediction
Due to the inherent gap between visual modality and textual modality, directly optimizing the model by response generation objective may result in the insufficient utilization of the visual knowledge. To align the semantic representations of two modalities, we devise Masked Concept Prediction (MCP) objective. 15% of the visual concepts are randomly replaced with [MASK] tokens in each training instance, which need to be predicted by the model. However, one problem still remains, i.e., the visual concepts have no specific order when extracting from images. In other words, we need to model MCP as a matching problem of set, which does not need to consider the order of predicted concepts when there are more than two concepts masked out simultaneously. To tackle this, inspired by , we adopt the Hungarian Matching Loss (Stewart et al., 2016;Carion et al., 2020) to estimate an optimal mapping α so that the prediction for each masked position is assigned one of the target concepts. Here we denote the set of all input as X = (O, Q, C, R), the set of the bidirectional self-attention part of X as B = (O, Q, C), the set of masked concepts asQ, the set of unmasked tokens as B\Q, and the prediction probabilities of the corresponding representations in the final layer of transformer as H where h i is the probability distribution of the i-th masked position. Hence, the MCP loss can be defined as: where α(i) is the index of the target concept assigned to the i-th prediction. When predicting a masked concept, the model will have to resort to visual region features, dialog contexts and other unmasked visual concepts. This would help the model to align the cross-modal representations between text and visual regions.

Masked Response Prediction
Encouraged by the success of UniLM (Dong et al., 2019) in Seq2Seq tasks, we adopt the Masked Response Prediction (MRP) objective to model the response generation. During training, 70% of the tokens in R are randomly masked with the special token [MASK]. The model is optimized to recover the masked tokens. The masked response tokens and other unmasked tokens in the whole input sequence can be denoted asR and X\R, respectively. Suppose that p i is the conditional probability distribution of the i-th token in R, the MRP loss is the Negative Log-Likelihood (NLL) of the masked  response tokens as follow: (7) Note that the self-attention mask in R is left-toright, but the rest are bidirectional. In other words, the tokens in O, Q and C can attend to each other from both directions, while the tokens in R can attend all tokens in O, Q, C and the leftward tokens in R including itself. MRP implicitly encourages the model to generate responses by learning the relationship among all input tokens. For decoding, we first encode the image regions, visual concepts, dialog contexts, and a special token [BOS] as input. Then the model starts the generation by feeding a [MASK] token and samples a word from the predicted distribution over vocabulary. Then, the [MASK] token is replaced by the generated token and a new [MASK] is appended to the input sequence for next word prediction. The generation process terminates when the model predicts [EOS] token or reaches the pre-defined maximum length.
Visual Knowledge Bias Normally, the top projection layer of generation model produces a probability distribution over the vocabulary: where the e r ∈ R d , W ∈ R |V |×d and b ∈ R |V | are the last output of the transformer network, weight and bias parameters of the decoding head, respectively. |V | denotes the vocabulary size. So far, the visual world knowledge is introduced into the response generation model by the shared-parameter self-attention layers. To further inject the visual knowledge into the generation model, we design a simple but effective strategy, namely Visual Knowledge Bias (VKB). Concretely, an additional visual vocabulary bias b q is first calculated as follow: where F q : R d → R |V | is a projection layer. e q avg denotes the average pooling on all hidden representations of visual concepts, i.e., e q avg = AvgP ooling(E q ) where E q = (e q 1 , ..., e q K ). Then, we mask non-visual-concept tokens in the vocabulary and the masked vocabulary biasb q ∈ R |V | is added to the top layer of generation model to get the final distribution over vocabulary: We leverage this final vocabulary distribution to calculate the MRP loss in Eq. 7 to optimize the model. This visual knowledge bias would encourage the model to generate more visual knowledge related tokens in the response. To sum up, the final objective of our response generation model is to minimize the integrated loss: 5 Experimental Setup

Datasets
To evaluate the performance of Maria, we conduct comprehensive experiments on the Reddit dataset released by , which is a large-scale and high-quality multi-turn conversations extracted from Reddit Conversation Corpus (Dziri et al., 2019b). Each dialog has 3 to 5 utterances, and the training/validation/test set has 1M/20K/20K dialogs respectively. We train and validate the retrieval model using the Karpathy's split 3 of the MS-COCO image captioning data, where the images are split into 113.2K/5K/5K samples as training/validation/test set, respectively. After the retrieval model is trained, we fetch 500K images from the Open Images dataset as the image index, and then retrieve images from it by dialog context and response to construct the training data for response generator.

Evaluation Metrics
Both automatic metrics and human evaluation are employed to assess the performance of Maria and baselines. Automatic metrics include: (1) Fluency: perplexity (PPL) measures the confidence of the generated responses; (2) Relevance: BLEU-1 (Papineni et al., 2002), Rouge-L (Lin, 2004), and we follow Serban et al. (2017) to utilize Embedding Average cosine similarity, Vector Extrema cosine similarity, and Embedding Greedy Matching score. All this metrics are calculated by running the public NLG evaluation script 4 ; (3) Diversity: Distinct-1 (Dist-1) and Distinct-2 (Dist-2) (Li et al., 2016) are defined as the number of distinct uni-grams or bi-grams divided by the total amount of words.
In human evaluation, we randomly select 100 dialogue contexts and the corresponding generated responses for Maria and compared baselines. Three human annotators are asked to score the response quality on a scale of {0, 1, 2} from three aspects, including Fluency, Relevance and Richness. The higher score means the better. Since each response receives 3 scores on each aspect, we report the average scores over annotators and responses. The inter-annotator agreement is measured by Fleiss' Kappa(Fleiss and Cohen, 1973).

Implementation Details
For the retrieval model, ResNeXt-101-32x8d feature is used as the visual embedding, while the concatenation of the last 4 layers of BERT's outputs is used as the textual embedding. Both embeddings are then respectively fed into an MLP composed of three layers of size (1024, 1024, 512). When training the retrieval model, we set the margin M = 0.5 for the hinge loss, and only tune the parameters of both MLPs while freezing the parameters of ResNeXt and BERT. The total training epoch is 20. At inference, the FAISS (Johnson et al., 2019) library is utilized to accelerate the inner product search by batch processing. We use the off-the-shelf object detector from UpDown (Anderson et al., 2018) to extract top-k (k=36) image region features and the corresponding visual concepts. The detector is a Faster R- CNN (Ren et al., 2015) model trained on the Visual Genome dataset (Krishna et al., 2017).
For the response generation model, we set the number of transformer layers L = 12 and the hidden embedding dimension D = 768. Besides, the network parameters are initialized by UniLM. The maximum sequence lengths of context and response are set to 110 and 40, respectively. The sequence lengths of region features and concept tokens are both set to 36. The batch size is 64. We use the Adam Optimizer (Kingma and Ba, 2015) with a learning rate 3e-5 to train the response generation model. The training is conducted on 4 Nvidia Tesla P40 24G GPU cards for 20 epochs.

Baselines
We compare the following baselines in the experiments: (1) Seq2Seq: A standard Sequence to Seqence model with attention mechanism (Bahdanau et al., 2015). (2)   ) that achieves the state-of-the-art performance on benchmarks of dialog generation. (5) ImgVAE: A dialog generation model  that is trained on both textual dialogs and image-grounded dialogs by recovering a latent image behind the textual dialog within a conditional variational autoencoding framework. (6) DialoGPT: An opendomain dialog model ) that fine-tunes GPT-2 (Radford et al., 2019) on massive Reddit data. Since DialoGPT is a dialog generation model trained on the text-only corpus, we introduce it as an auxiliary baseline. For a fair comparison, we choose the same model size (L=12,D=768) of DialoGPT (117M) as our model.

Automatic and Human Evaluations
We summarize the experimental results of automatic evaluations in Table 1. Maria achieves the substantial performance improvements over baselines on all metrics except for the comparison to Di-aloGPT. Especially, Maria significantly surpasses ImgVAE on Dist-1/2, which indicates introducing richer visual knowledge, i.e., image region features Model PPL BLEU-1 Rouge-L Average Extrema Greedy Dist-1 Dist-2 Seq2Seq (Bahdanau et al., 2015)   and the corresponding visual concepts, is beneficial to generating more diverse and informative responses. This also reflects in human evaluation of Table 2 that the richness score of Maria is higher than that of ImgVAE. Besides, in terms of relevance metrics including BLEU-1, Rouge-L, Average, Extrema and Greedy, Maria outperforms all baselines and even performs better than DialoGPT. This indicates introducing the extra visual knowledge related to dialog context can further force the model to produce more relevant responses.
On the other hand, the discrepancy of data distributions between the training data (i.e., Image-Chat  dataset) and test data (i.e., Reddit conversation dataset) of the text-toimage synthesis model in ImgVAE limits its performance in practice. Besides, constrained by the capability of the text-to-image synthesis model, the richness and diversity of the synthesized images are undesirable, while Maria can retrieve a variety of images from the large-scale image index. That may be the reason why ImgVAE consistently underperforms our Maria on relevance including automatic evaluation and human judgement, which also shows the superiority of the retrieval method for the zero-resource image-grounded conversation. Another observation is that Maria slightly underperforms DialoGPT on PPL and Dist-1/2. Since DialoGPT is a large-scale pre-training based dialog generation model and introduces the extra mutual information maximization objective to improve the informativeness of generated responses, which is consistent in human evaluation with respect to fluency and richness.

Ablation Study
We conduct extensive ablation experiments over different model variants and input components to better understand their relative importance to the dialog generation task. As shown in Table 1, training the simplified versions of Maria or removing any visual signals from input components leads to worse performance in terms of relevance and diversity. In particular, the results on the ablation study validate that: (1) The performance improvement of dialog generation benefits from the MCP's effectiveness in aligning the representations of text and vision; (2) When training Maria, introducing VKB can further improve the quality and diversity of generated responses; (3) Rich visual knowledge, i.e., image region features and visual concepts, play a significant role in improving the performance of dialog generation. Especially, removing the visual concepts leads to a dramatic performance drop on diversity. The phenomenon is due to the lack of necessary visual concepts, Maria can not well understand the visual world knowledge when only learning from the visual features.

Case Analysis
To further investigate the quality of responses generated by Maria, we put an example of generated responses in Figure  it, i.e., "the pizza at Aldi is the best in the world". This implies the commonsense that the supermarket usually has the pizza to sell. It is also observed that Maria pays more attention to the relevant image regions when generating the word "pizza", which demonstrates that Maria could capture useful visual knowledge from the image and subsequently leverage it to generate commonsense-aware responses. More cases are demonstrated in Appendices.

Conclusions
In this paper, we present Maria, a neural conversational agent powered by the visual world experiences. It is able to retrieve the visual world experiences with users and generate human-like responses with some visual commonsense. Extensive experiments demonstrate Maria achieves substantial improvements over the state-of-the-art methods in automatic and human evaluation. The future works could include: (1) Design a more precise and comprehensive image retriever to include multiple retrieval images; (2) Combining the retrieve module and dialog generation into an end-to-end model, instead of learning them individually; (3) Explore more efficient neural architectures to inject the visual knowledge into response generation.

A Appendices
In this section, we show more examples of word co-occurrence distributions on Google knowledge graph and MS-COCO images. Besides, some conversation samples produced by Maria and the baselines are also presented in Section A.2.

A.1 Word Co-occurrence Distribution Examples
In Figure 6, we present some supplementary examples of the word co-occurrence distribution on Google knowledge graph and MS-COCO images, including "traffic light", "bed", "book", and "pot plant". Figure 6 (a) shows the co-occurrence distributions of "traffic light" and other words on knowledge graph and images, respectively. As we can see, most of the co-occurred words with "traffic light" are the related concepts such as "smart traffic light", "traffic light protocol", "traffic light rating system", etc. While the co-occurred words on images are usually "car", "person", "truck", "bus", etc, which we often see when walking by the traffic lights. Interestingly, we found "umbrella" and "clock" also co-occurs with "traffic light" in some images. For the former, the picture we can imagine is that people were holding the "umbrellas" when they walked through a zebra crossing under the "traffic light". For the latter, the possible picture is that we can see both the "traffic light" and the "clock" on the top of a high building from a certain angle when walking on the street. Similar observations can be also seen in other examples.

(a) traffic light
Most of the co-occurrence words on knowledge graph are logically-related concepts. However, the co-occurrence relationship of object tags on images reflects some commonsense of our physical world, which implies some pictures that we human could easily imagine. This kind of knowledge is unique and inherent in images, but it can hardly be captured in the traditional knowledge bases, such as knowledge graph. Figure 7 shows some cases from the test set of Reddit data. We observe that the responses generated by Maria are more commonsensical and vivid than those of the baseline methods, which is consistent with our automatic and human evaluation results. Interestingly, Maria is able to retrieve correlated images using the dialog contexts, which makes its response more human-like. For instance, case (a) shows that when the dialog context marvels at "the pass of the world cup", Maria recalls a football player and compliments him "the best player in the world"; case (b) shows that when the dialog context chats about the "Canada weather", Maria is aware of the fact that "Canada" is often "snowy" and then talks about "Canada" in a funny tone, "I've never been to a place that doesn't have snow"; case (c) shows that Maria understands that "swan" is sometimes "dangerous" when they are on the "beach"; case (d) shows that when the dialog context tries to guess one type of game, Maria recalls a ping-pong "ball" game and describes it; and etc.