Think Beyond Words: Exploring Context-Relevant Visual Commonsense for Diverse Dialogue Generation

,


Introduction
Building intelligent dialogue systems is a longstanding goal of artificial intelligence and has attracted increasing research attention in recent years.An ideal conversation agent is supposed to generate diverse and informative responses without sacrificing their relevance to the dialogue context.To avoid general and dull dialogue generation (Li et al., 2016), some approaches modify model architecture to manipulate latent variables and target distributions (Lin et al., 2020;Wang et al., 2021), yet these works limit themselves to original conversations without considering useful auxiliary information.
Another series of solutions augment the training corpus with extra information like emotions or per- Figure 1: Examples from two pure language dialogue datasets, where the underlined green part is the output that needs to be generated1 .The hidden visual memory, which contains associative commonsense and needs to be explored, can be essential for humans to make proper responses during the conversation.sonality (Mazare et al., 2018;Song et al., 2019).Following this line, works like Su et al. (2020); Majumder et al. (2021) introduce more general non-conversation text like forum comments and stories to help generate richer responses.However, these works only consider information stored in pure text, ignoring the grounding information from the external visual world, which is essential for generating really meaningful language (Harnad, 1990;Bisk et al., 2020).
As shown in Figure 1, it is natural that when making a conversation, we do not only focus on current context.We also expand or transition the topics by using associative memory gained from the physi-cal world, so that the chat can be more engaging and last longer.In this work, we introduce visual commonsense as the logical semantic information stored in visual scenes from daily life.Considering that images representing everyday scenarios are typically logical and grounded in commonsense, it is reasonable to introduce them into open-domain conversation as additional information.Liang et al. (2021); Shen et al. (2021) are pioneering works that introduce visual information into the general open-domain response generation.However, these works only connect visual information by simply matching the context representation with images, without explicitly considering the topic transition of conversation.This may lead to monotonous and narrow semantics of the responses.Besides, the semantic gap between modalities makes it difficult for these methods to effectively integrate visual features.Furthermore, they ignore the balance between the contributions of two modalities in the decoding stage.
To alleviate the above issues, we present VIC-TOR, a context-relevant visual commonsense enhanced dialogue generator, which consists of three components: visual commonsense retriever, multimodal fusion block, and self-adaptive response generator.The visual commonsense retriever first extracts concept words from context.Then, in order to acquire explicit commonsense knowledge, it explores related concepts by multi-hop searching on knowledge graphs.Each of these related concepts will be considered globally and mapped into the corresponding images, which then produce captions to narrow the semantic gap.In this way, we obtain visual commonsense with rich associative semantic information.
To facilitate diverse dialogue, our multimodal fusion block incorporates auxiliary visual knowledge at each decoding step.It encodes visual commonsense with a transformer block and utilizes a coattention mechanism to fuse two modalities.The response generator is based on GPT-2 model.It takes knowledge pairs gained from knowledge graphs as guidance to encourage consistent responses with relevant topics.Finally, at each decoding step, the generator uses soft probability to adaptively combine the distributions based on the textual and visual information.We demonstrate the effectiveness of our approach on two public datasets in comparison with various representative baselines.
Our contributions are summarized as follows: • We present a novel approach to retrieve visual scenes based on dialogue.It expands concepts on knowledge graphs and maps them to unpaired image data, so as to acquire contextrelated visual commonsense with high quality.
• We propose VICTOR, a new conversation agent that fuses multimodal information to enrich and steer the generation process.It adaptively balances textual information from context and external visual commonsense, generating diverse responses while maintaining their coherence with contexts.
• We conduct extensive experiments on two open-domain dialogue datasets.The results show the effectiveness of our proposed method, and verify the potential of exploiting multimodal information for intelligent conversation agents.
2 Related Work

Controllable dialogue response generation
The goal of open-domain dialogue systems is to establish engaging conversations with users.To satisfy the human need for communication and affection, an ideal conversation agent always has a higher requirement in consistency, semantics and diversity (Huang et al., 2020).Therefore, constraints on conversation attributes like persona (Mazare et al., 2018;Zhang et al., 2018) and sentiment (Song et al., 2019;Shen and Feng, 2020), and external non-conversation data like documents and knowledge base (Li et al., 2020;Majumder et al., 2020) are introduced to control the dialogue response and improve the interactivity of the conversation model.However, most of these works use additional constraints or guiding information in the form of pure text, neglecting the rich commonsense knowledge stored in the visual scene.

Multimodal open-domain dialogue
Along with the thriving of multimodal learning for tasks like captioning (Tu et al., 2022;Li et al., 2022) and entity mapping (Li et al., 2018;Liu et al., 2022), the use of visual information for improving language tasks has also shown great potential in areas such as machine translation (Caglayan et al., 2019;Fang and Feng, 2022) and semantic parsing (Shi et al., 2019;Kojima et al., 2020).However, its exploration for enhancing dialogue generation is still limited.
Early attempts on this issue assume the conversation to be grounded on a given image (Mostafazadeh et al., 2017;Shuster et al., 2020).Yang et al. (2021) tries to recover the latent image of the conversation using conditional variational auto-encoding framework (Sohn et al., 2015).Recent researches (Liang et al., 2021;Shen et al., 2021) have taken it a step further by matching context with extra image data.Distinct from these existing works, our method expands original topics from context by searching from commonsense knowledge base, and uses corresponding images to explore valid visual information for response generation.

The Proposed Method
In this section, we first introduce our task formulation for open-domain dialogue generation with visual commonsense, and then illustrate the three main components of our proposed VICTOR model.

Task Formulation
notes the parallel conversational corpus 2 .C i is the context and R i is the corresponding response.D I denotes our collected image data.We assume that for each dialogue context C i , we can find an image subset V i = {v i1 , v i2 , . . ., v im } containing visual commonsense to assist the response generation, where V i ⊆ D I .Thus our goal is to learn a generation model P (R i |C i , V i ) from D T and D I .

Visual Commonsense Retrieval
As shown in Figure 2, we design a static approach to retrieve related visual commonsense for each conversation context.

Concepts Expansion
Since an engaging conversation requires dialogue agents to be able to pro-actively introduce new relevant topics, we expand the topic concepts by searching from Concept-Net (Speer et al.), a commonsense knowledge base.Following Ji et al. (2020), we first perform fuzzy matching with lemmatized form of surface texts to extract topic concepts from provided conversation context.After removing stopwords, we keep verbs and nouns as our original topic concepts T o .
We consider the original concepts as the initial nodes, and iteratively search for their directed 2 To illustrate our method concisely, we focus on singleturn dialogue generation here.Our approach will also work in the multi-turn setting when we use dialogue history as context.Image Mapping Our attempt is to utilize commonsense knowledge existing in the corresponding visual scenarios of the conversation topics.
It is intuitive to consider the connection among chosen concepts rather than mapping them separately into visual space.Since there is no large scale aligned dialogue-image dataset available, we train our concept-image matching model from MSCOCO (Lin et al., 2014), a commonly used image-captioning dataset containing sentenceimage pairs.Following Tan and Bansal (2020), we align each token in the caption s to the paired image, and perform token-level matching.
To extract feature representation of text and image, we adopt pretrained language and visual model (here we use BERT BASE (Devlin et al., 2018) and ResNeXt (Xie et al., 2017) respectively) to operate the encoding process.We then project the feature vectors of the two modalities into aligning space, and normalize them to norm-1 vectors of the same dimension d: where the mapping function f map (.) is a multilayer perceptron followed by normalization function, L is the sentence length.Thus we get the aligned textual and visual representation H s = {h si } and h v .
The relevance score of two modalities will be measured by the inner product of their representations.Finally, hinge loss is adopted to optimize the matching model: where v − is the randomly selected negative image sample, α is the margin between the similarities of a positive and a negative pair.
After training the token-image matching model, it takes the expanded topic concepts T e to retrieve their matched images.We keep the top K images for each concept word regarding their relevance scores.Thus we get the corresponding visual scenes V = {v 1 , v 2 , . . ., v m }, which contains the desired commonsense knowledge.

Multimodal Information Fusion
A commonly-used captioning model3 pretrained on MSCOCO dataset is adopted to caption the former retrieved image for each concept.The assumption is that caption-styled visual information is easier for the model to exploit than roughly extracted visual features.Then we concatenate these captions using token [cap].Thus we get the corresponding visual commonsense V c = {u 1 , . . ., u z }, where z is its total length.After that, we utilize transformer block(TB) (Vaswani et al., 2017) to obtain the representation of the visual commonsense.Formally, the representation of each V c is calculated by: where W emb ∈ R dvoc×d h is the word embedding matrix from the generator, d voc is the size of the vocabulary.PE(.) is the position embedding to make use of the sentence order.Afterward, we apply the fusion module to incorporate the context information and visual knowledge, so as to determine the external information desired by current context.Formally, at each decoding step t, the response generator will produce the hidden state h c t which encodes the current context (details will be described in the next section).
We leverage the hidden state as a context query, and use multi-head attention layer to capture the correlated visual information h vc t from H v : At the t-th decoding step, based on the extracted commonsense information, the decoding distribution over the vocabulary decided by visual knowledge can be produced by:

Self-adaptive Response Generator
The generation network is based on GPT-2 (Radford et al., 2019), a pretrained multi-layer transformer decoder which learns the language granularity from large amounts of open Web text data.As shown in Figure 3, given a dialogue context C, the decoding process of each step t is as follows: By using GPT-2 model, we first obtain the hidden state h t of the current context.To encourage the generated response to use topic knowledge, we explicitly consider the extracted concepts here.As we get the expanded concepts set by searching for neighbours on external knowledge bases earlier, we can obtain the related concepts pairs T pr = {[t hd 1 , t  2019), we first embed the two concepts of each pair and thereafter concatenate them to obtain the relatedconcepts embedding.Then we use h t to query from embedded pairs by applying single-layer multihead attention layer, getting topic-aware h c t : the probability distribution of the t-th token decided by textual knowledge will then be computed as follows: Since different conversation turns may require various information, it is crucial to balance the textual information from context, which constrains the direction of this conversation, and previously obtained visual knowledge, which indicates related commonsense from real world grounding.Thus we utilize a weighted average score β to decide the different levels of contribution of these two knowledge sources, for generating ideal responses.Instead of fixing a manual hyperparameter to adjust the balance, we adopt self-adaptive weight (See et al., 2017) based on the current hidden states of the context: then we can obtain the following combined decoding distribution: Finally, following the standard practice of dialogue response generation, we optimize our proposed model with the cross entropy loss: 4 Experimental Settings

Datasets
We conduct our experiments on two open-domain dialogue corpus, OTTers (Sevegnani et al., 2021) and DailyDialog (Li et al., 2017).OTTers is a dialogue dataset of human one-turn topic transitions.Unlike other common dialogue datasets which contain a large number of short, generic responses, each utterance in OTTers has a specific topic and is therefore more informative.OTTers is slightly different from other dialogue corpus in form: given one turn conversation [u a , u b ], where each utterance has a different topic, the goal is to generate a transition response u t to serve as a smooth link between them.This dataset is exactly suitable for testing our model, since the response generation requires associative commonsense knowledge.During the experiments, we concatenate [u a , u b ] using separator tokens as the model inputs, and treat u t as the outputs.To test the generalization ability of our model and make a fair comparison with other baselines, we also evaluate VICTOR on the commonly used DailyDialog dataset.The examples of both datasets are shown in Figure 1.
For image retrieval, we train our mapping model on MSCOCO dataset.We randomly sample 100K images from the Open Images dataset (Kuznetsova et al., 2020) as our candidate image set D I , then we retrieve images from it following section 3.2.
Among these comparison methods, Seq2seq is a standard generation model, GPT-2 is a commonly used pretrained language model, GVT and AdaLabel are both transformer-based models for diverse dialogue generation, GRF and AdaLabel are stateof-the-art approaches for the datasets we use.

Evaluation Metrics
Automatic Evaluation We hypothesize that our proposed approach, which leverages external topicaware visual commonsense, can increase the diversity of the generated responses, while maintaining relevance to their corresponding contexts.For fluency, we use Perplexity (Serban et al., 2015) to measure the confidence of the generated responses.A relatively low perplexity indicates better fluency.For relevance, we adopt widely used BLEU (Papineni et al., 2002) (here we use BLEU-1 and BLEU-4) and Rouge-L (Lin, 2004) to measure the n-gram overlaps between ground truth references and the generated responses.To measure the diversity, we report the percentage of distinct uni-grams and bigrams (Dist-1 and Dist-2 respectively) (Li et al., 2016) in all generated responses.
Human Evaluation Considering that the automatic metrics are not always accurate to evaluate the responses (Liu et al., 2016), we further conduct manual evaluation following previous works Wu et al. (2021); Zou et al. (2021).Specifically, we randomly sample 200 testing pairs from each test set.Given a dialogue context, three annotators are asked to conduct pair-wise comparison between the responses generated by VICTOR and three strong baselines, including state-of-the-art methods (1200 comparisons with three baselines on two datasets in total).For each comparison, three annotators are required to compare the responses from the following perspectives: fluency, context coherence, informativeness.The annotators need to judge which response is better independently.If the two responses are both proper or inappropriate, the comparison of this pair is treated as "draw".Ultimately, we average the results of three annotators and calculate their Fleiss' kappa scores (Fleiss, 1971).

Implementation Details
During the topic-expansion, we set the number of hops H = 2 and preserve top N = 5 concepts per hop.For the retrieval model, the concatenation of the last 4 layers of BERT output and image features from ResNeXt-101-32x8d are used as embedding of each modality.We set the hidden size d of the aligning space to 256 and the hinge loss margin α to 0.5.We test the performance of retrieving different number of top-scored images for each cocept, and set K = 1 for its best result (see section 5.4).The pretrained captioning model is combined with a ResNet-101 encoder and a LSTM decoder.
For the generator, we base our model on gpt2small4 (Transformer with 12 layers, 768 hidden size, 12 heads) .The multi-head transformer block for encoding visual commonsense has the structure of 6 layers, 768 hidden size and 6 heads.To train the model, we use the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 1e-6.At the inference stage, the maximum decoding length of the response is set to 40, and we adopt beam search decoding with a beam size of 3.All our experiments are implemented with PyTorch, and the entire model is trained on RTX3090 GPUs.

Automatic Evaluations
As shown in Table 1, our proposed model VICTOR outperforms baselines on most automatic metrics in these two datasets.In the aspect of relevance, it beats baselines in all related metrics, indicating responses generated by VICTOR can be coherent with the help of context-related knowledge.Meanwhile, enhanced by the extracted visual commonsense, VICTOR also achieves the best performance in Dist-  13 18.49 48.11 34.19 22.27 9.64 22.92 6.57 36.11 AdaLabel 33.16 17.27 1.71 18.21 16.79 39.88 30.92 24.12 8.46 27.65 9.95 39.12 VICTOR 16.29 21.49 4.82 20.09 24.41 56.64 22.21 29.15 14.89 30.21 14.14 46.47  that although AdaLabel can generate relatively diverse responses, the lack of context related external knowledge prevents it from keeping high relevance to the context.This problem can be particularly acute with OTTers dataset, since most dialogues in it are topic specific.The same problem also affects GVT model, without the assistance of commonsense knowledge, it performs rather poorly on OTTers dataset.Although the GRF baseline integrates information from knowledge bases, its performance is worse than our model on both relevance and diversity.This indicates the superiority of considering commonsense information in visual scenes rather than just pure textual knowledge.

Human Evaluations
The human evaluation results are shown in Table 2.Not surprisingly, VICTOR consistently outperforms all the strong baselines and achieves significant improvements on both datasets.We also analyze the bad cases and find that the baselines still suffer from the general or irrelevant responses.
The evaluation result indicates that VICTOR can generate more coherent and informative responses that are attractive to annotators.This validates the benefits of the context-relevant visual commonsense and the fusion mechanism.We also employ Fleiss' kappa scores to measure the reliability between different annotators, and results show that annotators reach a moderate agreement.

Ablation Study
To investigate the effectiveness of each part of VIC-TOR, we conduct ablation studies on two datasets by removing or replacing particular modules from the original model.Here, we have three variants: (1) w/o.VC: removing visual commonsense extraction and multimodal fusion block.
(2) w/o.AW: removing the adaptive weight of the response generator and replacing it with a fixed weight of 0.5.
(3) w.RF: replacing the caption-styled visual commonsense with ResNeXt features of the same image, which can be obtained by using the pretrained image encoder from our retrieval model.The ablation results are shown in Table 3.We observe that without fusing visual commonsense, the performance of variant-1 drops sharply with respect to relevance and diversity metrics.The result verifies the effectiveness of integrating contextrelevant visual knowledge into response generation.Besides, although variant-2 maintains a relatively high diversity, the values of relevance metrics drop largely due to the fixed balancing weight of the generator.This indicates that adaptively deciding the contribution of language and visual knowledge plays an important role in the generation process for different conversation turns.We also witness a small drop in performance of the variant-3, which uses ResNeXt features instead of the image captions as the visual commonsense source.As shown in previous researches (Jin et al., 2022;Feng et al., 2021), this phenomenon can be explained by the fact that captions of everyday scenarios, which dampen the reporting bias of the general text cor- pus, are better carriers of logical commonsense and contain less noise than roughly extracted image features.

Number of images
We further study the effect of visual commonsense by varying the number of retrieved images and conducting experiments on OTTers dataset.As shown in Table 4, all the results obtained with the help of visual commonsense are better than those without, while choosing top 1 image helps achieve best performance.This can be explained that each selected image refers to key information of all core concepts, resulting in partial semantic overlap, thus additional selection of more images may introduce more unnecessary noise, which is not helpful to the generation.

Case Study
To further investigate the quality of responses generated by VICTOR, and compare the results with other baselines intuitively, we show two dialogue cases from the two datasets in Figure 4.As we can see, the retrieval process can obtain proper expanded concepts from knowledge graphs and retrieve related images.The corresponding captions with logical commonsense will then bring auxiliary visual information into the generation process.In these two cases, although all four models have generated fluent and informative responses, compared with the other three strong baselines, responses generated by VICTOR are clearly more consistent with the context and more engaging.Again, the results prove the effectiveness of exploring contextrelevant visual commonsense for dialogue generation.

Conclusion
In this work, we propose a novel context-relevant visual commonsense enhanced approach for open domain dialogue generation.The model effectively extracts relevant visual commonsense and integrates the multimodal knowledge, and adaptively measures the contribution of different modalities, so as to produce better responses.Extensive experiments on two pure language dialogue datasets show that the proposed VICTOR model significantly outperforms previous approaches, indicating that VICTOR can generate more diverse and informative responses, while maintaining coherence with the context.For future work, we will continue to investigate the advantages of introducing external visual knowledge into the dialogue system.We notice that the current use of visual modality in this field may be too rough.Further study should focus on how to extract more specific and more necessary information from images or videos for enhancing response generation.Besides, enabling dialogue agents to handle multimodal inputs and outputs is also a relevant hot research problem.

Limitations
We discussed the limitations of this work.The proposed method trains the visual retriever and the generation model separately, which may affect the overall optimization of the system to a certain extent.Besides, limited by the performance of the retrieval modules, the extracted visual commonsense is not always an effective extension of the context content.This indicates that there is still room for improvement in the acquisition and utilization of high-quality visual knowledge for dialogue generation.
How was your trip to Brazil?User B: I had no idea how seriously they take soccer

Figure 2 :
Figure 2: Retrieval process: Extracting and expanding the context concepts, and mapping them to corresponding images.

Figure 3 :
Figure 3: The overall framework of VICTOR.
1/2, showing it can generate diverse and informative responses.Besides, we can see

Table 2 :
Human evaluation results on (a) Otters and (b) DailyDialog datasets.VICTOR is abbreviated as VIC.