MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation

Responding with multi-modal content has been recognized as an essential capability for an intelligent conversational agent. In this paper, we introduce the MMDialog dataset to facilitate multi-modal conversation better. MMDialog is composed of a curated set of 1.08 million real-world dialogues with 1.53 million unique images across 4,184 topics. MMDialog has two main and unique advantages. First, it is the largest multi-modal conversation dataset by the number of dialogues by 88x. Second, it contains massive topics to generalize the open domain. To build an engaging dialogue system with this dataset, we propose and normalize two response prediction tasks based on retrieval and generative scenarios. In addition, we build two baselines for the above tasks with state-of-the-art techniques and report their experimental performance. We also propose a novel evaluation metric MM-Relevance to measure the multi-modal responses. Our dataset is available in https://github.com/victorsungo/MMDialog.


Introduction
Empowering machines to converse like humans is a long-cherished goal of AI community, and there is growing interest in developing open-domain conversational agents Ghazvininejad et al., 2018;Zhou et al., 2018a). To usher machines into the world of human knowledge, it is a desirable trait of conversational agents to understand, perceive, and respond appropriately to multi-modality contexts beyond text (Das et al., 2017;Mostafazadeh et al., 2017;Shuster et al., 2020), which is similar to communicating through messenger tools (e.g., Facebook, WhatsApp, and WeChat) in reality.
Existing approaches to building multi-modal dialogue systems are primarily data-driven, requir- * Work done during the internship at MSRA.
At the base of a muddy ditch is the first primrose of my spring -glowing in the grey, a little spot of hope, brave, beautiful and perfect.
Hi Chris, wow well spotted with the beautiful flower, I love walking alongside the river where there is a bluebell way Love every photo.
Especially the weeping willow.
Thanks. It's nice to enjoy the wildlife nature and walk all the way to Winchester's great scenery.
I live in Scotland. We have woods opposite with bluebells but not as thick as yours, but have a river with kingfisher, Heron and dipper.
I would love to visit Scotland and witness the beautiful hills seas, and wildlife nature. We have a lots of cheeky squirrels and wood pigeons, I love to listen them in morning.
That's an amazing story thanks for sharing. ing the collection of a large-scale dataset first.
To facilitate this line of research, the community emerges a few dialogue datasets incorporating visual information (Meng et al., 2020;Zang et al., 2021;. For example, Visual Dialog (Das et al., 2017) is set up for visual question answering involving image inputs. IGC (Mostafazadeh et al., 2017) and Image-Chat (Shuster et al., 2020) are constructed in a crowd-sourcing method in which annotators are employed to chat about given images. Pho-toChat (Zang et al., 2021) is also built via crowdsourcing but contains sharing photos in conversations. MMChat  is collected from real conversations on Chinese social media.
Despite the diversity of multi-modal dialogue corpora, these datasets still have limitations. Firstly, several corpora, including Visual Dialog, IGC and Image-Chat, are derived from crowd-sourcing dialogues talking about given images. The topics of human utterances in a dialogue session are often triggered and grounded by these images, which is inconsistent with our daily communications, where the utterances are not always image-related . Secondly, other groups of datasets, such as OpenViDial 1.0/2.0 (Meng et al., 2020; and dialogues collected by Lee et al. (2021), are not originated from a real multi-modal conversation scenario. The former directly extracts dialogues and their visual contexts from movies and TV series, and the latter replaces some utterances with retrieved relevant images. Both methods artificially construct images from the multi-turn conversation to simulate multimodal dialogues. Finally, some recently proposed multi-modal dialogue data like PhotoChat and MM-Chat introduce real human-human conversations. They are still limited by their small scale or lack of domain diversity, impeding the further explorations on multi-modal dialogue modeling.
To address the aforementioned issues, we present MMDialog, a large-scale multi-turn dialogue dataset containing multi-modal open-domain conversations derived from real human-human chat content in social media. MMDialog contains 1.08M dialogue sessions and 1.53M associated images. We elaborately design a series of data filtering processes during the data collection phase. On average, one dialogue session has 2.59 images, which can be located anywhere at any conversation turn. Figure 1 depicts an example of human conversations in our MMDialog dataset. To the best of our knowledge, this is the first million-scale open-domain multi-modal dialogue corpora. We hope the large amount of dialogues and images can shed light on this line of research.
Furthermore, we define the multi-modal response generation and retrieval tasks based on MM-Dialog that are essential for building a more engaging multi-modal dialogue agent. We build baseline models and conduct several analyses of their performance. For the generative task, we follow  and implement the models for multimodal response generation. For the retrieval task, we also propose a CLIP-based dual-encoder for retrieval tasks inspired by Zang et al. (2021). Since in our multi-modal response prediction settings, the modality orders of generated responses may not be aligned with the ground-truth responses. Thus, it is non-trivial to conduct evaluation on crossmodal response elements. To tackle the above challenges, we propose a novel evaluation metric named MM-Relevance, which performs visuallanguage matching based on the large-scale pretrained multi-modal CLIP model . Evaluation results on MMDialog demonstrate that our designed baselines can achieve considerable performance on generation and retrieval tasks of both modalities.
To sum up, our contributions are four-fold: • We construct a novel multi-turn dialogue dataset MMDialog that contains 1.08M multimodal open-domain conversations and 1.53M associated images derived from social media and conduct data filtering and post-processing elaborately. To the best of our knowledge, this is the first million-scale multi-turn opendomain multi-modal dialogue corpus.
• We propose two benchmark tasks including generative and retrieval scenarios on MMDialog that are essential for building more engaging multi-modal dialogue systems.
• We propose a novel evaluation metric MM-Relevance measuring the relevance between generated multi-modal response and groundtruth response. It builds upon the large-scale pre-trained multi-modal CLIP model, which can specifically mitigate the modal misalignment issues.
• We design two baselines for corresponding tasks to promote future research on this dataset and achieve considerable performance on generation and retrieval tasks of both modalities. We also give comprehensive analysis to provide more insights into multi-modal dialogue modeling.
2 Related Works
Concurrent with the above works, several dialogue-related tasks have also been explored. Das et al. (2017) introduced the task of Visual Dialog, which requires an AI agent to hold a meaningful dialogue with humans in natural, conversational language about visual content. Mostafazadeh et al. (2017) proposed IGC, which contains 4K dialogues where each includes an image with a textual description, along with the questions and responses around the image. However, IGC is usually used for evaluation due to its small scale. Shuster et al. (2020) released Image-Chat that is larger than IGC and consists of 202K image-grounded dialogues. However, the above three datasets were created by asking the crowd workers to talk about a shared image to generate the conversation. Therefore, the utterances are often triggered and grounded by these images. In contrast, human daily communication utterances are not always image-related , which retain gaps with open-domain multi-modal conversation scenarios. Then, other groups of works proposed to derive the images from the multi-turn conversations: Meng et al. (2020);  constructed OpenViDial 1.0/2.0 by directly extracting dialogues and their visual contexts from movies and TV series. Lee et al. (2021) also built a multi-modal dialogue dataset by replacing the selected utterances with retrieved relevant images. However, although these corpora were constructed from open-domain conversations with images, they did not originate from a real multi-modal conversation scenario. Therefore, recently some researchers begin to introduce real human-human conversations. Zang et al. (2021) created the first human-human dialogue dataset with photo-sharing acts via crowd-sourcing.  collected multi-modal dialogues from real conversations on social media. Nevertheless, they were still limited by their small scale or lack of domain diversity, which may hinder further explorations on multi-modal dialogue modeling. To address the aforementioned issue, we make the first attempt to construct a million-scale multi-turn dialogue dataset, namely MMDialog, derived from social media and conduct data filtering and postprocessing elaborately.

Multi-Modal Dialogue Modeling
Based on the aforementioned multi-modal dialogue datasets, many advanced works have been proposed. Several modeling works (Qi et al., 2020;Niu et al., 2019;Gan et al., 2019) investigate how to escalate the performance of conversational agents in image-grounded dialogue. Afterward, researchers Liang et al., 2021) explore enriching textual expressions of generated dialogue responses through associative vision scenes. Zang et al. (2021) proposes two tasks, including photo-sharing intent prediction to predict whether model should intend to share the photo in the next dialogue turn and a dialogue-based image retrieval task to retrieve the most proper photo given the dialogue context. They also propose a dual-encoder model that uses object labels to encode image features, which achieves the best performance among all the models w/o cross-attention mechanisms. However, the authors do not conduct textual response retrieval tasks.  proposes a multi-modal dialogue generation model based on Seq2Seq architecture, which was proved to be superior to textual Seq2Seq model. However, this model can only generate plain textual responses, which is not in line with the open domain multi-modal response generation scenario. Recently,  make the first attempt to build a multi-modal dialogue response generation model named Divter that can effectively understand multi-modal dialogue context and generate informative text and high-resolution image responses. As advanced works on dialogue systems include retrieval-based methods (Wu et al., 2017;Zhou et al., 2018b;Whang et al., 2020; and generative methods (Li et al., 2016;Serban et al., 2016;. Therefore, we adapt Divter  to our multi-modal response generation settings and extend the dualencoder (Zang et al., 2021) to the retrieval-based scenarios as baselines.
platform on which users can converse with each other and share their daily lives messages freely in multiple modalities including plain text, photos, or even videos. We design the data collection process into 3 phases: In the first stage, we extensively manually collect the hashtags commonly used by users and covered as many domains as possible; The second phase starts from the seed hashtags collected before. Specifically we collect all turns with aforementioned hashtags and keep only the turns that contain at least one image, generally we call above turns anchors later. Then, for each anchor, we retrieve all the turns that replied to it and the turn it replied to. In the final phase, we also elaborately design a series of data filtering and post-processing steps to eliminate invalid cases and improve the quality of multi-modal dialogues in MMDialog. To protect the privacy and security of data, user and platform, MMDialog is released under strict terms for academic people only.

Hashtag Collection
To collect MMDialog, we crawl one of the most influential online social platform using its academic available API. To improve the data quality, we consider extracting dialogues with their hashtags (e.g. '#travel', '#friends', '#golf' ), as hashtags tend to show the main topic of the textual utterances and the visual media. Specifically, we manually screen out 4,184 popular hashtags, and each hashtag has at least 1,000 dialogues, in this way our dataset can not only satisfy the properties of open-domain, but also ensure a large scale. We depict the most popular hastags in Figure 4 in Appendix A.2.

Multi-modal Conversations Construction
Then, we leverage the manually collected hashtags as seeds to construct multi-turn dialogues. At first, for each hashtag, we crawl the turns containing corresponding hashtag and only keep those that contain at least one image object (i.e., anchors). Obviously, dialogues containing the anchors are the multi-modal multi-turn dialogues we pursue. Then in the same conversation, for each anchor, we look for all the other turns i) that replied to anchor until reach the leaf node, and ii) that anchor replied to up to the root node. Moreover, we could recursively follow the chain of replies to recover the entire conversation.

Data Filtering and Post-processing
Since the style of messages posted on social media platforms are widely varied, the initial version of MMDialog contains a lot of invalid, noisy and even harmful conversations, which may hinder the research conducted on this dataset. To tackle the above issue, we design a series of elaborate data filtering processes to filter out those high-quality multi-modal conversations: a) We remove dialogues containing toxic statements with explicit offensive words; b) We ignore and discard dialogues with GIFs and other modalities (such as videos) which cannot be downloaded immediately.
We leave this part of research as future work; c) We remove irregular characters from the dialogue content. For example, we do not consider any urls and '@' items (i.e., expression items for mentioning somebody); d) In particular, we convert emojis and hashtags into corresponding natural language forms to guarantee the coherence of the dialogues; e) We remove all self-talking cases (such as replying to themselves for 2 or more consecutive dialogue turns) to enhance the integrity of the conversations; f) We discard dialogues with incomplete or missing images; g) We only keep the conversations of no less than 3 dialogue turns. We believe that adopting the above data-filtering and post-processing procedure, the final large-scale multi-turn dialogues can be better leveraged to develop multi-modal opendomain conversation models.

Task Definition
Suppose that we have a multi-modal dialogue .., n}, U i is the multi-turn dialogue context, R i is the response regarding to U i . U i and R i could contain multi-modal components: textual elements (e.g., utterances) and visual elements (e.g., images). For any U and R, we denote U i = {u m k } K k=1 and R i = {r m l } L l=1 as sequence of multi-modal elements including textual utterances and visual images. K and L are the number of elements in context and response respectively. m ∈ {t, v} indicates the modal type of elements where t represents textual utterances while v signifies visual images. The goal is to learn a multi-modal dialogue model g from D, and thus for any new context U , one can predict a multi-modal response R with g.
Since advanced works on pure-text open-domain dialogue systems mainly include retrieval-based and generative-based methods. We adapt them to multi-modal scenarios and define the following two tasks that are essential for building a multi-modal open-domain dialogue system:

Task-1: Multi-modal Response Generation
To generate a multi-modal response R, one should learn a multi-modal generation model P (R|U ; θ) with θ the model parameters. Thus, given a new dialogue context U , following P (R|U ; θ), one can directly synthesize a multi-modal responseR consisting of textual utterances or visual images, or both of them.
Task-2: Multi-modal Response Retrieval As for the retrieval-based models, each dialogue example (U, R) additionally provides a series of negative multi-modal elements as distractions. Then we compose the ground-truth textual utterances {r t l } in R and the negative examples into an candidates set C t = {r t z } Z z=1 for text retrieval, where Z is the size of C. In the same way, we could also build the image candidates set C v = {r v z } Z z=1 . Thus, the goal of a response retrieval model is to extract an element from a given element candidate set C t or C v step by step while predicting each element r m z . Through such an retrieval process in an auto-regressive style, we can finally obtain a fully retrieved multi-modal responseR.

Response Modal Intent Prediction
In MMDialog, the textual utterances and visual images can be freely located anywhere in the multi-modal response. Therefore, the generation or retrieval order of the modality of response elements is also of great importance for the multi-modal conversation. The intent prediction task aims to predict the order of different modalities in responseR given the dialogue context U . Therefore, the intent prediction can be formulated as a classification task: where I(·, ·) is the intent prediction model which takes the dialogue context U and previous generated/retrieved response elementsR <j before j-th step as inputs and provides the modality of next element. Specifically, the model should predict 0 when r j is a textual utterance, and 1 when r j is a visual image. We also define the 2 which indicates that the responseR is completed and the model should stop generating/retrieving new elements.

Evaluation of Multi-Modal Dialogue Tasks
Since most of the evaluation metrics used for text generation (e.g. BLEU (Papineni et al., 2002), ROUGE (Lin, 2004)) or image generation tasks (e.g. FID and IS used in ) or retrieval (e.g., Recall) can only be evaluated within a single modality. At the same time, the modality orders of elements in a multi-modal dialogue response may not be aligned with the ground-truth response. Thus, it is non-trivial to conduct evaluation on cross-modal response elements.
In Task-1, we could obtain the BLEU and ROUGE scores by aligning the generated textual parts and those in ground-truth responses from the left. When predicting j-th element of response, if the model does not generate same modal corresponding to the current step of ground-truth element, we can assign the evaluation result of this step to zero value, which in fact means that the model predicts no element. However, we cannot directly adopt the same strategy for metrics such as PPL and FID metrics for textual response generation and image generation tasks respectively as the setting of default zero value is non-trivial. Besides, we can only compute IS for the generated images.
In Task-2, we choose to perform element-level evaluation. Specifically, we compute the Recall scores for each retrieved element candidate with the expected modal in the ground-truth response. Nevertheless, the above computational strategy is actually a compromise against the misalignment of the element modals and makes the evaluation of multi-modal responses sub-optimal and inaccuracy.
To tackle the evaluation issues in above two tasks, we propose a novel evaluation metric, named MM-Relevance, which performs visual-language matching based on the large-scale pre-trained multimodal CLIP model  for multimodal dialogue response generation and retrieval task. CLIP is trained on a vast corpus of imagecaption pairs from Web. It learns to bring the embeddings of both modalities (visual and textual) together via a contrastive objective. Therefore, we utilize this model to assess the relevance between the generated responses and the ground-truth responses to mitigate modal misalignment issues. In specific, suppose we obtain a generated or retrieved multi-modal responseR = {r m j } J j=1 , and the corresponding ground-truth response R = {r m l } L l=1 . We first align the two sequences from the left. Then, the representation vector of each element is obtained by encoding the textual response or visual image through text encoder or image encoder pre-trained by CLIP respectively. We denote the encoded vectors of two responses as: and E = {e m l } L l=1 . Then, we compute the CLIP scores of the two elements position by position until they cannot be aligned: In order to penalize the generated/retrieved sequence that is too long or short, we further improve this metric as: P MM , R MM , F1 MM denote soft-precision, soft-recall and soft-F1 score respectively. We take F1 MM as MM-Relevance. Thus, the relevance degree can now be computed between two modal misaligned responses R andR.
With regard to intent prediction, we follow Zang et al. (2021) and adopt F1 score as the evaluation metric that measures the accuracy of the model's prediction of the modality order for a dialogue turn. Specifically, we first get the modal sequences of generated/retrieved and ground-truth responses as M = {m j } J j=1 and M = {m l } L l=1 respectively.
Then, the F1 score can be computed as: where 1 is an indicator function that has value 1 when m i =m i , otherwise 0. Noting that since we perform element-level retrieval in Task-2, J is always equal to L. Whereas in Task-1, J is determined according to the modal sequence of the generative responseR.

Baselines
As shown in the Figure 2, we leverage baseline models to assess MMDialog for the aforementioned two novel multi-modal tasks.

Multi-modal Response Generation Model
We consider to implement the state-of-the-art multimodal dialogue response generation model Divter (Figure 2a) proposed by , which consists of two components: a textual dialogue response generator G and a description-to-image translator F. Specifically, G takes the dialogue context U as input, then generates a textual sequence which may contains a textual response r t or a textual image description r c or both of them. Noting that in our settings on MMDialog, there may also be several images u v in multi-turn dialogue context, we thereby replace these images by their descriptions u c with the help of an image-to-description translation model. In this way, we could concatenate the textual utterances u t and descriptions into a sequence as the input of G. In addition, we use [UTT] and [DST] at the beginning of textual utterance and image description respectively to distinguish the following action. Then, for a generated description r c beginning with [DST], F would take them as condition input, and generates a realistic and consistent high resolution image r v as the real response.

Multi-modal Response Retrieval Model
Inspired by Parekh et al. (2021) and Zang et al. (2021), we also build a retrieval model R named  DE++ which consists of a modality intent prediction module R α and a ranking module R β . As shown in Figure 2b, before each ranking action, R α firstly takes the dialogue context U and previous retrieved response elementsR <j before j-th step as inputs and predicts i) the response is completed and model should stop retrieving new elements. or ii) the modality of next elements. If i), the R α will take U ,R <j as input to predicts the intention I([U,R <j ]); if ii), R β will calculate the relevance score S([U,R <j ], r). In the same light, R β measures all candidates in {r m z } Z z=1 and selects the one with highest relevance score as the final response element at j-th step.
Specifically, R α and R β have similar architecture, we adopt CLIP text encoder and CLIP image encoder to represent textual utterance and image respectively. In R α , we concatenate all the context embeddings with a special learnable [CLS] embedding prepending at the first and feed the embedding sequence into a transformer module to predict the intent. In R β , we prepend the [CLS] embeddings to the concatenated context embeddings sequence or candidate embedding and then feed them into a transformer module separately. After that we can obtain the representation vectors of context and candidate, and compute relevance scores by conducting dot-product of two vectors.

Experiments
Experiments are conducted on MMDialog dataset to assess both our baselines on proposed multimodal dialogue tasks. We perform response/intent predictions for all turns except the first turn of each dialogue and consider all previous turns as context.

Experimental Setup
We first sample 10K and 10K dialogue sessions for validation and testing respectively. The detailed statistics are presented in Table 2  A: A: Generated Response: Retrieved Response: A: A: I love Animals, and I appreciated the Orwellian concept.
This album sometimes gets lost by casual fans in between so many titanic achievements by PinkFloyd, but the themes addressed still hold up strong today.

A:
A:

Dialogue Context Response
Animals is also one of my essential album. Great ProgRock! Figure 3: An example of MMDialog test set. Left: the multi-modal dialogue context between "A" and "B". Right: the multi-modal responses generated or retrieved by our designed baselines.
utterances and 999 negative visual images from the same split set for each dialogue, maintaining the total number of candidate elements at 1K. While in training phase, the negative ones are in-batch sampled similar to . For the textual dialogue response generator, we fine-tune DialoGPT  with transformers library provided by huggingface 1 using the version "DialoGPT-medium" consistent with . For the description-to-image translator, we implement DALL-E  using the code of "mega" version in https:// github.com/borisdayma/dalle-mini, which also has the same model settings with Sun 1 https://github.com/huggingface/ transformers et al. (2022). We fine-tune DALL-E mega for one epoch with initial learning rate 1e-7 and mini-batch size of 64. We process all images into 256 × 256 RGB format for DALL-E. To obtain the description of images in MMDialog, we adopt OFA-huge  using the code https://github.com/OFA-Sys/ OFA/tree/feature/add_transformers for image captioning. All version of CLIP models we leveraged in this paper are "openai/clip-vitbase-patch32" in https://huggingface. co/openai/clip-vit-base-patch32. When implementing Divter, we follow the same experimental configuration. As for the retrieval baseline, the representation vectors for both modality are obtained by CLIP model and fixed during training. The transformers used in retrieval tasks consist of 4 Transformer layers with a hidden size of 512 and 8 heads. We train the retrieval models with an initial learning rate of 5e-7 and mini-batch size of 512. For all baselines, early stopping on the validation set is adopted as a regularization strategy and the best model is selected based on the validation performance. The training of both tasks is conducted on 8 Nvidia Tesla A100 80G GPU cards. The BLEU and ROUGE scores are computed by codes in https: //github.com/Maluuba/nlg-eval, while the IS is obtained by https://github. com/toshas/torch-fidelity. Table 3 reports the evaluation results of multimodal response generation baseline. Follow , we evaluate the textual response generation, image generation and intent prediction tasks. Firstly, we can find that the state-of-the-art model Divter achieves relatively low textual response generation performance (9.44 on BLEU-1 and 11.19 on ROUGE-L) on our proposed MMDialog, which validates the difficulty of multi-modal response generation tasks and also demonstrates the necessity of constructing a large-scale multi-modal dialogue dataset for building data-driven models. Secondly, compared with the results on text generation, it is interesting to find that the model achieves better performance on the image generation task and reaches 20.53 on IS. Thirdly, we observe that the baseline achieve a 71.77 F1 score on intent prediction task, indicating that the model has a considerable ability to determine whether to generate text or images during the conversation. Finally, we also leverage the proposed MM-Relevance to evaluate the overall relevance degree between the generated multi-modal dialogue responses and ground-truth ones and our baseline achieves a score of 61.85.

Results of Multi-modal Response Retrieval Model
We also conduct the retrieval baselines and show the results in Table 4. Our proposed baseline DE++ achieves 29.84% R@1 and 22.23% R@1 on image retrieval and textual response retrieval respectively, which demonstrating the capacity of multi-modal retrieval model and the effectiveness of CLIP representation. As for the intent prediction, the F1 score is 61.81 which is inferior to the counterpart in gen-erative baseline Divter. This may be due to the fact that 24 layers of transformer (i.e., DialoGPTmedium) is used to encode the context in Divter but only 4 transformer layers without pre-training are used in DE++. Furthermore, we can also find that DE++ obtains a better MM-Relevance score than Divter, which may be attributed to the elementlevel retrieval in our retrieval experiments and we observe that the alignment of the modality would considerably improve the CLIP matching scores.

Case Study
To further investigate the quality of multi-modal responses predicted by our proposed baselines, we display an example on the MMDialog test data in Figure 3. The multi-turn dialogue context between "A" and "B" is shown in left while the multimodal responses generated or retrieved by our designed baselines are depicted in right. As we can see, the textual response generated by Divter is coherent with the dialogue context and it can also generate a realistic high-resolution image about the "Power Station" in last turn of context, which demonstrates the multi-modal generative capability of our designed generative baseline. As for the retrieval model, our baseline also retrieved a textual response about "PinkFloyd" and image on "Power Station" semantically related to the dialogue context, which also verifies the effectiveness of retrieval baseline.

Conclusion
We presented MMDialog, a large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. By extracting turns associated with images and their surrounding contexts from more than 4K topics, MMDialog provides a diverse and open-domain dataset. To facilitate research on building a more engaging multi-modal dialogue system, we define multi-modal response generation and retrieval tasks, and the MM-Relevance metric based on MMDialog. We also build baseline models and conduct several analyses of their performance. We believe this can serve as a rich resource to propel research in the multi-modal conversation, for years to help the community propose better methods suited to more scenarios.