Multi-Modal Open-Domain Dialogue

Recent work in open-domain conversational agents has demonstrated that significant improvements in humanness and user preference can be achieved via massive scaling in both pre-training data and model size (Adiwardana et al., 2020; Roller et al., 2020). However, if we want to build agents with human-like abilities, we must expand beyond handling just text. A particularly important topic is the ability to see images and communicate about what is perceived. With the goal of getting humans to engage in multi-modal dialogue, we investigate combining components from state-of-the-art open-domain dialogue agents with those from state-of-the-art vision models. We study incorporating different image fusion schemes and domain-adaptive pre-training and fine-tuning strategies, and show that our best resulting model outperforms strong existing models in multi-modal dialogue while simultaneously performing as well as its predecessor (text-only) BlenderBot (Roller et al., 2020) in text-based conversation. We additionally investigate and incorporate safety components in our final model, and show that such efforts do not diminish model performance with respect to human preference.


Introduction
An important goal of artificial intelligence is the construction of open-domain conversational agents that can engage humans in discourse. Indeed, the future of human interaction with AI is predicated on models that can exhibit a number of different conversational skills over the course of rich dialogue. Much recent work has explored building and training dialogue agents that can blend such skills throughout natural conversation, with the ultimate goal of providing an interesting and engrossing experience for humans (Smith et al., 2020;Shuster et al., 2019b). Coupled with the advancement of large-scale model training schemes, such models are becoming increasingly human-like and engaging (Zhang et al., 2020;Adiwardana et al., 2020;Roller et al., 2020).
In order to better approach human-like ability, however, it is necessary that agents can converse with both textual and visual context, similarly to how humans interact in the real world; indeed, communication grounded in images is naturally engaging to humans (Hu et al., 2014). Recent efforts have gone beyond classical, fact-based tasks such as image captioning or visual question answering (Antol et al., 2015;Das et al., 2017a) to produce models that can respond and communicate about images in the flow of natural conversation (Shuster et al., 2020(Shuster et al., , 2019b. In this work, we explore the extension of largescale conversational agents to image-based dia-logue. We combine representations from imagebased models that have been trained on object detection tasks (Lu et al., 2020(Lu et al., , 2019 with representations from Transformers with billions of parameters pre-trained on massive (text-only) dialogue datasets, to produce responses conditioned on both visual and textual context. To ensure that our model retains the ability to engage in regular, text-based conversation, we include in our training procedure multi-tasking with datasets expressly designed to instill conversational skills in the model (Smith et al., 2020).
We find that our best resulting models are as proficient in text-only conversation as the current best reported dialogue models, with respect to both performance on the relevant datasets and human evaluations of preference. Concatenating image feature embeddings to the input of our model's encoder leads to better performance than concatenating the embeddings to the encoder's output, and using spatially based image embeddings performs better than single-vector embeddings. Simultaneously, our model significantly outperforms recent strong multi-modal dialogue models when in an image-dialogue regime; we measure several metrics via pairwise human judgments using ACUTE-Eval (Li et al., 2019b) to show that our model is not only more preferred by humans but can also discuss and reference visual context throughout a conversation. See Figure 1 for one sample cherrypicked conversation with our model, with random and lemon-picked conversations in Figures 2 and 3.
One important avenue we explore with our best models is safety -that is, ensuring that our models are not offensive to their conversational partners. Dialogue safety is indeed a well-studied, but still unsolved, research area (Dinan et al., 2019b;Liu et al., 2019;Dinan et al., 2019a;Blodgett et al., 2020;Khatri et al., 2018;Schäfer and Burtenshaw, 2019;Zhang et al., 2018a), yet we note that safety in the context of image-dialogue is relatively less explored. In this work we examine gender bias and toxicity of text generations in the context of various styles from the Image-Chat dataset (Shuster et al., 2020). Notably, after tuning the model to reduce toxicity and gender bias, we find that human preference for this model does not diminish.
The training procedure and initial pre-trained model weights will be made publicly available to allow for fully reproducible results.

Multi-Modal Models and Tasks
Rich Representations Modeling multi-modal inputs, i.e. in visual + textual contexts, is a wellresearched area. Much of the existing literature explores similar architectures to our setup, i.e., using standard Transformer-based models to jointly encode text and images (Li et al., 2019a;Kiela et al., 2019). Others have explored modifications to the standard self-attention scheme in Transformers by incorporating additional co-attention (Lu et al., 2019;Tan and Bansal, 2019) or cross-attention (Stefanini et al., 2020) layers. These models have primarily been used for generating rich joint representations of images and text for use in downstream tasks, and primarily focus on the encoding aspect.
Visual Dialogue/Caption Generation Many tasks have been designed to measure the ability of a model to produce text in the context of images. Specifically, COCO Captions (Chen et al., 2015) and Flickr30k (Young et al., 2014) require a model to produce a caption for a given image. A variety of sequence-to-sequence (Vinyals et al., 2015;Xu et al., 2015;Anderson et al., 2018) and retrievalbased (Gu et al., 2018;Faghri et al., 2018;Nam et al., 2016) models have been applied to these tasks, however they do not go beyond the one-turn text generation expected for captioning an image. Other recent architectures have explored text generation (Wang et al., 2020;Park et al., 2020) in the context of the Visual Dialog (Das et al., 2017b) task; however, this task is primarily used to measure the ability to answer questions about an image in the flow of a natural conversation, which differs somewhat from the open-domain dialogue task. Further still, there have been recent forays into open-domain natural dialogue in the context of images, e.g. in the Image-Chat (Shuster et al., 2020) and Image-grounded Conversations (Mostafazadeh et al., 2017) tasks. Again, retrieval-based (Shuster et al., 2020;Ju et al., 2019) and sequence-tosequence (Shuster et al., 2019b(Shuster et al., , 2020 models have been used to conduct dialogue in this regime.

Multi-Task Training / Using Pre-Trained Representations
Our multi-modal model is constructed from models pre-trained in other, related domains; specifically, we seek to fuse the resulting weights of large-scale, uni-modal pre-training to achieve good performance on downstream, multi-modal tasks. Adapting pre-trained representations to later downstream tasks has been shown to be successful in NLP (Peters et al., 2019;Devlin et al., 2019) and dialogue in particular (Roller et al., 2020;Mazaré et al., 2018), while large-scale multi-modal pretraining has been shown to be effective in other downstream multi-modal tasks (Li et al., 2020;Chen et al., 2020;Singh et al., 2020b). Our work does not contain multi-modal pre-training in itself, but rather we explore "domain-adaptive pretraining" (Gururangan et al., 2020) or "intermediate task transfer" (Pruksachatkun et al., 2020), in which pre-trained representations are "adapted" to a certain domain via an intermediate training step, before training/evaluating on the requisite downstream tasks. We also employ multi-task training, to both help generalize the applicability of the model and improve its performance on downstream tasks/evaluations; this has been shown recently to help in both image-based (Singh et al., 2020b;Ju et al., 2019;Lu et al., 2020) and text-based (Shuster et al., 2019b;Roller et al., 2020) tasks.

Comparison to Existing Models
In this work, we compare our best resulting model to several existing models in the literature. BlenderBot: the 2.7-billion-parameter Transformer sequence-to-sequence model from Roller et al. (2020), known as "BST Generative 2.7B model" in that work, pre-trained on 1.5B comments from a third-party Reddit dump hosted by pushshift.io (Baumgartner et al., 2020). We refer to this model as "BlenderBot".
Dodeca: the Image+Seq2Seq model from do-decaDialogue (Shuster et al., 2019b), a Transformer sequence-to-sequence model in which the encoder is passed pre-trained image features from the ResNeXt-IG-3.5B model (Mahajan et al., 2018). We use their model fine-tuned on Image-Chat (and we refer to this model as "Dodeca").
2AMMC: a retrieval model in which multiple Transformers are attended over in order to make use of a combination of ResNeXt-IG-3.5B and Faster R-CNN image features (Girshick et al., 2018). We specifically use the 2AMMC model from Ju et al. (2019) because that model has the best test-set performance on Image-Chat in that work.

Model Architectures
The inputs to our models are visual and/or textual context, where applicable. We explore different ways to encode images, and we additionally compare ways of combining (fusing) the image and text representations before outputting a response.

Image Encoders
Converting an image from pixels to a vector representation is a well-researched problem, and thus we explore using two different image encoders, using features taken from ResNeXt (Mahajan et al., 2018) and Faster R- CNN (Ren et al., 2017), to determine the best fit for our tasks. See Appendix A for a description of these image encoders.

Multi-Modal Architecture
To jointly encode visual and textual context, we use a modification of a standard Transformer sequenceto-sequence architecture (Vaswani et al., 2017), whereby we experiment with different ways of fusing the image and text representations to generate an output sequence. Our Transformer model architecture follows that of Roller et al. (2020), with 2 encoder layers, 24 decoder layers, 2560dimensional embeddings, and 32 attention heads, and the weights are initialized from a 2.7-billion parameter model pre-trained on 1.5B comments from a third-party Reddit dump hosted by pushshift.io (Baumgartner et al., 2020) to generate a comment conditioned on the full thread leading up to the comment. From this base model, we explore two possible fusion schemes.

Late Fusion
The late fusion method is the same as in Shuster et al. (2019b), whereby the encoded image is projected to the same dimension as the text encoding of the Transformer encoder, concatenated with this output as an extra "token" output, and finally fed together as input to the decoder.
Early Fusion We additionally experiment with an earlier fusion scheme to allow greater interaction between the image and text in the sequenceto-sequence architecture. In a similar fashion to VisualBERT (Li et al., 2019a) and multi-modal Bitransformers (Kiela et al., 2019), we concatenate the projected image encoding from the visual input with the token embeddings from the textual input, assign each a different segment embedding, and jointly encode the text and image in the encoder. 1 The encoder thus performs full self-attention across the textual and visual context, with the entire output used as normal in the sequence-to-sequence architecture.
As our resulting model can be seen as a multimodal extension to the BlenderBot model (Roller et al., 2020), we refer to it as "Multi-Modal BlenderBot" (MMB).

Training Details
When training the model, we fix the weights of the pre-trained image encoders, except the linear projection to the Transformer output dimension, and fine-tune all of the weights of the Transformer encoder/decoder.

Domain-Adaptive Pre-Training
During training, the vast majority of trainable model weights are initialized from a large, 2.7B parameter Transformer pre-trained solely on textual input. As our end goal is to achieve improved performance on multi-modal tasks, we found that training first on domain-specific/related data was helpful in order to adapt the Transformer model to an image setting. Following (Singh et al., 2020b), we experimented with pre-training on COCO Captions (Chen et al., 2015) -a dataset of over 120k images with 5 captions each, resulting in over 600k utterances -in which the model is trained to generate a caption solely from image input. We additionally explored multi-tasked training with COCO Captions and on the same third-party Reddit dump hosted by pushshift.io (Baumgartner et al., 2020) as the one used in pre-training the Transformer model, to see whether it was necessary to ensure the model did not stray too far from its ability to handle pure textual input. See Appendix C for more details.

Fine-tuning Datasets
The goal of our resulting model is to perform well in a multi-modal dialogue setting; thus, we fine-tune the model on both dialogue and imagedialogue datasets. For dialogue-based datasets, we consider the same four as in Roller et al.  (Smith et al., 2020). To model image-dialogue, we consider the Image-Chat dataset (Shuster et al., 2020). We give a brief description of the five datasets in Appendix B; more information can be found in Roller et al. (2020) and Shuster et al. (2020).
In the fine-tuning stage, we consider two different regimes: one in which we multi-task train on the five datasets together, and one in which we train on Image-Chat alone. While the latter regime is useful in exploring upper bounds of model performance, our main goal is to build a model that can display the requisite skills of an appealing conversationalist (empathy, personalization, knowledge) while also having the ability to respond to and converse about images; thus, we are more interested in the former training setup. See Appendix C for more details.

Results on Pre-Training Datasets
To fully understand the effects of various training data and image features, as well as multi-modal fusion schemes, we measure model perplexity on the COCO and pushshift.io Reddit validation sets. We are primarily interested in performance on COCO Captions, as the model has already been extensively pre-trained on the pushshift.io Reddit data.
The full results are shown in Table 10 in the appendix, and we leave extensive discussion of the results to Appendix D. Notably, we find that training on COCO Captions exclusively yields the best performance on that task, with spatially-based image features yielding better performance than single vector representations. Additionally, our early fusion scheme outperforms the late fusion scheme holding all other variables constant.

Results on Fine-Tuned Datasets
We conduct the same ablation setups for training on the dialogue and image-and-dialogue datasets as we did in the domain-adapative pre-training setup; the ablation results for multi-tasking all of the datasets are in Table 11, while results for finetuning on Image-Chat alone are in Table 12 (each in the appendix).
Results are summarized in Table 1, and we note some interesting conclusions here, with further details in Appendix E. First, overloading the Trans-  Table 2: Test performance of existing models on the datasets considered, compared to MMB (specifically, the "MMB Style" model discussed in Section 5.2.1), in terms of F1, BLEU-4 (B), and ROUGE-L (R) scores. * indicates that gold knowledge was utilized in the WoW task.
former encoder/decoder to incorporate image features does not hinder performance on dialogue datasets (as seen via multi-tasked training), and in fact domain-adaptive pre-training improves downstream performance on Image-Chat. In terms of architecture choices, we find that our early fusion architecture improves performance on Image-Chat across all ablation regimes, with Faster R-CNN features yielding the best performance.

Final Test Results
Following the ablation analyses, we decide to compare our best multi-tasked and single-tasked trained model (with respect to the fine-tuning datasets), where we use Faster R-CNN image features and an early fusion scheme, to existing models in the literature. For this comparison, we consider additional metrics that can be computed on the actual model generations: F1, BLEU-4 and ROUGE-L. We generate model responses during inference with the same generation scheme as in Roller et al. (2020) -beam search with beam size of 10, minimum beam length of 20, and tri-gram blocking within the current generation and within the full textual context. The test perfor-mance of our best multitask model on the various datasets compared to existing models from Section 2.3 is shown in Table 2, with full test results in Table 13 in Appendix F.
We first note that the Dodeca model performs well across the board, and indeed has the highest ROUGE-L, BLEU-4, and F1 scores for the three text-only datasets. Higher BLEU-4 scores can be attributed to specifying a smaller minimum generation length, as forcing the BlenderBot models to generate no less than 20 tokens hurts precision when compared to reference labels -we verified this by generating with a smaller minimum length (5 tokens) and saw a 20% increase in BLEU-4 on Image-Chat for Multi-Modal BlenderBot. Higher ROUGE-L scores can additionally be attributed to specifying a larger minimum generation length; this was also verified by generating with a higher minimum length (50 tokens) where we saw nearly a 40% increase in ROUGE-L score. Nevertheless, we do not report an exhaustive search over parameters here for our model, and instead compare it to BlenderBot with the same settings next.
When compared to its predecessor, text-only   BlenderBot, MMB performs nearly the same on all four text-only datasets, indicating that MMB has not lost its proficiency in text-only dialogue. Additionally, when comparing performance on Image-Chat to models trained on multi-modal data, MMB outperforms Dodeca in terms of F1 score (13.1 vs. 12.9) and outperforms 2AMMC on all three metrics. For the 2AMMC model, these metrics are computed under the assumption that the model's chosen response (from a set of candidate responses collated from the Image-Chat training set) is the "generated" response.

Human/Model Chats Without Images
We compare MMB to BlenderBot by having crowdsourced workers chat with our models, over 50 conversations per model. (Henceforth we refer to this model as "MMB Style" to reflect the fact that it was exposed to Image-Chat styles during training: see Appendix B for a description of these styles.) Each conversation consists of 7 turns per speaker, with the human speaking first by saying "Hi!", following the convention of Adiwardana et al. (2020). No Image-Chat style is given to MMB Style at the beginning of these conversations, matching its training setup in which no style was given when training on dialogue datasets. Table 14 shows that human ratings of these conversations, including how often they contain issues such as contradiction and repetitiveness, are similar between models. We then perform ACUTE-Evals (Li et al., 2019b) on the collected conversations of MMB Style and BlenderBot in order for crowdsourced raters to directly compare conversations from different models in an A/B setting. For each comparison, we ask each rater to compare conversations on one of two metrics, following Li et al. (2019b): the Preference metric asks, "Who would you prefer to talk to for a long conversation?", and the Humanness metric asks, "Which speaker sounds more human?".
Results are shown in Table 3: raters choose conversations from one model over the other roughly equally, with no statistically significant differences among models. See Appendix G.2 for reasons that raters give for choosing one model over another.
In Table 4, we also compare MMB Style to two other baseline models, DialoGPT and Meena. Raters are significantly more likely to prefer MMB Style over both of these models with respect to both the preference and humanness metrics.

Human/Model Chats About Images
We measure MMB Style's ability to chitchat about what it perceives visually by collecting roughly 50 multi-modal conversations between a human and the MMB Style model, for which each conversation discusses an image taken from the test set of Image-Chat. 2 Image-Chat styles are divided into three categories, "positive", "neutral", and "negative" (Appendix B): only Image-Chat images for which the first speaker has a "positive" or "neutral" style are used, and thus images for which the first speaker has a "negative" style are filtered out. For each conversation, the image is first shown to both the human and the model. Then, the model responds to the image, and the human responds to the model to carry the conversation forward. The conversation continues for 6 human utterances and 7 model utterances total.
As a comparison, we also collect similar conversations between humans and two previous models trained on Image-Chat data, Dodeca and 2AMMC.   Among the three models, 2AMMC alone is a retrieval model: it retrieves its response from the set of utterances in the Image-Chat training set. Examples of the three models' initial responses to an image are in Table 5. We then run ACUTE-Evals to ask raters to compare these models' conversational skills on the Preference, Humanness, and Image-response metrics, where the Image-response metric asks, "Who talks about the image better?" The same image is used for both sides of each A/B comparison between conversations. Ratings are shown in   der bias: for instance, there is no safeguard against it misgendering a person in an image, and many common text datasets are known to contain gender bias (Dinan et al., 2019a, 2020a), which may lead to bias in models trained on them. To remedy this, we train a version of the MMB Style model in which we examine the label of each training example to determine whether it contains female or male words, and then a string representing that classification is appended to the example's context string (Dinan et al., 2019a), for input to the model. At inference time, the string representing a classification of "no female or male words" is appended to the context, nudging the model to generate a response containing no gendered words. The fraction of utterances produced by this model that still contain gendered words is shown in Table 7. Compared to the gold response, the original BlenderBot, and MMB Style, this degendered MMB model (which we call "MMB Degendered") reduces the likelihood of producing an utterance with male word(s) by roughly a factor of 9 and of producing an utterance with female word(s) by roughly a factor of 4, given a context from the ConvAI2 validation set. ACUTE-Evals in Table 3 show that this degendering does not lead to a significant drop in the preference for or humanness of the model's responses during a conversation.

Removing Dependence on Style
Since each of the images that MMB Style saw during training was associated with an Image-Chat style, it relies on an input style during inference in order to be able to discuss an image. However, this results in a model whose utterances will necessarily strongly exhibit a particular style. (For example, see the "Playful" MMB Style response in Table 21: constricting the model to respond playfully to all images could seem rather contrived and perhaps unlike typical human speech.) To avoid this, we train a version of MMB Style where, for 75% of all images seen during training, the accompanying style is replaced with the string "positive/neutral" or "negative", depending on which list the style was a part of. Thus, during inference, the string "positive/neutral" can be used in lieu of a specific style string in order to produce responses that are unlikely to be negative and that do not consistently display strong adherence to a specific style. We refer to this model as the "MMB Positive" model, or "MMB DegenPos" if it was trained with degendering in addition as in Section 6.1. Table 22 in the appendix shows that these models exhibit little increase in perplexity, with the increase likely due to the loss of specificity provided by a concrete style. The MMB DegenPos model exhibits the same level of degendering as the base MMB Degendered model (Table 7), and ACUTE-Evals show that these models exhibit no detectable loss of ability to talk about an image (Table 8). See Appendix H.1 for an ablation of MMB Positive in which a model is not shown images at all.

Safety
The MMB models may demonstrate offensiveness beyond gender bias for several reasons: (1) its generative nature makes it rather difficult to define a limited set of utterances; (2) the model's training   (3) the Image-Chat dataset has negative styles to better capture the range of human styles. All of these factors could lead to an unsafe response given a multi-modal context. To mitigate this problem, we first measure our models' toxicity using an openly available blocklist 3 and an offensive language classifier presented in Dinan et al.
(2019b). We define the term "toxicity" to mean the ratio between the number of offensive utterances and the total number of utterances generated by the model. We evaluate our model on the Image-Chat validation set, with a fixed style trait to control the generation, presenting results for different choices of fixed trait. We first evaluate our model in the first round of the Image-Chat validation set. The results in Table 9 indicate that positive styles reduce the level of toxicity by a large margin for both metrics (classifier and blocklist). The results also align well with our previous experiments on degendering, as toxicity is reduced across all styles after applying the degendering process. After degendering, we can considerably improve our model's safety by enforcing that it uses positive styles. We also evaluate our model in the second round of the conversation and collect the statistics based on the first round style, as shown in Table 23. This result suggests that even if the model is controlled with a positive style, it is less safe when responding to negative conversations.

Example Conversations/Failure Cases
We show several handpicked examples of conversations with our MMB DegenPos model in Figures 1,  2, and 3. See Appendix I for a discussion of what in these conversations tends to work well, as well as common failure modes.

Conclusion
In this work, we explored a necessary component of open-domain dialogue models preferred by humans: the ability to perceive and converse in the context of what is seen. We showed that we can match prior work in text-only dialogue in both automated metrics and preference/humanness metrics, and our best model surpasses existing models in multi-modal dialogue. Finally, we demonstrated that we do not sacrifice human preference for our model by incorporating safety components into it.

Ethical Considerations
In this work we present conversational agents that maintain dialogue in a multi-modal setting. Our intention is to ultimately build agents that can meaningfully engage humans in dialogue; in such a setting, humans who chat with our models would benefit from having a chat partner who is personable, knowledgeable, empathetic, and visually perceptive. Our experiments and human evaluations lead us to believe that our models are preferred to alternatives, and beneficial interactions should take place when pairing our models with human conversational partners. It is clear, however, that conversational language can contain offensive statements. Indeed, if no measures or precautions are taken during model training or deployment, conversational models can produce offensive statements as well -this should come as no surprise given the nature of the pretraining data (i.e., Internet chat forums) (Xu et al., 2020), yet it is perhaps even more important in a multi-modal setting, where otherwise safe text can be viewed as offensive given the right (or, in this case, wrong) visual context.
As we note in our introduction, safety in opendomain dialogue is a well-researched (and far from solved) issue, and despite our work not focusing specifically on generating safe conversations, we make some efforts to address safety concerns in Section 6.3. As mentioned above, the main goal of this work is to explore and measure the conversational ability of various multi-modal dialogue architectures. Nevertheless, we acknowledge that safety is a major element of human-model discourse, and we note that we dedicate a substantial portion of the paper towards exploring how certain safety mechanisms impact how humans interact with our models.
In particular, we can identify several potential ethical failure modes that might arise if this model were used in an irresponsible manner. First, if we were to release this model to the general public as-is without any safety measures in place (such as the ones we discuss above), bad actors could either attempt to find specific dialogues/images for which our model delivers an unsafe response or else deploy this model in a setting that exploits any safety weaknesses in the model. We also acknowledge the potential for remaining bias in the model's responses along demographic lines such as gender, although we address the question of gender bias by degendering our model in Section 6.1.

A Details of Image Encoders
We test the following image encoders in our MMB models: ResNeXt WSL We first experiment with image representations obtained from pre-training a ResNeXt 32x48d model on nearly 1 billion public images (Mahajan et al., 2018), with subsequent fine-tuning on the ImageNet1K dataset (Russakovsky et al., 2015) 4 . The output of this model is a 2048-dimensional vector, and we refer to these representations as "ResNeXt WSL" features.
ResNeXt WSL Spatial One can also take the output of the image encoder prior to its final fullyconnected layer to obtain "spatial" image features, resulting in a 2048×7×7-dimensional vector. We explore results with these features as well, and refer to them as "ResNeXt WSL Spatial". ) dataset, and contains 140k training utterances in which crowdworkers were given prepared "persona" lines, e.g. "I like dogs" or "I play basketball", and then paired up and asked to get to know each other through conversation.

EmpatheticDialogues (ED)
The EmpatheticDialogues dataset (Rashkin et al., 2019) was created via crowdworkers as well, and involves two speakers playing different roles in a conversation. One is a "listener", who displays empathy in a conversation while conversing with someone who is describing a personal situation. The model is trained to act like the "listener". The resulting dataset contains 50k utterances.
Wizard of Wikipedia (WoW) The Wizard of Wikipedia dataset (Dinan et al., 2019c) involves two speakers discussing a given topic in depth, comprising 194k utterances. One speaker (the "apprentice") attempts to dive deep on and learn about a chosen topic; the other (the "wizard") has access to a retrieval system over Wikipedia, and is tasked with teaching their conversational partner about a topic via grounding their responses in a knowledge source.

BlendedSkillTalk
(BST) BlendedSkillTalk (Smith et al., 2020) is a dataset that essentially combines the three above. That is, crowdworkers are paired up similarly to the three previous datasets, but now all three "skills" (personalization, empathy, and knowledge) are at play throughout the dialogue: the speakers are tasked with blending the skills while engaging their partners in conversation. The resulting dataset contains 74k utterances.
Image-Chat (IC) The Image-Chat dataset (Shuster et al., 2020) contains 200k dialogues over 200k images: crowdworkers were tasked with discussing an image in the context of a given style, e.g. "Happy", "Cheerful", or "Sad", in order to hold an engaging conversation. The resulting dataset contains over 400k utterances. For each conversation in the dataset, the two speakers are each assigned a style in which that speaker responds, and these styles are optionally fed into models as part of the input, alongside the dialogue context. There are 215 styles in total, and styles are divided into 3 categories, "positive", "neutral", and "negative". 6

C Additional Training Details
Domain-adaptive Pre-training During domainadaptive pre-training, we trained the model on 8 GPUs for 10k-30k SGD updates, using earlystopping on the validation set. The models were optimized using Adam (Kingma and Ba, 2014), with sweeps over a learning rate between 5e-6 and 3e-5, using 100 warmup steps.
Fine-tuning In this stage, we train the models on 8 GPUs for around 10k train updates using a similar optimization setup as in the domain-adaptive pretraining stage.

D Automatic Evaluations on Pre-Training Datasets
Training Data We first note that, regardless of image fusion and image feature choices, we see the best performance on COCO Captions by simply fine-tuning exclusively on that data. This is an expected result, though we do see that in nearly every scenario the decrease in perplexity is not large (e.g. 5.23 for Faster R-CNN early fusion multi-tasking, down to 4.83 with just COCO Captions).
Image Features Across all training setups, we see that using spatially-based image features (ResNeXt WSL Spatial, Faster R-CNN) yields better performance than just a single vector image representation (ResNeXt WSL). This difference is particularly noticeable when training with COCO and pushshift.io Reddit, where with Faster R-CNN features the model obtains an average ppl of 9.13 over the two datasets, while with ResNeXt WSL features the model only obtains 10.1 ppl. We find that using Faster R-CNN features additionally outperforms using ResNeXt WSL Spatial features, where using the latter obtains an average of 10.0 ppl over the two datasets.
Image Fusion Finally, holding all other variables constant, we find that using our early fusion scheme yields improvements over using a late fusion scheme. E.g., with Faster-R-CNN features in the COCO-only setup, we see a decrease in perplexity from 5.21 to 4.83; with ResNeXt WSL Spatial image features, we see perplexity differences ranging from 0.3 to 0.9 depending on the training data.

E Ablation Results on Fine-Tuned Datasets
Text-Only Datasets First, we look at the performance of our models on the text-only datasets. The second-to-last column in Table 11 shows the average perplexity across the text-only datasets. If we compare the model that performs best on Image-Chat across all sets of image features (Faster-R-CNN features with BST + + IC + COCO + Reddit training data with early fusion) to the model in row 2, which is trained both without images and without Image-Chat on the text-only datasets, we see that the perplexity differences are quite small: that is, including training on an image-dialogue dataset, and overloading the Transformer encoder/decoder to incorporate image features, does not hinder dialogue performance.
Training Data Across all image-feature choices, we see that the choice of training data indeed makes a difference in performance on Image-Chat. Examining the early fusion model in   that domain-adaptive pre-training indeed improves performance on Image-Chat. This difference is highlighted even more when we measure performance on the first turn of Image-Chat, in which the model must generate a response given no textual context: 15.16 to 14.64, 15.34 to 14.76, and 13.66 to 13.51. We note a similar trend in Table 12.
Image Features Again, we see that using Faster R-CNN features leads to dramatic improvements compared to using the ResNeXt WSL features (spatial or otherwise), yielding 12.36 perplexity on Image-Chat compared to 12.85 and 12.87 perplexity with ResNeXt WSL (non-spatial and spatial respectively) during multi-tasking, and 12.29 perplex-   ity on Image-Chat compared to 12.92 and 12.87 respectively for single-task training on Image-Chat (see Table 12).
Image Fusion Finally, we note as before that using our early fusion technique improves performance on Image-Chat across all ablation regimes. While the average perplexity across the dialogue datasets is best when using late image fusion, we obtain the best image chat perplexity when performing early image fusion.  most categories of issues, with BlenderBot being flagged slightly more often for contradictions and repetitiveness and MMB Style flagged more often for being non-sensical; however, the mean engagingness rating of the two models across conversations is the same (both 4.7 out of 5). Degendering the MMB model (MMB Degendered) results in a slight drop in engagingness vs. no degendering (Table 14). Similar human ratings at the end of conversations between a human and a model about an image show that MMB Style beats Dodeca and 2AMMC on measures of engagingness, humanness, and the ability to talk about an image by a large margin (Table 15).

G.2 Reasons for ACUTE-Eval Ratings
For ACUTE-Evals comparing pairs of human/model conversations from different models, crowdsource workers are asked to select among 10 checkboxes to explain their preference for one conversation over another. Workers are able to select multiple checkboxes. Results for ACUTE-Evals on the preference metric are shown in Tables 16, 17, and 18.

G.3 ACUTE-Evals on the models' first response to an image
On ACUTE-Evals comparing two models' initial responses to the same image, we find that crowdsource raters choose both the MMB Style and 2AMMC models' responses significantly more often than those of Dodeca (Table 19). We also find no significant difference in the rate at which MMB Style image responses are chosen compared to the same model fine-tuned only on Image-Chat and not on dialogue datasets (Table 20), which implies that multitasking on dialogue datasets does not degrade the ability to effectively respond to an image. See Table 21 for additional example responses of models to images.

H Additional Analyses of Safety and Gender Bias
See Table 22 for a table of perplexies of all MMB model variants. Table 23 displays measurements of safety in the second rounds of Image-Chat conversations dependent on whether first round exhibited a positive or negative style.

H.1 Analyzing Dependence on Image
We also train a no-image ablation model, otherwise equivalent to MMB Positive, for which Image-Chat images are removed during both training and inference: crowdsource workers prefer the image responses of MMB Positive to those of this ablation model 80% to 20% (Table 24). For this ablation, style was removed from the context (replaced with the string "positive/neutral") to prevent the ablation model from being aided by this information. Figure 1 in particular demonstrates a successful conversation: the model is clearly able to interpret what is in the image (a teddy bear and a road), and it is able to thoughtfully and creatively combine these two subjects in the conversation for several turns. Figure 2 provides several more example conversations: in all of these, the model is able to both discuss the image and use it as a catalyst for further conversation, although occasionally with contradiction and forgetfulness issues as seen in Roller et al. (2020). (For instance, the model contradicts itself on whether it has any pets and forgets who is planning to make a fancy dinner.) Last, we show a few hand-picked examples of poor conversations in Figure 3: in these, the model fails to identify the contents of the images, identifying them both as buildings, although this may reflect a difference in the prevalence of (for example) buildings vs. roller coasters in the training sets. Despite the human nudging the model about what the images actually convey, the model does not demonstrate that it has corrected its initial misidentification in later turns. This could perhaps be remedied by an increase in image training data, by further advancements in the integration of image features with this BlenderBot-based sequence-to-sequence model, or perhaps by training specifically on data in which one partner learns about the contents of an image over time.    (Style) Pompous (Self-important, Arrogant) MMB I have a statue just like this one in my backyard, but it's much nicer. Dodeca I could take a better picture than that. 2AMMC I could make a fence that looks way better (Style) Empathetic MMB These flowers remind me of my grandmother's garden. I miss her so much. Dodeca I feel bad for the flowers. 2AMMC I can't imagine how difficult it is for the farmer to round them all up. He works so hard.

I Example Conversations and Failure Modes
(Style) Narcissistic (Self-centered, Egotistical) MMB I would never go to a beach like this. My beach is so much nicer. Dodeca I could take a better picture than that.
2AMMC I could swim out to those rocks. The current is strong but I am a stronger swimmer.
(Style) Boisterous (Rowdy, Loud) MMB I would love to jump in that puddle and play in the mud! It would be so fun! Dodeca This looks like a great place to party! 2AMMC I would ride through here on a loud bike and throw a party! (Style) Happy MMB I can't wait to go to the beach and fly a kite with my best friend! Dodeca What a beautiful day! 2AMMC Face full of wind, smell of salt water, cool of the ocean, and fantastic memories being made!   Table 23: Toxicity of MMB variants as assessed with different control variables. We evaluate on the second round of the Image-Chat validation set. Column "Pos C" shows the safety classifier metric when conditioning on a positive style for the round-1 utterance, and "Pos B" shows the same thing for the blocklist metric. The following two columns show the same metrics when the round-1 utterance has a negative style.