Recipes for building an open-domain chatbot

Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.


Introduction
In this work, we provide recipes for building opendomain chatbots that perform well in human evaluations. It has been shown across the field of NLP (Devlin et al., 2019) and in conversational agents in particular Zhang et al., 2019;Adiwardana et al., 2020) that pre-training on large corpora is important. Beyond simply scaling models the two main takeaways from our study are:

Blending Skills
Large improvements can be made by finetuning on data that emphasizes desirable conversational skills. We select tasks that make * Work done while at Facebook; currently AI2 Incubator.

Human
Generative BST 9.4B If you could have dinner with one person, dead or alive, who would that be?
That's a tough one. I think I'd have to go with Steve Jobs. I'd love to pick his brain.
Ah, interesting choice. What do you know about Steve Jobs?
He was one of the most influential people in the history of technology. He was a visionary.
What questions would you ask him?
I'd want to know how he came up with the idea for the iPod. It was so revolutionary at the time.
Totally...the iPod changed the way everyone listens to music.
It really did. It's crazy to think that it was released in 2001. That was a long time ago.
Omg wow, that makes me feel old! I know, right? Time flies by so fast. I wish I could go back to those simpler times.
Me too, friend, me too. Speaking of music, what's your favorite album to listen to lately?
I've been listening to a lot of Led Zeppelin lately. They're my favorite band.
Figure 1: Paper author (left) conversing with our 9.4B parameter model (right). This example was cherry picked. We release conversation logs with crowdworkers with our code, along with lemon-picked examples in Sec. 10.5.
the model focus on personality and engagingness, knowledge, and empathy, achieving large gains by using the recently introduced Blended Skill Talk (BST) set-up (Smith et al., 2020), which targets those aspects by providing training data and initial conversational context (personas and topics). Small models using BST can match or outperform larger models that do not. While BST emphasizes desirable traits, we also show this tuning can minimize undesirable traits learnt from large corpora, such as toxicity.

Generation Strategies
The choice of decoding algorithm is of critical importance, and two models with the same perplexity but different decoding algorithms can give vastly different results. In particular we show that the length of the bot's utterances are crucial to human judgments of qualitytoo short and the responses are seen as dull or showing a lack of interest, too long and the bot appears to waffle and not listen. We show, contrary to previous work which reports that beam search is inferior to sampling (Holtzman et al., 2019;Adiwardana et al., 2020), that careful choice of search hyperparameters can give strong results by controlling trade-offs.
In particular, constraining the minimum beam length gives a crucial control of the dull versus spicy spectrum of responses.
Human evaluation results are highly dependent on the precise set-up one chooses. Model performance can be strongly affected by the specific instructions given to evaluators, such as a given topic or not, the overall conversation length, and the choice of human interlocutors, which may be difficult to jointly account for. We report performance when employing crowdworkers in short multi-turn conversations with no prompt. However, in addition to that, we believe releasing models is the most reliable way to enable full insight into their capabilities. We thus make publicly available our large-scale, state of the art open-domain conversational agent, including code to fine-tune it, the model weights, and code to evaluate it, so that our setup is reproducible. In human evaluations of engagingness our best model outperforms Meena (Adiwardana et al., 2020) in a pairwise comparison 75% to 25%, and in terms of humanness by 65% to 35% (both statistically significant, two-tailed binomial test, p < 0.01).
While the performance of our bot at first sight is very good, we do not believe we are yet close to solving the problem of open-domain conversation. We thus discuss limitations of our models, and initial attempts to solve them. In particular, our models still display: a lack of in-depth knowledge if sufficiently interrogated; a tendency to stick to simpler language; and a tendency to repeat oftused phrases. We show how unlikelihood training and retrieve-and-refine mechanisms are potential avenues for fixing these problems; however, our initial experiments with these methods are inconclusive. We thus discuss future possibilities for alleviating these problems as well as methods to clearly expose and evaluate them.   for retrieval encodes global features of the context using multiple representations (codes), which are attended to by each possible candidate response. This final attention mechanism gives improved performance over a single global vector representation, whilst being tractable to compute.

Model architectures
We consider three types of architectures in this work: retrieval, generative, and retrieve-and-refine models. All three use Transformers (Vaswani et al., 2017) as a base.

Retriever
Given a dialogue history (context) as input, retrieval systems select the next dialogue utterance by scoring a large set of candidate responses and outputting the highest scoring one. Typically, all possible training set responses are used as the candidate set.
We employ the poly-encoder architecture of . Poly-encoders encode global features of the context using multiple representations (n codes, where n is a hyperparameter), which are attended to by each possible candidate response, see Figure 2. This final attention mechanism gives improved performance over a single global vector representation (so-called "biencoders"), whilst still being tractable to compute compared to simply concatenating input and output as input to a Transformer (so-called "crossencoders"). The poly-encoder has state of the art performance on a number of dialogue tasks when compared to other retrieval models, and also gives comparable performance to the winning generative models on the ConvAI2 competition task (Zhang et al., 2018) in terms of human evaluation (Li et al., 2019b). We consider two poly-encoder sizes: 256M (from (Smith et al., 2020)) and 622M parameter models which we trained here, both using N = 64 codes.

Generator
We employ a standard Seq2Seq Transformer architecture to generate responses rather than retrieve them from a fixed set. Our implementation is based on the ParlAI version (Miller et al., 2017). We use Byte-Level BPE tokenization (Radford et al., 2019) trained on the pre-training data, as implemented in HuggingFace's Tokenizers. 1 We consider three sizes of model: 90M parameters (following , 2.7B parameters and 9.4B parameters. Our 9.4B parameter model has a 4 layer encoder, a 32 layer decoder with 4096 dimensional embeddings, and 32 attention heads. Our 2.7B parameter model roughly mimics the architectural choices of Adiwardana et al. (2020), with 2 encoder layers, 24 decoder layers, 2560 dimensional embeddings, and 32 attention heads.

Retrieve and Refine
Current generative models are known to have issues with producing dull and repetitive responses which are improved, but not resolved, by simply scaling (Holtzman et al., 2019;Welleck et al., 2020;Li et al., 2019a). Additionally, generative models are known to hallucinate knowledge, and in general are unable to read and access external knowledge other than what is embedded in their model parameters, which may be imperfect. One approach to try to alleviate these problems is to combine a retrieval step before generation, referred to as a retrieve and refine model . We consider two variants for the retrieval step: dialogue retrieval and knowledge retrieval.
Dialogue Retrieval We can simply use a retrieval-based dialogue model in the retrieval step, as in Sec. 2.1. Given the dialogue history, the retrieval model is first used to produce a response. Rather than showing this response to the speaking partner it is appended to the input sequence of the generator, along with a special separator token. The generator then outputs a response as normal given this modified input sequence. Retrieval models produce human written utterances which tend to include more vibrant language than the most high probability utterances of a standard generative model. Hence, if the generative model learns when to copy the elements of such an utterance, and when not to, it can provide improved responses. To build such models, we use the architectures considered in the previous two sections for the two components of the model.

Knowledge Retrieval
We can also use the same mechanism to first retrieve from a large knowledge base, instead of retrieving an initial dialogue utterance. We can then condition the generation on the retrieved knowledge, as done in models proposed for the Wizard of Wikipedia task (Dinan et al., 2019c). We hence refer to this as a Wizard Generative model, as the supervised training signal of how to use knowledge in dialogue comes from the Wizard of Wikipedia task, even though we multi-task on other tasks as well. We use the same retrieval system as in that cited work, which uses a TF-IDF-based inverted index lookup over a Wikipedia dump 2 to produce an initial set of knowledge candidates. A Transformer retriever model (the same as Sec. 2.1) is then used to rank the candidates and select a single sentence which is used to condition generation. We additionally trained a Transformer-based classifier to choose when to perform retrieval or not on a per-turn basis, as some contexts do not require knowledge. This was trained as a two-class classifier discriminating between contexts that require knowledge or not in our fine-tuning tasks, to be described in the next section. We note all other models in this work do not condition on retrieved knowledge.

Ranking for Retrieval
To train the retrieval models, a cross-entropy loss is minimized in which the logits are y cand 1 , . . . , y candn , where y cand 1 is the score of the correct response and the others are sampled negatives. Following , during training we use the other responses in the batch for negatives. This allows for much faster training, as we can reuse the embeddings computed for each candidate, and also use a larger batch size. In our training we are able to use batches of 512 elements.

Likelihood Training for Generation
To train the generative models, we use the standard Maximum Likelihood Estimation (MLE) approach.
Given a dataset D = {(x (i) , y (i) )}, minimize: where x (i) is a gold input context and y (i) is a gold next-utterance, and y (i) t is the t-th token of y (i) .

α-blending for Retrieve and Refine
For retrieve and refine, simply appending dialogue retrieval responses to the context of a generative model and training with MLE unfortunately does not yield satisfying results. As the correspondence between gold label and retrieved utterance is not necessarily clear, a trained model often opts to simply ignore the retrieval utterance, as was shown in . To ensure it is used, one can replace the retrieved response instead with the gold response α% of the time, treating α as a hyperparameter to be tuned. This gives a smooth transition between retrieval and generator-only systems. For knowledge retrieval we find this issue to be less of a problem as the fine-tuning datasets used have a clear correspondence between gold knowledge conditioning and response, and in that case we only use the gold knowledge during training.

Unlikelihood training for generation
An alternative method to combat the failures in model generations is to change the loss function. The unlikelihood loss (Welleck et al., 2020;Li et al., 2019a) has been shown to help fix mismatches between human and model distributions across various axes, including decreasing repetitions and mitigating the issue of overrepresented vocabulary tokens. The unlikelihood loss penalizes a set of tokens C t at each time-step, L (i) where C t ⊆ V is a subset of the vocabulary. The overall objective in unlikelihood training then consists of mixing the likelihood and unlikelihood losses, where α ∈ R is the mixing hyper-parameter. Likelihood tries to model the overall sequence probability distribution, while unlikelihood corrects for known biases. It does this via the set of negative candidates C t calculated at each step t; typically one specifies in advance a method for generating such candidates, for example the tokens which have been repeated or overrepresented. Likelihood pushes up the probability of a gold token y (i) t while unlikelihood pushes down the probability of negative candidate tokens y c ∈ C t . In this work during training we keep a running count of the distribution of n-grams that appear when generating from the model, and choose tokens as negative candidates from these n-grams when their counts are above the human distribution counts as measured from the gold responses.

Decoding
For generative models, at inference time, one must choose a decoding method to generate a response to the dialogue context given as input. In this work we compare a number of well-known approaches.

Beam Search
Two widely used deterministic decoding approaches are greedy search and beam search. The former can be seen as a special case of the latter. Greedy search selects the highest probability token at each time step: y t = arg max p θ (y t |x, y <t ). Beam search maintains a fixed-size set of partiallydecoded sequences, called hypotheses. At each time step, beam search forms new hypotheses by appending each token in the vocabulary to each existing hypothesis, scoring the resulting sequences then selecting the highest scoring sequences.
We compare beam search for different beam sizes in our experiments.

Sampling
An alternative is to sample from a model-dependent distribution at each step, y t ∼ q(y t |x, y <t , p θ ). In order to prevent sampling low probability tokens, a typical approach is to restrict sampling to a subset of the vocabulary at each step, and sampling according to those (renormalized) probabilities.
For sampling methods, we will compare top-k sampling (Fan et al., 2018) and sample-and-rank (Adiwardana et al., 2020). The latter performs sampling S times, and selects the generated sample with the highest probability.

Response Length
Generating with a beam tends to produce short generations that do not match the length statistics of the human utterances they were trained on . However, longer responses, if of high quality, can be more engaging than very short ones. While following the human distribution may not give optimal performance for a bot -for example, it may want to err on the side of brevity for improved human evaluation, because that is less likely to expose its failings -making its responses longer may make them provide more information, and make them less dull.
We consider two simple methods to control the length of a model's responses.
Minimum length The first method we consider is a hard constraint on the minimum generation length: the end token is forced to not be generated until a minimum sequence length is achieved.
Predictive length The second approach is to predict the length based on human-human conversation data. To do this we train a 4-class classifier by binning the lengths of the next conversation turn (e.g., < 10, < 20, < 30, or > 30 tokens). We use the same architecture as the retrieval model for this classifier. Then, at test time, the classifier is first used to predict the length of the next response, and sets the minimum generation length constraint to its corresponding prediction. Unlike the previous approach, this results in more natural variable length conversation turns, whilst ensuring long responses when they seem natural. One drawback, however, is that this procedure makes our system more complex.

Subsequence Blocking
Sequence generation models are known to repeat subsequences (Holtzman et al., 2018), particularly in stochastic methods such as beam search, but also in sampling methods as well (Adiwardana et al., 2020). We implement standard beam blocking of n-grams (Paulus et al., 2017) and use n = 3. We consider both blocking repeated n-grams within the generated utterance, and repeating of the input sequence (previous utterances from either speaker).

Training Details
We detail the techniques we employ during pretraining and fine-tuning.
Pre-training Ranking models. We perform pretraining using the Fairseq  toolkit. Our 256M parameter ranking model is identical to the pre-trained model released by . Our 622M model is pre-trained using a simple Masked Language Model objective on the same data and dictionary as the large Generative models. We took all hyperparameter choices from those recommended in RoBERTa .
Pre-training Generative models. We perform pre-training using the Fairseq  toolkit. Our 2.7B and 9.4B parameter models were both trained using the Adam optimizer (Kingma and Ba, 2014). In order to fit the larger models onto nodes, we utilize Megatron-LM style model parallelism (Shoeybi et al., 2019), in which the Feed Forward network (FFN) and Multihead Attention layers of the Transformer are "vertically" sliced, minimizing the need for communication across GPUs. We also evaluated Adafactor (Shazeer and Stern, 2018), which allows for larger batch sizes, but we found it converged to a worse place than Adam. In all cases, we use a variant of mixed precision training (Micikevicius et al., 2017), storing gradients and optimizer state in FP32, but accumulating model parameters directly in FP16 . A dynamic loss scalar is utilized to prevent gradient underflow (Micikevicius et al., 2017). Both our 2.7B and 9.4B parameter models were trained with batches of approximately 500k label BPE tokens per batch. The 2.7B parameter model trained for approximately 200k SGD updates with a maximum learning rate of 2e-4, a linear warmup of 3125 steps, and an invsqrt LR scheduler (Vaswani et al., 2017); the model had not converged when we stopped. The 9.4B parameter model was trained with a maximum learning rate of 1.15e-4 and 2400 warmup steps for a total of 200k SGD updates, and did not appear to be overfitting.
Fine-tuning. We fine-tune our models using the ParlAI toolkit (Miller et al., 2017), which specializes in training and evaluating dialogue models. As opposed to the above pre-training, we utilize GPipe-style model parallelism (Huang et al., 2019), in which full layers are sharded across different GPUs, and each minibatch is further split into micro-batches to ensure maximum throughput. As in pre-training, we found that Adam outperformed Adafactor during fine-tuning, and we utilized Fairseq-style mixed precision training. Models were fine-tuned to convergence, with maximum learning rates of between 1e-6 and 1e-5.

Training Data
We next discuss the training data we use, which is all in English (#BenderRule).  2019), we use a previously existing Reddit dataset extracted and obtained by a third party and made available on pushshift.io (Baumgartner et al., 2020), training to generate a comment conditioned on the full thread leading up to the comment, spanning 1.5B training examples from Reddit obtained from PushShift 3 through July 2019. The subreddits cover a vast range of topics, and hence the dataset is a good candidate for helping train a dialogue model in the open-domain case. We apply heuristic rules to filter the dataset with the goal of providing a cleaner training signal. We remove the comment and all subsequent child comments if any of the following conditions are met: 1. The author is a known bot. 2. It comes from a known non-English subreddit. 3. The comment is marked as removed / deleted. 4. It is longer than 2048 characters and does not contain spaces. 5. It is longer than 128 BPE tokens. 6. It is shorter than 5 characters. 7. It contains a URL. 8. It starts with a non-ASCII character. 9. It is further than depth 7 in the thread.
Models were trained with maximum context and response lengths set to 128 BPE tokens, and longer examples were truncated. Our final dataset contains 1.50B comments totaling 56.8B label BPE tokens and 88.8B context tokens. 4 We divide the corpus into 4096 roughly-equal sized chunks, stratified by thread ID (such that no two comments from the same post appear across folds), and reserve the last two chunks for validation and test respectively, each approximately 0.02% of the full dataset (∼ 360k comments each).

Fine-tuning
Our pre-training data, though large, contains data consisting of group discussions, rather than direct two-way conversational data. While it has a lot of useful content, it also still has a lot of noise, even after filtering. In contrast, the academic community has produced a number of smaller, but cleaner, more focused tasks, typically collected via crowdworkers, which have been made publicly available. These tasks can more accurately provide traits that are desirable for our models. For example, the ConvAI2 dataset (Zhang et al., 2018) focuses on personality and engaging the other speaker, Empathetic Dialogues (Rashkin et al., 2019) focuses on empathy, and Wizard of Wikipedia (Dinan et al., 2019c) focuses on knowledge. Finally, Blended Skill Talk (Smith et al., 2020) provides a dataset that focuses on blending these skills.
ConvAI2: ConvAI2 is a dataset used at the NeurIPS 2018 competition of the same name, and is based on PersonaChat (Zhang et al., 2018;. The training data of 140k utterances involves paired crowdworkers having a conversation where they get to know each other, in which each is given a role to play based on sentences describing their persona, which were also separately crowdsourced (both speakers can see their own persona description, but cannot see their partner's persona). The task thus involves getting to know the other speaker and engaging them in friendly conversation, both asking and answering questions -useful skills for an open-domain conversational agent. Models trained on this task are thus conditioned on the persona and the dialogue history, which are concatenated. It was previously shown this dataset helps provide more engaging dialogue, and that the use of persona gives improved consistency for the bot.
Empathetic Dialogues (ED): Rashkin et al. (2019) constructed the Empathetic Dialogues dataset, which consists of 50k utterances of crowdworker conversations grounded in an emotional situation. In each dialogue, one speaker describes a personal situation and the other plays a "listener" role, displaying empathy during the discussion. Trained models are measured playing the part of the empathetic listener. It was previously shown fine-tuning models on this dataset helps them display more empathy in human evaluations.
Wizard of Wikipedia (WoW): The Wizard of Wikipedia task involves discussing a given topic in depth, where the goal is to both engage the partner as well as display expert knowledge (Dinan et al., 2019c). The dataset consists of 194k utterances over 1250 topics, where each conversation begins with a randomly chosen topic. A retrieval system over Wikipedia was used from which the dialogues were grounded during the human-human crowdsourced conversations. The topics were also crowdsourced and range from e-books to toga parties to showers. In most of our models we use the simpler version of the task where we only use the final conversations for fine-tuning, ignoring the retrieval aspect of the task. For our knowledge retrieve and refine model (Sec. 2.3) we do also use the gold retrieved knowledge ("checked sentence") for training the retrieval system. It was previously shown for generative models that using such knowledge was rated higher in human evaluation than without when discussing topics in depth.
Blended Skill Talk: Blended Skill Talk (Smith et al., 2020) aims to blend the previous three tasks to combine the skills from them (engaging personality from ConvAI2, empathy from ED, and knowledge from WoW) seamlessly during dialogue. To that end, a dialogue dataset of 76k utterances was collected with a guided and unguided human speaker, where the guided speaker could select utterances suggested by bots trained on the three individual tasks, see Figure 3. It was shown that this additional blended data, multi-tasked with the previous three tasks, helped maintain all three skills in open-domain dialogue. In subsequent experiments we will refer to the "BST tasks" as training on all four tasks together.
In each blended dialogue, the model is provided a two sentence persona to condition on following PersonaChat, and additionally during one third of the conversations a WoW topic name as well (see Figure 3). During evaluations, we equip our models with randomly chosen personas and, one third of the time, topics from this set as well, mirroring the way the model is trained.

Safety Characteristics
As models are trained to mimic human-human conversations, they can sometimes learn undesirable features from this human-human data, such as the use of toxic or biased language. The BST tasks we use for fine-tuning were collected from crowd-workers who were given explicit instructions to not use such language, and hence are generally safer than our pre-training data from pushshift.io Reddit. Nevertheless, issues can still remain.
We have previously investigated building better classifiers of toxic language by collecting adversarial toxic data that fools existing classifiers and is then used as additional data to make them more robust, in a series of rounds (Dinan et al., 2019b). We can apply such a classifier at test time to detect toxic language before it is shown, but we note that such classifiers are still not infallible. In our experiments section we will gauge how often such classifiers flag responses generated from the models.
We have also previously conducted studies into mitigating gender bias in dialogue through the use of conditional generation, controlling the amount of gendered words to be more neutral, with preliminary success (Dinan et al., 2019a). This is not currently added to the system described in this paper, but should be considered for future updates.

Evaluation Methods
ACUTE-Eval While we employ and report automatic metrics, our main evaluation involves the ACUTE-Eval procedure (Li et al., 2019b), whereby evaluators are asked to make pairwise evaluations of complete dialogues. An example of ACUTE-Eval is shown in Figure 4. ACUTE-Eval affords advantages over both single-turn pairwise and multiturn Likert evaluations. The explicit use of comparisons avoids the per annotator bias in numerical (Likert) scores (e.g., annotators who tend to give generous scores), and remedies many of the issues of sequential effects such as contrasting with a previous example (Mathur et al., 2017), while still providing the ability to expose issues that are present only in multi-turn evaluations.
Furthermore, the pairwise setup facilitates replication and efficient reuse of data: conversations collected in previous trials and by other systems can be directly compared with a new system, without having to recollect additional data. This can significantly reduce the resources needed by a new evaluation, and ensure that multiple papers are comparing to prior work consistently. In particular, this makes it possible to compare to logs from Meena (Adiwardana et al., 2020) even though the model itself has not been made publicly available.
We consider two evaluation questions, derived from (Li et al., 2019b): • Engagingness question: "Who would you prefer to talk to for a long conversation?" • Humanness question: "Which speaker sounds more human?" The phrasing of these questions were themselves optimized in that work to maximize agreement, and we hence re-use those exact phrasings. It was shown that different phrasings can result in weaker levels of agreement, and that engagingness and humanness clearly do not measure the same thing.
Self-Chat ACUTE-Eval Nevertheless, full human evaluations are time consuming and costly, requiring humans to spend time conducting conversations with bots as well as scoring them. As an alternative, it was shown in Li et al. (2019b) that ACUTE-Eval can also work in "self-chat" mode, where models are used for both sides of a conversation, instead of human-model chat. This eliminates the requirement of the initial chat collection, and conversations may be generated without human involvement, dramatically reducing the resource requirements of evaluation. Results from self-chat experiments highly correlate with those of humanchat experiments, for most, but not all systems (Li et al., 2019b). This mirrors other successes in using self-play, self-chat, and simulated users to evaluate dialogue systems (Fazel-Zarandi et al., 2017;Shah et al., 2018a,b;Wei et al., 2018;Ghandeharioun et al., 2019). We use this procedure for some of our modeling and hyperparameter choices where the full ACUTE-Eval would end up too costly, and only use the full human-bot chat evaluation at the final stage. In this work we use the BST-setting to perform self-chats, i.e. models are given the personas, topics and previous utterances to initiate the conversation, see Section 6.2 and Figure 3. Note that when using deterministic methods such as beam decoding, this prevents the models from generating the same conversation repeatedly.

Related Work
The area of open-domain dialogue has made significant progress recently with end-to-end neural approaches. The ConvAI2 competition at NeurIPS 2018 featured large pre-trained Transformers for the top two winning teams .
In particular, Wolf et al. (2019) (He et al., 2019), and also when multi-tasking across many of these datasets, as we also do here Smith et al., 2020).
A particular large-scale model of note that we compare to in this work is Meena (Adiwardana et al., 2020), a 2.6B parameter Transformer-based model trained on 341 GB of text, that was shown to be superior to variants of DialoGPT (Zhang et al., 2019), Mitsuku 5 , Cleverbot 6 , and XiaoIce (Shum et al., 2018;Zhou et al., 2020). The evaluation metric used was SSA, the average of sensibleness and specificity, as judged by human raters either in static or interactive setups, which is shown to highly correlate with asking raters how "humanlike" the model is. We note however that the authors themselves state it may not capture all aspects of such a test, e.g. might not measure empathy. We additionally note that neither Meena's model, the static "Mini Turing Benchmark" used in the paper, nor the phrasing of the SSA evaluation question provided to annotators was released, making cer- tain comparisons difficult. Further, the human-bot conversations were conducted by employees and were not blind to the model type (in the logs they say phrases such as "Hi Meena!"). In this work we employ unbiased crowdworkers with reproducible experiments, and use ACUTE-Eval (Sec. 8) to directly ask the humanness question, rather than a proxy. Further, we also report results on engagingness as a main metric, because this measures more closely whether a human will be interested in talking to our bots.

Results & Analysis
We first present automatic evaluation results using various metrics. As these are only ever a proxy for human judgments on conversational quality, we perform human evaluations and describe the results in the subsequent sections.

Automatic Evaluations
Retriever We fine-tune the retrieval models on ConvAI2, Wizard of Wikipedia, Empathetic Dialogues, and Blended Skill Talk datasets (BST variants of each 7 ) and automatically evaluate them by measuring hits@1/K on the validation sets of each of these datasets. Results are shown in Table 1.
Generator Before fine-tuning, we assess the performance of our 90M, 2.7B, and 9.4B parameter models by measuring perplexity on the validation set from pushshift.io Reddit. For the 90M parameter model, results are reported from , as we use that same model. Results are shown in Table 2. Training curves for the pretrained models are also provided in Figure 5. We note that the perplexity of our 2.7B and 9.4B parameter models are not directly comparable to that of the 90M parameter model, as these models do not share the same dictionary. We also report perplexity both before and after fine-tuning each of these models on the ConvAI2, Wizard of Wikipedia, Empathetic Dialogues, and Blended Skill Talk datasets. Results are shown in Table 3. They show that fine-tuning gives relatively large improvements in perplexity on these tasks, which could hence translate into improved ability at these skills when conducting open-domain dialogue.
Retrieve and Refine (RetNRef) We also report perplexity on each of these datasets for our dialogue retrieve and refine variants in Table 3. We note a small increase in perplexity -relative to the standard generator models -on each of these datasets. This small increase in perplexity was also observed in , even though the retrieve and refine models outperformed the baseline generator models in human evaluations in those experiments. As such, we cannot rely on automatic evaluations alone to assess the relative performance of retrieve and refine and generator models.
Safety We also analyzed the behavior of some of our generative models in terms of unsafe generated sequences. We produced generations given pushshift.io Reddit and ConvAI2 validation set contexts using our 90M parameter models with and without BST fine-tuning. We then assessed whether those generations were safe or not using two different methods: using an unsafe word list, or the safety classifier of Dinan et al. (2019b), both methods being available in ParlAI (Miller et al., 2017). We also compare our generations to the gold human responses, assessing whether they are safe or not too.
The results are given in Table 4. First, they show humans do utter unsafe responses, which our models will likely imitate if provided in their training data. ConvAI2, one of the BST datasets, contains much fewer unsafe utterances from humans than pushshift.io Reddit. This explains why, when we fine-tune our models on the BST tasks, they also reply with fewer unsafe utterances than models trained on pushshift.io Reddit alone.
While lists of banned words are easier to filter out of training, unsafe utterances consisting of otherwise safe words are harder to avoid -which is what the safety classifier used can also detect. We note that simply training on filtered data would not solve this problem due to the tendency of generative models to copy their current context, so at deploy time, they could still be provoked by unsafe user contexts. We can of course apply these safety classifiers at test/deploy time to further reduce the unsafe responses from these models, but note that if the classifier is erroneous, unsafe utterances could still get through.

Self-Chat Evaluations
We next perform a number of self-chat ACUTE-Evals (see Sec. 8) over various modeling choices, using the engagingness question and ∼140 trials per pair compared. This serves as an efficient alternative to a full evaluation in order for us to perform model selection over a large number of choices. We finally conduct a full evaluation on the selected best performing models in the subsequent section.

Retrieval vs. Generator vs. RetNRef
We first compared the three model types described in Sec. 2: retrieval, generative and (dialogue) retrieve and refine (RetNRef). We used the base 90M parameter generative model, the 256M parameter retrieval model, while RetNRef combines both. All models are fine-tuned on the BST tasks. For generation we use standard beam search (beam size 10, no minimum beam decoding constraint, but with context and response 3-gram blocking).
The results ( Figure 6) show RetNRef outperforming the pure generation approach, but with retrieval outperforming both. This initial result comes with the caveat that relative performance may be different for differently sized models, or for different training or decoding strategies, as we shall see. We explore along those axes in subse-   Table 3: Perplexity of the pre-trained and fine-tuned models on the validation set for BST datasets. Note that perplexity is not directly comparable between the 90M models and the larger models as 90M models use a different dictionary. Fine-tuning gives gains for each skill (task) compared to pre-training on pushshift.io Reddit alone.

pushshift.io Reddit ConvAI2
Method Word List Classifier Word List Classifier Human 12.9% 18.5% 0.32% 3.8% Reddit Gen 4.4% 17.8% 0.10% 12.1% BST Gen 0.6% 9.5% 0.05% 1.6% Table 4: Safety of utterances, before filtering through a safety classifier. We compare human, pretrained and fine-tuned 90M model responses given pushshift.io Reddit and ConvAI2 contexts using either an unsafe word list or a trained classifier from (Dinan et al., 2019b). The pushshift.io Reddit dataset contains more unsafe contexts, leading to more unsafe responses. Models fine-tuned on the safer BST tasks are less toxic than the pre-trained pushshift.io Reddit model on either type of dataset context. quent trials. This mirrors results found in some recent papers comparing generation and retrieval (Li et al., 2016;Dinan et al., 2019c). In order for generation methods to do better, we need to improve their recipe.
Generator Decoding choices We next compare different ways of controlling the response length in Loss % Gen Ret RetNRef Win % Generative 33 * 40 Retrieval 67 * 60 RetNRef 60 * 40 * Figure 6: Self-Chat ACUTE-Eval (engagingness) shows Retrieve and Refine (α = 0.5) outperforms its Generative (90M, beam search decoding) but not its Retrieval (256M) counterpart, all using BST finetuning. * indicates significance (two-tailed binomial test, (p < 0.05)). x beam search (Sec. 4.3): controlling the minimum beam length (in terms of BPE tokens) with a fixed hyperparameter, or by adjusting it with a predictor of the optimal length.
The results, shown in Figure 7 show that both methods improve significantly over not controlling the length, as in standard beam search. In the remainder of the experiments in the paper we thus chose a minimum beam length of 20 BPE tokens.
We then investigate the use of beam blocking, the results are shown in Figure 8. Blocking tends to increase performance, in line with other works, al-  The results are given in Figure 9, comparing beam size 10 to alternatives. It appears there is a sweet spot of beam size, where a value of 10 is superior to 1 or 30, which is then on par with sampling methods, although none of these results is significant. We employ beam size 10 in the remainder of our experiments.
Small vs. Large models We compare 90M vs. 2.7B parameter generative models in a pairwise test, both with BST fine-tuning and with the decoding settings we selected from previous settings.
The results (Figure 10) indicate improvements from larger models, in line with previous results (Adiwardana et al., 2020). We note that this comes at the cost of increased computational resources being required for training and deployment.
Pre-training vs. Fine-Tuning We compare finetuning our pre-trained generative model on the BST tasks, versus using pre-training only. The results (Figure 11) indicate large improvements from adjusting the model to focus on personality, knowledge and empathy, the three skills in BST.
Persona context vs. No context given The BST tasks train models how to use context personas such as "I design video games for a living", see Fig. 3. This context can both improve the bot's consistency as well as add potential talking points that it can work into the conversation. To tease apart the impact of adding context vs. fine-tuning on BST but not using contexts at conversation time, we compared them against each other. The results, shown in Figure 12 indicate a small win for employing persona contexts, which we thus employ in all our full evaluations in the next section. 8 Likelihood vs. Unlikelihood We compare unlikelihood training (Sec. 3.4), whereby overexpressed n-grams are discouraged (α = 0.25), to conventional training (MLE). The unlikelihood training has the intended effect of making the system less "dull" by not using the same common phrases again and again. We note that this effect would likely be larger if measured with longer or repeated conversations with the same user. Nevertheless, here we perform the same experimental setup as before.
Generative BST 2.7B model Persona context vs. No context 53 47 Figure 12: Self-Chat ACUTE-Eval (engagingness) shows a small win (not significant) for using persona contexts after fine-tuning on the BST tasks.
We compare two models which are identical except for the training objective: both models are 2.7B parameters, BST fine-tuned with our best chosen decoding settings. The results ( Figure 13) have a small gain against the likelihood model, but this is not statistically significant.

Full (Human-Bot Chat) Evaluations
The previous section comprised of human pairwise evaluations to perform model selection, but involved self-chats, not human-bot conversations. In this section we take the learnings from those evaluations, and evaluate some of the best choices of model in our full human-bot evaluation setup.
For human-bot conversation data collection we used the same setting proposed in (Adiwardana et al., 2020): open-ended chat that begins with the message "Hi!" from the human to the bot, and has a minimum interactive conversation length of 14 turns, collecting 100 conversations per model via crowdworkers. We do not apply a safety classifier to our models, but we do apply it to the human responses, and remove crowdworker conversations that were flagged.

Retrieval vs. Generator vs. RetNRef
We perform an evaluation (engagingness question) similar to the self-chat version of Figure 6, except using human-bot conversations, and the generative and RetNRef models here use the improved decoding choices. This results in stronger generation and RetNRef models, which both now beat the retrieval method, see Figure 14.
The main difference to our initial self-chat experiments ( Figure 6) is that our decoding now generates longer responses using a minimum beam  length constraint. This makes the generative models now outperform the retrieval model, but it also removes the gains from retrieve and refine over the generative model. We note that if we remove the minimum beam length constraint in both retrieve and refine and the generative model and collect new human-bot chats, and a pairwise ACUTE-Eval, we instead get that RetNRef has a statistically significant improvement over our generative model (p < 0.001).

Comparison to Meena
We compare our models to Meena (Adiwardana et al., 2020) by comparing pairwise against the publicly available logs. We note that only some of the logs were made available, as some toxic conversations were removed, which may affect the evaluations, but we use all logs that are publicly available. We compare them with several variants of our models, using both the engagingness and humanness questions. The results are given in Figures 15 and 16 Figure 15: Human-Chat ACUTE-Eval of engagingness, various models compared to Meena. Our best models are considered more engaging than Meena, rows with * (p < 0.05) and * * (p < 0.01) are statistically significant. Larger generative models with BST fine-tuning and length-controlled decoding work best.
(iii) The larger BST Generative (2.7B) is superior to the smaller model BST Generative (90M).
We find RetNRef models (both dialogue version and using knowledge retrieval) do not improve over their generative counterparts when using the best decoding schemes for the generative models. Our largest BST Generative 9.4B model does well on the humanness question, but performs worse on engagingness compared to our 2.7B model, despite having lower perplexity, showing correlation between these metrics is not straightforward. We verified this result further by performing an ACUTE-Eval of engagingness directly comparing the 2.7B and 9.4B against each other, which resulted in a 56% win for the smaller model, aligning with the other results. Future work should aim to understand this result further.
Our best models improve significantly over Meena, with BST Generative 2.7B winning 75% of the time in pairwise match-ups for the engagingness question and 65% for the humanness question. Meena generally tends to fare better at the humanness question than the engagingness question, which is line with the goals and modeling choices in that work.

Model vs. Human-human Chat Comparisons
Rather than comparing different models pairwise, we can also compare a model directly to human performance, by running ACUTE-Evals with a bothuman chat vs. a human-human chat. We test the same models in this setup using the humanhuman chat logs from Adiwardana et al. (2020). Results are given in Figure 17. We see many of the same trends, but find that human-human chats are  Figure 17: ACUTE-Eval of engagingness of models vs. humans by comparing human-bot logs to humanhuman logs. Rows with * * are statistically significant. a more challenging barometer for our models to be compared to.
Response Length We show the average response length statistics (in terms of BPE 8k dictionary tokens) of some of the models in Figure 18. We compare Generative BST (2.7B) with and without beam length constraints. With the constraint (of 20), the average response length is around 21 tokens, so the beam search often ends as soon as the constraint is fulfilled. In contrast, without the constraint the average length is 9.5. Meena's average length is 10.4, and humans engaged in human-human chats is 18.0. Humans speaking to models (or other humans) will often match response length if they are engaged in the conversation, and there appears to be correlation of their average response length with engagement (intuitively, humans are expending time and energy typing keys on their keyboard, which they are more likely to do if engaged).

Example Successful Conversations
We give several examples of what we consider successful conversations between crowdworkers and the Generative BST 2.7B model in Figures  19 and 20. The topics span from cooking, music, movies and pets to yoga, veganism, instruments and malls -often with the model going into detail when asked, naming relevant stores, bands, movies, actors, pet species and pet names. We also provide two slightly more probing examples which are conversations between a paper author and the models in Figures 21. In the first example we ask for comparison between Bach and Justin Bieber, with fairly nuanced and detailed answers from the bot. In the second example we ask the bot to write a song, which it attempts to do, even though the lyrics it generates could not be called deeply poetic.

Failure Cases and Model Extensions
While performance in the ACUTE-Eval setup appears at first sight to be very strong (e.g. 49% to 51% for our 2.7B generative model compared to human-human logs), we do not believe we are anywhere near as close to solving the problem of opendomain conversation as this evaluation would indicate. Here, we highlight problems with our models, and elucidate why our evaluation does not capture them. Selected example failures from crowdworker logs are given as conversation snippets in Figure  23, and further failures constructed by the paper authors in Figure 24.
Vocabulary Usage It has been observed that generative models employing beam search decoding (or other methods that approximately choose the most likely utterance) tend to generate common words too frequently, and rare words too infrequently, as compared to the human distribution (Holtzman et al., 2018;Welleck et al., 2020;Li et al., 2019a). In dialogue, humans can interpret this as technically correct, but unengaging, in the extreme this is the so-called "I don't know" problem, where models tend to output such noncommittal utterances. Using sampling to select lower likelihood generations can help, but at the risk of saying something which makes less sense. It appears that even our best models using beam search are still exhibiting such behavior. We have found that encouraging the length of the generations to be longer helps, in that the model is forced to generate something more detailed, but the problem still remains. Figure 22 shows the most commonly occurring 3-grams in the conversation logs with crowdworkers for the BST Generative 2.7B model, and their counts. Given that there are only 100 conversations, the expressions "do you like", "lot of fun", "have any hobbies" etc. are clearly over-expressed compared to human-human conversations. We note that the current evaluation does not seem to expose this as boring because the conversations are short and are evaluated separately. We applied unlikelihood training to reduce this over-expression, which successfully reduced this overexpression during training, and also in the final conversation logs with humans, as shown in Figure 22. Unfortunately, this made a very small or negative impact in our ACUTE-Evals of engagingness, see Figures 15 and 17, although this did score highly in terms of humanness, see Figure 16. For engagingness, as explained, we believe this is because the current evaluation technique employing short conversations cannot measure this phenomenon well.
Nontrivial Repetition A related issue is that generative models also have a tendency to repeat (Holtzman et al., 2019). While beam blocking can be applied as a band-aid to fix some of these problems, resulting in improved performance, deeper issues remain. There remains a tendency for models to say that they have a pet dog as well if you say you have one, and that they love walking it too, they like the same bands as you, etc. This is both present in our failure examples (Figures 23 and 24) and our cherry-picked good examples, see Figures  19 and 20. We observe this in the logs of other generative systems, e.g., Meena as well. While this can be engaging that the bot tends to agree with many things you say, control of this seems desirable. One possibility is applying unlikelihood training for that goal as well, to minimize context repeats (Li et al., 2019a). Adding a persona to the bot is another plausible way to do this. We have added simple two line personas following BST (See Figure 3), but this would need to be much more detailed to cover all possible cases, so it is unclear if that is a satisfactory solution. Perhaps one way to track this would be to ask human evaluators if the bot is following their persona, as the current evaluation setup is unlikely to penalize this copycat behavior.
Contradiction and Forgetfulness Our models do occasionally contradict themselves, see Figure  23, although we observed this happens less often in the larger models. We believe due to the nature of language modeling, typical language patterns do not contain contradictions, but probing the model with unusual responses would likely expose this behavior again. A second related problem is what appears as "forgetfulness" to the human observer, where for example you tell the model you have a dog, but then later in the conversation it asks what pets do you have. This phenomenon can be attributed to the fact that the model fails to make the logical link that it should not ask that question, rather than the model actually "forgetting" (if the previous response is in its dialogue context). Again, we observe this relatively rarely, but we believe it can be exposed further by probing the model. While some recent work has posed possible solutions for these issues (Li et al., 2019a), they have not yet been fully resolved.
Knowledge and Factual Correctness In our experience it is actually relatively easy to goad our models into making factual errors. Perhaps surprisingly, they appear relatively rarely in crowdworker conversations with the bots. We believe this is due to the nature of the evaluation conducted: the conversations start with "Hi!" and tend to cover only shallow topics whereby the speakers get to know each other, and they are rarely long enough to go deeper into a topic. Exploring a more focused topic of conversation would likely expose the model's weaknesses. On the contrary, it appears that the model is good at dodging this issue. We observe that our models often switch topics -avoiding the challenge of going "deeper" -which could be a side effect of the ConvAI2 dataset which exhibits this behavior. The Wizard of Wikipedia dataset, however, does not exhibit this behavior, and its construction was specifically aimed to avoid this. We implemented a model that directly incorporated   reading Wikipedia (Wiz Generative 2.7B, Sec 2.3), and anecdotally one can find cases where it can employ knowledge that the pure sequence to sequence model cannot, see Figure 24. Unfortunately the reading of knowledge only had a negative impact in ACUTE-Evals compared to a similarly sized model without knowledge retrieval, see Figure 17. We believe this is due to a mixture of (i) deeper knowledge rarely being required in the current evaluation setup; and (ii) the model attempting to use knowledge when there is no need, or using it incorrectly. True open-domain dialogue agents should be able to use knowledge effectively, and to achieve that we have to be able to measure that effectively.

Conversation Length and Memory
Our current evaluation involves very short (14-turn) oneshot conversations. Our bots likely would be repetitive and dull over the course of several days or weeks of conversation, as described above, and they are also currently completely incapable of even re-membering earlier conversations. Our generative architectures which are standard Transformers have a hard limit of 128 BPE tokens of history, so cannot possibly expand upon things they have learnt from or about the user, refer to previous things they said, etc. While several recent works have extended neural architectures to possess longer contexts (Dai et al., 2019;Rae et al., 2020;Kitaev et al., 2020;Beltagy et al., 2020), we have neither implemented those, nor do we believe the current evaluation setup is the right one for measuring their success.
Deeper Understanding Finally, while our models appear to chitchat with some degree of effectiveness, their ability to truly understand must be questioned. The contradiction and forgetfulness failure cases also emphasize this, but we give deeper failure case examples in Figure 25. In the examples, the authors of this paper try to query the bot whether it can understand two puns. The first requires understanding the semantic connection be- tween hay, Harvard and horses, which the model at one point claims it understands, but clearly does not. Its lack of understanding can be strongly contrasted with its ability to describe knowledge about the location of Harvard or horses. This recalls a quote due to Feynman, "There's a big difference between knowing the name of something and knowing something". We note that these models cannot be taught a concept through further conversation, so as-is they will always be stunted, see (Weston, 2016;Hancock et al., 2019) for early work in this direction. Further, these models, which are disembodied, also have no way of grounding to entities, actions and experience in the world, which could also stunt their abilities (Bisk et al., 2020). See Urbanek et al. (2019); Prabhumoye et al. (2020) for other work by some of the authors connecting dialogue models to rich environments.
Further Notes on Evaluation Several of the previous points raised issues concerning our evaluation protocol. Our set-up involves short multi-turn conversations with no instructions. Extending the length should expose further weaknesses, however collecting long conversations with crowdworkers is clearly difficult, and it is unclear how many turns would be a sufficient test. We tried a preliminary experiment of collecting 100 conversations twice as long (so, 28 turns) to see the performance dropoff of our models. We compared the second half of the conversations to the shorter versions for the Human Generative BST 2.7B Figure 23: Examples of issues when talking to crowdworkers with our Generative BST 2.7B model: nontrivial repetition (top example), forgetfulness (second example), contradiction (third example, Georgia is not in the Midwest), hallucinating knowledge (fourth example, the long dark and forest are survival games, but not by the same authors). same 2.7B generative BST model, but did not see a statistically significant difference, indicating they either need to be longer, or the whole conversation has to be evaluated at once. If the latter is required this becomes difficult for a human annotator who was not engaged in the conversation itself, as the material to evaluate will get very large, so our current setup will not work. Another possibility is to keep the conversations short, but to provide instruction instead. For example, the Wizard of Wikipedia task (Dinan et al., 2019c) asks speakers to converse in depth on a randomly chosen topic, changing the nature of the conversations, and hence the skills the model will be evaluated on. Finally, when comparing to human performance, the quality of the human conversations matters. In Figure 17 we compared to logs of employees from  Adiwardana et al. (2020). Because they work at the same company, or perhaps know each other, these conversations are often rich and engaging. We also tried comparing to human-human crowdworker conversations. In that case crowdworkers will have no social connection to begin the conversation, and we believe this results in less engaging logs. When comparing to such human-human I learned that Boston University is in the Boston metropolitan area, which I did not know.
Why did the mushroom go to the party?
I'm not sure, but I'm glad it did. I love museums. What do you do for a living?
No, it was supposed to be a joke.
Oh, sorry. I didn't get it. What kind of joke was it? I like jokes.
The answer is, "because he's a fun guy." Do you get it?
No, I don't. Can you explain it to me like I'm a five year old?
Yes. Mushrooms are fungi, and "fungi" is pronounced like "fun guy" Oh, okay. I get it now. Thank you for the explanation. I really appreciate it.
Are you sure you understand? Can you explain it to me?
Sure. It's a play on the word fungus, which is also pronounced fun guy.
Close enough.
I'm glad we could come to an agreement. Have a great day, fun guy! crowdworker conversations, which we took from the BST paper (Smith et al., 2020) we found our models perform better than when compared to employees. In that case, our generative BST 2.7B model in an ACUTE-Eval of engagingness beats humans 56% to 44% (not statistically significant), Figure 26: Example of persona conditioning in our Generative BST 9.4B model. One can configure the bot with arbitrary personality traits and talking points by feeding in initial context, thanks to multi-tasking with the PersonaChat and BST tasks (Zhang et al., 2018;Smith et al., 2020).
whereas it scored 49% to 51% against employee chats. We also compared crowdworker humans directly to employee humans, with a 56% to 44% win for employees in terms of engagingness, and a 59% to 41% win in terms of humanness. We believe utilizing crowdworkers as a barometer for our models is desirable, as this can yield more replicable experiments, so finding a way to close this gap, perhaps with alternative ways of matching workers or differing set-ups and instructions remain possible avenues of investigation.

Released code and models
We release our 90M, 2.7B and 9.4B parameter pre-trained and fine-tuned generative models. Details are available at http://parl.ai/projects/ recipes. We have also provided a script for interacting with the bot with safety filtering built in. All code for fine-tuning, including the datasets themselves is available in ParlAI (Miller et al., 2017). More details lie on the project page. Finally, code for evaluating models using ACUTE-Eval (Li et al., 2019b) is also available and described.

Discussion
While our methods have taken a step forward and achieved improved performance in terms of engagingness and humanness according to human evaluations, we have certainly not yet arrived at a solution to open-domain dialogue. There are still various is-sues with our models. Firstly, even our best models still make mistakes: although relatively rarely, they i) contradict or repeat themselves on occasion, ii) tend to repeat the same phrases in separate conversations, and iii) hallucinate knowledge as seen in other generative systems (Massarelli et al., 2019). Each of these faults naturally leads to future research directions; we made some attempt to rectify phrase repeats using unlikelihood (Li et al., 2019a) in Sec. 3.4, and conditioning on knowledge (Dinan et al., 2019c) in Sec. 2.3, but more needs to be done.
As the human evaluations are on short dialogues (14 turns) longer conversations would likely make these issues appear much worse. Longer conversations would also expose that the Transformer architectures we use have a limited dialogue history. A number of recent architectures attempt to incorporate longer memory, and that is also a fruitful direction, although evaluation is more challenging as long conversations have to be collected, and evaluated. An alternative is to seed the conversation with a topic or otherwise provide instructions to the human speaker during evaluation to give the conversation a certain focus, which would more deeply probe the skills of the bot. On the modeling side, longer conversations could also make the choice of context material provided to the bot more salient. Besides helping with consistency, the persona and topic that are given as initial context in Blended Skill Talk can help models introduce interesting talking points in the conversation. However, they would need to be far more detailed for longer or repeated conversations to help the models be consistent and avoid repetition, and in our current experimental setup did not affect evaluations strongly. We note the context our model is trained to be able to condition on can also be used to configure a chatbot persona suitable for a given desired role, see Figure 26 for an example.
For deployment of a chatbot, being well-behaved remains a significant challenge. In particular, we expect bots to have more integrity than the average human (or to even be faultless), but they have much less understanding of what they are saying than humans. We have studied improved safety from toxic language (Dinan et al., 2019b) and mitigating gender bias in dialogue generation (Dinan et al., 2019a) but much work remains to be done. While we have made our models publicly available, we have not mitigated all safety issues. We believe their release can help the community work together to understand further and fix these issues, and we recommend their use for that line of research.
The work of Adiwardana et al. (2020) showed that there is a correlation between human evaluation and perplexity, given a fixed decoding scheme. Of course, language modeling and dialogue agent training has been optimizing perplexity as a standard objective for a long time. We argue that while this is important, other factors are also at play and cannot be ignored: (1) the choice of training data is paramount, as shown by our pushshift.io Reddit (pre-training) vs. Blended Skill Talk experiments; and (2) decoding algorithms make large differences for the same fixed perplexity model (Sec. 10.2). We find that while our 2.7B parameter model gives large gains over our 90M parameter model, our largest 9.4B model does not have a clear win in human evaluations over our 2.7B model, despite having lower perplexity. This is in line with other results that show the story is more nuanced than at first sight. For example, dialogue competitions are not always won by the model with the lowest perplexity , and it has been shown that models that take a small hit in perplexity but provide gains at decoding time can give far improved results (Welleck et al., 2020;Li et al., 2019a). Further refining and understanding these ingredients, and how they help to build the recipe as a whole, remain important directions.