Neural Generation Meets Real People: Building a Social, Informative Open-Domain Dialogue Agent

We present Chirpy Cardinal, an open-domain social chatbot. Aiming to be both informative and conversational, our bot chats with users in an authentic, emotionally intelligent way. By integrating controlled neural generation with scaffolded, hand-written dialogue, we let both the user and bot take turns driving the conversation, producing an engaging and socially fluent experience. Deployed in the fourth iteration of the Alexa Prize Socialbot Grand Challenge, Chirpy Cardinal handled thousands of conversations per day, placing second out of nine bots with an average user rating of 3.58/5.


Introduction
Despite recent major advances (Adiwardana et al., 2020), open-domain chit-chat-friendly, social, casual conversation-remains a challenging task. In addition to difficulties with the sheer length and open-endedness required, social chatbots, or "socialbots," often struggle with fluency-whether due to the canned responses of manually constructed dialogue trees (Walker et al., 2001) or the anomalies of neural generators (Nie et al., 2021). But just being error-free isn't enough: to have a rewarding conversation, socialbots must be personable-displaying emotional intelligence, a rich personality, and an understanding of social dynamics. Although methods exist to address many of these issues individually, combining all of these features into a full-bodied conversation remains difficult.
In this paper, we describe Chirpy Cardinal, an open-domain conversational socialbot, which aims to bridge the gap between traditional dialogue treebased approaches (Walker et al., 2001;Chen et al., 2018) and large pretrained neural dialogue agents (Adiwardana et al., 2020;Roller et al., 2020). Capable of discussing thousands of topics, Chirpy * Equal contribution. centers emotional and social intelligence with the goal of authentic, engaging interaction. Specifically, we make the following contributions: • Conversations with open-domain socialbots often lack a stable structure. To ameliorate this, we present a extensible design for opendomain dialogue which prioritizes conversational stability and flexibility through mixed initiative (Horvitz, 1999).
• Although pretrained neural generators (Collins and Ghahramani, 2021) can be extremely fluent, real-life deployment can suffer from a lack of both controllability and consistency (Nie et al., 2021). Towards this end, we describe several approaches to integrate neural generation into a symbolic setup, achieving local fluency without sacrificing global coherence.
• Towards the goal of a rewarding conversation, we suggest a set of approaches-ranging from small routines to complete submoduleswhich aim to make our socialbot a good conversational partner. We focus on being both flexible-handling a wide variety of topics in an interesting and informative way (Section 4)-and personable-empathizing with the other interlocutor even in difficult topics or situations (Section 5).
Deployed in the Alexa Prize Socialbot Grand Challenge 4, Chirpy Cardinal reached thousands of users per day; with conversations lasting up to 45 minutes at a time, it placed second out of nine agents in the finals. We open-source our system as an extensible framework for open-domain social dialogue 1 , providing an example of real-world deployment of conversational NLP systems. 2 1 github.com/stanfordnlp/chirpycardinal 2 stanfordnlp.github.io/chirpycardinal arXiv:2207.12021v1 [cs.CL] 25 Jul 2022 Let's chat LAUNCH always begins the conversation. If we've talked with the user before, we confirm their identity ("It's nice to chat with you again, Alice!").
Hi, this is an Alexa Prize Socialbot. I'd love to get to know you a bit better before we chat! Is it all right if I ask for your name?

My name is Alice
Well it's nice to meet you, Alice! I'm excited to chat with you today.

NEURAL CHAT
So, I hope you're having a wonderful morning. What did you have for breakfast today?
NEURAL CHAT prompts first-time users about food, a subject of universal interest.

I'm sorry to hear that. Perhaps you can tell me about a food you like to eat instead?
Our neural generator responds directly.

How about fish and chips?
Entity Linker (B) identifies fish and chips as the new entity.
Ah yes, fish and chips are one of my favorite things to eat up here in the cloud. Personally, I especially like the batter in it, I think it gives them a really nice flavor. What do you think?
With a food item mentioned, FOOD (E.5) starts. Phrases like "in the cloud" acknowledge our semi-embodied status.
I love it when the batter is crunchy I agree! The crunch and the texture is really good. It's a shame it's hard to find a good crispy batter these days, though. Did you know that fish and chips were first made in England around the th century? I can't believe people have been eating fish and chips for so long! Neural generation with the prefix "I agree!" produces a controllable opinion that doesn't contradict the user. We then add a fun fact scraped from Wikipedia.
wow that's a long time It really is. Anyway, I'm feeling hungry now! Thanks for recommending fish and chips! FOOD terminates; note that we respond "It really is" to any comments containing "that's".
Speaking of fish and chips, I was reading recently and found out that fish and chips is a common take-away food in the United Kingdom and numerous other countries. Do you wanna talk about the United Kingdom?
We generate a TRANSITION from Fish and chips to United Kingdom to continue the conversation smoothly.

Sure!
The user accepts the new entity.
I just learned recently that violent pirates could have been executed in the united kingdom until .
WIKI (4.2) offers a fun fact from Reddit, rephrased for conversationality.
Oh, that's surprising Isn't it? Come to think of it, I did read something the other day about United Kingdom's Culture and Geography, if you're interested in talking about any of that.
WIKI switches to a more informational mode, suggesting sections from the Wikipedia article to discuss.
Tell me about its culture The culture of the United Kingdom has been influenced by many factors including: the nation's island status; its history as a western liberal democracy and a major power; as well as being a political union of four countries with each preserving elements of distinctive traditions, customs and symbolism.
WIKI gives a summary of the requested section.
so how would you describe its culture

Neural annotator detects question (D).
If I'm not mistaken, cultural superpower WIKI's Q&A module responds (C.4). We acknowledge user sharing with a neuralgenerated response indicating willingness to listen further.

PERSONAL ISSUES
no that's okay, I'd like to stop talking Figure 1: An example dialogue. To respect users' privacy, this is a conversation with an author, not a real user. We model dialogue as a series of subconversations (left side), whose topics are parsed and tracked by a neural entity linker. Each subconversation is managed by a RG, whose "scaffolded" architecture comprises hand-written treelets (plain text) incorporating numerous dynamic elements, ranging from neural generation to retrieval from Wikipedia to neurally rephrased fun facts. Prefix-based generation provides controllability, especially for sensitive topics like personal issues.

System Design
We model a user dialogue as a series of subconversations (Figure 1), each handled by a response generator (RG). Varying greatly in scope and domain, each RG handles a specific topic (e.g. MOVIES, SPORTS) grounded in the outside world. RGs comprise dialog trees (Weizenbaum et al., 1966), whose tree nodes, which we term treelets, implement custom logic (e.g. intent classification or retrieval) to generate a response. At the start of each turn, the user utterance is annotated for linguistic features (Appendix C), then processed in parallel by all RGs. By default, the previous turn's RG is selected; should the RG that last responded crash or a different RG request to take over, we seamlessly switch RGs and move to a new subconversation.

Navigation
To enable mixed initiative-shared user-bot responsibility in driving the conversation (Horvitz, 1999)-we provide a suggested navigational path, while letting users deviate drastically from it. Specifically, each RG continues through its dialogue tree until exhausting its subconversation; we then transition to another RG by bringing up a previously user-mentioned topic ("You mentioned cats earlier; would you say you're a big fan?"), mentioning a tangentially related topic that we can discuss well, or simply sampling a new RG and corresponding topic at random. Users may explicitly change the topic ("can we talk about roblox"); implicitly suggest a desire to redirect the conversation ("yeah" or "uh-huh"); or otherwise behave in ways that require the bot to act dynamically ("i don't know, how about you?"). We handle these deviations from the conversational flow through neural handlers that allow periods of flexibility before returning to the overall conversational structure (Appendix F).

Entity Handling
To allow users to discuss a vast array of interesting topics relevant to their lives, we support any Wikipedia entity as a topic of discussion. 3 To do so, we entity-link (Kolitsas et al., 2018) the user utterance to relevant entities using a fine-tuned BERT model (Broscheit, 2019; also B.3), mitigating ASR errors through a phonetic similarity search (B.2).
Since incorporating Wikipedia article titles directly into bot utterances can be awkward (e.g. "can we talk about cat"), we refer to entities by more natural talkable names (e.g. "cats"), generated using GPT-3 (Brown et al., 2020

Neural Generation
Although neural generative models (Roller et al., 2021) have achieved success in open-domain dialogue, significant obstacles impede deployment in real-life situations: neural text degeneration (Holtzman et al., 2020;Welleck et al., 2019), hallucination (Dziri et al., 2021, and inconsistency (Zhang et al., 2018), In addition, large latency can make models challenging to deploy in practice (Worswick, 2020). In this section, we investigate ways to utilize the power of such models in the context of structured dialogue. We propose integrating neural generation in the context of hand-written scaffolding, aiming to benefit from its variety and fluency while maintaining coherency over time.

DistillBlender: A Fast, General-Purpose Neural Generator
For general use, we distill a single model from BlenderBot-3B (Roller et al., 2021) with 9 decoder layers 4 , reducing latency significantly over the original model. We use it as follows: • The NEURAL CHAT RG, which directly exposes lightly edited neural model outputs as a subconversation. Due to BlenderBot's endto-end training, this is initially a rich, fluent conversational experience, but due to rapid degradation we terminate after 5 turns.
• Conditional prompting (Keskar et al., 2019), which enables controllability in a structured context. We apply hand-written prefixes to guide the model towards fluent, contextually appropriate completions (Table 1).

WIKI
In contrast to humans, open-domain chatbots are commonly expected to be able to "engage in conversation on any topic" (Adiwardana et al., 2020). Towards this end, the WIKI RG discusses any entity. We aim to be informative, not overwhelming; in addition to encouraging users to share their own knowledge and experience about the entity, we bring up interesting factoids from /r/todayilearned (conversationally rephrased; E.3.4), as well as infilled remarks. We then discuss the entity in more depth based on its article, flexibly acknowledging user questions and comments with the Q&A handler (C.4) or neural generation.

OPINION
A core part of social chit-chat (Walker, 2009), exchanging and commenting on opinions allows a socialbot to project a stronger sense of personality. The OPINION RG solicits users' opinions on topics and reciprocates with its 'own' opinions (sourced from Twitter), including occasional disagreement to help engage user interest (E.4).

Rules-based RGs
In order to broaden the scope of our bot, we manually build several domain-specific response generators. FOOD, which always opens the conversation, discusses common foods scraped from Wikipedia. 6 MOVIES uses the Alexa Linked Data API to discuss movies and actors. MUSIC uses the Mu-sicBrainz 7 database to discuss songs, artists, and music genres. SPORTS uses the ESPN API to discuss NFL football and NBA basketball. We describe these RGs in more detail in Appendix E.

Being Personable
To achieve truly social conversation, a socialbot must be a good conversational partner: empathetic, supportive, and interested in what its human interlocutor has to say (Salovey and Mayer, 1990;Li et al., 2017). In this section, we describe several approaches that aim to achieve this, ranging from full RGs to smaller subroutines.

Handling Personal Issues
Many users-especially those who chat with our socialbot looking for companionship-share personal struggles with our bot, requiring emotional sensitivity and tact. Handling such conversations purely neurally would result in rapid degeneration due to neural toxicity (Dinan et al., 2021). To address this, the PERSONAL ISSUES RG responds to personal disclosures using active listening techniques (Bodie et al., 2015), asking exploratory questions about Figure 2: Engagement metrics vs. rating. We bucket 8 conversations based on four engagement metrics-number of turns, number of distinct entities, average user utterance length, and average both utterance length-and plot each bucket against user rating (Likert 1-5 scale, measured per-conversation). 95% confidence intervals computed via bootstrapping (n = 1000). the nature of the user's issue ("When did you start feeling this way?"), and validating their concerns ("I see, that sounds di cult.") On the other hand, a significant subset of users become verbally abusive during the conversation (Curry andRieser, 2018, 2019). We follow the strategy of Li et al. (2021): a de-escalating statement to avoid confrontation, addressing the user by name ("John"); then changing the topic.

Self-disclosure
With an overall goal of self-disclosure-disclosing goals, attitudes, and personal interests to support interpersonal intimacy (Altman and Taylor, 1973;Ignatius and Kokkonen, 2007)-the ALIENS RG entails a five-part series of discussions, where the socialbot muses about its pet topic, the possible existence of extraterrestial life, as well as its own identity and sense of purpose. This RG comes up only after sufficient rapport has been built between user and bot. 9 9 i.e. after 30 turns in the conversation

Personalization
Users often expect chatbots to remember personal preferences and user details (Chaves and Gerosa, 2021;Svikhnushina et al., 2021) and to tailor their responses accordingly (Neururer et al., 2018;Shum et al., 2018). We personalize bot responses with the user's preferences: for example, in regards to the Olympics, "Ah, that makes sense since you did say it's your favorite sport!". Referencing this user state across conversations makes repeated conversations with Chirpy feel fresh and dynamic, rather than retreading past questions and topics.

Results
In this work, we have outlined a set of design priorities and corresponding approaches to design a fluent, flexible, and sociable chatbot. We validate these through the Alexa Socialbot Grand Challenge 4: engaging in approximately 1,000 conversations per day, our socialbot achieved an average user rating of 3.55, ending the development period tied for first place in rating. 10 Validating our design goals, we observe high ratings for a hybrid neuralscaffolded approach (FOOD, etc.), personable RGs (ALIENS), and open-domain techniques (WIKI) (Figure 3). Our socialbot engages in long, varied conversations without repeating itself ( Figure 2).
That said, both overall rating and sample conversations testify that Chirpy remains far from the goal of truly compelling and enjoyable human-bot interaction. We do not argue that our approaches are sufficient-or even necessary-to create such an ideal system; rather, we hope that the priorities outlined here can serve as a starting point to help inform further socialbot development, whether purely neural or hybrid in nature.

Ethics Statement
In this work, we have presented a conversational agent that conducts an open-domain dialogue. We believe that many people would enjoy having a chat partner who is empathetic and knowledgeable, and our ratings seem to suggest that a reasonable number of people appreciate their conversations enough to want to talk to the bot again. Prior to engaging with the chatbot, all user participants are required to consent to their conversations, feedback, and ratings being recorded, as per the Alexa Terms of Use. No actual user conversations or identifying information is used in this paper.
However, as our system incorporates computational methods for generating conversational utterances automatically, there exists a risk that users may be exposed to unsafe utterances or discussion topics. Conversational models of all kinds can produce sexist, racist, or otherwise unsafe statements; neural conversational agents can be particularly vulnerable due to pre-training on Internet chat forums, which can be particularly toxic (Xu et al., 2020). Towards this end, our system incorporates a safety module that prevents our model from producing utterances with certain hard-coded words or categories. Yet the use of a blacklist in itself raises additional ethical issues, as poorly designed blacklists can marginalize communities by blocking topics that ideally, one should be able to discuss equitably.
Finally, the human-like nature of open-domain dialogue systems can be particularly damaging when used in an adversarial context, e.g. by state actors (Boshmaf et al., 2012). Ultimately, like all text generation methods, the benefit of releasing an open-domain dialogue model must be weighed against its possible downsides.  . For reliability, our function is stateless; therefore, to preserve information between turns, we store our bot's overall state in an external Post-greSQL state table (see Figure 4). We execute the following steps on each turn: At the end of the turn, the bot's overall state contains the user's utterance, the conversational history, the NLP Pipeline annotations for the user's utterance, and a state for each individual RG. Each individual RG state contains information required by that RG -for example, it might contain the current treelet in the RG's dialogue graph, or a list of the utterances and/or entities that have been discussed, to avoid repetition.

A.2 Response Design
Responses and prompts both carry a priority, with the highest-priority response/prompt chosen at the corresponding stage. In general, the RG which responded last has the highest priority; however, RG's can optionally specify a lower priority so that other RG's take over, or a higher priority to take over from another RG. In practice, these priority levels are rarely used due to their tendency to produce a choppy conversation.

A.3 Navigational Intent Classifier
A user has positive navigational intent if they want to discuss a topic; conversely, negative navigational intent means that the user would like to avoid discussing a topic. Users may express navigational intent while specifying a topic ("can we talk about minecraft", referring to the current topic ("let's discuss this more"), or referring to no topic ("I don't want to chat anymore"). Positive and negative navigational intents can even be combined ("I don't want to talk about movies any more, let's chat about you"). We classify use manually-constructed regexes, which achieve extremely high precision.

B Entity-Linking Details
Detecting and understanding references to realworld entities is essential to any open-domain conversational system; we find that users appreciate being able to discuss a wide variety of topics that interest them or are relevant to their lives. For our socialbot, we train and deploy a neural entity linker that links spans to Wikipedia entities.

B.1 Entity Pool
To obtain our pool of potential entities, we process the May 20th, 2020 dump of Englishlanguage Wikipedia 11 using MWParserFromHell 12 and Spark 13 . We store our data in a large Elastic-Search index, keeping only entities with at least 200 cross-references in Wikipedia. In total, we have 171, 961 entities.
Notably, certain entities are inappropriate to discuss even if correctly entity-linked by our model; for example, our system is unable to handle abstract nouns well (e.g., philosophy, film). To ameliorate this, we manually created a set of low-precision entities composed of both WikiData categories (e.g., conspiracy theory, financial risk, research method) and specific common entity names (e.g., bank, catalog, coast). The bot will not start a conversation itself about such entities; however, it is able to handle explicit user navigational requests (e.g., can we talk about the bank). Separately, we also ban certain racial, religious, and other identity-based terms that are unlikely to result in a good conversation on either the bot's or user's part, as well as certain short acronyms (e.g. cet, ep, fm) that are almost always triggered by ASR errors.

B.2 Candidate generation
For a given user utterance, we want to compute the set of entities that the user could possibly be referring to; for example, if the user mentions "swift", this could refer to the bird, musical artist, or programming language. To do so, for each possible span, we pre-compute the set of entities for which the span serves as a Wikipedia anchortext, creating a mapping from spans to sets of candidate entities. At execution time, for all n-grams in the user utterance with 5 or fewer tokens 14 , we retrieve the set of candidate entities from our database.
Since we do not have access to original user audio, ASR errors can impede candidate generation (Chen et al., 2018). For example, if an user's reference to the film Ford v Ferrari is erroneously transcribed as "four v ferrari", a naïve entity linker will fail to identify the correct entity. To address this, we pre-compute phoneme and metaphone representations for all of our entities (e.g. converting Harry Potter to 'HH EH R IY P AA T ER' 15 and 'HRPTR' 16 ). At execution time, each n-gram's candidate set is augmented with the sets for spans with similar phoneme/metaphone representations.

B.3 Entity disambiguation
Given a set of candidate entities, we want to select those candidates that the user is interested in. Towards this end, we fine-tune a BERT-medium (Devlin et al., 2019) to disambiguate entities, following Broscheit (2019) with minor modifications. Specifically, we learn an embedding for each entity in our dataset. Then given a span within an user utterance, we model the probability that the span refers to a given candidate entity as the dot product between the contextual span representation and the entity's embedding. At deployment, we only take entities with a predicted likelihood of at least 0.5; additionally, we use only the highest-likelihood entity for each span.
We depart from Broscheit by mean-pooling over the contextualized span representation, rather than doing per-token entity-level disambiguation. Finetuning takes about 20 days using 4 Titan X GPUs; during deployment, we execute using CPU only.
We will release all reproduction code as well as models in the future.

B.4 Entity Tracking
At any given point, we track the current entity (the current subject of conversation), a set of untalked entities ( entities which the user has mentioned but we have not yet addressed), and a set of rejected entities (which the user does not want to discuss; these are no longer brought up by our bot.). These are updated every turn as follows: • Entities receiving negative navigational intent ("can we not talk about paraguay") are rejected. Non-specific negative navigational intent ("let's not discuss this") causes the current entity to be rejected instead.
• Entities receiving positive navigational intent ("can we talk about mexico") are set as the current entity. The previous conversation ends, with all RGs are prompted to handle this new current entity instead.
• If the currently active RG asked a question on the last turn, the current highest-priority entity is identified as the presumable user answer and set as the current entity. Additionally, if the previous question expects a particular category of entities (e.g. "What's your favorite movie?"), we pick the highest-priority entity matching the expected category (e.g., film).
• All remaining entities are marked as untalked (to be possibly discussed later).

C Annotators
All annotators-modules which provide linguistic annotations for the user utterance-are executed in parallel at the beginning of each turn.

C.1 CoreNLP
We use the following annotators from Stanford CoreNLP (Manning et al., 2014): tokenization, sentence splitting, part-of-speech tagging, lemmatization, named entity recognition, constituency parsing, dependency parsing, coreference resolution, and sentiment analysis. Due to the format of the user utterances (lowercase with no punctuation), we use caseless models 17 for part-of-speech tagging, constituency parsing and named entity recognition. We use these annotations for certain handwritten NLU operations.

C.2 Dialogue Act Classifier
Dialogue acts, an ontology over user intents (Stolcke et al., 2000;Jurafsky et al., 1997), have been successfully employed in open-domain dialogue agents (Yu et al., 2019). We modify MIDAS (Yu and Yu, 2021)-an annotation schema designed specifically for human-chatbot dialogue-to better fit the needs of our bot, removing 4 labels 18 due to low frequency in our conversations and creating 5 new labels: correction, clarification, uncertain, non-compliant, and personal question. In total, our modified schema has 24 labels. Evaluated on the MIDAS test set, a fine-tuned BERT baseline achieves .78 micro-F1; however, evaluated on an OOD test set composed of our own conversations, it achieves only .53 (Table 3). Although self-training (McClosky et al., 2006) proved ineffective, hand-labeling additional OOD conversations achieved a micro-F1 of 0.81. The predictions of this final model inform navigation, as well as RG-specific NLU.

C.3 Question Classifier
Users often spontaneously ask factual questions, personal questions, follow-up questions, and even questions unrelated to the current topic. Recognizing and answering these questions is important, particularly for user initiative, but is also non-trivial, as ASR-transcribed user utterances do not contain punctuation. To recognize questions, we fine-tuned a RoBERTa model (Liu et al., 2019;Wolf et al., 2019) on an simplified version of the Dialogue Act training data, framing the task as binary classification, conditioned only on the user utterance. This 18 apology, apology-response, other, and thanks  model achieved an F1-score of 0.92 and improved the reliability of question detection.

D Neural Generation
Our neural agent is a distilled ( . We used a batch size of 1 for training on a single V100 GPU. For decoding, we use top-k sampling (k = 5) with temperature T = 0.7. To encourage response diversity across the conversation, we sample sequences of minimum length randomly chosen from 5, 10, 15, 20, 25; in practice, the length of the generations is 0-2 tokens above the minimum selected length. Additionally, we use delayed beam search (Massarelli et al., 2020), with the conversational history up to 128 tokens in the past serving as context. After decoding, we first filter out offensive, null, and repetitive responses, as well as questions after the first turn. We then select a final response based on the posterior likelihood, among other metrics.

D.1 Analysis
We find that our model qualitatively outperforms a GPT-2 (Radford et al., 2019) baseline fine-tuned on Empathetic Dialogues (Table 4), with similar latency. That said, our model still suffers certain limitations out-of-the-box; we discuss strategies for mitigating these issues.
Diversity-coherence tradeoff For our model, beam search decoding yields coherent but nondiverse responses, while stochastic decoding results in nonsensical generations even under top-p (Holtzman et al., 2020) or top-k (Fan et al., 2018) sampling. Delayed beam search, which samples the first few tokens before defaulting to beam search, yielded more stable behavior than stochastic decoding, and better diversity than beam search.
Degeneration The model outputs conversationending phrases (e.g., "I have to go", "It was nice talking") after 7 turns, hurting user experience. Manual examination of the training data revealed this is due to the short lengths of conversations in the training data collected via crowdworkers. We alleviated this with a manual blacklist of conversation-ending phrases and forcing NEU-RAL CHAT (Section 3.1) to hand off to another RG prior to the 7-turn limit.
Changing the topic Our model tends to generate one or more follow-up questions in every utterance, which are often unrelated to the current topic, creating a disorienting topic change (e.g. "What's your favorite color?"). To address this, NEURAL CHAT limits the number of questions in each neural chat conversation to one, with subsequent questions truncated. During conditional generation, questions are deleted altogether.
Latency Generating responses can take up to 2s in the worst-case, averaging just under 1s. To mitigate this, we pre-fetch responses from Distill-Blender at the start of every turn, instead of waiting for the Neural Chat RG to request a neural response. If the responding RG returns a response without utilizing the DistillBlender response during the turn, we ignore its output and return early. Although this results in unused pings to the neural model, we find that the trade-off is worth it.

E Additional RG Details E.1 Launch
At the beginning of the conversation, the bot initially selected icebreakers at random. However, we eventually found that certain icebreakers tended to fare better than others. Specifically, conversations starting with food-related icebreakers (e.g. "Do you have any recommendations for what I should cook at home?") had an average rating was 3.49 over a sample of 1405 conversations, compared to an average rating of 3.43 for non-food-related icebreakers (e.g. "What did you do over the weekend?") over a sample of 1418 conversations. Digging deeper, Have you been doing anything productive?
Well, I've been trying to be productive but I keep on breaking things.
Breaking things can be very frustrating. I hope you are able to get it all sorted out.
That's no good, it's not productive to be stuck on something and not get to something else.

E.2 News
The NEWS RG (Section 4.1) curates global news from The Washington Post 19 and The Guardian 20 . Article titles, topic categories, body texts, dates, and content URLs are stored in a constantly updating ElasticSearch index. When a topic or entity available in our index appears in conversation, the News RG brings up related stories from our database. In addition, NEWS also initiaties conversations about currently trending news topics by scraping trending news from Google Trends 21 .
Behavior To produce a prompt usable in conversation, we rephrase the headline to conversational form using GPT-3 davinci-instruct-beta. 22 If the user expresses interest in continuing the conversation, the we provides a conversational summary generated by Pegasus-Multinews (Zhang et al., 2019;Fabbri et al., 2019). Summaries are decoded using 8 beams and a maximum of 50 tokens for conversationality, and are pre-generated for efficiency; if the neural module fails, we instead use an extractive summary (Mihalcea and Tarau, 2004).
Follow-up If the user continues to be engaged, we prompt for questions or comments. If a comment is detected, a neural response is generated using a set of hand-written prefixes; If a question is detected (C.3), they are answered via the QA annotator (C.4). We then conversationally paraphrase the answer using a GPT-2-medium model (Radford et al., 2019) fine-tuned on Topical Chat (Gopalakrishnan et al., 2019) to produce a more human-like response. We use the truncated conversational history as the input history and a merged representation of the answer and the span as the the factual content. It outputs a conversationalsounding paraphrase of the answer. Finally, we rank the generated paraphrases using Fused-PCMI (Paranjape and Manning, 2021).

E.3 Wiki
To support our goal of high-coverage world knowledge (Section 1), the Wiki RG uses Wikipedia articles as grounding to discuss any entity that interests the user and that is not handled by any other RG. Our goal is to allow the user to conversationally discover interesting information about the entity.

E.3.1 Data
We use the Wikipedia dump from May 20th, 2020 23 , processed using MWParserFromHell 24 and Spark. 25 We store our data in a large ElasticSearch index.

E.3.2 Behavior
Wiki RG facilitates a discussion about an entity based on how it came up in conversation (see Fig. 5). If the user initiates an discussion about an entity, the RG encourages the user to share their own knowledge and experience about the entity. Otherwise, if the entity came up only in passing or as a response to a bot prompt (e.g. "What's a country you would like to visit?"), then the RG responds with an 'infilled' remark (discussed below) or an interesting fact (i.e. 'TILs' scraped from the /r/todayilearned subreddit) about the entity. These conversation starters serve the purpose of drawing the user into a more conversational dialog about the entity before proceeding to a more content-rich discussion of it.
Discussing the entity in depth. If the user responds positively to our initial discussion of the entity, we begin a "Discuss in depth" conversation loop (see Fig. 6). Our bot provides a summary of some section of the entity's Wikipedia article and handles the user's sentiments, opinions, and questions appropriately before checking if the user would like to continue with the discussion. If the user responds affirmatively, we suggest another section for discussion, otherwise we exit the RG. This setup ensures that the user is not overly fatigued by the amount of information generated in these section summaries, while allowing interested users to discuss engrossing topics in great depth.
A short example Wiki interaction is shown in Turns 6 through 10 of Table 1.

E.3.3 Template-Based Infilling
To provide the user with rich, coherent conversation for a wide class of entities, we developed a novel method-infilling-which generates interesting remarks from handwritten templates based on relevant context. For example, given the actor Keanu Reaves as the current entity, the template I love how [[actor]] acted in [[film]], especially their <mask> might be infilled as follows: I love how [Keanu Reeves] acted in [The Matrix], especially their ability to freeze time. By defining a diverse set of templates for each entity category, we are able to provide expressive yet controllable conversation on many different types of entities. In effect, this acts as a more flexible version of standard slot-filling methods that does not require a structured knowledge base.
Infilling has the following steps: • A set of templates and appropriate contexts is retrieved. Given some entity, we select a set of handwritten templates based on its Wikidata category (e.g. actor, musical instrument).
For each template, we retrieve an appropriate short context from Wikipedia (approximately 3 sentences) using the mean-pooled GloVebased method of (Arora et al., 2016).
• Given each (context, template) pair, an infiller model fills in the blanks. This is parameterized by a BART-base model trained on a dataset generated by ∼ 4286 examples, mostly generated using GPT-3 (Brown et al., 2020) and augmented by hand-written examples.
• The infills are reranked by an aggregate Di-alogRPT (Gao et al., 2020) and likelihood score as measured by a GPT-2-medium model fine-tuned on Empathetic Dialogues.

E.3.4 TIL's: Conversational Paraphrasing
We use this RG as a testbed for our conversational paraphrasing system. The system takes as input the truncated conversational history, and some knowledge context (either a TIL about the current entity, or an excerpt of the Wikipedia article, selected based on TF-IDF similarity to the user's response to an open-ended question). It outputs a conversational-sounding paraphrase of the knowledge context. The model was trained by finetuning a GPT-2-medium language model (Radford et al., 2019) on a processed and filtered version of the TopicalChat dataset (Gopalakrishnan et al., 2019). The paraphrases are generated using top-p decoding with p = 0.75 and temperature τ = 0.9, and we pick the one which has the highest unigram overlap with the knowledge context.

E.4 Opinion
Exchanging opinions is a core part of social chitchat. To form a stronger sense of personality, and to seem more relatable, it is important that our bot can also express its opinions. The Opinion RG's goal is to listen to users' opinions on certain topics, and reciprocate with its 'own' opinions (sourced from Twitter) on those topics.

E.4.2 Behavior
Currently, the Opinion RG activates when the user mentions one of the whitelisted entities (e.g. Table 1, Turn 8). We ask whether the user likes the entity and classify their response using the CoreNLP sentiment classifier (Section C.1). We then either agree or disagree with the user. If we disagree, we either ask the user for their reason for their opinion, or supply a reason why we disagree, and ask what they think of our reason. Ultimately, we want the user to have a positive experience with our bot, so regardless of whether we disagree or agree with the user, we will ask the user their opinion on a related entity, and always agree with the user about the new entity. The conversation may end earlier, as we detect on each turn whether the user is still interested via their utterance length. If the utterance contains less than 4 words, and it does not contain any of the 'agreement' words (such as 'same', 'me too', etc.) we will hand off the conversation to another RG. Even when the RG is not active, it keeps track of whether the user has already expressed an opinion on an entity, by applying a regex similar to that applied to the tweets.

E.4.3 Agreement Policies
Disagreement is an unavoidable part of humanhuman conversations, and we hypothesize that occasional disagreement is necessary in order for our bot to have a convincing and individual personality.
To test this, we implemented three policies: • (i) ALWAYS_AGREE -we always agree with the user's sentiment on the entity; • (ii) LISTEN_FIRST_DISAGREE -first we ask the user's reason for liking/disliking the entity, then we offer our reason for disagreeing with their sentiment; and  • (iii) CONVINCED_AGREE -we initially disagree with the user's sentiment on the entity, but after the user gives their reason for liking/disliking the entity, we switch our sentiment to match the user's (i.e. we are convinced by the user).
To evaluate the policies, we ask the user Would you like to continue sharing opinions? and interpret the desire to continue is an indication of a successful policy. Table 5 shows that users prefer ALWAYS_AGREE and LISTEN_FIRST_DISAGREE over CONVINCED_AGREE, and all policies have high continuation rates, suggesting that disagreement can be a positive and stimulating part of a conversation, but that the manner and delivery of the disagreement is an important factor.

E.5 Food
The Food RG also focuses on scripted responses to discuss foods and give suggestions. It is often activated at the beginning of the conversation when Neural Chat RG prompts a user for what they have eaten today. The Food RG then goes through a sequence where it asks the user about their favorite variant of that food (e.g. favorite pizza topping), mentions the bot's favorite variant, and possibly provides a fun fact about the food. The Food RG is backed by food data scraped from Wikipedia structured in such a way that subclasses and variants of food are linked to each other. It also uses templated responses with neural infilling to generate descriptions of foods or comments on what the user likes, allowing for variation and flexibility for more interesting responses.

E.6 Movies
The Movies RG is designed to deliver a highquality scripted conversation about a movie the user specifies, using information drawn from the Alexa Knowledge Graph. 27 Currently, the RG is activated when the user asks to talk about movies, mentions a movie keyword (such as movies or film) or talks about any movie-related entity (e.g. Saving Private Ryan, Meryl Streep, the Coen brothers, etc.). Once activated, the RG typically asks the user to name a movie, asks the user's opinion on it, gives a fun fact about the movie, asks the user their opinion on an actor in the movie, then asks the user if they've seen a different movie featuring that actor (See Turns 4-7 in Table 1). The RG uses treelets (Section 2) to organize the dialogue graph, hand-written templates to form the bot utterances, and a mixture of regexes and the CoreNLP sentiment classifier (Section C.1) to classify the user's responses.

E.7 Music
Similar to the Movies RG, the Music RG is designed to deliver scripted conversations about musical entities that the user specify. The RG is activated when a musician/band or a music keyword (such as music or songs) is mentioned. Once activated, the Music RG engages in a conversation specific to the type of the musical entity that was mentioned. Unlike the Movies RG, the Music RG has a randomized internal prompting system that allows the conversation to be centered around music even when a scripted conversation is exhausted for a specific entity. For example, after the Music RG goes until the end of a scripted conversation for a musician, it can ask for an internal prompt, and start a conversation about musical instruments, songs, or music in general. The randomized nature of the internal prompting system makes the conversation more flexible, and mitigates some of the weaknesses of scripted conversations mentioned in Section E.6.

E.8 Sports
The Sports RG is designed to deliver up-to-date and high-quality conversations on a sport for which the user expresses interest. Currently, we support conversations on NFL football and NBA basketball, the two most-watched sports in the US. When prompted to discuss sports, the user is asked if they are a fan of these two sports. If so, they are asked for their favorite team, but otherwise the conversation moves to a different RG. The RG supports detailed, factual conversation on the user's favorite team, as well as their favorite player on that team. The Sports RG is backed by an ESPN API scraper that pulls information on all NFL and NBA teams (their game schedule, their roster, wins/losses, game analysis, etc.) and facts about all players (their age, position, college, statistics, and expert analysis on their overall play). For example, if the user is a fan of the Denver Broncos, the RG is capable of discussing the Broncos' most recent game (who won/lost, what the score was, what player played well, etc.) and then transitions into discussing a specific Broncos player from the game that the user likes. By utilizing automatic summarization, we are able to intersperse current, specific analysis of their favorite player or team that comes directly from ESPN analysts, giving the conversation a sophisticated and natural tone.

E.9 Fallback
When all other RG's fail to produce a suitable response, we rely upon two fallback RG's that always execute. The Neural Fallback RG selects a fallback responses generated by the DistillBlender model (Section 3.1), with all questions removed. Given that the model is trained on end-to-end dialogue, we find that this is a good conversational baseline. If the Neural Fallback RG fails, we resort to the Fallback RG, which returns a pre-written generic fallback (e.g. "Sorry, I don't know how to answer that").
F High-Initiative User Handling: Example Dialog Flows