Unsupervised Enrichment of Persona-grounded Dialog with Background Stories

Humans often refer to personal narratives, life experiences, and events to make a conversation more engaging and rich. While persona-grounded dialog models are able to generate responses that follow a given persona, they often miss out on stating detailed experiences or events related to a persona, often leaving conversations shallow and dull. In this work, we equip dialog models with ‘background stories’ related to a persona by leveraging fictional narratives from existing story datasets (e.g. ROCStories). Since current dialog datasets do not contain such narratives as responses, we perform an unsupervised adaptation of a retrieved story for generating a dialog response using a gradient-based rewriting technique. Our proposed method encourages the generated response to be fluent (i.e., highly likely) with the dialog history, minimally different from the retrieved story to preserve event ordering and consistent with the original persona. We demonstrate that our method can generate responses that are more diverse, and are rated more engaging and human-like by human evaluators, compared to outputs from existing dialog models.


Introduction
Humans often rely on specific incidents and experiences while conversing in social contexts (Dunbar et al., 1997). Responses from existing chitchat dialog agents often lack such specific details. To mitigate this, some prior work has looked into assigning personas to dialog agents (Zhang et al., 2018;Majumder et al., 2020). However, persona descriptions are often shallow and limited in scope, and while they lead to improvements response specificity, they still lack the level of detail with which humans share experiences.
In this work, we propose methods to enrich dialog personas with relevant background events us- from an existing corpus. We propose a gradient-based technique which encourages the generated response to be fluent with the dialog history, minimally different from the retrieved story, and consistent with the persona. The proposed approach leads to more specific and interesting responses.
ing fictional narratives from existing story datasets such as ROCStories (Mostafazadeh et al., 2016). For example, for a persona attribute 'I have two children and a dog,' we are able to identify a relevant narrative from a story corpus ( Figure 1). However, such stories may not directly fit fluently in the dialog context. Thus, retrieved stories should be adapted to construct a response that is fluent and relevant to the context. Since existing datasets (such as PersonaChat (Zhang et al., 2018)) do not contain responses with such background stories, such adaptation has to be done in an unsupervised fashion with decoders trained to generate responses conditioned only on a dialog history and persona.
To adapt a retrieved narrative incident as a relevant background story, we use a decoding procedure which encourages the generated response to (1) be fluent with the dialog history, (2) be consistent with the original persona, and (3) be minimally different from the retrieved story. While fluency with dialog context is encouraged directly by the likelihood as per the underlying language model the remaining two constraints are incorporated via iterative updates to the decoder output distributions at inference time. Our inference-time decoding method is different from the only recent effort by Su et al. (2020) that leverages non-dialog data (forum comments, book snippets) as distant labels to train dialog systems with supervision. Our contributions can be summarized as follows: • We propose a novel approach to enrich dialog agent personas with relevant backstories, relying only on existing story datasets.
• We propose to use an unsupervised backpropagation based decoding procedure 1 to adapt the relevant stories such that the resulting response is fluent with the dialog history and consistent with the dialog agent persona. Our method works with a model trained just with dialog data i.e. without access to story corpus at training time.
• Our experiments demonstrate that the proposed approach results in much more engaging and specific dialog outputs in a persona-grounded dialog setup. This fills a gap in existing dialog models which often lack the capability to generate responses about specific events and experiences relevant to persona attributes.

Unsupervised Persona Enrichment with Background Stories
Given dialog history h and persona C consisting of several (typically 3-5, example shown in Figure  1) attributes, our goal is to construct a dialog response x. Our underlying model is based on the discrete persona attribute choice model from Majumder et al. (2020). To generate a dialog utterance x, we first sample a persona attribute c ∼ p(c|h) conditioned on the dialog history h. x is then generated conditioned on the dialog history and the chosen persona attribute. The underlying dialog model's decoder is initialized with a pretrained GPT-2 model, and is fine-tuned on the PersonaChat dataset (Zhang et al., 2018). However, in our current setup, we also have to identify relevant background stories and use them to construct fluent responses at decoding time. Therefore, we propose a different decoding procedure.
To generate a response, we first sample a persona attribute c ∼ p(c|h). Next we retrieve stories cor-responding to the persona attribute c (Section 2.1). However, the underlying dialog model is trained to generate responses conditioned only on the dialog history and persona. To incorporate the retrieved story in the response, we perform gradient-based inference (Section 2.2), that only assumes a left-toright language model trained on dialog context and responses, and the story is handled at decoding time in an unsupervised fashion. We refer to the proposed method as PABST (Unsupervised PersonA enrichment with Background STories).

Retrieving Relevant Stories
For a persona attribute c, we aim to identify relevant stories from a story corpus. Toward this goal, we rank the stories using the F1 component of BERTscore  based retrieval using the persona attribute c as the query and the highest scoring story is chosen. Note that many of the stories are written in the third person. For use as background stories, we must first transform them to first-person. Following prior work (Brahman and Chaturvedi, 2020), we identify the protagonist of such stories as the most frequently occurring character. Thereafter, we use co-reference resolution (Lee et al., 2017) to identify all words or phrases that refer to the protagonist. Finally, all words or phrases so identified are replaced with suitable first person pronouns (e.g. 'his books' to 'my books').

Gradient-based Inference
Our underlying dialog model is not trained to condition on a retrieved story, and cannot be directly used to construct a desirable response using s. To tackle this, we consider a decoding strategy which, in addition to fluency with history h, encourages response x to follow two soft constraints: (1) be minimally different from story s, and (2) be consistent with persona c.
First, we generate an initial response based only on the dialog history. Then we perform an iterative procedure which alternates between performing a forward pass on the language model to encourage fluency, and a backward pass which updates the response via back-propagation to respect the two soft constraints. However, x is discrete, and cannot be directly updated using gradients from backpropagation. Instead, we maintain and update a soft representation o of x, where o i corresponds to the last hidden state representation for the i th token position, i.e., p(x i ) ∼ softmax(W o i /τ ), where τ is the temperature parameter, W is the embedding matrix, and W o i ∈ R V (V is the vocabulary size). Our approach is inspired by recent works that use gradient-based decoding for text generation with soft constraints (Dathathri et al., 2020;Qin et al., 2020). Next we describe the backward and forward passes of the iterative procedure.
Backward Pass with Soft Constraints We define the following soft constraints on response x: (1) Divergence from story: We want to encourage x to be minimally different from the story s. Following prior work (Qin et al., 2020), we compute a cross entropy loss (denoted by cross-entr henceforth) with story s = {s 1 , . . . , s T } tokens as labels and W o 1 , . . . , W o T as the logits.
(2) Consistency to persona: We want x to be consistent with persona attribute c. Consider a classifier q φ (o, c) which predicts the probability of x (or rather the soft representation o of x) entailing c. The classifier q φ (o, c) is a bag-of-words classification head on decoder hidden states o, fine-tuned on the Dialogue-NLI dataset (Welleck et al., 2019) to predict whether pairs of persona attributes and responses are entailed or not. The objective to maximize can be written as: Forward Pass to Encourage Fluency Next we perform a forward pass of the underlying dialog model, with the goal of regularizing the hidden states towards the unmodified language model values. On computing the forward pass at the j th token, we mix the final hidden states o f j from the forward pass with o b j computed in the backward pass, via weighted addition to get the resulting The resulting o j is used for computing the logits at the next time step j + 1.
We initialize the output response by performing greedy decoding from the underlying dialog model, conditioned on the dialog history and persona attribute. Then we iteratively update o by alternate backward and forward passes. We sample the final response x ∼ softmax(W o/τ ). In practice, we found that 5 iterations are sufficient to generate good quality outputs. is the % of distinct uni-and bi-grams. ENTR is the geometric mean of n-gram entropy. Grad. Inf. is the unsupervised gradient-based decoding as opposed to Nucleus sampling (Holtzman et al., 2020).

Experiments
We evaluate methods in terms of their capability to generate diverse, fluent and engaging responses. Hyperparameters are noted in Appendix §A.
Datasets We experiment with the PersonaChat dialog dataset (Zhang et al., 2018) consisting of 131,438 utterances for training, 15,602 for validation, and 15,024 for testing. For stories, we use the training split of the ROCStories dataset (Mostafazadeh et al., 2016), that consists of 78,529 stories, each typically of 4 to 5 sentences.
Baselines We consider two broad groups of models as baselines: (1) Without access to story corpus: We use finetuned GPT2 (TRANSFERO) on PersonaChat, and the discrete persona attribute choice model (DISCCHOICE) from Majumder et al. (2020). We also consider a version of DISC-CHOICE which enriches personas with inferences from a commonsense knowledge base (CS-KB).
(2) Baselines using story corpus: To allow DIS-CCHOICE models to generate story-like responses, we adapt an alternative training regime (PSEUDO) from (Su et al., 2020), where we randomly replace some of the target dialog responses with retrieved stories-treating them as pseudo labels. Finally, we also consider a MULTITASK training setup from (Su et al., 2020), wherein the decoder is trained on PersonaChat as well as with a language modeling objective on ROCStories. We additionally consider a RETRIEVAL baseline that uses the retrieved story verbatim as the dialog response.

Automatic Evaluation
We hypothesize that that the proposed approach to leverage external non-dialog data can increase the diversity of the generated responses. Following   prior work (Li et al., 2016), we report the percentage of distinct uni-grams and bi-grams (D-1 and D-2 respectively). Note that these values do not capture the actual frequency distribution of different word types. Therefore, we also report the geometric mean of entropy values of empirical frequency distributions of n-grams of words (n ∈ {1, 2, 3}) (Jhamtani et al., 2018), denoted by ENTR.
We observe that methods that use story data show much higher diversity compared to methods that do not (Table 1). Among methods using story data, gradient-based decoding (PABST) performs better than DISCCHOICE trained with PSEUDO or MULTITASK. Note that just using RETRIEVAL outputs as-is leads to even more diverse outputs than PABST. However, they are much less sensible with the context, as shown in human evaluations.

Human Evaluation
Since we do not have ground truth story-like responses in the dialog dataset, we perform human evaluation with 150 test examples to investigate if PABST generates responses that are 1) sensible with the dialog history and 2) engaging. We hired two Anglophone (Lifetime HIT acceptance % > 85) annotators for every test sample. The order of the systems present in the interface is randomized.
A snapshot of the human evaluation interface is provided in Appendix §C. All differences in values from human evaluations are significant with p < 0.05 from bootstrap tests on 1000 subsets of size 50. Cohen's Kappa (Cohen, 1960) to measure inter-annotator agreement for sensibility and engagement were 0.79 and 0.82 respectively.
From the results (shown in Table 3), we note that in comparison to responses from baselines, responses from PABST are more engaging and more sensible with respect to the dialog history. We further make following observations. Firstly, using the gradient-based decoding approach with retrieved stories (PABST) works significantly better than using distant supervision with stories data (PSEUDO and MULTITASK). Secondly, background stories provide sufficient detail for an engaging conversation compared to DIS-CCHOICE which expands persona attributes using commonsense knowledge (Majumder et al., 2020). Finally, we also observe that PABST performs worse when we do not use the consistency constraint (w/o DNLI).
Choice of λ d We also experiment with different values of the weight for the divergence term (λ d ) in L: High (λ d = 5), Moderate (λ d = 1), and Low (λ d = 0.05). We consider 100 samples for this experiment. We attribute a high λ d to responses strictly copying the story. We find that PABST (moderate λ d ) wins wins 81.2% and 69.1% cases against PABST (high λ d ) on 'sensible' and 'engaging' response criteria respectively. Similarly, PABST (moderate λ d ) wins 93.2% and 84.7% cases against PABST (low λ d ) in terms of sensibility and engagement respectively. Table 3 shows responses generated by different baselines. We observe that PABST is able to follow the retrieved story (same as output from RETRIEVAL) while modifying the response to be conversation-like and sensible with dialog history. Responses from other baselines remain verbose or incoherent. Mirroring the human evaluation, we observe that choosing a higher λ d makes the model to almost repeat the retrieved story but a lower value smooths the output to make it more sensible with the ongoing dialog.

Related Work
A desired impact of the proposed approach is increase in diversity of the generated responses. To tackle the issue of diversity in dialog model outputs, prior work has focused on decoding strategies such as diversity-promoting sampling (Holtzman et al., 2020); training strategies such as discouraging undesirable responses via unlikelihood training ; model changes such as using stochastic variables (Serban et al., 2017); and using external data such as forum data (Su et al., 2020) or external knowledge bases (Majumder et al., 2020). In contrast to these, our proposed method generates responses with background stories using a gradientbased decoding approach.
One of the steps in our proposed approach is to retrieve relevant stories from an external corpus. Prior work has explored using retrieval of similar dialog instances as an initial step in improving response diversity and other human-like desiderata in dialog . Distant supervision by using retrieved text snippets as pseudo responses has been explored in prior work (Su et al., 2020;. We use an external data source to improve dialog responses, a theme shared with some efforts in other tasks such as machine translation (Khandelwal et al.). The use of narrative text in dialog has been explored in prior work, mostly as a 'script' or template for conversation Zhu et al., 2020).
We adapted a BERT-based retrieval method  in our case to retrieve relevant story given dialog context and use retrieved story in the decoding phase.
Gradient-based for text generation with soft constraints has been explored in prior work (Dathathri et al., 2020;Qin et al., 2020).  focused on generating response which are consistent to given persona. Differently, we use a gradientbased decoding to generate a dialog response while honoring constraints such as consistency to persona and similarity to retrieved story.

Conclusion
We propose a method to enrich persona-grounded dialog with background stories at the inference time only using an existing corpus of non-conversational narratives-opening up new ways to generate enriched and engaging responses. One of the limitations of PABST is the assumption of the background story at every turn. As future work, we can include a decision step to decide if we need to incorporate a background story or not, given the dialog history. We can further explore ways to use retrieved stories over multiple turns instead of a single turn.

Impact Statement
In this work, we discuss ways to make a dialog system to generate more engaging responses. Since we use a finetuned version of a pretrained generative model, we inherit the general risk of generating biased or toxic language, which should be carefully filtered. Furthermore, the generations may incorporate biases that are already present in the dialog dataset and story dataset due to crowd-sourced data collection. Hence, we cautiously advise any developer who wishes to use a different story dataset for the background stories to be aware of the biases present in the dataset. Finally, we also note that experiments in this paper are limited only to English language.

A Implementation Details
We obtain the PersonaChat dataset from ParlAI repository 2 . ROCStories dataset is obtained from the repository of original release 3 . We adapted codes from original PPLM (Dathathri et al., 2020) repository 4 and modified them for our own objective function.
Network architecture For the generator network, we use GPT2 (Transformer with 12 layers, 768 hidden size, 12 heads-gpt2-small 5 ) following the state-of-the-art model (Wolf et al., 2019) from Conv-AI2 competition. The decoder has total of 124 Million parameters. We used the pre-trained decoder model obtained from (Majumder et al., 2020).
Hyperparameters PABST does not require any training since we perform gradient-based decoding at the inference time. For our best method, in objective function L, we use λ d as 1 and λ c as 1. We keep generation length to be 100 to encourage longer generations. We train the consistency classifier using code from PPLM repository 6 . The weight γ for mixing forward and backward passes was set to 0.45. For PSEUDO, we substitute a random 30% of the original target responses with retrieved stories. Figure 2 shows a sample question for the pairwise comparison between response generated by PABST and a baseline for sensibility and engagingness. A similar setup was used to measure performance between PABST variants with different λ d values (0.5, 1, 5).   RETRIEVAL: I always thought golf was for old people. One day, my dad invited me to play golf with him. I reluctantly accepted his invitation. On the golf range, I saw lots of people of my age. I realized golf was a sport for everyone to enjoy regardless of age.

C Qualitative Examples
PSEUDO: Last week, I went to play golf. My dad invited me. I was reluctant.
MULTITASK: That's great! I play volleyball during weekend.
PABST (λ d = 5.0): I like playing golf. I always thought it is for old people. Recently I went to a golf range. But I saw lots of people of my age. I realized golf is a sports for everyone.
PABST (λ d = 1.0): That's great. I like playing golf. I always thought it is for old people. Recently I went to a golf range. But I saw lots of people of my age. I realized golf is a sports for everyone.