PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives

Sustaining coherent and engaging narratives requires dialogue or storytelling agents to understandhow the personas of speakers or listeners ground the narrative. Specifically, these agents must infer personas of their listeners to produce statements that cater to their interests. They must also learn to maintain consistent speaker personas for themselves throughout the narrative, so that their counterparts feel involved in a realistic conversation or story.However, personas are diverse and complex: they entail large quantities of rich interconnected world knowledge that is challenging to robustly represent in general narrative systems (e.g., a singer is good at singing, and may have attended conservatoire). In this work, we construct a new large-scale persona commonsense knowledge graph, PeaCoK, containing ~100K human-validated persona facts. Our knowledge graph schematizes five dimensions of persona knowledge identified in previous studies of human interactive behaviours, and distils facts in this schema from both existing commonsense knowledge graphs and large-scale pretrained language models. Our analysis indicates that PeaCoK contains rich and precise world persona inferences that help downstream systems generate more consistent and engaging narratives.


Introduction
Interlocutors or storytellers in narrative scenarios often exhibit varying behaviours, which are affected by their own diverse personas, but also the personas of the counterparts they are interacting with. For example, an adventurous architect may be interested in talking about outdoor explorations with his friends who have similar hobbies, but may prefer to discuss architectural design ideas with his * Equal contribution. † Corresponding author. 1 We release our data and code to the community at https: //github.com/Silin159/PeaCoK  Figure 1: Illustration of world persona knowledge grounded on commonsense reasoning. colleagues at work. Narrative systems must know when such behaviours should be exhibited, requiring them to learn and represent the rich personas of characters based on self-introductions, biographies and other background profiles.
This goal of modeling diverse persona attributes is at the heart of research in the areas of personagrounded dialogue (Zhang et al., 2018;Zhong et al., 2020;Xu et al., 2022), story generation (Chandu et al., 2019;Zhang et al., 2022) and narrative understanding (Brahman et al., 2021). However, the complex nature of real-world personas, which involve rich world knowledge, and the countless ways in which they might interact, is challenging to reliably learn purely from data. For instance, as shown in Figure 1, a singer preparing an album may have studied music at university at one point, which would allow them to share their experience with a student majoring in composition, who may study music as a daily routine.
Prior work takes first steps at improving the persona knowledge representations available in narrative systems. Mazare et al., 2018 extract self-comments from Reddit websites to expand the scale of background persona profiles that can be used in downstream narrative settings. However, their collected profiles are fragmented and ignore the interconnections between personas that govern interactions. Meanwhile, Majumder et al., 2020use knowledge generators (Bosselut et al., 2019 to expand the persona profiles with commonsense inferences, but these commonsense expansions are limited to general social commonsense (Hwang et al., 2021), and do not form a systematic personacentric knowledge frame. Consequently, the lack of world-level persona commonsense knowledge resource hinders progress in learning the systematic persona representations necessary to sustain consistent and engaging narratives.
In this work, we propose a Persona-grounded Commonsense Knowledge graph (KG), PEACOK, which represents world-level persona knowledge at scale. Building off the persona concept initially proposed in human-computer interaction (HCI;Cooper, 1999;Mulder and Yaar, 2006;Cooper et al., 2007) and on behaviour analysis literature for human leisure conversations (Dunbar et al., 1997), we define a persona frame that formalizes five common aspects of persona knowledge: characteristics, routines and habits, goals and plans, experiences, and relationships. Using this knowledge frame, we construct a large-scale graph of persona commonsense knowledge by extracting and generating persona knowledge from both existing hand-crafted commonsense KGs and large-scale pretrained language models (LMs). We validate the knowledge graph via a joint human-AI majority voting scheme that integrates large pretrained LMs into the loop of crowdsourcing, and efficiently mediates the disagreements between human annotators.
Our resulting KG, PEACOK contains ∼100K high-quality commonsense inferences (i.e., facts) about personas whose connectivity in the KG reveals countless opportunities to discover common ground between personas. A neural extrapolation from the KG (Hwang et al., 2021) also shows that PEACOK's annotated personas enable the development of effective persona inference generators. Finally, the extended knowledge provided by PEA-COK enables a downstream persona-grounded dialogue system to generate more consistent and engaging responses in conversations, particularly when more interconnections between the interlocutor personas are found in PEACOK.

Related Work
Commonsense Knowledge Graphs Commonsense KGs such as ConceptNet (Liu and Singh, 2004;Speer et al., 2017), ATOMIC (Sap et al., 2019a), ANION (Jiang et al., 2021) and ATOMIC 20 20 (Hwang et al., 2021) are widely used in NLP applications that involve integrating implicit world knowledge, e.g., question answering (Talmor et al., 2019;Sap et al., 2019b;Chang et al., 2020;Shwartz et al., 2020) and text generation (Lin et al., 2020). However, despite the importance of persona knowledge in modeling human behavior -a crucial component for building reliable narrative systems (Zhang et al., 2018;Chandu et al., 2019) -no commonsense KG explicitly focuses on representing human persona knowledge. We present PEA-COK to open the field of developing commonsense knowledge graphs around personas. Persona-Grounded Narratives Integrating personas to improve consistency and engagement of narratives is an important goal in dialogue (Song et al., 2020;Liu et al., 2020) and storytelling (Chandu et al., 2019;Zhang et al., 2022) systems. One representative work that greatly contributed to the development of faithful persona emulation, PERSONA-CHAT (Zhang et al., 2018), constructs a crowdsourced dialogue dataset by asking participants to perform conversations based on their assigned persona profiles, i.e., 5 statements of selfintroductions. More recent work improves persona modeling in narrative systems by generating persona profiles from online resources (Mazare et al., 2018), training persona detectors (Gu et al., 2021) andpredictors (Zhou et al., 2021), and distilling personas knowledge from commonsense inference engines (Majumder et al., 2020). However, while these works align characters in narratives with persona profiles, they only implicitly model the areas of interaction between personas. In contrast, PEACOK explicitly represents interconnections between persona profiles, enabling persona interaction modeling in narrative systems. Personas in Human Interaction The concept of personas is originally defined as the modeling of users based on their attributes, goals and historical actions, which is used for identifying system development requirements in HCI design (Cooper, 1999;Mulder and Yaar, 2006;Randolph, 2004;Cooper et al., 2007). Behavioral studies in human leisure conversations (Dunbar et al., 1997) also found similar persona attributes driving con-versational topics, including personal relationships, experiences, future activities and hobbies. Based on the above studies, our work distills a systematic persona knowledge frame as a schema for our KG, PEACOK.

PEACOK Knowledge Frame
To construct a systematic representation of persona knowledge, we distill five common aspects of personas from classical persona definitions in literature in human-computer interaction (Cooper, 1999;Mulder and Yaar, 2006;Cooper et al., 2007), and human conversational behaviour (HCB;Dunbar et al., 1997). We use these dimensions to form persona frames consisting of a persona, five relations, and multiple attributes assigned to the persona through those relations. We describe the five relations below: Characteristics describe an intrinsic trait, e.g., a quality or a mental state, that the persona likely exhibits. For example, as shown in Figure 1, good at singing describes a talent of a singer, which is one of the singer's characteristics.
Routines or Habits describe an extrinsic behaviour that the persona does on a regular basis, e.g., a singer may regularly write songs.
Goals or Plans describe an extrinsic action or outcome that the persona wants to accomplish or do in the future, e.g., a singer may aim to win a Grammy award some day.
Experiences describe extrinsic events or activities that the persona did in the past. For instance, a singer may have studied music at college.
Relationships encode likely interactions of the persona with other people or social groups. Note that this relation can be overlapped with other relations in PEACOK. For example, a singer may want to have more fans, which connotes a relationship between singer and fans, but also a future goal or plan of singer.

PEACOK Construction
We use our persona frames to construct a knowledge graph of persona commonsense where personas are treated as head entities in the graph, frame relations constitute edge type relations, and attributes are tails in a (head, relation, tail) structure. Then, we devise a three-step procedure to construct the frames that make up PEACOK, as shown in Figure 2. First, we search existing commonsense KGs to select entities that can serve as head personas. Then we query these KGs and prompt pretrained LMs to collect tail attributes that are potentially associated with the personas via the five relations defined in Sec. 3. Finally, we use crowdsourcing with large LMs in the loop to classify whether these persona inferences are valid.

Persona Selection
We select entities that can represent head personas using ATOMIC 20 20 (Hwang et al., 2021), a commonsense KG covering knowledge about physical objects, daily events, and social interactions. We assume that entities related to personas should be about human beings, rather than other animals or non-living objects. Therefore, we first over-sample living entities from ATOMIC 20 20 which have animated behaviours, by extracting head entities that possess the CapableOf relation (i.e., are capable of doing something), e.g., an actor who is capable of performing, as shown in Figure 2. Then we filter out non-human beings in our extracted living entities, by removing entities that appear in the Animal Appendix of Wiktionary 2 . We also manually filter out other inappropriate entities which are too generic (e.g., man) or unrealistic (e.g., devil).
This initial procedure provides us with a diverse collection of initial coarse personas (e.g., actor, singer). To enlarge our persona set with finegrained personas (e.g., actor who acts in movies vs. actor who acts in plays), we collect additional persona candidates using three types of event-based entities derived from our initial persona set: a) entities containing the initial persona in a more complex context, e.g., X becomes an actor associates with the process of becoming an actor, rather than being an actor, b) entities that can be linked to the initial persona through the ATOMIC 20 20 CapableOf relation, e.g., X acts in play is linked to actor, and c) entities that are returned by Sentence-BERT retrieval (Reimers and Gurevych, 2019) for the initial persona, e.g., X becomes a movie star. For the latter two types of derived event-based entities, we prompt InstructGPT-3 (Ouyang et al., 2022) to filter out extended personas which do not entail their initial seed persona, e.g., X wants to be a lawyer is not entailed by a X is a judge, as X would al-

KG-Based Approach
InstructGPT ready be a lawyer if they were a judge. Finally, we extract 3.8K personas, which are converted to persona statements and integrated in PEACOK. 3

Attribute Induction
We derive the attribute knowledge for our collected set of head personas using both hand-crafted KGs and large language models pretrained on natural language corpora (which contain many narratives with implied persona information).

KG-Based Approach
We first select 10 commonsense relations in ATOMIC 20 20 KG which are potentially related to persona knowledge. 4 For each persona entity selected in Sec. 4.1, we extract potential attributes by taking 1-hop inferences of the persona along one of our selected ATOMIC 20 20 relations. As ATOMIC 20 20 may have a limited coverage of commonsense knowledge, we also use a knowledge model, COMET (Bosselut et al., 2019), pretrained on ATOMIC 20 20 , to generate potential attributes of each persona as well. We append each selected ATOMIC 20 20 relation to the persona entity, and feed each persona-relation pair to COMET to generate 5 new potential attributes.

LM-Based Approach
To mine more persona knowledge implied in natural language corpora, we also prompt InstructGPT-3 to generate new persona attributes. Using each of the five relations defined in Sec. 3, we prompt InstructGPT-3 with our persona statements and generate 5 new attributes for each relation. For example, for the Experience relation, we instruct the model to guess distinctive activities that an individual fitting the persona 3 Details regarding head entity conversion and the prompt for InstructGPT-3 entity filtering are in Appendix A. 4 Appendix A lists our selected 10 ATOMIC 20 20 relations and their descriptions. might have done in the past. We adapt InstructGPT-3 using 5 manually created in-context examples for each type of relation. 5

Relation Classification
Once we have a large-set of initial candidate knowledge tuples to compose our persona frames, we use crowdworkers from Amazon Mechanical Turk to verify every collected relationship consisting of a head persona, relation, and tail attribute. Because we observe that a fine-grained labeling schema can help workers better distinguish different relations and yield more precise annotations, we task workers with classifying fine-grained underlying features of the relations. For each attribute, we independently ask two workers to judge whether it describes: a) an intrinsic or extrinsic feature of the persona, b) a one-off or regular attribute of the persona, c) a past, present or future attribute of the persona, d) an attribute of only the persona itself, or describing the persona's relationship with others (interactivity). Finally, for each attribute in the persona frame, we ask workers whether the attribute is distinctively associated with the persona or generically associated with many potential personas (distinctiveness). As an example, in Table 1, we see that get tips from customers is distinctively associated as a common routine of a waiter. Meanwhile, get better is a generic attribute that would not be strongly associated with runner, as many personas can have the goal of self-improvement. We follow Figure 3 to map the first three dimensions of feature labels to one of the first four relations defined in Sec. 3, which we define as the main relation label of each persona-attribute pair. The other two dimensions of feature labels, i.e., interactivity (containing the fifth relation in Sec. 3) and distinctiveness, are defined as two additional relation labels. If a worker judges that an extracted attribute is not associated with the persona at all, we instead ask the worker to label the relation as Not Persona.
Majority Voting with LM in the Loop To mediate the disagreements between two crowdworkers without introducing more human labour (i.e., a third worker), we use InstructGPT-3 and the two workers in a majority vote scheme to determine the final relation labels of some persona-attribute mappings. For each attribute collected in Sec. 4.2, we prompt InstructGPT-3 to produce additional labels for the relation of the attribute with respect to the persona. We prompt InstructGPT-3 on three labeling tasks corresponding to the three dimensions of relation labeling schema shown in Figure 3. For the main dimension, we set the labeling classes to include the four main relation labels, and also a negative class (No Persona) indicating that the attribute is not a persona attribute or too generic (e.g., living a happy life). We prompt InstructGPT-3 with 2 examples of each class for the main dimension (i.e., 10 manually labeled in-context examples).  For the interactivity and distinctiveness dimensions, we ask InstructGPT-3 to predict a binary label for each dimension. For these predictions, we provide InstructGPT-3 with 4 examples of each class (i.e., 8 manually labeled in-context examples for each dimension). 6 For each dimension of the relation labeling schema shown in Figure 3, we determine the final label as the majority label given by InstructGPT-3 and the two workers. We set the final label as Controversial if no unique majority label is found, e.g., InstructGPT-3 and two workers all give different labels. Finally, each persona-attribute pair forms a persona fact triple with its annotated relation labels in PEACOK. Table 1 shows some examples of PEACOK facts. 7

PEACOK Analysis
Our statistics of the final PEACOK relations are shown in Table 2, where we construct 102,097 facts with valid persona knowledge inferences. We stratify PEACOK statistics based on the two persona collection approaches (KG-based and LM-based) described in Sec. 4.2. We find that the KG-based distillation (which extracts information initially annotated by human workers) results in more imbalanced persona knowledge. A large proportion  (∼57%) of Routine or Habit relations dominate the extracted persona relations, and there are fewer Relationship and Distinctive facts, as well. This indicates that hand-crafted social commonsense KGs contain a narrower view of real-world persona knowledge, highlighting the importance of also distilling a balanced set of persona knowledge from large pretrained LMs. However, the repurposed knowledge from the KG was initially written by humans, and contains diverse persona inferences less likely to be generated by LLMs.
Persona Interconnectivity In addition to containing diverse knowledge from multiple sources, PEACOK also contains interesting interconnections among personas, which potentially indicate engaging points of common ground for characters of narratives. For example, as shown in Figure 1, a professional singer's experience of studying music at college is also the routine of a music-major student, which shows a common topic for these two persona to discuss. Among 40,665 distinctive attributes in PEACOK, we find that 9,242 attributes are connected to two or more personas, forming 239,812 bridges, i.e., pairs of personas connected via a shared common attribute. 8

Attribute Disagreements
One of our innovations in this work is to introduce InstructGPT-3 as a third annotator to resolve disagreements between human annotators via majority voting. We analyze the disagreements between 8 The number of bridges grows combinatorially with the number of personas sharing an attribute.  workers across the annotations as in Table 3, and observe that labels from InstructGPT-3 effectively solve many disagreements between human workers. For the main dimension labeling, ∼73% of the disagreements are solved by adding InstructGPT-3 as a third annotator. However, ∼27% of labels remain Controversial when both annotators and GPT3 all disagree in different ways. These controversial labels enable further research on the ambiguities in real-world persona types and the potential stereotypes in persona judgments. In the interactivity and distinctiveness dimensions where the labeling schema is binary, disagreements of workers are fully solved by the majority voting with InstructGPT-3, though ambiguous cases may still remain.
Expert Study However, one question that naturally arises, when employing a majority voting with InstructGPT-3 in the loop, is whether this clas-  InstructGPT-3 agrees with one of the workers but not the other, and when both workers agree with each other but not with InstructGPT-3. Experts are required to pass a qualification test by performing 20 test annotations correctly. Furthermore, in the case of disagreements (7% of cases), a third expert re-checked the annotations of the two experts and resolved the disagreement cases. 9 Table 4 presents the accuracy and F1 of the majority voting results, compared to the re-annotations from experts as ground truth labels. We stratify the results into two cases: the two workers disagree with each other but InstructGPT-3 agrees with one of them, and both workers agree with each other but not with InstructGPT-3. We observe a high agreement between the experts and the majority vote, with an average accuracy and F1 of 0.874 and 0.865, respectively. These results validate majority voting with InstructGPT-3 in the loop, showing that InstructGPT-3 serves as a reliable third annotator when disagreements arise. Moreover, the integration of InstructGPT-3 in the verification loop contributes to lower temporal and financial costs, compared to introducing further human annotators.

BLEU ROUGE-L METEOR SkipThoughts
However, we note that InstructGPT-3 is not a panacea on its own. While the model effectively resolves worker disagreements, we find that its individual predictions are only correct with ∼60% macro-F1, which is far from the ∼85% macro-F1 with majority voting, indicating that not all PEA-COK persona relations are known by large-scale language models, and that human crowdsourcing is still necessary to ensure data quality.  As baselines, we compare to a fewshot GPT-3 (Brown et al., 2020) that uses 5 randomly sampled training facts (with same relation as the testing fact) to prompt the tail knowledge generation and a zero-shot GPT-3.5 (text-davinci-003) baseline model. These baselines compare PEACOK training to larger LMs that use both in context-learning and instruction tuning. We conduct both automatic and human evaluations on the knowledge generators, with results shown in Tables 5 and 6. 10 Compared to few-shot GPT-3, COMET-BART trained on PEACOK achieves overall better automatic evaluation results on various NLG metrics, despite being a much smaller model. 11 In the human evaluation, we find that facts generated by COMET-BART receive a high acceptance rate by crowdworkers for plausibility, slightly beating fewshot GPT-3. We also find that zero-shot GPT-3.5 model, although more advanced than the GPT-3 baseline model, scores, on average, ∼15.3% and ∼9.3% lower than COMET-BART in terms of automatic metrics and human acceptance, respectively. All above results indicate that PEACOK can serve as a reliable persona knowledge base, which en- 10 We include more implementation details of our neural KG analysis in Appendix C. 11 GPT-3 and COMET-BART have 175B and 440M parameters, respectively.  Table 7: Downstream dialogue response generation results on the ConvAI2 PERSONA-CHAT dataset. All the results are evaluated on the development set since the test set is not publicly available. We use the trained model provided by P 2 BOT paper to reproduce the baseline results under the same environment as for developing P 2 BOT + PEACOK.  We link facts from PEACOK to PERSONA-CHAT dialogues, thereby extending P 2 BOT's persona perception and augmenting its dialogue response generation. 12 We evaluate our models based on both original and revised interlocutor profiles provided in the ConvAI2 PERSONA-CHAT dataset, and measure the perplexity (PPL), word-level F1, and cumulative 4-gram BLEU (Papineni et al., 2002) of the generated responses compared to the references. We also follow ConvAI2 to measure Hits@1, i.e., the probability that real response is ranked the highest by the model among 20 candidates. For each interlocutor, we randomly sample 5 PEACOK facts that are linked to their PERSONA-CHAT profile, 13 and convert them into natural language statements to form their extended persona knowledge. 14 Our augmented model is denoted as P 2 BOT + PEACOK. To compare PEACOK's persona-centric knowledge augmentations with general commonsense augmentations, we also evaluate another baseline model P 2 BOT + ATOMIC 20 20 , where we follow Majumder et al., 2020 to extend interlocutor personas with 5 randomly sampled commonsense inferences from the COMET-ATOMIC 20 20 model (Hwang et al., 2021).

Results
In Table 7, we show that P 2 BOT + PEA-COK significantly outperforms P 2 BOT on PPL and Hits@1, 15 and has comparable F1 and BLEU scores. Compared to P 2 BOT+ ATOMIC 20 20 , P 2 BOT + PEACOK also demonstrates a clear improvement across all metrics, indicating the importance of aug-13 Due to the model capacity limitation of the baseline P 2 BOT, we only sample a subset of linked PEACOK facts as the extended persona knowledge for each interlocutor.
14 Fact preprocessing details are in Appendix C and D. 15 Table 9: Pairwise comparisons of dialogue response generation between P 2 BOT + PEACOK versus P 2 BOT, stratified by the number of shared PEACOK attributes between interlocutors. "#CA" denotes the number of common attributes shared by the two interlocutors' linked PEACOK knowledge. "#DR" denotes the number of dialogue responses evaluated in each stratified experiment. Ties are not shown.
menting narrative systems with persona-grounded commonsense knowledge.
Human Evaluation Automatic metrics are not fully reliable for evaluating dialogue systems (Liu et al., 2016;Novikova et al., 2017), so we also conduct human evaluations on the dialogue responses. We make pairwise comparisons between P 2 BOT + PEACOK and other baseline models, based on their generated responses to 200 randomly sampled dialogue histories (100 each with original and revised PERSONA-CHAT profiles). Two expert annotators from our research group manually compare four aspects of the response generation quality: fluency, whether the response is fluent and understandable, consistency, where the response is consistent with the dialogue history, engagement, whether the response is engaging and interesting, and persona expression, whether the response demonstrates persona information related to the interlocutor's profile. To ensure the fairness and reliability of our human evaluation, similar to Sec. 5.1, we require each expert to pass a qualification test on 10 pairwise comparisons, and also include a third qualified expert to re-check the evaluation results. We note that both expert annotators do not see the source model from which each response is generated. The human evaluation results in Table 8 show that P 2 BOT + PEACOK generates more consistent and engaging dialogues compared to other neural baselines, demonstrating that persona commonsense knowledge is a key contributor to the conversation consistency and engagement. However, P 2 BOT + PEACOK still has room for improvement compared to human performance.
Perhaps most interestingly, though, we find that PEACOK's impact on the consistency and engage-ment of dialogues is most pronounced when there are interconnections between the personas of the interlocutors. We stratify the pairwise comparison between P 2 BOT + PEACOK versus P 2 BOT from Table 8 based on the overlap of the two interlocutors' linked PEACOK knowledge. In Table 9, we show the results of this stratification across the cases where the interlocutors have 0, 1 or more than 1 shared attributes. Specifically, we find that the winning rates of P 2 BOT w/ PEACOK on dialogue consistency and engagement increase as the overlap of the two speakers' linked PEACOK personas becomes larger, demonstrating that more connections between interlocutors leads to more consistent and engaging conversations, and highlighting the importance of learning interconnected world persona knowledge in narratives.

Conclusion
In this work, we propose a persona commonsense knowledge graph, PEACOK, to complement the real-world picture of personas that ground consistent and engaging narratives. PEACOK consists of ∼100K commonsense facts on five dimensions of persona knowledge identified in human interactive behaviours, which are distilled from both existing KGs and pretrained LMs. Our KG analysis and downstream experiments demonstrate that PEACOK is a reliable world persona knowledge resource, which has great potential to benefit consistency and engagement of narrative systems.

Limitations
We acknowledge a few limitations in this work. First, PEACOK cannot be comprehensive. Persona knowledge is very broad and our resource cannot cover all dimensions of personas, nor all attributes of these dimensions. We select five dimensions of personas that we found salient from background literature in human interaction, and we distill attributes for these dimensions from ATOMIC 20 20 , COMET and InstructGPT-3. These resources, while rich in knowledge, only represent a subset of possible background resources for the construction of PEACOK(among other KGs and pretrained language models). Furthermore, the primary language of these three resources is English, making PEA-COK a solely English resource. Finally, in downstream narrative experiments, the usage of our augmented persona knowledge is constrained by the capacity of baseline model, which leaves for fu-ture work the exploration of downstream persona knowledge augmentation on a larger scale.

Ethics Statement
Our work is approved by our institution's human research ethics committee to conduct human-centric or ethics-related experiments, e.g., crowdsourcing and human evaluations. Topic-wise, our research develops a knowledge graph of commonsense knowledge about personas to augment understanding of characters and their interactions in diverse narratives. Given that some of the attributes are extracted from previous KGs or generated by LMs, we cannot guarantee our knowledge graph does not contain attribute alignments with negative connotations that could provide undesired information to a downstream system. However, we took the following steps to mitigate this effect. First, the set of personas we include in PEACOK was manually filtered to not include stereotypical and harmful roles, thereby limiting the negative associations of the personas themselves. Second, we explicitly prompted the LM to generate optimistic attributes about personas, which has been shown in prior work to reduce the toxicity of outputs (Schick et al., 2021). Finally, each attribute in PEACOK is explicitly validated by two human workers for toxicity, providing a final opportunity for workers to flag problematic content. However, we acknowledge that none of these safeguards are perfect, as language models may still produce toxic outputs and annotators may have differing opinions on what constitutes toxic content (Sap et al., 2022).

A PEACOK Construction Details
Head Persona Selection Table 10 shows our designed prompt for InstructGPT-3 head persona filtering described in Sec. 4.1. We preprocess our extracted human and event-based entities to make them fit into the prompt. Specifically, we fill each human entity into the template "I am a(n) ___." to convert it into a natural language sentence. We also replace the general token "PersonX" in each evenbased entity with the pronoun "I", and lemmatize the third person singular in its verbs. To build the integral statement (final head persona in PEACOK) that combines a human entity with each of its derived event-based entity, we instead replace the even-based entity's "PersonX" token with "who", and then append it to the converted sentence of its human entity. Note that for each human entity itself or event-based entity that contains a human entity (i.e., the first type of derived event-based entities), we directly include its converted sentence alone as one of the head persona statements in PEACOK.

KG-Based Tail Attribute Collection
We use ATOMIC 20 20 as the background resource for KGbased tail attribute collection described in Sec. 4.2. This advanced KG contains 1.33M general social commonsense inferences based on a rich variety of entities, including 0.21M inferences about physical objects, 0.20M inferences centered on daily events, and other 0.92M inferences based on social interactions. Table 11 lists the 10 ATOMIC 20 20 relations that we consider as potentially related to persona knowledge, which we use to query tail attributes from ATOMIC 20 20 KG and COMET, based on each original entity collected in the head persona selection (Sec. 4.1). Tables 12  and 13 show the prompts provided to InstructGPT-Does the phrase distinctively entail the role of the person in the script?

Relation Relation Description
HasProperty the person is characterized by being/having CapableOf the person is capable of Desires the person desires xNeed but before, the person needs xAttr the person is seen as xEffect as a result, the person will xReact as a result, the person feels xWant as a result, the person wants xIntent because the person wants 3 tail to generate attributes for each persona (Sec. 4.2), based on each converted persona statement derived from the head persona selection (Sec. 4.1). We use 2 different sets of in-context examples to prompt the InstructGPT-3 generation.
Specifically, examples under the Simple Head Personas block are used for head statements converted from human entities or event-based entities that directly contain human entities (the first type of derived event-based entities). While examples under the Complex Head Personas block are used for event-based entities that do not contain human entities (the second and third types of derived eventbased entities), where the event-based entity is com-bined with its source human entity to form a integral statement.

Crowdsourcing Relation Classification
We conduct a worker qualification for our persona relation classification described in Sec. 4.3. To select native English speakers, we focus on the group of workers whose locations are in the USA. We test workers with 10 head personas, each with 2 tail personas (i.e., totally 20 head-tail persona pairs), and select workers who can reasonably annotate 18 or more (i.e., ≥90%) relations between the given head and tail personas. Finally, 72 out of 207 workers are selected as qualified. We pay each worker $0.30 for doing every 5 annotations. The average hourly wage for each worker is about $18.00, which is in the acceptable range of hourly wage suggested by Amazon Mechanical Turk. Figure 4 and 5 shows the screenshots of our acceptance policy, privacy policy, and task instruction used for crowdsourcing.  Table 17 shows the fine-grained statistics of persona relations included in PEACOK. Each PEA-COK fact's relation consists of three dimensions of labels as shown in Figure 3. The combinations of Routine or Habit, Self and Distinctive labels is the most frequent relation in PEACOK, which implies that individual daily activities might be the most common topic involved in human interactions. Table 18 shows several examples of persona facts in PEACOK, which showcases our knowledge graph's rich commonsense inferences on persona-grounded knowledge.

C Neural KG Analysis Details
Fact Preprocessing We develop neural knowledge generator based on the PEACOK facts whose  relations are labeled as Distinctive in the third (distinctiveness) dimension. We preprocess these distinctive PEACOK facts to facilitate knowledge generation. In particular, we follow Table 19 to map each fact's relation into a textual description, and then concatenate it with the fact's head and tail personas. If the relation is labeled as Relationship in the second (interactivity) dimension, we also append its description in Table 19 to the fact's maindimension label description, i.e., one of the other four descriptions in Table 19. For example, (I am a waiter, Characteristic and Relationship, skilled at customer service) is converted into I am a waiter, here is my character trait related to other people or social groups, skilled at customer service.
Evaluation Details We split our preprocessed facts into three sets, with size 64853, 8913 and 14112 for training, validation and testing, respectively. Note that the three sets of facts do not have overlapped head personas with each other. We evaluate tail persona generation on the 3030 unique head-relation combinations in the testing set, with the 14112 gold tail personas serving as references. Several NLG metrics are adopted for the automatic evaluation, including cumulative 4-gram BLEU (Papineni et al., 2002), ROUGE-L (Lin, 2004), METEOR (Banerjee and Lavie, 2005) and SkipThoughts (Kiros et al., 2015). For human evaluation, we use the same group of workers qualified for PEACOK relation classification described in Appendix A. Each fact with generated tail is evaluated by one Amazon Mechanical Turk worker, following our instruction shown in Figure 6. We pay each worker $0.20 for evaluating every 5 facts, which keeps similar hourly wage as compared to PEACOK relation classification.

Model Training
We use Kogito (Ismayilzada and Bosselut, 2022) toolkit to train the COMET-BART knowledge generator, with the default hyperparameters suggested by the toolkit. One NVIDIA TITAN X Pascal GPU is used to train the model for 7 epochs, which costs about 1 hour to get the highest ROUGE-L score on the validation set. For the 5-shot GPT-3 generation, we prompt the davinci endpoint with default hyperparameters suggested by the OpenAI GPT-3 platform.
We also train a DeBERTa (He et al., 2020) discriminator to re-rank the facts generated by COMET-BART and GPT-3. For each training fact, we create one negative example by replacing its tail persona with a randomly sampled one from another training fact, which have a different head persona but same relation. We train the DeBERTa model to discriminate true facts versus negative samples based on a binary classification loss, with hyperparameters suggested by the ComFact (Gao et al., 2022) benchmark. Four NVIDIA TITAN X Pascal GPUs are used to train the model for 6 epochs, which costs about 21 hours to get the highest F1 score on the validation set. Finally, for both COMET-BART and GPT-3, we evaluate their top-1 of 5 generated facts re-ranked by our DeBERTa discriminator, with their default decoding methods, i.e., beam search for COMET-BART and nucleus sampling for GPT-3, with 1.0 top-p sampling rate and 0.9 temperature value.

D Persona Dialogue Agent Implementation Details
Our downstream dataset, ConvAI PERSONA-CHAT, contains 17878 and 1000 crowdsourced dialogues for training and validation, while 1015 testing dialogues are not public. In each dialogue sample, two speakers are pre-given their own persona profiles, i.e., four or five sentences of self-introductions, to conduct conversations. Based on the persona profiles, P 2 BOT uses a reinforcement learning (Sutton et al., 1999) approach to build mutual persona perception between speakers, which enhances the quality of personalized dialogue generation.
Persona Knowledge Linking We first link candidate facts from PEACOK via the pattern matching and embedding similarity heuristics introduced in ComFact, and then use a DeBERTa (He et al., 2020) entity linker trained on ComFact to select relevant facts from the candidates. We use the DeBERTa entity linker (instead of fact linker) to check the relevance of each fact's head and tail personas independently, without considering their in-between relations. This is because the DeBERTa fact linker from ComFact is trained on ATOMIC 20 20 relations, which cannot well identify the new relation sets of PEACOK. We link persona facts from PEACOK whose head and tail personas are both relevant to the extracted PERSONA-CHAT statement or sentence. We also include an additional set of persona facts which only have relevant tail, since the high-level head personas are not always revealed in the dialogue. Similar to the fact preprocessing described in Appendix C, we convert each linked persona fact into a natural language statement, by first following Table 19 to map each fact's relation into a textual description, and then concatenate it with the fact's head and tail personas.

Model Training
We train our knowledge augmented models (i.e., P 2 BOT w/ PEACOK and P 2 BOT w/ ATOMIC 20 20 ) with the same hyperparameters and early stopping settings as the original P 2 BOT model. Two NVIDIA TITAN X Pascal GPUs are used, which takes about 20 hours to get convergence (early stopped) on the validation set.
Human Evaluation For each pairwise comparison, we show the experts two responses generated by different models, with the gold dialogue history and the interlocutor persona profiles. We ask the experts to compare the two responses with regard to our four evaluation aspects (i.e., fluency, consistency, engagement and persona expression). To guide the experts to better understand our evaluation criteria, we interpret each evaluation aspect as a specific question, as shown in Table 20.  response generated by each model along with the dialogue history and the persona profile associated with the speaker of the response. The linked PEA-COK knowledge (i.e., fact) that involved in the response generation is also presented. We find that the involved PEACOK fact help identify a potential role of the speaker, i.e., breeder inferred from milking cows and farmland, and also explain the speaker's persona of having a pet canine, i.e., dog. Therefore, compared to other baseline models, P 2 BOT w/ PEACOK generates a more consistent and engaging response, which is well associated with the counterpart's last utterance in the dialogue history, and also simultaneously conveys a related persona of the speaker.

Evaluation Aspect Question
Fluency Which response is more fluent and understandable? Consistency Which response is more consistent with the dialogue history? Engagement Which response shows higher engagement, e.g., more attractive and interesting, more active involvement? Persona Expression Which response shows richer personas of the interlocutor that are consistent with his or her persona profiles? Table 20: Questions for human evaluation on downstream dialogue response generation, with regard to the four evaluation aspects.

Dialogue History
Hi there, how are you tonight? I'm tired from milking the cows. Well, I know I am strong for a 100, but I don't think I can do that. 100?! That's older than my dad! Would you make your son milk cows? Yeah because I sure can't do it.

Persona Profile
The Black Hawks are my favorite team! I reside in the country surrounded by farmland. I like to meditate with art. I have a pet canine that looks like a police canine.
Responses P 2 BOT: I' m sorry. I am a Buddhist, but I like to meditate. P 2 BOT w/ ATOMIC 20 20 : Well, I'm a grown man, and I do not want to be a cop. P 2 BOT w/ PEACOK: You can do it! I have a dog. He can help you.
Human: I would rather be coloring in my books! Do you like sports?
Involved PEACOK Knowledge I am a breeder, Routine or Habit, breed dog