Persona-Guided Planning for Controlling the Protagonist’s Persona in Story Generation

Endowing the protagonist with a specific personality is essential for writing an engaging story. In this paper, we aim to control the protagonist’s persona in story generation, i.e., generating a story from a leading context and a persona description, where the protagonist should exhibit the specified personality through a coherent event sequence. Considering that personas are usually embodied implicitly and sparsely in stories, we propose a planning-based generation model named ConPer to explicitly model the relationship between personas and events. ConPer first plans events of the protagonist’s behavior which are motivated by the specified persona through predicting one target sentence, then plans the plot as a sequence of keywords with the guidance of the predicted persona-related events and commonsense knowledge, and finally generates the whole story. Both automatic and manual evaluation results demonstrate that ConPer outperforms state-of-the-art baselines for generating more coherent and persona-controllable stories. Our code is available at https://github.com/thu-coai/ConPer.


Introduction
Stories are important for entertainment. They are made engaging often by portraying animated and believable characters since a story plot unfolds as the characters interact with the object world created in the story (Young, 2000). Cognitive psychologists determined that the ability of an audience to comprehend a story is strongly correlated with the characters' believability (Graesser et al., 1991). And the believability mostly depends on whether the characters' reaction to what has happened and their deliberate behavior accord with their personas (e.g., weakness, abilities, occupations) (Madsen and Nielsen, 2009;Riedl and Young, 2010).  (Akoury et al., 2020). The protagonist's name is shown in the square bracket. And we manually write Persona B based on Persona A. We highlight the sentences which embody the given personas in red.
Furthermore, previous studies have also stressed the importance of personas in stories to maintain the interest of audience and instigate their sense of empathy and relatedness (Cavazza et al., 2009;Chandu et al., 2019). However, despite the broad recognition of its importance, it has not yet been widely explored to endow characters with specified personalities in story generation.
In this paper, we present the first study to impose free-form controllable persona on story generation. Specifically, we require generation models to generate a coherent story, where the protagonist should exhibit the desired personality. We focus on controlling the persona of only the protagonist of a story in this paper and leave the modeling of personas of multiple characters for future work. As exemplified in Table 1, given a context to present the story settings including characters, location, problems (e.g., "Boruc" was suffering from a plane crash) and a persona description, the model should generate a coherent story to exhibit the persona (e.g., what happened when "Boruc" was "skilled" or "unskilled"). In particular, we require the model to embody the personality of the protagonist implicitly through his actions (e.g., "checked his controls" for the personality "skilled pilot"). Therefore, the modeling of relations between persona and events is the first challenge of this problem. Then, we observe that only a small amount of events in a human-written story relate to personas directly and the rest serve for explaining the cause and effect of these events to maintain the coherence of the whole story. Accordingly, the second challenge is learning to plan a coherent event sequence (e.g., first "finding the plane shaking", then "checking controls", and finally "landing safely") to embody personas naturally.
In this paper, we propose a generation model named CONPER to deal with Controlling Persona of the protagonist in story generation. Due to the persona-sparsity issue that most events in a story do not embody the persona, directly fine-tuning on real-world stories may mislead the model to focus on persona-unrelated events and regard the persona-related events as noise . Therefore, before generating the whole story, CON-PER first plans persona-related events through predicting one target sentence, which should be motivated by the given personality following the leading context. To this end, we extract persona-related events that have a high semantic similarity with the persona description in the training stage. Then, CONPER plans the plot as a sequence of keywords to complete the cause and effect of the predicted persona-related events with the guidance of commonsense knowledge. Finally, CONPER generates the whole story conditioned on the planned plot. The stories are shown to have better coherence and persona-consistency than state-of-the-art baselines.
We summarize our contributions as follows: I. We propose a new task of controlling the personality of the protagonist in story generation.

II.
We propose a generation model named CONPER to impose specified persona into story generation by planning persona-related events and a keyword sequence as intermediate representations.
III. We empirically show that CONPER can achieve better controllability of persona and generate more coherent stories than strong baselines.

Related Work
Story Generation There have been wide explorations for various story generation tasks, such as story ending generation (Guan et al., 2019), story completion (Wang and Wan, 2019) and story generation from short prompts (Fan et al., 2018), titles (Yao et al., 2019) or beginnings (Guan et al., 2020. To improve the coherence of story generation, prior studies usually first predicted intermediate representations as plans and then generated stories conditioned on the plans. The plans could be a series of keywords (Yao et al., 2019), an action sequence (Fan et al., 2019;Goldfarb-Tarrant et al., 2020) or a keyword distribution (Kang and Hovy, 2020). In terms of character modeling in stories, some studies focused on learning characters' persona as latent variables (Bamman et al., 2013(Bamman et al., , 2014 or represented characters as learnable embeddings (Ji et al., 2017;Clark et al., 2018;. Chandu et al. (2019) proposed five types of specific personas for visual story generation. Brahman et al. (2021) formulated two new tasks including character description generation and character identification. In contrast, we focus on story generation conditioned on personas in a free form of text to describe one's strengths, weaknesses, abilities, occupations and goals.
Controllable Generation Controllable text generation aims to generate texts with specified attributes. For example, Keskar et al. (2019) pretrained a language model conditioned on control codes of different attributes (e.g., domains, links). Dathathri et al. (2020) proposed to combine a pretrained language model with trainable attribute classifiers to increase the likelihood of the target attributes. Recent studies in dialogue models focused on controlling through sentence functions (Ke et al., 2018), politeness (Niu and Bansal, 2018) and conversation targets (Tang et al., 2019). For storytelling, Brahman et al. (2020) incorporated additional phrases to guide the story generation. Brahman and Chaturvedi (2020) proposed to control the emotional trajectory in a story by regularizing the generation process with reinforcement learning. Rashkin et al. (2020) generated stories from outlines of characters and events by tracking the dynamic plot states with a memory network.
A similar research to ours is Zhang et al. (2018), which introduced the PersonaChat dataset for endowing the chit-chat dialogue agents with a consis- Figure 1: Model overview of CONPER. The training process is divided into the following three stages: (a) Target Planning: planning persona-related events (called "target" for short); (b) Plot Planning: planning a keyword sequence as an intermediate representation of the story with the guidance of the target and a dynamically growing local knowledge graph; And (c) Story Generation: generating the whole story conditioned on the input and plans. tent persona. However, dialogues in PersonaChat tend to exhibit the given personas explicitly (e.g., the agent says "I am terrified of dogs" for the persona "I am afraid of dogs"). For quantitative analysis, we compute the ROUGE score (Lin, 2004) between the persona description and the dialogue or story. We find that the rouge-2 score is 0.1584 for PersonaChat and 0.018 for our dataset (i.e., STORIUM). The results indicate that exhibiting personas in stories requires a stronger ability to associate the action of a character and his implicit traits compared with exhibiting personas in dialogues.
Commonsense Knowledge Recent studies have demonstrated that incorporating external commonsense knowledge significantly improved the coherence and informativeness for dialog generation (Zhou et al., 2018a;Zhong et al., 2020), story ending generation (Guan et al., 2019), essay generation (Yang et al., 2019), story generation (Guan et al., 2020;Xu et al., 2020;Mao et al., 2019) and story completion (Ammanabrolu et al., 2021). These studies usually retrieved a static local knowledge graph which contains entities mentioned in the input, and their related entities. We propose to incorporate the knowledge dynamically during generation to better model the keyword transition in a long-from story.

Methodology
We define our task as follows: given a context X = (x 1 , x 2 , · · · , x |X| ) with |X| tokens, and a persona description for the protagonist P = (p 1 , p 2 , · · · , p l ) of length l, the model should generate a coherent story Y = (y 1 , y 2 , · · · , y |Y | ) of length |Y | to exhibit the persona. To tackle the problem, the popular generation model such as GPT2 commonly employ a left-to-right decoder to minimize the negative log-likelihood L ST of human-written stories: where S is the concatenation of X and P , s t is the decoder's hidden state at the t-th position of the story, W and b are trainable parameters. Based on this framework, we divide the training process of CONPER into three stages as shown in Figure 1.

Target Planning
We observe that most sentences in a human-written story do not aim to exhibit any personas, but serve to maintain the coherence of the story. Fine-tuning on these stories directly may mislead the model to regard input personas as noise and focus on modeling the persona-unrelated events which are in the majority. Therefore, we propose to first predict persona-related events (i.e., the target) before generating the whole story. We use an automatic approach to extract the target from a story since there is no available manual annotation. Specifically, we regard the sentence as the target which has the highest semantic similarity with the persona description. We consider only one sentence as the target in this work due to the persona-sparsity issue, and we also present the result of experimenting with two sentences as the target in the appendix B.1. More explorations of using multiple target sentences are left as future work. We adopt NLTK (Bird et al., 2009) for sentence tokenization. And we measure the similarity between sentences using BERTScore Recall (Zhang et al., 2019) with RoBERTa Large  as the backbone model. Let T = (τ 1 , τ 2 , · · · , τ ι ) denote the target sentence of length ι, which should be a sub-sequence of Y . Formally, the loss function L T P for this stage can be derived as follows: In this way, we exert explicit supervision to encourage the model to condition on the input personas.

Plot Planning
At this stage, CONPER learns to plan a keyword sequence for subsequent story generation (Yao et al., 2019). Plot planning requires a strong ability to model the causal and temporal relationship in the context for expanding a reasonable story plot (e.g., associating "unskilled" with "failure" for the example in Table 1), which is extremely challenging without any external guidance, for instance, commonsense knowledge. In order to plan a coherent event sequence, we introduce a dynamically growing local knowledge graph, a subset of the external commonsense knowledge base ConceptNet (Speer et al., 2017), which is initialized to contain triples related to the keywords mentioned in the input and target. When planning the next keyword, CON-PER combines the knowledge information from the local graph and the contextualized features captured by the language model with learnable weights. Then CONPER grows the local graph by adding the knowledge triples neighboring the predicted keyword. Formally, we denote the keyword sequence as W = (w 1 , w 2 , · · · , w k ) of length k and the local graph as G t for predicting the keyword w t . The loss function L KW for generating the keyword sequence is as follows: Keyword Extraction We extract words that relate to emotions and events from each sentence of a story as keywords for training, since they are important for modeling characters' evolving psychological states and their behavior. We measure the emotional tendency of each word using the sentiment analyzer in NLTK, which predicts a distribution over four basic emotions, i.e., negative, neutral, positive, and compound. We regard those words as related to emotions whose scores for negative or positive are larger than 0.5. Secondly, we extract and lemmatize the nouns and verbs (excluding stopwords) from a story as event-related keywords with NLTK for POS-tagging and lemmatization. Then we combine the two types of keywords in the original order as the keyword sequence for planning. We limit the number of keywords extracted from each sentence in stories up to 5, and we ensure that there is at least one keyword for a sentence by randomly choosing one word if no keywords are extracted. We don't keep this limitation when extracting keywords from the leading context and the persona description, since these keywords are only used to initialize the local knowledge graph.

Incorporating Knowledge
We introduce a dynamically growing local knowledge graph for plot planning. For each example, we initialize the graph G 1 as a set of knowledge triples where the keywords in S and T are the head or tail entities, and then update G t to G t+1 by adding triples related with the generated keyword w t at t-th step. Then, the key problem at this stage is representing and utilizing the local graph for next keyword prediction. The local graph consists of multiple sub-graphs, each of which contains all the triples related with a keyword denoted as ε i = {(h i n , r i n , t i n )|h i n ∈ V, r i n ∈ R, t i n ∈ V}}| N n=1 , where R and V are the relation set and entity set of ConceptNet, respectively. We derive the representation g i for ε i using graph attention (Zhou et al., 2018b) as follows: where W h , W r and W t are trainable parameters, h i n , r i n and t i n are learnable embedding representations for h i n , r i n and t i n , respectively. We use the same BPE tokenizer (Radford et al., 2019) with the language model to tokenize the head and tail entities, which may lead to multiple sub-words for an entity. Therefore, we derive h i n and t i n by adding the embeddings of all the sub-words. And we initialize the relation embeddings randomly. After obtaining the graph representation, we predict the distribution of the next keyword by dynamically deciding whether to select the keyword from the local graph as follows: where γ t ∈ {0, 1} is a binary learnable weight, and P t l is a distribution over the whole vocabulary while P t k is a distribution over the entities in G t . We incorporate the knowledge information implicitly for computing both distributions: where W k , b k , W p and b p are trainable parameters, and c t is a summary vector of the knowledge information by attending on the representations of all the sub-graphs in G t , formally as follows: where W g is a trainable parameter. During training process, we set γ t to the ground-truth labelγ t . During generation process, we decide γ t by deriving the probability p t of selecting an entity from the local graph as the next keyword. And we set γ t to 1 if p t < 0.5 otherwise 0. We compute p t as follows: where W p and b p are trainable parameters. We train the classifier with the standard cross entropy loss L C derived as follows: whereγ t is the ground-truth label. In summary, the overall loss function L P P for the plot planning stage is computed as follows: By incorporating commonsense knowledge for planning, and dynamically updating the local graph, CONPER can better model the causal and temporal relationship between events in the context.

Target Guidance
In order to further improve the coherence and the persona-consistency, we propose to exert explicit guidance of the predicted target on plot planning. Specifically, we expect CONPER to predict keywords close to the target in semantics. Therefore, we add a bias term d t k and d t l into Equation 10 and 11, respectively, formally as follows: where W d and b d are trainable parameters, s tar is the target representation computed by averaging the hidden states at each position of the predicted target, and E k is an embedding matrix, each row of which is the embedding for an entity in G t . The modification for Equation 11 is similar except that we compute the bias term d t l with an embedding matrix E l for the whole vocabulary.

Story Generation
After planning the target T and the keyword sequence W , we train CONPER to generate the whole story conditioned on the input and plans with the standard language model loss L ST . Since we extract one sentence from a story as the target, we do not train CONPER to regenerate the sentence in the story generation stage. And we insert a special token Target in the story to specify the position of the target during training. In the inference time, CONPER first plans the target and plot, then generates the whole story, and finally places the target into the position of Target.

Dataset
We conduct the experiments on the STORIUM dataset (Akoury et al., 2020). STORIUM contains nearly 6k long-form stories and each story unfolds through a series of scenes with several shared characters. A scene consists of multiple short scene entries, each of which is written to either portray one character with annotation for his personality (i.e., the "card" in STORIUM), or introduce new story settings (e.g., problems, locations) from the perspective of the narrator. In this paper, we concatenate all entries from the same scene since a scene can be seen as an independent story. And we regard a scene entry written for a certain character as the target output, the personality of the character as the persona description, and the previous entries written for this character or from the perspective of the narrator in the same scene as the leading context. We split the processed examples for training, validation and testing based on the official split of STORIUM. We retain about 1,000 words (with the correct sentence boundary) for each example due to the length limit of the pretrained language model. At the plot planning stage, we retrieve a set of triples from ConceptNet (Speer et al., 2017) for each keyword extracted from the input or generated by the model. We only retain those triples of which both the head and tail entity contain one word and occur in our dataset, and the confidence score of the relation (annotated by ConceptNet) is more than 1.0. The average number of triples for each keyword is 33. We show more statistics in Table 2.

Baselines
We compare CONPER with following baselines. (4) GPT2 Scr : It has the same network architecture with GPT2 but is trained on our dataset from scratch without any pretrained parameters. (5) GPT2 Ft : It is initialized using pretrained parameters, and then fine-tuned on our dataset with the standard language modeling objective. (6) PlanA-head: It first predicts a keyword distribution conditioned upon the input, and then generates a story by combining the language model prediction and the keyword distribution with a gate mechanism (Kang and Hovy, 2020). We remove the sentence position embedding and the auxiliary training objective (next sentence prediction) used in the original paper for fair comparison. Furthermore, we evaluate the following ablated models to investigate the influence of each component: (1) CONPER w/o KG: removing the guidance of the commonsense knowledge in the plot planning stage.
(2) CONPER w/o TG: removing target guidance in the plot planning stage.
(3) CONPER w/o PP: removing the plot planning stage, which means the model first plans a target sentence and then directly generates the whole story. (4) CONPER w/o TP: removing the target planning stage, which also leads to the removal of target guidance in the plot planning stage.

Experiment Settings
We build CONPER based on GPT2 (Radford et al., 2019), which is widely used for story generation (Guan et al., 2020). We concatenate the context and the persona description with a special token as input for each example. For fair comparison, we also add special tokens at both ends of the target sentence in a training example for all baselines. We implement the non-pretrained models based on the scripts provided by the original papers, and the pretrained models based on the public checkpoints and codes of HuggingFace's Transformers * . And we set all the pretrained models to the base version due to limited computational resources. We set the batch size to 8, the initial learning rate of the AdamW optimizer to 5e-5, and the maximum training epoch to 5 with an early stopping mechanism. And we generate stories using top-p sampling with p = 0.9 (Holtzman et al., 2020). We apply these settings to all the GPT-based models, including GPT Scr , GPT Ft , PlanAhead, CONPER and its ablated models. As for ConvS2S, Fusion and Plan&Write, we used the settings from their respective papers and codebases.

Automatic Evaluation
Metrics We adopt the following automatic metrics for evaluation on the test set.   between generated and ground-truth stories (Papineni et al., 2002).
(2) BERTScore-target (BS-t): We use BERTScore Recall (Zhang et al., 2019) to measure the semantic similarity between the generated target sentence and the persona description. A higher result indicates the target embodies the persona better.
(3) BERTScore-max (BS-m): It computes the maximum value of BERTScore between each sentence in the generated story and the persona description. (4) Persona-Consistency (PC): It is a learnable automatic metric (Guan and Huang, 2020). We fine-tune RoBERTa BASE on the training set as a classifier to distinguish whether a story exhibits a consistent persona with a persona description. We regard the ground-truth stories as positive examples where the stories and the descriptions are consistent, and construct negative examples by replacing the story with a randomly sampled one. After fine-tuning, the classifier achieves an 83.63% accuracy on the auto-constructed test set. Then we calculate the consistency score as the average classifier score of all the generated texts regarding the corresponding input.
Result Table 4 shows the automatic evaluation results. CONPER can generate more word overlaps with ground-truth stories as shown by higher BLEU scores. And CONPER can better embody the specified persona in the target sentence and the whole story as shown by the higher BS-t and BSm score. The higher PC score of CONPER also further demonstrate the better exhibition of given personas in the generated stories. As for ablation tests, all the ablated models have lower scores in terms of all metrics than CONPER, indicating the effectiveness of each component. Both CONPER w/o PP and CONPER w/o TP drop significantly in BLEU scores, suggesting that planning is important for generating long-form stories. CONPER w/o TP also performs substantially worse in all metrics than CONPER w/o TG, indicating the necessity of explicitly modeling the relations between persona descriptions and story plots. We also show analysis of target guidance in Appendix C.

Manual Evaluation
We conduct a pairwise comparison between our model and four strong baselines including PlanAhead, GPT2 Ft , Fusion and ConvS2S. We randomly sample 100 stories from the test set, and obtain 500 stories generated by CONPER and four baseline models. For each pair of stories (one by CON-PER, and the other by a baseline, along with the input), we hire three annotators to give a preference (win, lose or tie) in terms of coherence (intersentence relatedness, causal and temporal dependencies) and persona-consistency with the input (exhibiting consistent personas). We adopt majority voting to make the final decisions among three annotators. Note that the two aspects are independently evaluated. We resort to Amazon Mechanical Turk (AMT) for the annotation. As shown in Table  3, CONPER outperforms baselines significantly in coherence and persona consistency. Furthermore, we used human annotation to eval-  uate whether the identified target sentence embodies the given persona. We randomly sampled 100 examples from the test set, and identified the target for each example as the sentence with the maximum BERTScore with the persona description. And we used a random policy as a baseline which randomly samples a sentence from the original story as the target. We hired three annotators on AMT to annotate each example ("Yes" if the sentence embodies the given persona, and "No" otherwise). We adopted majority voting to make the final decision among three annotators. Table 5 shows our method significantly outperforms the random policy in identifying the persona-related sentences.

Controllability Analysis
To further investigate whether the models can be generalized to generate specific stories to exhibit different personas conditioned on the same context, we perform a quantitative study to observe how many generated stories are successfully controlled as the input persona descriptions change.
Automatic Evaluation For each example in the test set, we use a model to generate ten stories conditioned on the context of this example and ten persona descriptions randomly sampled from other examples, respectively. We regard a generated story as successfully controlled if the pair of the story and its corresponding persona description (along with the context) has the maximum persona-consistency score among all the ten descriptions. We regard the average percentages of the stories which are successfully controlled in all the ten generated stories for each example in the whole test set as the controllability score of the model. We show the results for CONPER and strong baselines in Table 6. Furthermore, we also compute the superiority (denoted as ∆) of the persona-consistency score computed between a generated story and its corresponding description compared to that computed between the story and one of the other nine descriptions (Sinha et al., 2020). A larger ∆ means the model can gen-   erate more specific stories adhering to the personas. As shown in Table 6, there are more stories successfully controlled for CONPER than baselines. And the larger ∆ of CONPER suggests that it can generate more specific stories to the input personas. The results show the better generalization ability of CONPER to generate persona-controllable stories.
Manual Evaluation For manual evaluation, we randomly sampled 50 examples from the test set, and manually revised the persona descriptions to exhibit an opposite persona (e.g., from "skilled pilot" to "unskilled pilot"). We required a model to generate two stories conditioned on the original and its opposite persona description, respectively. Finally we obtained 300 stories from three models including GPT2 Ft , PlanAhead and CONPER. Then, we hired three graduates to judge whether each story accords with the input persona. All annotators have good English language proficiency and are well trained for this evaluation task. Table 7 shows the evaluation results. We can see that 66% of the stories generated by CONPER are accordant with the input persona, suggesting the better controllability of CONPER.

Case Study
We present some cases in Table 8. We can see that the story generated by CONPER exhibits the specified persona with a coherent event sequence. The planned keywords by CONPER provide an effective discourse-level guidance for the subsequent story generation, such as tablet, which has a commonsense connection with computer skills and Context: · · · the group has gathered on the rooftop garden of Miyamoto Mansion · · · the TV set out near the long table on the patio is talking about some spree of thefts at low volume · · · the issue of Chloe's disappearance and the missing statue still hang over their heads.
Persona Description: [Aito] You are above average in your computer skills. If information is power, then your ability to use the internet makes you one of the most powerful people on the planet.
GPT2Ft: Aito looked at the others, still trying to help find a way out of the hotel. He wasn't sure what the rest of the group wanted to see if they were going to survive and all knew if he needed to be needed · · · PlanAhead: Miyamoto Mansion · · · perhaps it's just a bit farther away. The music sounds bright enough but the line of visitors does not. Aito was once a pretty girl, he had always been quite witty when talking to people but she always found it annoying that a group of tourists looked like trash just to her · · · CONPER: · · · "Oh, wait · · · wait · · · people are talking about Chloe?" · · · "I have a feeling the internet is probably our best chance to get through this" · · · Aito looked around the table a moment before pulling out her tablet and starting typing furiously into her computer. She looked up at the tablet that had appeared, and she could see that it was working on a number of things· · · Planned keywords: · · · people → look → around → tablet → see · · · Table 8: Generated stories by different models. Italic words indicate the improper entities or events in terms of the consistency with the input. The bold sentence indicate the generated target by CONPER. Red words denote the consistent events adhering to the input. And the extracted keywords are underlined.
Internet in the input. In contrast, the baselines tend to not generate any persona-related events. For example, the given persona description emphasizes the strong computer skills of the protagonist while the stories generated by PlanAhead and GPT2 have nothing to do with the computer skills. We further analyze some error cases generated by our model in Appendix G.

Conclusion
We present CONPER, a planning-based model for a new task aiming at controlling the protagonist's persona in story generation. We propose target planning to explicitly model the relations between persona-related events and input personas, and plot planning to learn the keyword transition in a story with the guidance of predicted personarelated events and external commonsense knowledge. Extensive experiments show that CONPER can generate more coherent stories with better consistency with the input personas than strong baselines. Further analysis also indicates the better persona-controllability of CONPER.

Ethics Statements
We conduct the experiments by adapting a public story generation dataset STORIUM to our task. Automatic and manual evaluation results show that our model CONPER outperforms existing state-ofthe-art models in terms of coherence, consistency and controllability, suggesting the generalization ability of CONPER to different input personas. And our approach can be easily extended to different syntactic levels (e.g., phrase-level and paragraphlevel events), different model architectures (e.g., BART (Lewis et al., 2020)) and different generation tasks (e.g., stylized long text generation).
In both STORIUM and ConceptNet, we find some potentially offensive words. Therefore, our model may suffer from risks of generating offensive content, although we have not observed such content in the generated results. Furthermore, ConceptNet consists of commonsense triples of concepts, which may not be enough for modeling inter-event relations in long-form stories. We resort to Amazon Mechanical Turk (AMT) for manual evaluation.
We do not ask about personal privacy or collect personal information of annotators in the annotation process. We hire three annotators and pay each annotator $0.1 for comparing each pair of stories. The payment is reasonable considering that it would cost average one minute for an annotator to finish a comparison.

A Implementation Details
We train our model on one Quadro RTX 6000 GPU. It costs about 25 hours to train our model, and 4 hours to generate stories using our model.

B.1 Target Extraction
We regard one sentence which has the maximum BERTScore with the persona description as the target in our model. We conducted two experiments to further investigate the influence of target extraction strategy: (1) CONPER (Rand): It regards a sentence randomly sampled from the story as the target for training in the target planning stage.
(2) CONPER (Multi): It regards two sentences which have the maximum BERTScore with the persona description as the target.
As shown in Table 9, when using a random sentence as the target, all the metrics drop significantly. And Table 5 in the main paper shows that it is hard for the random policy to select persona-related sentences. The results indicate the benefit of our methods for modeling relations between personas and events. Moreover, using multiple sentences as the target is inferior to using only one in terms of most metrics. It is possibly because stories in STORIUM tend to embody personas sparsely, and modeling the relations between personas and multiple persona-unrelated events directly may hurt the performance. The BS-t score is higher when using multiple sentences because more words can easily lead to a higher recall score.

B.2 Keyword Extraction
We extracted at most 5 keywords from each sentence for the plot planning stage. We also experimented with a more sparse plan by extracting only one keyword from each sentence (called CON-PER (Sparse)). Table 9 shows that using a more sparse plan performs worse in all metrics. It is possibly because the limited planning keywords could not make the best of the external knowledge to form coherent and persona-related plots.

C Analysis of Target Guidance
We visualize how target guidance affects word prediction in the plot planning stage in Figure 2. The original word distribution is weighted to those words irrelevant to the target sentence, while the  Table 9: Automatic evaluation results for several variants of CONPER. The best performance is highlighted in bold. All results are multiplied by 100.
bias term (Equation 18) is weighted to those words related to the target sentence in semantics such as bar. After combining the original word distribution with the bias term, the final distribution can balance the trade-off between target guidance and language model prediction. This validates our hypothesis that target guidance can draw the planned plots closer to the target, which helps improve the story coherence and persona-consistency.

D Diversity
We compare the diversity of CONPER with baselines using distinct-n (D-n) (Li et al., 2016), the ratio of distinct n-grams to all n-grams in generated stories. The results in Table 10 show that CON-PER has better coherence and persona consistency without sacrificing the diversity.

E Manual Evaluation
We conduct manual evaluation on Amazon Mechanical Turk. To improve the annotation quality, we provide a detailed instruction for annotators, which contains: (1) a summary of our task; (2) a formal definition for coherence and persona consistency; and (3)    coherence and persona consistency. The detailed evaluation guideline is shown in Figure 3.