Stylized Story Generation with Style-Guided Planning

Current storytelling systems focus more ongenerating stories with coherent plots regard-less of the narration style, which is impor-tant for controllable text generation. There-fore, we propose a new task, stylized story gen-eration, namely generating stories with speci-fied style given a leading context. To tacklethe problem, we propose a novel generationmodel that first plans the stylized keywordsand then generates the whole story with theguidance of the keywords. Besides, we pro-pose two automatic metrics to evaluate theconsistency between the generated story andthe specified style. Experiments demonstratesthat our model can controllably generateemo-tion-driven orevent-driven stories based onthe ROCStories dataset (Mostafazadeh et al.,2016). Our study presents insights for stylizedstory generation in further research.


Introduction
Story generation is a challenging task in natural language generation (NLG), namely generating a reasonable story given a leading context. Recent work focuses on enhancing the coherence of generated stories (Fan et al., 2018;Yao et al., 2019) or introducing commonsense knowledge Xu et al., 2020). However, it has not yet been investigated to generate stories with controllable styles, which is important since different styles serve different writing purposes. As exemplified in Figure 1, emotion-driven stories use emotional words (e.g., "excited", "enjoyed") to reveal the inner states of the characters and bring the readers closer to the characters. In comparison, event-driven stories usually contain a sequence of events with a clear temporal order (e.g., "tearing"→"tried"→"found"→"hooked"), which aims to narrate the story objectively. L eading Context: Alice bought a new television.
The picture on screen was tear ing. Then she tr ied different adjustments. Eventually, she found out what was wrong. She got the wrong cable hooked up.

Style
Stor y Figure 1: Example of stylized story generation given the same leading context. The stylized keywords are in bold.
In this paper, we formalize the task of stylized story generation, which requires generating a coherent story with a specified style given the first sentence as the leading context. Style has multiple interpretations, which can be seen as a unique voice of the author expressed through the use of certain stylistic devices (e.g. choices of words) (Mou and Vechtomova, 2020). In this work we focus on the choices of words and define the story styles based on the pattern of wording. Specifically, we focus on two story styles, including emotion-driven and event-driven stories. Emotion-driven stories contain abundant words with emotional inclination. We identify the emotional words using the off-theshelf toolkit NRCLex (Mohammad, 2020), which supports retrieving the emotional effects of a word from a predefined lexicon. And event-driven stories tend to use serial actions as an event sequence. We use NLTK (Bird et al., 2009) to extract verbs in a story as the actions. Since no public datasets are available for learning to generate stylized stories, we regard the extracted words as stylistic keywords and then annotate the story styles for existing story datasets automatically based on the keyword distribution. Note that the story styles can be extended easily by defining new stylistic keywords.
In this work, we propose a generation model for stylized story generation. Our model first predicts the distribution of stylistic keywords and then generates a story with the guidance of the distribution. Furthermore, we propose two new automatic metrics to evaluate the consistency between the generated stories and the specified styles: lexical style consistency (LSC) and semantic style consistency (SSC), which focus on the number of stylistic keywords and the overall semantics, respectively. Extensive experiments demonstrate that the stories generated by our model not only achieve better fluency and coherence than strong baselines but also have better consistency with the specified styles. 1

Related Work
Story Generation Recently there have been significant advances for story generation with the encoder-decoder paradigm (Sutskever et al., 2014), the transformer-based architecture (Vaswani et al., 2017) and the large-scale pre-trained models (Radford et al., 2019;Lewis et al., 2020). Prior studies usually decomposed the generation into separate steps by first planning a sketch and then generating the whole story from the sketch. The sketch is usually a series of keywords (Yao et al., 2019), a learnable skeleton (Xu et al., 2018) or an action sequence (Fan et al., 2019;Goldfarb-Tarrant et al., 2020). Another line is to incorporate external knowledge into story generation Xu et al., 2020). However, generating stories with controllable styles has hardly been investigated.
Stylized Generation Stylized generation aims to generate texts with controllable attributes. For example, recent studies in dialogue systems focused on controlling persona Boyd et al., 2020), sentence functions (Ke et al., 2018), politeness (Niu and Bansal, 2018), and topics (Tang et al., 2019). In story generation, Huang et al. (2019) and Xu et al. (2020) controlled the story topics and planned keywords, respectively. Besides, for general text generation, the authorship (Tikhonov and Yamshchikov, 2018), sentiment (Hu et al., 2017), and topics (Li et al., 2020) can also be controlled for different purposes. We introduce a new controllable attribute in story generation, i.e., the story style, which has been paid little attention to in prior studies.

Proposed Method
In this section, we first show the task formulation for stylized story generation ( §3.1). Then we present the details of our two-step model: styleguided keywords planning ( §3.2) and generation with planned keywords ( §3.3).

Task Formulation
Input: The first sentence x = (x 1 , x 2 , . . . , x n ) of a story with length n, where x i is the i-th word. A special token l to indicate the expected style of the generated story. l ∈ { emo , eve }, which refers to the emotion-driven and event-driven styles, respectively. Besides, in the training phase, we set l = other if the training example is neither emotion-driven nor event-driven to improve the data efficiency.
Output: A story y = (y 1 , y 2 , . . . , y m ) of length m with the style l, where y i is i-th word.

Planning
We insert l at the beginning of x and encode them as follows: is the hidden state corresponding to x i , h 0 is the hidden state at the position of l, and Enc is a bidirectional or unidirectional encoder. Then, we regard the stylistic keywords as bag-of-words (Kang and Hovy, 2020) and predict the keyword distribution P k (w|x, l) over the whole vocabulary V as follows: where W k and b k are trainable parameters, and h c is the context embedding to summarize the input information. We directly set h c = h 0 . The training objective in this stage is to minimize the cross-entropy loss L k between the predicted keyword distribution P k (w|l, x) and the ground trutĥ P k (w|l, x) as follows: is an one-hot vector over V. We do not decode a keyword sequence explicitly (Yao et al., 2019) but generate stories directly based on the keyword distribution P k (w|l, x) to avoid introducing extra exposure bias (He et al., 2019).

Generation
We employ a left-to-right decoder to generate a story conditioned upon the input and the predicted keyword distribution. The training objective in this stage is to minimize the negative log-likelihood L st of the ground truth stories: We derive P (y t |l, x, y <t ) by explicitly combining the stylistic keyword distribution into the decoding process as follows: where W s and b s are trainable parameters, P l is a distribution over V without conditioning on the predicted keywords, and g t ∈ R |V| is a gate vector indicating the weight of the keyword distribution P k . We compute g t as follows: where W g , b g , W r and b r are trainable parameters. In summary, the final training objective L of our model is derived as follows: where α is an adjustable scale factor.

Dataset
We conduct the experiments on the ROCStories corpus (Mostafazadeh et al., 2016), which contains 98,159 five-sentence stories. We randomly split ROCStories by 8:1:1 for training/validation/test, respectively. The average number of words in the input (the first sentence) and the output (the last four sentences) are 9.1 and 40.8, respectively. Besides, we follow  to delexicalize stories in the dataset by masking all the male /female/neutral names with MALE / FEMALE / NEUTRAL to achieve better generalization.

Style Annotation
We extract stylistic keywords from stories in the dataset and assign a style label for each story according to the distribution of stylistic keywords.

Stylistic Keywords
We use NRCLex and NLTK to extract stylistic keywords. NRCLex maps each word in a story to its underlying emotion labels according to a word-emotion lexicon (e.g., "favorite" → "joy"). We select the words with following emotion labels: "fear", "anger", "surprise", "sadness", "disgust" and "joy", as the keywords for the emotion-driven style. Besides, we use NLTK to extract verbs as keywords for the event-driven style. We filter out the stop words and common verbs with bottom ten IDF 2 (e.g., "is", "have") from the extracted verbs. Intuitively, the more stylistic keywords of some style a story has, the more consistent it is with that style. Therefore, we propose to compare the numbers of keywords for different styles for style annotation.
Normalized Numbers of Keywords Let N s denote the number of keywords for style s in a story. We assume N s is a random variable, and follows a Gaussian distribution N (µ s , σ 2 s ), where µ s and σ s are the mean and standard deviation computed on the training set. Given a story which contains n s keywords for style s, we normalize n s to n s = P (N s n s ) ∈ [0, 1] for fair comparison between keywords for different styles.

Styles
Training Validation Test Annotation We annotate the style label l for a given story by comparing its n emo and n eve , which refer to the normalized numbers of keywords for emotion-driven and event-driven styles, respectively. We annotate the story with emo if n emo is higher than n eve , and eve otherwise. However, if both n emo and n eve are lower than τ 1 , or |n emo − n eve | < τ 2 , we annotate the story with other since there is no significant tendency to any styles. τ 1 and τ 2 are hyper-parameters, which are set to 0.7 and 0.3, respectively. For stories labeled with other , we select five words as the stylistic keywords from those keywords for emotion-driven and event-driven styles. Table 1 shows the stylistic distribution of the dataset.

Baselines and Experiment Settings
We compare our model with GPT-2 (Radford et al., 2019) and BART (Lewis et al., 2020) as baselines. We fine-tune the baselines on ROCStories with the style tokens and the beginnings as input. We build our model based on BART. Our approach can easily adapt to other pre-trained models such as BERT. We set the scale factor in Equation 10 to 0.2. For all models, We generate stories using top-k sampling (Fan et al., 2018) with k = 50 and a softmax temperature of 0.8.

Automatic Evaluation
Evaluation Metrics We use the following metrics for automatic evaluation: (1) Perplexity (PPL). Since the automatically annotated style labels may contain innate bias, we do not calculate the perplexity conditioned on the annotated styles for the stories in the test set. Instead, we calculate the perplexity of a model for each sample conditioned on two styles (emotion-driven and eventdriven), respectively, and then get the perplexity on the entire test set by averaging the smaller perplexity for each sample.(2) BLEU (B-n) (Papineni et al., 2002): The metric evaluates n-gram overlap (n = 1, 2). For each beginning in the test set, we generate two stories conditioned on two styles, respectively. Then we calculate the BLEU score on the test set by averaging the higher BLEU with the reference story for each sample.(3) Distinct (Dn) (Li et al., 2016): The metric measures the generation diversity with the percentage of unique n-grams (n = 1, 2). (4) Numbers of Stylistic Keywords (Number): We use the average n (described in §4.2) to evaluate how many consistent stylistic keywords the generated stories have. (5) Lexical Style Consistency (LSC): We calculate the percentage of the stories annotated with the consistent style in all generated stories using the annotation strategy described in §4.2. (6) Semantic Style Consistency (SSC): It is a learnable automatic metric . We finetune BERT BASE on the training set as a classifier to distinguish whether a story is emotion-driven, event-driven, or others with the automatic labels as the golden truth. For each style, we regard the average classification score on the style to measure the style consistency. Table 2 shows the accuracy and F1-Scores of the BERT model on the test set.

Results
We show the evaluation results of PPL and BLEU in Table 3. Note that we do not provide  PPL for GPT-2 since it does not adopt the same vocabulary used in BART. We can see that our model has lower perplexity and higher word overlap with the human-written stories than baselines.  We present the results of diversity and style consistency on the generated stories with different specified styles in Table 4. Our model achieves comparable diversity with baselines, generates more keywords of the specified styles, and outperforms baselines in both lexical and semantic style consistency by a large margin.

Manual Evaluation
We conduct a pairwise comparison between our model and baselines. We randomly generate 100 stories from the test set for each style and model. For each pair of stories (one by ours, and the other by a baseline), we hire three annotators to give a preference (win, lose and tie) in terms of fluency, coherence, and style consistency. We adopt majority voting to make the final decisions among the annotators. We resort to Amazon Mechanical Turk for manual annotation. As shown in Table 5 Fleiss, 1971) to measure the inter-annotator agreement. * and ** mean p-value<0.05 and p-value<0.01 (Wilcoxon signed-rank test), respectively.
Leading Context: Bob has a girl friend.

Emotion-driven Style
GPT-2 She wants to take a trip to Hawaii. She goes on vacation. She is at the beach. It is a pretty day.
BART One day, Bob saw a cute necklace on the sidewalk. Bob decided to buy it. After buying it, Bob loves it. Bob likes the necklace.

Ours
He is really nervous about her feeling around her. His girlfriend is very protective . Bob gets along great with her. Bob has a wonderful time with his girlfriend.

Event-driven Style
GPT-2 She likes her hair. She takes a few pictures of her friend's hair. He takes a picture of her hair and posts it. She likes it very much and she is happy .
BART He knew he was always going to be mean to her. After a while Bob realized that he was being annoying . He had to leave his job and walk . Now he has a new girlfriend and a new job.

Ours
He has been talking to her all day. She stopped listening to him now. One day, she says his name and walked away. He decided to break up with her in another place. or substantial (0.6 κ 0.8) agreement, and our model outperforms baselines significantly in fluency, coherence, and style consistency. Table 6 shows several generated cases. We generate the stories using different models given the same leading context and specified style. For the emotion-driven style, our model can generate various emotional keywords (e.g., "nervous", "protective", "great", and "wonderful") and focus more on shaping the characters' personality. For the event-driven style, our model can generate fluent stories with a reasonable event sequence. In comparison, baselines tend to confuse the two styles. For example, the stories generated by the baselines for the event-driven style still contain many emotional keywords (e.g., "likes", "annoying"). Besides, for the emotion-driven style, the baselines generate fewer and repetitive emotional keywords. Furthermore, the baselines may suffer from more severe repetition (e.g., "take a picture") than our model. And the baselines sometimes mix up or neglect some characters (e.g., GPT-2 and BART only cover one of "Bob" and "his girlfriend" but neglect the other one for the emotion-driven style). In summary, our model can generate more coherent stories with specified styles than the baselines.

Conclusion
We present a pilot study on a new task, stylized story generation. We define story style with respect to emotion and event, and propose a generation model which conditions on planned stylistic keywords. Comparative experiments with strong baselines show the promising results of the proposed model. Our work can inspire further research in this new direction.