EtriCA: Event-Triggered Context-Aware Story Generation Augmented by Cross Attention

One of the key challenges of automatic story generation is how to generate a long narrative that can maintain fluency, relevance, and coherence. Despite recent progress, current story generation systems still face the challenge of how to effectively capture contextual and event features, which has a profound impact on a model's generation performance. To address these challenges, we present EtriCA, a novel neural generation model, which improves the relevance and coherence of the generated stories through residually mapping context features to event sequences with a cross-attention mechanism. Such a feature capturing mechanism allows our model to better exploit the logical relatedness between events when generating stories. Extensive experiments based on both automatic and human evaluations show that our model significantly outperforms state-of-the-art baselines, demonstrating the effectiveness of our model in leveraging context and event features.


Introduction
Story Generation aims to generate fluent, relevant and coherent narratives conditioned on a given context.As the task is notoriously difficult, a common strategy is to employ storylines composed of events to support the generation process (Yao et al., 2019;Chen et al., 2021;Alhussain and Azmi, 2021;Tang et al., 2022b).This process imitates the behavior of human writers.Firstly, a story will start from a sketch of key words containing events, and then human writers will unfold the story following the track of planned event sequences.
Despite recent progress, existing approaches are still ineffective in exploiting planned events when generating stories.Usually, pre-trained generation models, e.g., BART (Goldfarb-Tarrant et al., 2020;Clark and Smith, 2021;Huang et al., 2022) are employed to generate stories after event planning.However, as shown by the conflicts in Figure 1, the separate sentences generated by BART look reasonable, Figure 1: Conditioned on leading context and reference events (extracted from reference stories), existing generation models still suffer from problems of relevance and coherence.For instance, we fine-tune BART (Lewis et al., 2020) to generate stories.The leading context and reference text in this example are collected from ROC Stories (Mostafazadeh et al., 2016).Some conflicts among them are observed and coloured.but there are several issues observed considering the whole story: As a commonsense story, if the car needs to be "fixed and replaced" then it is too broken to "drive around"; "Ken" should not drive the car "very fast" in the "snow"; If "Ken" "got stuck in the ditch" or "lost traction", he cannot then be "driving long distances".We hypothesise that these problems come from the inadequacy of capturing contextual features when keeping track of event sequences, because (i) the planned events generally lack background information, e.g., Ken (the character) and snow (the scene) and (ii) training stories may have the same events but different reference stories, which may lead to confusion during inference if not considering the story-specific scenario.
Therefore, to address these challenges we propose EtriCA -a novel Event-Triggered Context-Aware end-to-end framework for story generation.Given both leading context and planned events, EtriCA can more effectively capture contextual and event features from inputs than state-of-the-art (abbr.SOTA) baseline models.Traditional generation models struggle to learn contextual representations when implicitly keeping track of the state of events due to feature differences of events and contexts.As an abstract storyline, an event sequence only contains schematic information related to actions (e.g. the verb), while the context usually records story-specific details, e.g., the scene and characters in a story.
To comprehensively leverage both features, we draw inspiration from prior work dealing with information fusion (Chen et al., 2018;Xing et al., 2020;He et al., 2020;You et al., 2020;Wang et al., 2021;Tang et al., 2022a) to encode heterogeneous features with a cross attention mechanism (Gheini et al., 2021).We aim to inform our model of the context background when the neural module unfolds each event into a narrative.We propose a novel neural module that learns to implicitly map contextual features to event features through information fusion on their numeric vector spaces (we call this process contextualising events).The whole process is illustrated in Figure .2. With the contextualised event features, an autoregressive decoder is employed to dynamically generate stories by learning to unfold the contextualised events.We also introduce an auxiliary task of Sentence Similarity Prediction (Guan et al., 2021) to enhance the coherence between event-driven sentences.
To support research on event-driven story generation, we propose a new task formulated by writing stories according to a given leading context and event sequence.We improve the event extraction framework of Chen et al. (2021) by exploiting dependency parsing to capture event related roles from sentences, instead of using heuristic rules.We also present two datasets where multi-sentence narratives from existing datasets are paired with event sequences using our automatic event extraction framework.Importantly, our task formulation can also benefit the study of controllable story generation, considering there is increasing interest in storyline-based neural generative frameworks (Xu et al., 2020;Ghazarian et al., 2021;Chen et al., 2021).According to our extensive experiments, EtriCA performs better than baseline models consid-ering the metrics of fluency, coherence, and relevance.Our contributions1 can be summarised as follows: • A new task formulation for event-driven story writing, which requires the generation model to write stories according to a given leading context and event sequence.2 Related Work

Neural Story Generation
Before the surge of deep learning techniques, story generation models only generated simple sentences and heavily relied on manual designs (McIntyre and Lapata, 2009;Woodsend and Lapata, 2010;McIntyre and Lapata, 2010;Huang and Huang, 2013;Kybartas and Bidarra, 2016).Since neural story generation came into being, end-to-end neural models, especially pre-trained models, e.g., BART (Lewis et al., 2020) and GPT-2 (Radford et al., 2019), are widely employed as the main module of story writing (Rashkin et al., 2020;Guan et al., 2020;Goldfarb-Tarrant et al., 2020;Clark and Smith, 2021).However, it is hard to guarantee logical correctness for naive Seq2Seq models when the generated text is growing longer, so recent work is exploring multi-step generations which implement neural models in traditional generative pipelines (Guan et al., 2021).For example, Yao et al. (2019);Goldfarb-Tarrant et al. (2020); Chen et al. (2021) split story generation into planning (inputs to events) and writing (events to stories), and leverage two neural generation models to learn them.

Event Planning for Story Generation
At the planning stage, prior research (Yao et al., 2019;Rashkin et al., 2020;Goldfarb-Tarrant et al., 2020; The overview of the event feature contextualising process.The leading context coloured in red contains some important information which affects the generation process, e.g., the weather "snows" may lead to "accident".These implicit clues help the neural generator to disambiguate the context of events.We firstly fuse both context and event features, and then feed them to the generator.TOK is the basic unit of a sentence.POS is the part of speech, and DEP stands for dependencies between tokens.Through parsing dependencies, the event trigger (also recognised as the root of sentence) filters all significant roles to represent a complete action.Meanwhile, extracted neighbour events are considered to have temporal relations.Jhamtani and Berg-Kirkpatrick, 2020;Ghazarian et al., 2021) mostly focused on extracting event sequences from the reference text as the ground truths of plot planning, and then leveraged neural models (Radford et al., 2019;Lewis et al., 2020) to predict events with given leading context or titles.Events have a lot of representation formats, e.g., verbs, tuples, key words, etc.Among them a straight forward approach is extracting verbs as events (Jhamtani and Berg-Kirkpatrick, 2020; Guan et al., 2020;Kong et al., 2021), which is also the method we followed.However, verbs alone are not good enough to keep information integrity.For instance, semantic roles like negation (not) are significant for correct understanding.Peng and Roth (2016) and Chen et al. (2021) use some heuristic rules to include these semantic roles, but those heuristic rules are not complete to include all the key roles.Therefore, inspired by related works (Rusu et al., 2014;Björne and Salakoski, 2018;Huang et al., 2018) in open-domain event extraction, we propose an event extraction workflow based dependency parsing to capture essential components for verb phrases in sentences as events.

Task Formulation
Under the umbrella of controllable story generation we define the following task: write a story that leverages both the given leading context and a given planned event sequence.Our primary goal is to investigate how to consider the context while keeping track of the given event sequence with neural generation models, so we expand the original context-aware story generation settings of Guan et al. (2021) by adding an event sequence, for each leading context, as the storyline to follow.Input: The input for each sample includes a leading context C = {c 1 , c 2 , ..., c n } which acts as the first sentence of a story, and an event sequence E ={e 1 ,e 2 ,...,e m } as a storyline to build up a sketch for a story.c i means the i-th token of the leading context, and e i means the i-th event representing the i-th sentence in a story.

Output:
The output is a multi-sentence story S = {s 1 1 ,s 1 2 ,...,s 2 1 ...,s m n }. , s i j denotes the j-th token of i-th sentence in a story.

Event Sequence Preparation
Following the prior work of event extraction (Chen et al., 2021) (discussed in Sec.2.2), we present an automatic framework which includes all verb related roles by analysing dependencies between words 2 .The event representation is not the focus of this study however.Figure 3 shows an example of the event extraction process.The details of the event schema are appended in Appendix.A.2.

Neural Framework
At the writing stage, conditioned on a leading context C and planned events E (see Sec. 3.1), the neural model generates a multi-sentence story S. Figure 4 shows the overview of the whole process.
Contextualised Features Representation At the encoding stage, neural models have the inputs of C and E which have feature differences as mentioned above.Conventional end-to-end models usually concatenate embeddings of different inputs, since neural encoders capture their features in numeric vector space (e.g. through self-attention).However, when the event sequence grows longer, the growing concatenated embeddings may decrease the influence of C (we discuss this in Appendix A.4). Instead, we firstly leverage two separate BART (Lewis et al., 2020) encoders to incorporate features, and then fuse features 2 We use spaCy (https://spacy.io/)to split sentences and parse dependencies between words in each sentence.
with multi-head attentions calculated as below: (1) where Encoder c and Encoder e inherit pre-trained parameters from BART but do not share trainable parameters when fine-tuning.F c and F e stand for the features captured from C and E, respectively.i denotes the i-th head of the attention scores which have m heads in total.
, and W M are trainable parameters.The i-th head attention A i is the attention-based weight sum of the feature matrix.Finally, the obtained F ca represents the attentions of ongoing events under the consideration of context.
In order to contextualise the input event features, we incrementally add the F ca to the original event features F e so that neural models are forced to learn the context gap between event sequences and stories, i.e., where β denotes the scale factor of F ca .β ⊙ F ca is the representation of the context gap trained via residual mapping.F h concatenates both leading context features F c and contextualised event features F he .These are fed into a neural decoder to predict tokens and sentence representations.
Decoding and Sentence-level Fitting As in other conventional generation systems, we employ an auto-regressive decoder to generate story tokens y t as equations below.
where t denotes a time step.X denotes the input to the neural model.H t is the t-th hidden state of the decoder module.H t is computed from the information of both context and events contained in F h and the prior predicted story y <t .W is a trainable parameter, and P (y t |y <t , F h ) is the probability distribution of the vocabulary including special tokens.Through a sampling strategy, e.g., argmax, we collect the predicted token y t .In addition to the token-level representation, we introduce an auxiliary task of Sentence Similarity Prediction (Guan et al., 2021) to learn sentence-level representations and training methods.Due to the page limit, the details are moved to Appendix A.4.

Sentence Similarity Prediction
Training and Inference As shown in Figure 4, the neural model is trained to fit both token and sentence level references as follows: where L lm is the cross-entropy loss of P (y t |y <t ,F h ).
L sent is the loss of predicted sentence similarities.sim s ij and sim y ij denote the sentence similarities between the i-th and j-th sentences in a reference story and a generated stories, respectively.λ is an adjustable scale factor, and L overall is the overall loss.By minimising L overall , the neural model learns to predict a human-like story.The Sentence Similarity Prediction only works on training, and when doing inference the neural model finally outputs stories without those special tokens.

Datasets
In this study, we annotate two popular datasets, ROCStories (ROC) (Mostafazadeh et al., 2016) and Writing Prompts (WP) (Fan et al., 2018) with extra event sequences as our benchmarks.We follow the settings of prior work (Xu et al., 2020;Guan et al., 2021) to preprocess these data.The stories in both datasets are split into sentences by NLTK (Bird and Loper, 2004).The data of ROC are delexicalised by masking all the names with tokens of [MALE], [FEMALE], and [NEUTRAL].The data of WP are recollected from the original development and test set, and we retain the first eleven sentences in each story3 .For both datasets, the first sentence in a story is extracted to be the leading context C as the input, and the rest is used as the reference story S. Finally, we obtain a long story dataset, WP (10 sentences), and a short story dataset, ROC (4 sentences) for the following experiments.The event sequence E is extracted from the reference story S as the planned plot to guide story generation.Statistically, the ROC has stories as the Train/Dev/Test set of 88344/4908/4909 stories, respectively, and the split of WP is 26758/2000/2000.

Baselines
We compare EtriCA with following SOTA generation models: (1) P&W (Plan and Write) (Yao et al., 2019): The main architecture is based on a BiLSTM with an attention mechanism (Garg et al., 2019).To make the comparison more fair, we enhance the original code by replacing original static word embeddings with the dynamic embeddings of the pre-trained BART; (2) GPT-2 (Radford et al., 2019): A popular auto-regressive generative model which has been

Implementation Details
The main contribution of our generation model is the contextualising module, which can adapt to other encoder-decoder frameworks.Therefore, we employ the encoders and decoders from the BART framework (Lewis et al., 2020), which has shown strong performance in prior studies (Goldfarb-Tarrant et al., 2020;Guan et al., 2021), to build our neural generation model.We fine-tune our generation model based on a publicly available BART checkpoint4 and fix the random seed to 42.All of our code is implemented in PyTorch, and trained with the PyTorch Lightning framework.More details of the hyper-parameters of the model, training and inference are described in Appendix A.1.

Evaluation Metrics
Perplexity (PPL) measures the uncertainty of generated tokens predicted by neural models.ROUGE-n (R-n) (Lin, 2004) is a set of reference metrics measuring the coverage rate between generated stories and the referenced stories where n denotes n-grams.BLEU-n (B-n) (Papineni et al., 2002) is also a set of reference metrics to compute n-gram overlaps between the generated stories and the references.Lexical Repetition-n (LR-n) is an unreferenced metric to compute the percentage of generated stories which have a 4-gram repeated at least n times (Shao et al., 2019).Distinction-n (D-n) is an unreferenced metric qualifying the distinction of stories by measuring the ratio of distinct n-grams to all those generated n-grams (Li et al., 2016).Intra-story Repetition (Yao et al., 2019) measures the repetition of each sentence in a story by the overlaps of trigrams.Intra-story Coherence and Relevance (Xu et al., 2018), originally used in dialogue evaluation, is based on cosine similarity between semantic embeddings5 to calculate sentencelevel coherence and relevance.This approach is used to measure the relatedness between neighbouring generated sentences as the intra-story coherence, and the relatedness between the leading context and story sentence as the intra-story relevance6 .Intra-story Aggregate Metrics i.e. repetition, coherence, and relevance, are obtained by the mean of sentence-level metrics.

Evaluation Results
Reference metrics.BLEU and ROUGE metrics, EtriCA also outperforms other baselines, which demonstrates that EtriCA generates stories more closely resembling the human written reference stories.
In addition, with the ablation study, it can be observed that both the context and event features play an important role in improving the generation process.Considering the performance of -w/o leading andw/o events, they indicate that the features contained in the two kinds of inputs are complementary to each other, and both of them are essential for good story writing.Therefore, the task of how to effectively incorporate both features is important for enhancing the ability of writing a high-quality story.When EtriCA does not implement our contextualising module (abbr.cm), all the metrics substantially drop, and some metrics even become lower than those of BART l+e and HINT l+e .This observation suggests that our contextualising module can more effectively fuse the heterogeneous features, and generate a richer semantic representation for the following story writing.Similarly, the sentence-level representations also improve most of the metrics, although not as much as the contextualising module.We hypothesise it is because the contextualised module has significantly reduced the gap between event sequences and the stories (each event is paired with each sentence), making the improvement by sentence-level representations less prominent.Our hypothesis is also confirmed in the following experiments.
Unreferenced Metrics.In another set of experiments, we examine repetition and diversity of the generated stories, where the results are reported in Table 2.It can be observed that EtriCA gives strong performance for both lexical repetition (LR-2) and diversity (D-4), either achieves the best performance or is on a par with the best performing baseline.To further investigate how our model performs on writing along with the planned events, we follow Yao et al. (2019) to observe the intra-story repetitions for each generated sentence, as shown in Figure 5.The results show that EtriCA consistently outperforms the baselines for both sentence-level and story-level (i.e., aggregated) repetitious scores, indicating that EtriCA performs better on event-triggered story writing.
In-depth analysis Furthermore, we present more experimental results7 to analyse the intra-story coherence and relevance of our models.We select the two strongest baselines according to previous experimental results to compare.As shown in Table 3, our approach consistently outperforms the baselines for both the intra-story coherence and relevance.This indicates our contextualising module improves the feature capturing for both the context and event features to enhance the logical relatedness between story sentences, and between the story and the input.
Additionally, the ablation results, in which EtriCA and "-w/o sen" have very close performance, also confirm the aforementioned hypothesis that the feature capturing mechanism of the contextualising module partly replaces the functions of sentence-level representations.
Figure 6 further shows that our model significantly outperforms baselines for coherence and relevance, which indicates its advance in feature capturing and generating more event and context related stories.
Leading Context: [MALE] had lost his dog over a month ago .Event Sequence: missed dog → notices something → sees dog → turns out be BART l+e : He missed his dog for a whole month.One day he notices something moving and is startled.He sees the dog on the floor.It turns out to be a squirrel.-w/o sen: He had missed his dog so much that he had to search for him.As he was searching, he notices something about a dog.He sees the dog with a bag.It turns out to be a stray, a wad of dog spray.
-w/o cm: [MALE] missed his dog this summer.He notices something on his neighbor's wall about a house.[MALE] notices the dog was very sad. it turns out that there must be a really sad day next time.
Table 4: A case study for an example in ROC Stories.
[MALE], [FEMALE], and [NEUTRAL] are the specital tokens to replace names in stories.The highlighted bold words denote the events corresponding to the given event sequence.

Human Evaluation
We conducted human evaluation based on pair-wise comparisons with two competitive baselines and the ablated model without our proposed contextualising module.We randomly sample 150 stories from the test datasets of ROC Stories8 .There are 3 evaluators invited to choose which generated story is better (Win/Lose/Tie) on three aspects: (i) Fluency considers each sentence in isolation and measures the quality from a linguistic perspective, e.g., the grammatical correctness or the correct representation of semantic meaning; (ii) Coherence measures the logical relatedness between story sentences; (iii) Relevance measures the context relevance between stories and the leading contexts.When summarising the human annotation result, the final results are counted by majority voting.As shown in Table 5, EtriCA outperforms SOTA baselines in terms of fluency, coherence, and relevance.All generation models have relatively little conflict with the given input, so they all have good performance on relevance, causing less differences in relevance.On the other hand, the advances on fluency and coherence are very significant, indicating the advantage of our contextualising module in capturing highlevel features from the context and event sequences.

Case Study
As is shown in Table 4, examples indicate that compared to baseline models, EtriCA generates a more logically related story with given inputs, and contains fewer confusing expressions inside the story.
(More examples in Appendix.

Conclusion
We present a controllable story generation task conditioned with leading contexts and event sequences.We also provide two new datasets with extra annotated events, and introduce a new set of automatic metrics to measure coherence and fluency.We propose EtriCA, a novel generation model, which better exploits context and event features with a cross attention based contextualising network.Through extensive experiments and analyses, we demonstrate EtriCA can generate stories with better fluency, coherence, and relevance compared to competitive baselines.

Limitations
According to our methodology and some observations from experiments, we conclude the limitations around our study as follows: • Robustness: The quality of given events may affect the robustness of generation models.It is very hard to define whether a plot (an event sequence) is "interesting" or "bad".When provided with unusual or strange events for a story, the generation model struggles to follow the events cohesively.• Sequence Length: Our neural models cannot write stories that are too long in length.Because of the features of neural encoders and decoders, the inputs and outputs generally have a certain length limitation, e.g.1024 tokens.• Generalisation: The training datasets of stories limit the topics of open domain story writing.
Although large-scale pre-training has been widely adopted, our neural models still struggle to tackle some topics, writing styles, etc.For instance, the performance of story generation on WP is relatively worse than on ROC, because WP contains spoken language and irregular expressions.• Experiments: For the limitation of resources, we did not conduct some further experiments in this study.For instance, we did not further study the performance of difference event representations because it is not the focus of this study.
Training Settings Our experiments are carried out on multiple GPUs (e.g., RTX A4000) on a cloud platform, so for the convenience of reproduction, we fix the random seed to 42.When training neural models, we implement PyTorch Lightning10 framework to set up training processes.The training parameters setup are listed as follows: The batch size is set to 64; The learning rate is 8e-5; max source length is set to 1024; The optimiser uses Adam (Kingma and Ba, 2014), and the ϵ of Adam is set to 1e-8.The whole training process will last for 5 epochs, but the result only considers the checkpoint with the best performance on the metric of loss (the lowest).It is worth mentioning that EtriCA needs two separate encoders to encode context (natural language) and events (concatenated serialised events) separately, but the encoder of public BART checkpoint is only pre-trained on natural language text.Therefore, to make the event encoder learn event features better, we firstly train a BART model on stories given both the context and planned events, and then restore the pre-trained encoder paratermeters to the event encoder of EtriCA.
Inference Settings When evaluating and testing, we adopt the nucleus sampling (Holtzman et al., 2019) strategy to generate texts.We also change the batch size to 15 when doing the inference, because nucleus sampling requires large amount of memory,

A.2 Details of Event Schema
An event is supposed to represent an important change that happens, so generally the representation of an action.The schema for an event aims to include all relevant roles to the action and filter trivial details for representation.Inspired by the work of Rusu et al. (2014); Björne and Salakoski (2018) which used dependency parser to capture dependencies between words belonging to different clauses, we extract event mentions from sentences according to the hierarchy of typed dependencies (De Marneffe and Manning, 2008) (see details in Appendix.A.2).In this way we can obtain more informative and unambiguous events compared to single-verb events used in previous work (Jhamtani and Berg-Kirkpatrick, 2020;Guan et al., 2020;Kong et al., 2021).The schema is shown in Figure 7.
As shown in Figure .7, event arguments are extracted according to selected dependencies between words.Here, we give the details of these dependencies, and  The schemas of events are required to consider performance with respect to both generalisation and representation.More dependencies included can make an event more informative but may cut down its generalisation ability.For instance, Subject (e.g.I, you, Kent, ...) is useful to state the character of an event, but stories usually have different characters that may make it difficult for extracted events from one story to be reused in another story.E.g., "Kent is driving" and "He is driving" are the same meaning but if the subject "Kent" is extracted as an event role, it is very hard to predict same event for another story, which means the generalisation is damaged.According to similar criterion we select key roles as the arguments to events with the consideration of both generalisation and representation.

A.3 Details of Event Extraction
We extract events from the text of training dataset including reference stories and leading contexts.The data structure of an event is a set including relevant trigger and arguments in a sentence.We firstly use spaCy 11 to parse dependencies between words in a sentence, and then annotate event trigger and argu-11 https://spacy.io/ments according to the dependencies.An event e contains attributes introduced in Figure 7, in which the event trigger as the root is generally the predicate.Before encoders accept text as the input, extracted events will be serialised to text format to feed the model.
Since existing story datasets do not have the reference storylines paired with reference stories, we develop an event extractor to extract event sequences from reference stories as the storylines.We follow the route to represent events as verb phrases.Verbs as the anchor of sentences can be seen as the event trigger, so our primary goal is to extract all the key roles (as event arguments) related to the event trigger.The neighbourhood of extracted events will be considered as temporal relations.
With temporally related events from training stories, we construct an event graph denoted as G, which is an isomorphic graph with single event type and single relation type.We suppose G is a data structure composed by triples in e h ,r,e t format.The workflow of the extraction process is explained as follows: ] where i denotes the i-th sentence.We use Figure 8: The snapshot of our evaluation interface.The stories are randomly collected from the ROC dataset, and the annotators are required to choose a choice for each question on the right colomn.For the convenience of accurate annotation, the system allow annotators to directly compare different generated stories with given input.A survey will be automatically recorded with all the three questions answered and the "submit" button pressed.
Sentence-Bert to obtain a numeric vector F sent i , which contains the features of sentences through representation learning.Then we force the similarity score sim s ij between generated sentences to fit similarities between sentences in the reference stories.The calculation of similarity score is shown below: where i and j denote the indices of a sentence.sim denotes the similarity.sim s ij , the ground-truth similarity, is computed by cosine similarity between outputs of Sentence-Bert.u ij is an intermediate variable of similarity obtained from predicted sentence representations, and W sep denotes a trainable parameter.To guarantee sim y ij is symmetric with respect to either i to j or j to i, u ij and u ji are both incorporated.
Contextualised Features Representation Conventional end-to-end models usually concatenate embeddings of different input, e.g.here we concatenate C and E as the input to the baseline model.The encoders such as LSTM, Bert-like encoders will incorporate the heterogeneous features.However, there may be some potential problems: (i) C and E have different word distribution, since C is natural language but E is not.A single encoder may not capture the two type of features in a single input numeric vector, efficiently and effectively; (ii) With the increment of the length of stories, the relative size of E will surpass C. The single encoder may pay less attention to C features, e.g. the vectors C involved in the calculations of attention scores become relatively less when the plan events increase.

A.5 Details of Human Evaluation
We developed an evaluation system, which helps us to collect annotations of evaluation, to make the story pairs for annotation anonymous, fairly shuffled, and easy to compare. Figure 8 is the snapshot of our annotating process.
Evaluators are required to follow the annotation standards shown on the left top corner.Considering the different biases among individuals, we also notify every annotator our standards set for this task: (i) Fluency considers the errors shown in generated text, e.g.grammatical errors ≥ spelling errors ≥ unnatural repetitions >language quality.(ii) Coherence focuses on the logical relatedness between sentences.We asked annotators to count all the incoherent parts, and consider how many word edits would be needed to make the story coherent (i.e.fewer edits needed = more coherent story).(iii) Relevance focuses on the relatedness between generated sentences and BART l+e He missed his dog for a whole month.One day he notices something moving and is startled.He sees the dog on the floor.It turns out to be a squirrel.

HINT l+e
One day [MALE] missed his dog.He notices something about her name on the dog's tag.[MALE] sees the dog in the tags.It turns out it could be a dog from the police department.

EtriCA
He missed his dog badly.He notices something strange on the curb.He sees the dog outside.It turns out to be a stray dog.
-w/o sen He had missed his dog so much that he had to search for him.As he was searching, he notices something about a dog.He sees the dog with a bag.It turns out to be a stray, a wad of dog spray.
-w/o cm [MALE] missed his dog this summer.He notices something on his neighbor's wall about a house.
[MALE] notices the dog was very sad. it turns out that there must be a really sad day next time.
-w/o leading He missed his dog.
[MALE] notices something in the area.he sees a dog. it turns out to be a black dog.
-w/o events He was devastated by the loss.He decided to pull a long string of nail polish.He found a couple of old nail polish cans that were very old.His dog enjoyed his touches.the leading context.However, it is very subjective to judge if a story is "interesting" or not relevant.Therefore, we suggest evaluators to judge how irrelevant a story is by counting the conflict generated sentences with the leading context.

A.6 Case Study
As is shown in Table A.6, EtriCA can generate better stories considering both context relatedness and story quality.The strong baselines models, i.e.BART l+e and HINT l+e , generate stories with good obedience to the planned event sequences, and relatively good fluency.However, they fail to write reasonable stories with the logical relatedness to the ongoing circumstances.For instance, "It turns out to be a squirrel."is a good sentence and also uses the event "turns out be", but it has nothing to do with the topic -"the dog is missing", and no coherence with previous sentences as well.
In terms of the results obtained for the ablation study, we see the importance of different components to the whole generation model.If there is no planned event sequence (see -w/o events), it is very hard to let a neural model write a coherent story, which is also demonstrated in previous work (Yao et al., 2019).If there is no given leading context (seew/o leading), neural models struggle to unfold the planned events, because the neural model does not understand the concept of a "topic" for a story, it may cause confusion.Without the contextualising module (see -w/o cm), the neural model struggles to process the heterogeneous features from context and events.
Figure2: The overview of the event feature contextualising process.The leading context coloured in red contains some important information which affects the generation process, e.g., the weather "snows" may lead to "accident".These implicit clues help the neural generator to disambiguate the context of events.We firstly fuse both context and event features, and then feed them to the generator.

Figure 3 :
Figure3: An example to illustrate the process of event extraction.TOK is the basic unit of a sentence.POS is the part of speech, and DEP stands for dependencies between tokens.Through parsing dependencies, the event trigger (also recognised as the root of sentence) filters all significant roles to represent a complete action.Meanwhile, extracted neighbour events are considered to have temporal relations.

Figure 4 :
Figure 4: The overview of EtriCA architecture.The technique details are explained in Sec 3.3.When training, in addition to predicting text tokens {y 1 1 ,...,y i j } one by one, we train the decoder to learn sentence-level representations by the similarity prediction auxiliary task shown in the dotted box.Through representation learning, neural models learn how to generate the reference-like stories with given leading context and planned event sequence.

Figure 5 :
Figure 5: The results of intra-story repetitions and aggregate scores on stories of the ROC dataset.The curve graphs illustrate the intra-story repetition for each sentence (leading context as the first sentence) in a story.The histograms depict the aggregate scores of intra-story repetitions over the story sentences.
HINT l+e : One day [MALE] missed his dog.He notices something about her name on the dog's tag.[MALE]  sees the dog in the tags.It turns out it could be a dog from the police department.EtriCA: He missed his dog badly.He notices something strange on the curb.He sees the dog outside.It turns out to be a stray dog.

Figure 6 :
Figure 6: The results of intra-story coherence and relevance on the ROC dataset.

not afford sleep] [ :woke up] temporal dependencies trigger root bright aux arg:mod DET prt det arg:mod DEP: EVENT:
could not afford to sleep in .[:

Table 1 :
(Goldfarb-Tarrant et al., 2020;Clark and Smith, 2021)th, 2021)ets.The best performance in each line is highlighted in bold.↑/↓means the higher/lower the better, respectively.l+emeanstheinput of the model concatenates the leading context and event sequence.w/osen,w/ocm,w/oleading,andw/o events means ablating the auxiliary task of sentence similarity prediction, respectively, the contextualising module, the leading features, and the event features.widelyused in prior works(Rashkin et al., 2020;Guan et al., 2020;Clark and Smith, 2021); (3) BART: This is a composed model constructed with a BERTlike(Devlin et al., 2019)encoder and a GPT-like decoder, and shown advances in prior NLG works(Goldfarb-Tarrant et al., 2020;Clark and Smith, 2021).
(Guan et al., 2021)., 2021): It is currently the SOTA framework on context-aware story generation, which enhanced the coherence and relevance through training with two training objectives.

Table 2 :
Table1shows the automatic evaluation results on both the short story dataset ROC and the long story dataset WP.It can be observed that EtriCA outperforms all baselines on all metrics for both datasets.Compared to the strongest baseline BART and HINT, our model reduces perplexity by 15% on ROC and 25% on WP, respectively.For the Automatic evaluation of unreferenced metrics on the ROC and WP datasets for the generation models writing stories conditioned on both leading text and the reference event sequence.Golden in the table denotes the reference stories.

Table 5 :
(Fleiss, 1971)on results on the ROC dataset.The scores stand for the percentage of model chosen in pair comparisons (win another model).Kappa means the Fleiss' Kappa(Fleiss, 1971)coefficient that is used to measure inter-annotator agreement.All of our results have reached moderate agreement.* refers to significance at p<0.05, whilst * * refers to significance at p<0.01, on a sign test.
Table. 6 indicate the roles of

Table 6 :
Details of dependencies in Event Schema.Examples are extracted with the format of [head]dependency->[tail].
Input Leading Context [MALE] had lost his dog over a month ago .Event Sequence missed dog → notices something → sees dog → turns out be P&W l+e He wished he could live with his friend.He 'd run in them all the time.But one day, he woke up exhausted.He went to the doctor with his best friend.GPT-2 l+e [MALE] was only a parent at the time.[MALE] notices the dog and he lets it go.He notices the dog has been moved and so he notices what happened.[MALE] then realises that it is a bad dog and there is something wrong with his life.

Table 7 :
A case study of generated stories conditioned with a leading context and an event sequence collected from ROC Stories.[MALE],[FEMALE], and [NEUTRAL] are the specital tokens to replace names in story.The highlighted bold words denote the events corresponding to the given event sequence.