COINS: Dynamically Generating COntextualized Inference Rules for Narrative Story Completion

Despite recent successes of large pre-trained language models in solving reasoning tasks, their inference capabilities remain opaque. We posit that such models can be made more interpretable by explicitly generating interim inference rules, and using them to guide the generation of task-specific textual outputs. In this paper we present Coins, a recursive inference framework that i) iteratively reads context sentences, ii) dynamically generates contextualized inference rules, encodes them, and iii) uses them to guide task-specific output generation. We apply to a Narrative Story Completion task that asks a model to complete a story with missing sentences, to produce a coherent story with plausible logical connections, causal relationships, and temporal dependencies. By modularizing inference and sentence generation steps in a recurrent model, we aim to make reasoning steps and their effects on next sentence generation transparent. Our automatic and manual evaluations show that the model generates better story sentences than SOTA baselines, especially in terms of coherence. We further demonstrate improved performance over strong pre-trained LMs in generating commonsense inference rules. The recursive nature of holds the potential for controlled generation of longer sequences.


Introduction
Narrative story understanding, and similarly story generation, requires the ability to construe meaning that is not explicitly stated through commonsense reasoning over events in the story (Rashkin et al., 2018a). Previous work in modeling narrative stories has focused on learning scripts 1 (Schank and Abelson, 1977;Mooney and DeJong, 1985) and learning narrative schemas using corpus statis-1 Scripts are structured knowledge about stereotypical event sequences together with their participants. Beginning: S1: Janie was excited to see her sister's play in theatre. S2: Janie got a call from her boss about an emergency work.
• SomeoneA wasn't able to go SomewhereB (to see the play) End: S5: Janie watched a video of the play later.
• SomeoneB wants SomeoneA to work. tics (Chambers and Jurafsky, 2009;Balasubramanian et al., 2013;Nguyen et al., 2015). Recently, large pretrained language models (LMs) such as GPT-2 have shown remarkable performance on various generation tasks. While these pretrained LMs learn probabilistic associations between words and sentences, they still have difficulties in modeling causality (Mostafazadeh et al., 2020). Also, in narrative story generation, models need to be consistent with everyday commonsense norms. Hence, to address a story generation task, i) models need to be equipped with suitable knowledge, ii) they need effective knowledge integration and reasoning methods, and ideally iii) we want to be able to make the effectiveness of these methods transparent.

Implicit Inference Rules Effect
In this work we focus on the aspects i) to iii), by investigating new methods that build on pretrained LMs to generate missing sentences from an incomplete narrative story. Specifically, we focus on Narrative Story Completion (NSC), a new task setting for story generation. Given an incomplete story, specified only through its beginning and ending, the task is to generate the missing sentences to complete the story (see Figure 1). Our hypothesis is that in order to obtaining a consistent and coherent narrative story, the task requires a model's ability to perform commonsense inference about events and entities in a story. Unlike other existing tasks, NSC requires: i) generating multiple sentences to complete a story, and ii) ensuring that the generated sentences are coherent with respect to both beginning and ending of the story. Hence, the NSC task offers a challenging setup for investigating the reasoning capacities of a story generation model.
Humans excel in drawing inferences and constructing causal chains that explain the connection between events (Kintsch and Dijk, 1978). Figure  1 illustrates this with an example from our NSC task. 2 From Janie was excited to see her sister's play in theatre (s 1 ) . Janie got a call from her boss about new work (s 2 ) and the outcome Janie watched a video of the play later. (s 5 ) -we can construct inference rules in forward and backward direction: forward via EFFECT: Someone B (boss) gave work to Someone A (Janie); backward via CAUSE: Someone A (Janie) wasn't able to go Somewhere B (to the theatre). By combining these inferences, we can obtain a representation from which to generate a connection that completes the story, e.g., Janie's boss wanted her to look after the issue (s 3 ) . She missed the theatre play (s 4 ) .
In this work, we propose COINS: a recursive model that jointly learns to i) dynamically generate commonsense inference rules 3 grounded in the context and to ii) perform controled and coherent story generation, using the generated inferences as a guide. We hypothesize that jointly learning to generate contextualized inference rules from dynamically predicted contextualized inference rules and learning to generate story sentences incrementally while taking the inferences into account, will improve the quality of both the predicted inference rules and of generated story sentences. Moreover, the recursive nature of the model and the individuation of the inference prediction and sentence generation tasks make the process more interpretable: the generated inference rules can be viewed as intermediate representations, and can serve as explanations of how the dynamically produced inferences influence the quality of generated story sentences.
Our main contributions are as follows: 1) We propose a new setting for a Narrative Story Completion task, which asks a system to complete a narrative story given its beginning and ending, with the aim of examining the reasoning capacities of a model that solves the task.
2) We propose an integrated reasoning and NL generation model, COINS, that based on its current context generates contextualized commonsense inference rules and follow-up sentences, in a stepwise recurrent process.
3) We conduct extensive experiments with automatic and human evaluation. Automatic evaluations show that COINS outperforms strong baselines (+2.2 BLEU score). Human evaluation shows that compared to strong baselines, our model yields better sentence generations with respect to coherence (+50.5%) and grammaticality (+20.5%). 4) We show that COINS generates better inference rules (+2.3 BLEU score) compared to a finetuned GPT-2 model, and that jointly learning to generate inferences and story sentences improves the quality of the generated inference rules.
Our code is made publicly available. 4

Related Work
Sentence-level Commonsense Inference and Beyond. Recent research in this area has focused on commonsense knowledge acquisition Zhang et al., 2020;Speer et al., 2017; and commonsense reasoning (Zellers et al., 2019;Talmor et al., 2018). In our work, we focus on inferential knowledge about events, and entities participating in such events. Rashkin et al. (2018b) introduced a knowledge resource of commonsense inferences regarding people's intents and reactions towards a diverse set of events. With COMET, Bosselut et al. (2019) have shown that pre-trained neural language models can be fine-tuned using large knowledge bases (such as ATOMIC, ) to generate inferences for a given event or sentence. However, the generated knowledge from COMET is noncontextualized and hence, can be inconsistent. Recently, Mostafazadeh et al. (2020) proposed GLU-COSE, a new resource and dataset that offers semistructured commonsense inference rules that are grounded in sentences of specific stories. They show that fine-tuning a pre-trained LM on the GLUCOSE dataset helps the model to better generate inferrable commonsense explanations given a complete story. In concurrent work, Gabriel et al. (2021) proposed PARA-COMET, a model that in-corporates paragraph-level information to generate coherent commonsense inferences from narratives. In this work, we investigate how well a neural model can generate contextualized commonsense inference rules for an incomplete story. Learning to predict iterative inference steps for successive events in a narration using semi-structured knowledge rules is still a difficult and underexplored task. We propose a model that learns to iteratively generate a coherent completion of an incomplete narrative story utilizing semi-structured knowledge as offered by the GLUCOSE framework.
Commonsense Reasoning in Narrative Stories. Early work on narrative events focused on script learning, by defining stereotypical event sequences together with their participants (Schank and Abelson, 1977). In later works, Jurafsky (2008, 2009); Balasubramanian et al. (2013); Nguyen et al. (2015); Pichotta and Mooney (2014) proposed methods to learn narrative event chains using a simpler event representation that allows for efficient learning and inference. Chambers and Jurafsky (2009) acquired Narrative Event Schemata from corpora and established the Narrative Cloze Task (Chambers and Jurafsky, 2008) that evaluates script knowledge by predicting a missing event (verb and its arguments) in a sequence of observed events. More recently, Mostafazadeh et al. (2016) proposed the story cloze task that selects a plausible (right) over an implausible (wrong) story ending.  proposed an abductive reasoning task to test a model's ability to generate plausible explanations for an incomplete set of observations. Paul and Frank (2020) proposed a multi-head knowledge attention method to dynamically incorporate non-contextualized inferential knowledge to address the abductive reasoning task. Qin et al. (2020) proposed an unsupervised decoding algorithm that can flexibly incorporate both the past and future contexts using only off-the-shelf language models to generate plausible explanations. Concurrent to our work, Paul and Frank (2021) presented a method for addressing the abductive reasoning task by explicitly learning what events could follow other events in a hypothetical scenario. In our work, we make use of the ROCStories dataset (Mostafazadeh et al., 2016) to build a Narrative Story Completion task that tests a model's ability of generating missing sentences in a story. We propose a model that aims to produce coherent narrative stories by performing iterative commonsense inference steps.
Narrative Story Generation. Much existing work on story generation relied on symbolic planning methods (Lebowitz, 1987;PÉrez and Sharples, 2001;Józefowicz et al., 2016). With the advances of Seq2Seq models, several works applied them in automatic story generation tasks (Roemmele, 2016;Jain et al., 2017). Fan et al. (2018) proposed a hierarchical approach to generate short stories from initial prompts. Recently, many works have focused on integrating external commonsense knowledge from large static knowledge bases like ATOMIC  or ConceptNet (Speer et al., 2017) for different tasks such as story ending generation (Ji et al., 2020;Guan et al., 2019) or story generation (Guan et al., 2020;. In concurrent work, Ammanabrolu et al. (2021) look into causality for a commonsense plot generation task. In our work, we model the assumption that contextualized inference rules provide inferred information that can guide a system in generating both contextually grounded and coherent follow-up sentences in a story generation task.

Task Definition
We formulate the Narrative Story Completion task (NSC) as follows: given an incomplete story (S= s 1 , s 2 , s n ) as a sequence of tokens t = {t 1 , t 2 , ..., t SEP , ..., t m } (with t SEP a mask token delimiting s 2 and s n ), the goal is to generate the missing sentences (s 3 , ..., s n−1 ) as a sequence of tokens y s i ={y s i 1 , y s i 2 , ..., y s i v } (with i = 3, ..., n−1 and v the maximum length of each sentence).
In the setting of the NSC task, we expect the completed story to be coherent. That is, the generated sentences should exhibit reasonable logical connections, causal relationships, and temporal dependencies with each other and the given beginning and ending of the story. In this paper, we define a discourse to be coherent if successive sentences that are about the same entities, and the reported events involving them can be construed to reflect common knowledge about how events are typically connected in a temporal sequence or by causal relations. Similar to Hobbs (1985), the criteria to conclude that discourse is coherent include require that there are reflections of causality in the text.
Our take on this task is to incrementally generate contextualized inference rules from the given context, and to make use of this knowledge to generate missing story sentences.

Relation Type Dimensions
Cause (Dim 1-5) (1) Event that directly causes or enables X; (2) Emotion or basic human drive that motivates X; (3) Location state that enables X; (4) A possession state that enables X; (5) Other attribute that enables X. Effect (Dim 6-10) (6) An event that is directly caused or enabled by X; (7) An emotion that is caused by X; (8) A change of location that X results in; (9) A change of possession that X results in; (10) Other change in attribute that X results in. s1: Jane loved cooking. s2: Everyone else in her family did too. s5: Eventually she learned everything there was to teach. Gold: SomeoneA loves SomethingA (that is an activity ) >CAUSES/ENABLES> SomeoneA learns everything there is to learn. Jane loves cooking >CAUSES/ENABLES> Jane learns everything there is to learn COINS: SomeoneA is a quick learner >CAUSES/ENABLES> SomeoneA learns everything there is to learn. Jane is a quick learner >CAUSES/ENABLES> Jane learns everything there is to learn.

Discourse-Aware Inference Rules
This section details how we construct training data for the NSC task, by enriching stories with automatically predicted contextualized inferences. 5 We utilize the GLUCOSE (Mostafazadeh et al., 2020) dataset, which contains implicit commonsense knowledge in form of semi-structured general and specific inference rules 6 (cf. Table 1) that are grounded in the context of individual stories from ROCStories. In GLUCOSE, given a story S and a selected sentence X from the story, the authors define ten dimensions d of commonsense causal explanations related to X, inspired by human cognitive psychology. Only a small part of ROCStories is annotated with GLUCOSE inferences (Table 3). Given the amount of commonsense knowledge needed for real-world tasks, a static knowledge resource is always incomplete. Thus, we fine-tune a pre-trained GPT-2 model on the annotated part of GLUCOSE to dynamically generate inference rules for each sentence X i of each story S i from the underlying ROCStories data. We fine-tune two separate language models CSI gen and CSI spec for general and specific rules, respectively ( Table 2).
The 10 dimensions d in GLUCOSE cover im-5 For testing we rely on GLUCOSE's manually validated inference rules on a small subset of the ROCStories corpus. 6 Specific means rules grounded in a given context and general corresponds to rules that are applicable to other contexts.  plicit causes and effects of a sentence X in a given story. In our work, we are interested in inference rules that explain a sentence's causes and effects, to study the impact of such inferences on narrative story completion. We therefore cluster all dimensions d into the two categories EFFECT vs. CAUSE (Table 1) and aggregate all rules from the respective categories (preserving their dimensions). Once our models (CSI gen , CSI spec ) are trained, we apply them to our NSC task training data, to enrich it with inference rules for each sentence and story.

COINS: COntextualized Inference and Narrative Story Completion Model
In this section we introduce a recursively operating reasoning and sentence generation model: COINS. An overview is given in Figure 2. In each iteration, the model applies two consecutive steps: (1) Inference Step: Given an incomplete story context S = X ⊕ S i and relation r, an inference model CSI (gen or spec) generates COntextualized inference rules of type r.
(2) Generation Step: a sentence generator reads the generated inference rules concatenated with the current context S and generates the next story sentence s i+1 . The context S is updated with s i+1 and steps (1) and (2) are repeated (cf. Algorithm 1).
This formulation allows us to i) examine inference and generation capabilities separately from each other, ii) helps determine the impact of inferential knowledge on story generation, and iii) can give us insight into how knowledge can guide story generation in a recursive inference framework.

Inference
Step. We define the initial story context S = {s 1 , s 2 ,[SEP], s n }, a selected sentence as We adopt a pretrained GPT-2 (base) (Radford et al., 2019) transformer model with multiple Transformer blocks of multi-head self-attention and fully connected layers. During training, in each iteration the input to the model is a concatenation of the current source (S , s i , r) and target sequence i.e., the inference Generate Semi-Structured Inference Rules Generate Missing Sentence (1) defines the inference rule (IR) generation model: where h 0 p is a summation of token embedding e p and position embedding P p for the p-th token; h l p is the l-th layer's output at position p, computed through transformer blocks with the masked multi-head self attention mechanism; h L p is the final layer's hidden state and y <p indicates the left context of position p. The softmax layer defines the model to output the most probable target sequence: the most likely inference rules (E i and C i ) for each relation type (cf. Algorithm Line 4-5).
During training, we minimize the objective (2) where m, N denote the number of tokens in the source (S , s i , r) and target sequence (inference rules) respectively; β refers to model parameters.
In this work, we focus on the NSC task, which requires our model to capture temporal dependencies and causal relationships between events. While we designed our sentence generation model in such a way that it can utilize inference rules from both forward and backward directions for each sentence, we here trigger the generation of CAUSE inference rules for s n , since we expect that events, motivations or attributes that cause s n will be relevant for generating the preceding sentences [s 3 , . . . s n−1 ]. Ii = Ei ⊕ Ci 7: si+1 = GenNewSentence(Ii, S ) 8: GenS := GenS + si+1 9: M emIR := M emIR ⊕ Ii 10: Similarly, we generate EFFECT relations for s i , assuming that an event, changes of emotion or changes of attribute that are possible effects caused by s i will be most relevant for generating the missing follow-up sentences. In principle, however, for NSC and other story generation tasks, we may consider CAUSE and EFFECT relations for all sentences, letting the model freely choose from the full space of inferences.
We concatenate the generated inference rules (I i = E i ⊕ C i ) 7 and store the last hidden representation in M em IR ∈ IR N ×L×H , where N is the number of sentences, L the maximum inference sequence length and H the hidden state dimensions. M em IR is updated with the hidden representations of inference rules in each iteration. Hence, M em IR could act as an intermediate representation, and as a basis for providing explanations for observed story sentence generations. M em IR may also be used as a memory for long-form text generation tasks, to keep track of implicit knowledge triggered by previously generated text, and could support flexible discourse serialization patterns. 8

Generation
Step. Given the generated inference rules I i (in form of tokens) and the incomplete story context S , we aim to generate the next missing sentence. We pass the input through another pretrained GPT-2 (base) model (cf. Equation 1). The loss function for the sentence generator is where y k denotes the k-th token and v the maximum length of the generated sentence; 5091 i ∈ [2, n − 1] ; [EOK] denotes the end of knowledge rule tokens, and θ refers to model parameters.
Update Story Context. In the final step we update the story context by inserting the generated sentence s i+1 into the previous story context (cf. Algorithm 1, line 12).
Training and Inference. We add the losses L I for inference generation and L S for sentence generation to make the models dependent on each other (Algorithm 1, line. 10-11). For both the inference and the generation step model, we minimize the negative log likelihood loss of the respective target sequence.

Dataset
We apply COINS to the NSC and the Story Ending Generation tasks. 9 For data statistics see Table 3. Narrative Story Completion. We follow the task definition as introduced in §3. Data Collection. We construct the NSC dataset on the basis of the ROCStories corpus (Mostafazadeh et al., 2016), which contains 98,162 five-sentence stories with a clear beginning and ending, thus making it a good choice for this task. We choose the first two sentences (s 1 , s 2 ) as beginning rather than just s 1 because the first sentence (s 1 ) tends to be short in length, and usually introduces characters or sets the scene (Mostafazadeh et al., 2016), wherease the second sentence (s 2 ) provides more information about the initial story.

Hyperparameter Details
Parameter size. For GPT-2 we use the GPT-2 small checkpoint (117M parameters) based on the implementation of HuggingFace (Wolf et al., 2020). Decoding Strategy. In the inference stage, we adopt beam search decoding with a beam size of 5 for all our models and all baselines we produce. We used the following set of hyperparameters for our COINS model: batch size: {2, 4}; epochs: {3, 5}; learning rate: {1e-5, 5e-6}. We use Adam Optimizer, and dropout rate = 0.1. We ran our experiments with GPU sizes of 11GB and 24GB.

Baselines
We compare our COINS model to the following baselines: 9 The results for Story Ending Generation will corroborate our results for NSC. All details are given in the Appendix.
(a) GPT-2 (Radford et al., 2018) (with 12-layer, 768-hidden, 12-heads), trained with an objective to predict the next word. The input to the GPT-2 model is the concatenation of the source and the target story sequence. We follow the standard procedure to fine-tune GPT-2 on the NSC task during training and minimize the loss function:  2020)). The knowledge triples were converted to sentences using templates. A multitask learning framework further fine-tunes this model on both the Story Ending Generation task and classifying corrupted stories from real ones. As our baseline we choose the version without multi-tasking, since the corrupted story setting is not applicable for the NSC task.
(c) GRF (Ji et al., 2020) is the current SOTA for the Abductive Reasoning and the Story Ending Generation tasks. GRF enables pre-trained models (GPT-2 small) with dynamic multi-hop reasoning on multi-relational paths extracted from the external ConceptNet commonsense knowledge graph.
(d) GLUCOSE-GPT-2 Similar to Guan et al. (2020), we fine-tune pretrained GPT-2 (small) on the GLUCOSE dataset using general rules (GR). We follow the same procedure as Guan et al. (2020) and (i) first fine-tune a pre-trained GPT-2 , but here on the GLUCOSE dataset, with the following loss: where r: CAUSE/EFFECT, I i : Inference rules. (ii) Then we fine-tune the above model again on the NSC dataset with the following loss: The main difference between GLUCOSE-GPT-2 and COINS is: COINS explicitly learns to generate (contextualized) inference rules on the fly during the inference step and incorporates them in the story generation step.

Automatic Evaluation Metric
For automatic evaluation in the NSC task we use as metrics Perplexity (indicates fluency of text generation), BLEU-1/2 (Papineni et al., 2002) and ROUGE-L (Lin, 2004). We report performance on the test

Results
Our experimental results are summarised in Tables 4 and 6.
NSC task. Table 4 shows the results for the models described in §6.3 and evaluated as per §6.4. We observe the following: (i) COINS outperforms all strong baseline models that utilize pre-trained language models and incorporate external commonsense knowledge with respect to all automatic evaluation metrics. Note that GLUCOSE-GPT2 and COINS are using the same knowledge resource, hence the clear performance increase of COINS (+4.92 BLEU score) indicates that jointly learning to generate contextualized inferences rules and missing sentences in a recursive manner can enhance generation quality. 10 (ii) Similar to Ji et al. (2020)  The result indicates that SR performs better than GR when the model sees the full story context. In general we observe that story generation benefits from higher-quality, contextualized inference 10 Since GRF's architecture is specific for ConceptNet, we cannot exclude that the better performance of COINS (+2.2 BLEU) is in part due to differences in the used knowledge.  rules from GLUCOSE (for COINS). 11 The improvement of COINS over GLUCOSE-GPT-2 indicates that our model is well able to utilize and profit from the inference rules. In the oracle setting, SR performs much better than GR. This is expected, since oracle rules with access to the full context will deliver more contextually-relevant inferences, while GR rules may diverge more from the story context. However, in the realistic NSC task setting (Table 4, lines 5,6) GR outperforms SR, which again underlines the generalization capacities of COINS.

Impact of different inputs for the Generation
Step. In Table 5 we investigate the performance of COINS with different inputs to the sentence generation component at inference time: (i) When only inference rules (from the inference step) are given to the model without any story context (S = {s 1 , s 2 ,[SEP], s n }) (IR only), sentence generation benefits when specific rules are used. This is expected since the specific rules contain statements with concrete character names and paraphrased events from the story. (ii) When only the story beginning (s 1,2 ) is provided to the sentence generation model without the ending sentence s n (w/oSE) nor inference rules (w/oIR) we observe that the performance drops compared to models given the full incomplete context (S ), indicating that knowing the story ending helps the model to generate missing sentences that are coherent with the story. However, (iii) when adding inference rules IR (from the inference step i.e., E i + C i ) to the context (s 1,2 ) without ending sentence (w/oSE), performance again improves (+5.85 BLEU scores). Note that the inference rule contains the CAUSE relation for s n . This indicates that the model is able to utilize inference rules for story generation. 12 Performance of inference rule generation. We now investigate how difficult it is to generate contextualized inference rules (specific and general) when multiple sentences are missing from a story. For this we compare COINS to a GPT-2 model fine-tuned on GLUCOSE data to generate inference rules (cf. §4). We study the impact of jointly and dynamically learning sentence and inference rule generation (in COINS) on the inference generation task -while the fine-tuned GPT-2 model only learns to generate inference rules conditioned on the static story context. We specifically examine the difficulty of generating inference rules for two consecutive sentences (s 3 and s 4 ) in a 5-sentence context, as opposed to shorter sequences, in three different scenarios: i) when the complete story context S is given; ii) when the incomplete context S (i.e., s 1 , s 2 and s 5 ) is given, plus either s 3 or s 4 (1-missing sentence), and iii) when S is given, but neither of the intermediate sentences s 3 and s 4 (2-missing sentences). In each setting, we generate EFFECT and CAUSE rules for the targeted sentences s 3 , s 4 , and compare their quality. The results are reported in Table 6. We observe that in the 2-missing sentences setting, COINS outperforms GPT-2 (by +2.3 BLEU score on average). This indicates that learning to perform inference rule generation jointly with sentence generation is beneficial for filling-in multiple story sentences. Interestingly, for increasing numbers of missing sentences, performance drops drastically for CAUSE (as opposed to EFFECT), but less so for COINS as opposed to GPT-2. A possible reason for this may be the conditional, uni-directional nature of the underlying GPT-2 language model, which is trained to predict follow-up words in forward direction. This may favor future-directed EFFECT rules -as opposed to CAUSE relations. The milder effect on COINS could indicate that the concurrent inference model supports the sentence generation model to overcome this weakness. 13

Manual Evaluation
Automatic metrics can give us some indication of NLG quality, however, these metrics do not necessarily reflect the coherence of generated story sentences. We thus conduct a human evaluation focusing on the grammaticality and coherence of the generated sentences in their story context. We 13 In future work, we will test the above hypothesis by experimenting with a bi-directional transformer generation model.  Table 6: Automatic evaluation of the quality of inference rules in different context settings. Best results in bold. Metric: BLEU-1 scores, E: EFFECT, C: CAUSE, Grey: context-specific rules (SR); regular: general rules (GR), † : fine-tuned on GLUCOSE dataset.
conduct pairwise comparisons for randomly sampled 100 instances of our best model, i.e., COINS with GR (according to automatic metrics) with four strong baseline models (GPT-2, GLUCOSE-GPT-2, GRF, KE). For each pair of instances (one from COINS, the other from a baseline model), we present the generated sentences in their story context, and asked three annotators to give a preference rating (win, tie, lose) according to the criteria grammaticality and coherence. For grammaticality, we present each sentence in isolation and ask the annotators to rate which sentence is more fluent, readable, and compliant with the English standard usage. For coherence, we ask the annotators to assess which of the two generated sentences are more logically coherent with each other and the story beginning and ending, in terms of causal and temporal dependencies. We applied majority voting among the three annotators to obtain final decisions. More details about the annotation are given in Appendix.
The human evaluation results are presented in Table 7. 14 The results show that our model produces more coherent and more grammatically correct sentences compared to all baselines. This indicates that with support of learned contextualized inference rules based on GLUCOSE knowledge, our model generates more coherent story sentences that are causally and temporally well connected.
Relevance of Generated Inferences Rules. We further conduct human evaluation to validate the effectiveness and relevance of the generated inference rules. We randomly select 50 instances from the NSC dev set. We asked three annotators to evaluate the (GR) inference rules 15 . We define an inference rule to be relevant if (a) it captures im-   Figure 3: Human evaluation of the relevance of Inference Rules generated by COINS.
plicit causes and effects of a selected sentence X given an incomplete story S , and (b) it is providing useful explanations for the incomplete story S . The result for this evaluation is shown in Fig.3, for EFFECT and CAUSE relations. We find that in 36% and 34% of cases for effects and causes, respectively (computed on the basis of majority agreement), our algorithm was able to generate relevant inference rules. Our annotations yielded fair inter-annotator agreement of Fleiss' κ = 0.45.
Case Study. We provide an example from NSC with different generation outputs (

Conclusion
We addressed a Narrative Story Completion task that allows us to probe the coherence capabilities of a neural generation model. We proposed COINS, a model that iteratively generates commonsense inference rules grounded in the context and generates story sentences, using the generated inferences as a guide. Human and automatic eval-  uations show that the model outperforms strong commonsense knowledge-based generation models. By individuating the inference rule and sentence generation steps, COINS can make the contribution of commonsense knowledge on story generation transparent. The recursive nature of the inference-driven generation model holds potential for knowledge-driven control in the generation of longer sequences. In future work we will explore how an enhanced memory of generated inferences can realize more complex narrative patterns that diverge from strictly ordered narrative sequences.  step it generates EFFECT inference rules for sentence (s 4 ). As seen in Table 9, the COINS model outperforms all previous strong baselines, including GPT2-GLUCOSE that uses the same knowledge resource. Interestingly, we also observe that fine-tuning on GLUCOSE or ConceptNet knowledge improves the text generation diversity, indicating that the models leverage concepts and event knowledge during generation (cf. Table 9 line.4-8).
Automatic Metrics. For Story Ending Generation (SEG) we follow the metrics used in Guan et al. (2019); Ji et al. (2020): they use BLEU-1/2 to measure n-gram overlap between generated and human-written story endings, and Distinct-n (Li et al., 2016) to measure the generation diversity using maximum mutual information.
Baselines. For the Story Ending Generation task, we compare COINS to the IE+GA model (Guan et al., 2019). It is based on incremental encoding and multi-source graph attention (Guan et al.,

KE
When he got home, he noticed his tires were flat. He decided to pull over. GRF She decided to move to California. She found a great place to live. s1: Her favorite glasses were ruined. s2: The pink dye had gotten all over them. s5: She chose pink, and they both laughed at the irony. Missing Sentences: s3: Her mother took her to get a new prescription. s4: It was time to order a new pair. COINS(MSGR) She took her friend to get a new one. She took it and it was pink. GPT-2 She bought a new pair of glasses. She wore them to school. GPT-2 GLU-COSE She couldn't decide between two colors. She finally decided on pink. KE She was sad that she couldn't see anymore. Her boyfriend came over to help. GRF She decided to dye them pink instead. She went to the store and bought a pink one. s3: But unfortunately the teacher could not even read it. s4: Susy was humiliated. COINS(MSGR) But she could not. Teacher didn't read the essay. GPT-2 Suddenly, her hand slipped. She fell and broke her wrist. GPT-2 GLU-COSE But all the sudden she got an F. Susy was so embarrassed.

KE
When she got her paper back she realized she had tylenol. She had written the entire essay by hand.

GRF
Susy was very nervous about the essay. The teacher told her she was not allowed to write. s1: Seth was at a party with his friends. s2: Someone dared a kid to climb on a wall. s5: He immediately began screaming that his leg was broken. Missing Sentences: s3: The kid climbed to the top and everyone cheered. s4: Suddenly he slipped and fell to the ground.

Gold:
Some PeopleA (who should not be there) start daring a SomeoneC to climb a SomethingC (without safety gear) >Causes/Enables> SomeoneC (who should not be there makes it to the top then falls down and SomeoneC (who is acting like monkey)). The kids start daring a kid to climb the wall >Causes/Enables> He makes it to the top then falls down and breaks his leg. Finetuned GPT-2: Some PeopleB start daring a SomeoneA to climb a SomethingC >Causes/Enables> SomeoneA quickly shouted that his leg was broken. Someone start daring a kid to climb the wall >Causes/Enables> He shouted that his leg was broken.

COINS:
Some PeopleB start daring a SomeoneA to climb a SomethingC >Causes/Enables> SomeoneA is on top of SomewhereA Someone start daring a kid to climb the wall >Causes/Enables> He climbed at the top. Table 14: Example of inference rules generated by COINS and Fine-tuned GPT-2 when 2-sentences are missing (compared to Gold from GLUCOSE). Grey: context-specific rules (SR); regular: general rules (GR). Bolded sentence s 2 is X, EFFECT is the relation type r.