Memory and Knowledge Augmented Language Models for Inferring Salience in Long-Form Stories

Measuring event salience is essential in the understanding of stories. This paper takes a recent unsupervised method for salience detection derived from Barthes Cardinal Functions and theories of surprise and applies it to longer narrative forms. We improve the standard transformer language model by incorporating an external knowledgebase (derived from Retrieval Augmented Generation) and adding a memory mechanism to enhance performance on longer works. We use a novel approach to derive salience annotation using chapter-aligned summaries from the Shmoop corpus for classic literary works. Our evaluation against this data demonstrates that our salience detection model improves performance over and above a non-knowledgebase and memory augmented language model, both of which are crucial to this improvement.

1 Introduction Forster (1985) compared a story to a wriggling worm of time that can be seen as a series of events arranged in order (see also Abbott 2008) -dinner comes after breakfast, night after day, nemesis follows hubris. Not all events are of equal importance, and some far more salient than others. For example, the beginning of Dickens' Great Expectations -Keep still, you little devil, or I'll cut your throat! -is more salient to the story than events such as my sister had a trenchant way of cutting our bread and butter for us. Salient events in storytelling are those that drive the plot forward, change the state in the story world, as opposed to descriptive details or non-consequential activities. As such, detecting salience is an essential part of understanding narrative. Detecting salient events has important downstream applications such as summarisation; salient events are the core of plots and can aid storyline writing and story generation; they represent essential information and are relevant to question answering. This paper builds on the work of Otake et al. (2020), who use Barthes Cardinal Functions (BCF) for unsupervised salience detection. We augment this approach with a knowledgebase (KB) and memory. Barthes Cardinal Functions (Barthes and Duisit, 1966) are hinge events that cannot be deleted without altering the story; they are the decision points between alternative consequential paths. Barthes and Duisit also define catalysers, which are inconsequential events such as the bread and butter example, indices, which are descriptive, referring to a character or situation, and informants, which identify time and space. These latter types can be seen as satellites around the nuclei, or filling in gaps between cardinal functions. Hence to identify BCF is to identify the main skeleton of the plot. We treat the BCF events as the salient events in a story. This scheme relates in narratology with Chatman (1980) kernels and satellites model, as well as with discourse theory in RST (Mann and Thompson, 1988), which similarly has nuclei and satellites and more loosely with SDRT (Asher and Lascarides, 2005) with coordinating and subordinating relations. The key to the Otake et al. method is that it can be implemented using any LM (Language Model) on any text and does not require a large corpus of annotated training data.
In this paper, we extend the BCF concept by exploring new measures of salience derived from structural manipulations: We infer swap salience, which is swapping rather than deleting an event within the BCF framework. Schmid (2003) discusses how an event can be salient if a reader expects it, but it is unexpected to the character in the story. The reader puts themselves into the character's shoes. Zillmann (1996) emphasises how suspense is driven by anticipation and apprehension on behalf of characters the reader cares about. Bae and Young (2009) propose to use this knowledge disparity between the reader and the character to create more suspenseful plots and hence more important events. We model knowledge salience as the difference between an expert-informed reader versus a naive one by taking the difference between the average log-likelihood of a base LM and an LM enriched with memory and a KB. We also take inspiration from the model of Wilmot and Keller (2020), who compute suspense and surprise in short stories using vector states from a hierarchical model; this follows from theoretical work by Ely et al. (2015), and cognitive work from Li et al. (2019). We show how a vector salience measure can be computed based on this approach.
In addition to exploring new salience measures, our work aims to overcome limitations of existing work on salience modeling. Otake et al. (2020) only evaluate their model on a single type of narrative (Russian fairytales) and on a very small annotated dataset. We address this by using aligned summaries from the Shmoop corpus (Chaudhury et al., 2019) to provide salience labels. This results in a large dataset of long works (novels and plays) annotated with silver-standard salience labels. A second limitation of Otake et al. is that they use GPT-2 (Radford et al., 2019) as their LM, which has a relatively short context of a few hundred wordpieces. While this works for short stories, the context is too short to track lengthy novels or plays. Often a character will disappear for a long period; for example, Abel Magwitch in Great Expectations. Plots are often non-linear with recalls and flash-forwards, and the same characters and places reoccur at intermittent points in the story. At any moment in the story, the most relevant passages are not the most recent but the previous actions of the characters, places, and situations involved.
We address this limitation by incorporating an episodic knowledge retrieval mechanism (derived from RAG; Lewis et al. 2020b) and fuse this with a short-term memory mechanism that can extend the capabilities of a transformer LM. The intent is that the memory will learn to recall the most relevant parts of the story, act as an implicit index into these dimensions, and the KB will supplement this with typical plot knowledge. This memory mechanism is much more suitable than recent work on extended transformers for longer sequences, see Tay et al. (2020) and Fournier et al. (2021) for thorough reviews. Characters, places, subplots ebb and flow in long stories, so the most relevant information may be hundreds of pages previous with mainly irrelevant information in-between, which suits indexed episodic memory rather than a transformer that must filter out the mainly irrelevant details in-between. For example, Abel Magwitch in Great Expectations is in the first two chapters and then reappears explicitly in Chapter 40.
Our results show that integrating KB and memory components improves the overall performance of salience detection. Using a vector alternative to infer salience is a slight improvement over the LM. Other measures such as swap salience and knowledge salience perform worse than the main salience measures but still show improvements over our baseline model.

Related Work
The main architectural innovation is to use an external knowledgebase, based on RAG (Lewis et al., 2020b), and combine this seamlessly with a memory mechanism to improve the model's predictive performance. The main structure of this model is to use a question and document encoder, both transformers, to learn and look up passages of text from a knowledgebase (based on DPR; Karpukhin et al. 2020) and then fuse this knowledge into a transformer encoder/decoder model such as BART (Lewis et al., 2020a) or T5 (Raffel et al., 2020). Similar models including REALM (Guu et al., 2020), Hard EM (Min et al., 2019a), SpanSeqGen , and Fusion-in-Decoder (Izacard and Grave, 2021) have achieved state-of-theart results in factual domains such as answering natural language questions, trivia or games such as Jeopardy. In these domains, the key insight is that offloading knowledge externally allows models to perform better than much larger transformers that need to encode all knowledge in their weights. These methods that rely on retrieving raw text are also competitive with those that have tried to incorporate structured information such as GraphRetriever (Min et al., 2019b) or PathRetriever (Asai et al., 2020). We experiment both with a Wikipedia KB and Wikiplots, a KB of story plot summaries. The motive for the latter is that these plot fragments or vignettes act as a planning system (or schema; Schank and Abelson 1977) guiding expectations. Riedl and Sugandh (2008) used a similar concept in a rule-based system. Sap et al. (2020) also use a bag-like episodic memory mechanism for inference in stories without the more sophisticated transformer encoders of the RAG model. After the experimental work in this paper, a follow-up paper by Shuster et al. (2021) on several RAG variants found that the KB was able to reduce the amount of hallucination in generating dialogue. The KB grounds the text generation in relevant facts retrieved from the KB. While the story domain is different intuitively, the same effect is desirable; inferring salience should be grounded either in plot knowledge from Wikiplots or general knowledge from Wikipedia, and also the memory of the previous character actions and plot developments.

Model
The RAG model has been extended to incorporate a memory module, see Figure 1. 1 . Seen passages are added to the memory cache (configurable FIFO or LRU). The model retrieves n passages, performs a lookup in both the KB and memory and then reranks them together using the dot product score between the question and document encoder vectors. A significant benefit is that it naturally integrates both a short-term and long-term KB retrieval mechanism with a relatively simple design while allowing a powerful pre-trained LM (BART Large; Lewis et al. 2020a) and retrieval systems (DPR; Karpukhin et al. 2020) from RAG to be retained. For comparison, we train a baseline model in which only the question encoder from RAG is finetuned so that existing KBs can be used without becoming stale. We also compare to the mem model, where both the question and document encoder are finetuned, and only memory is used during inference.
The notation follows from the RAG paper and the model derived from the RAG-Token model. Assuming x is the original input or context text, and y is the target label for upcoming text, and z a passage of text from a retrieved document from the KB or memory, t a time index from the place of the passage in the story, and θ the parameterisation of the model. The generation task, p θ (y t | x, z, y 1:t−1 ), is to predict y t by marginalising over the input, previously generated word pieces and the retrieved document, this is defined in (1). Each next token is marginalised over all the retrieved z documents. The respective probability varies for each z at each step for each retrieved passage.
The top z ∈ Z, by default five, passages are retrieved by maximising the dot product, p µ (z | x) ∝ exp(d(z) T q(x)), where d is the document encoder, and q the question encoder, both pretrained from the DPR multiset model, resulting in a bi-encoder setup. Only q is finetuned in training. The text passages z, whether retrieved from the KB or memory, are then concatenated onto the original x text and fed through the BART large encoder/decoder model. The memory mechanism for training is a single pool of up to 128k vectors that operates as an LRU cache during training.
The principal training loss in (2) is simply the negative log likelihood over the batch as per the standard left-to-right decoder loss for BART. Because the model marginalises the retrieved passages, back-propagation through this loss also updates the question encoder to retrieve more relevant passages.

Training
Datasets are read in an interleaved or round-robin fashion so that only one (x, y) pair from each story is in a batch. Batches are sliding windows of 12 sentences for both x and y with a k of five passages to retrieve. The combined context for the concatenated encoder text is truncated to 512 word pieces, and the max length for the decoder is 128. The model is trained with a batch size of 32. RAG has a delimiter separating retrieved text when concatenating for BART. We swap the order of RAGs concatenation so that the context is first and answer passages second to prevent truncation of the context text. To allow the model to train on 12GB GPUs, we use the zero optimisation memory saving features of DeepSpeed (Rasley et al., 2020), which also necessitates using FP16, with gradient checkpointing for the model. Our training uses the base version of the RAG multiset encoders and the original pre-trained BART Large. We finetune with Adam (Kingma and Ba, 2015) with a learning rate of 2 −6 .  Figure 1: Architecture of the memory RAG model: On the left-hand side are caches containing the permanent KB and transitory memory, which seen passages are added to. The Retriever encodes context text, looks up from both KB and memory, and concatenates the retrieved text to the context text. The generator, the BART encoderdecoder processes each passages concatenation, and marginalises over them to produce a distribution over output wordpieces.

Datasets
BooksCorpus (Zhu et al., 2015) provides a large corpus of longer novel-length works and is used for training. However, BooksCorpus consists of free books scraped from Smashwords; these works are highly slanted towards particular genres such as romance and fantasy which are unlike the evaluated task, which is mainly classic works. To supplement BooksCorpus an additional training dataset from Gutenberg using the c-w/gutenberg library filtered to only English language fictional works. Another important area of longer-form storytelling is movies or dramatic works. So to improve diversity, the Movie Scripts datasets (Ramakrishna et al., 2017) is used. Multi-dataset models performed better on the validation set in training than single corpus models, so only these are evaluated. The training set sizes are BooksCorpus circa 18k works, Gutenberg 27k, and Movie Scripts 1.5k. We split sentences using Blingfire.

Baselines
The primary baselines for salience prediction come from Otake et al. (2020). Random just randomly assigns a salience score to each sentence position. Ascending assigns scores that increase per position. Descending, the reverse, assigns decreasing scores per position. The intuition behind these benchmarks is that important information can be clustered at the beginning or end of a story or chapter.
Otake et al. use TF-IDF as another benchmark; we use a BERT derived clustering summarisation approach (Miller, 2019). The method uses k-means to cluster BERT sentence vectors according to the number of desired sentences and then selecting the sentences closest to the centroids. Since salience scores are required, we adapt this method to output the cosine distance from the centroid as a salience score. We set the k so that there is one cluster for every 10 sentences. One change from Miller is to use the stsb-roberta-large sentence transformers model (Reimers and Gurevych, 2019), which has sentence embeddings that perform much better on a range of semantic tasks than raw BERT.

Inference
Salience detection is based on the BCF method (Otake et al., 2020). We only use the sentence deletion variant. Let S be the set of all sentences. The salience is σ. For BCF this uses an event removal function r and coherence evaluator c. c is the difference in coherence between when the sentence t is present and removed in (3) for the following n sentences. Note that r can be used more broadly as a structural manipulation function. In this paper r is also used for swap function and a knowledge difference function, these are described later.
The coherence (4) and (5) is the average loglikelihood of the word pieces following sentences up to the maximum word pieces of the label, nor- malised by the length (6).
We treat a sentence as an event. In inference, we use a context of 12 sentences (truncated to 512 word pieces) and up to 20 passages are retrieved either from the KB or memory. Otake et al. run salience from each deleted sentence to the end of the story, which is factorial complexity for the number of sentences. This is infeasible on novel-length works, so our salience implementation is more localised and run over the next 128 LM wordpieces. As well as BCF Salience, several other measures for salience are explored. We experiment with knowledge salience, which measures the difference between salience with the RAG KB and Memory enabled versus with it disabled. Swap salience follows the same structure as sentence deletion, but the r function swaps the order of the sentences rather than deleting them, and so tests order dependence as a form of salience. The sentiment is another relevant factor in whether something is salient; more emotional passages, either negative or positive, might be more salient. We use VADER sentiment (Hutto and Gilbert, 2014) as an adjustment factor for other salience measures salience · (1.0 + abs(sentiment)) where sentiment is the absolute values of the sentiment in the range 0.0 − 1.0. In addition, we follow Wilmot and Keller (2020) and define measures based on embeddings: We define E as the average of the word piece vectors from the BART encoder after marginalisation. The first measure is the cosine distance from subsequent vectors, defined by Wilmot and Keller as Ely surprise cos_dist(E t , E t−1 ). The second measure takes a vector distance rather than average log-likelihood in the sentence deletion BCF method to create a version based on an embedding, not LM decoding. The evaluated measures are: • Clus-Sal: The clustering baseline.
• Like-Sal: The main BCF measure described.
• No-Know-Sal: The same but with both the memory and KB disabled as per Otake et al.
• Like-Imp-Sal: Use sentiment to adjust the salience.
• Know-Sal: The difference between average log-likelihood of the LM with the KB and memory on versus off, knowledge salience.
• Swap-Sal: Use the same BCF approach but swaps rather than deletes a sentence to test structural ordering.
• Emb-Sal: Salience based on above embedding distance not average log-likelihood.
We run the evaluation on three models: With the Wikiplots dataset, with the Wikipedia dataset, and with just mem enabled and additional finetuning of the document encoder.

Perplexity Model Improvements
The major innovation of the RAG derived model is incorporating the KB and memory mechanism into the LM, and therefore, it needs to be tested what impact it has as a general LM. The best model is the baseline with only the memory, and the KB turned off. Both versions of the KB on their own and memory combined are slightly worse and around the same perplexity. The crucial difference is that LM, the model with neither, is far worse, and scrambling, which retrieve random passages, is only slightly better.
Overall, these results validate that memory and KB are hugely improving the predictive power of the LM.

ProppLearner
Following on from the BCF paper (Otake et al., 2020), we evaluate the ProppLearner task derived from the Propp dataset (Finlayson, 2017), a richly annotated corpus of 15 Russian fairytales translated into English. See Otake et al. for more rationale for the link, but the Proppian functions with which this corpus is annotated define stereotypically important roles in the classic Russian fairytale. They represent the key events of a story's plot, which should therefore be salient. As per Otake et al., the results are reported using MAP (mean average precision; Manning et al. 2008  All the RAG models are the baseline used with different variants of the salience measures. The best RAG models, see Table 2, measures Like-Sal and Emb-Sal score slightly better than Otake et al.'s model. This validation is limited, though, as the Propp dataset is tiny, with only 15 stories of less than 150 sentences and limited annotations. In the next section, we extend this evaluation approach to a corpus of much longer works of classical literature, using silver labels derived from a corpus of aligned summaries. This allows us to test both the memory and KB mechanism adequately, as these would be expected to be most advantageous for longer works. This approach will enable us to test whether our method scales beyond short texts and adds robustness to the evaluation through the breadth of the corpus and the challenging nature of the text.

Shmoop Automated Evaluation
Ideally, to evaluate this thesis on longer works, there would be a set of Gold standard annotations with the salient sentences. Typically even short novellas can be over 20K words, more normal novels longer than 50K words. More sweeping works such as Anna Karenina, Wuthering Heights, The Fellowship of the Ring, or David Copperfield can be well over 100K words. Per-sentence annotations for longer works such as novels and plays are prohibitively expensive. This is especially true when multiple annotators are required to ensure high inter-annotator agreement. It would also not be possible with insufficiently trained and lower cost crowdsourced workers. Reading a local passage would not be enough as it is only possible to judge salience over the whole narrative, which can be tens of thousands of words. This requires strong comprehension and thus requires skilled annotators and is a daunting annotation task. Instead, this paper builds on a variant of an approach for event salience in news articles (Liu et al., 2018;Jindal et al., 2020). The method is to align expert-written summaries with the full text, tagging sentences that align with the summary as salient, thus turning the evaluation into a binary ranking problem. The intuition is that the summary will mention only salient events and themes.
We use the Shmoop corpus (Chaudhury et al., 2019), which contains classic works of literature, such as Moby Dick, but also plays such as A Midsummer Nights Dream, and short stories including The Mask of the Red Death. The Shmoop corpus has stories split into chapters with aligned summaries. These bullet point summaries, if colloquial in style, are professionally written as study guides for students. They are written with a deep understanding of the plots and the salient events in them, which can serve as a valid proxy for salience. Conceptually they are also similar to the ProppLearner evaluation, although without specific Proppian roles, which are unused anyway for binary salience classification. It also aligns with the BCF concept, as if events from the summary are removed, they would significantly alter the plot. 2 Jindal et al. (2020) align summaries to 2 There are occasional exceptions, such as summary points text by using BERT (Devlin et al., 2019) to match constituent parts of events extracted from semantic role labels (SRLs). However, in testing, this performed poorly. Unlike news, the story summaries are more loose descriptions of events, which the SRL method struggles with. We instead found using an S-Bert transformer (Reimers and Gurevych, 2019) on the whole sentence worked much better in aligning summaries to the full text. The method is as follows: 3 1. Split aligned chapters into sentences, S t for summaries and F t for the full text.
3. Calculate cosine similarity for all pairwise r(S t ) and r(F t ) for t ± ρ, where the range is ρ = 10.0% and the valid range for t is x ∈ X for S t , and y ∈ Y for F t .
4. Mark up to k as salient sentences for all sentence pairs in the alignment window s(x, y) = cos_sim(r(S tx ), r(F ty ) where: Pairs of summary and full-text sentences are matched within a percentile range. The rationale is that matches are likely to occur in the full text in a roughly similar position to the summary. We allow up to three targets per summary sentence, as the summary sentences often compress information with multiple clauses and because sometimes there are near identically suitable matches. The advantage of this method is that it allows automated evaluation of salience to scale to longer works that test the memory and KB mechanism of the model without excessive annotation cost. The silver Shmoop annotations are on 226 titles, spanning 6, 939 chapters with 214, 617 silver standard labels. Each chapter averages 148 sentences with an average of 31 labelled as salient using the criteria specified. See Figure 2 for an example of the that discuss themes of the overall work and not specific plot events, but these are rare. 3   Shmoop labels plotted with the salience for a book chapter.
As for the ProppLearner data, we report MAP. We also evaluate with ROUGE-L (Lin, 2004), comparing the text by selecting the k most salient sentences according to the measure where k is the number of salient sentences, and report recall at k. All measures are calculated by chapter, and we take the mean across the dataset.
The results in Table 3 reveal several main themes. The Clus-Sal baseline measure improves on all the other baselines but only by a comparatively small margin with the best of each, by 0.03 compared with the best MAP baseline, 0.04 with Rouge-L, and 0.02 with recall. The baseline is a centroid based extractive summarisation model that uses a powerful transformer; the relatively small performance improvement increase shows that the task is challenging.
The main Like-Sal measure shows an improvement of around 0.05 over Clus-Sal, and 0.10-0.15 over the baseline. This is a reasonable improvement given the model is unsupervised. The No-Know-Sal (without memory and KB) is about 0.03-0.04 worse on MAP and recall, which indicates that the RAG enhancements are helping improve salience detection. The theoretical reason would be that BCF detects shifts in state and the informed model with the KB and memory is more likely to predict more obvious events. So salient events are more likely to be significant plot shifts. The biggest finding is that salience based on the embedding, Emb-Sal is the strongest measure. This shows the merit of using the BART model more flexibly as a general-purpose sentence encoding model. The Emb-Surp measure is a slight improvement on the baselines, indicating that it is mainly the BCF method that causes an improvement in salience detection, rather than a simple measure of how much the story changes from sentence to sentence.
One difference from the Otake et al. finding is that combining the Clus measures makes little difference. Neither do the Imp measures that use absolute sentiment score. While worth exploring further this is consistent with Wilmot and Keller (2020) findings when adjusting sentiment with inferring surprise and suspense.
Of the more esoteric measures, both Swap-Sal and Know-Sal improve on the baseline, although not by much. The more interesting is Know-Diff-Sal, which performs similarly to the Clus-Sal baseline. The measure as a proxy to exploit the difference between reader and character is quite crude. There may be a more sophisticated way to develop this idea by modelling character knowledge explicitly.
Largely speaking, there does not seem to be much of a difference between the different memory and KB configurations. With the best measure Emb-Sal, the results are nearly identical. With the original BCF measure Like-Sal and its variants, both the Wikiplots dataset (plot summaries from Wikipedia) and the full Wikipedia dataset only result in a tiny improvement. It might be expected that a KB would improve performance for salience prediction, but recall that in the perplexity evaluation, memory-only performed better. The present results also suggest that the memory mechanism is the main reason for the improvement over No-Know-Sal.
The memory and KB access pattern of the model is highly non-linear and references the earlier mention of the same characters, places, or moods. One example of this is from Great Expectations final chapter, where Pip and Estella have their last meeting. The passage most recalled is their early meeting some 100K odd words earlier while walking in the Garden where Estella plays a trick on Pip. The memory focuses on the characters and their relationship rather than many irrelevant details and subplots occurring in between. The episodic memory can be thought of as acting as an index into crucial elements of the plot, which is essential for narrative comprehension (Zwaan, 1999;Zwaan et al., 1995). It justifies the suitability of an episodic memory model for understanding longer-form narrative texts.

Conclusion
The main overall finding is that the BCF method can infer salience over and above baselines with an improvement on much longer works. We find that augmenting an LM with memory and an external KB can improve the detection of salience and increase the predictive power of the LM on narrative text. We also find that a vector-based version of the concept can perform slightly better than using the log-likelihood from an LM. Therefore, this paper demonstrates that it is feasible to run an unsupervised method on novels from Dickens or plays by Shakespeare and achieve correlation with an automated silver label benchmark. Nevertheless, the MAP results are around 0.3, and ROUGE-L is 0.4, which leaves room for improvement.
One factor in the moderate increase could be that the salience modelling is explicitly local over the label of the n next tokens. This is more a local view of salience as intended from the reader perspective. The model may flag up false leads that are locally important but not globally for the plot. In contrast, the Shmoop is written with the knowledge of the whole story, and so will exclude them.
A more reader orientated evaluation is for future work. Although the Shmoop alignment is generally strong, there are occasions where arguably multiple sentences could be deemed the correct one, and the silver label is one and the salient peak the other. With this unsupervised approach, performance is likely to be underestimated as the labels are entirely independent. In contrast to much recent supervised work, such as PEGASUS (Zhang et al., 2020) summarization Has additional human evaluation on some datasets. system, use silver labels created with proxies such as ROUGE. The labels both train the system and are evaluated on. Even with a separate test set performance, the system is more likely to replicate any noisy misalignments in the labelling process and overestimate performance.
On future work, if RAG can improve LM prediction performance and help infer salience, then the same models would seem to hold promise in improving text generation, including story generation.
The knowledge salience approach is a simple attempt to model the informed reader versus the naive one. In narratology, the characters perspective is crucial in for example eventfulness (Schmid, 2003;Schmid et al., 2017);Lotman et al. (1977) notion of characters crossing a forbidden semantic border; or suspense as per Zillmann (1996), or Gerrig and Bernardo (1994) concept of the reader as problem solver. There is, therefore, rich potential work in modelling character states, knowledge, intents and contrasting them with the readers' expectations, and the norms of the narrative world in inferring concepts such as salience, suspense, and surprise. Characters could be implicitly modelled using a per entity memory model extending the current RAG approach. Or take a more structured approach inspired by recent work such as Sims and Bamman (2020) modelling literary character communication, or story generation systems such as CAST (Peng et al., 2021) that model multiple characters goals or C2PO (Ammanabrolu et al., 2021) that more explicitly models causal chain relations.

A.1 Interactive Plots
Within supplementary material are shmoop_iteractive_plots. This is a selection of interactive plots all.html from the Shmoop alignment of chapters. The plots show all the reported metrics that can be toggled using the legend. Each plot has the full text of the chapter with a sentence on each data point. The gold stars are the original Shmoop summary sentence, which sentence it aligns with, and the similarity score. Also included with each chapter is a correlation heatmap plot showing the Spearman ρ correlation between all the reported metrics.

A.2 Retrieved Passages
To illustrate how the retrieval mechanism functions in supplementary material within shmoop_retrieved_passages are chapters from the end or near the end of Kim, Great Expectations and 20,000 Leagues Under the Sea. These json files contain a list of the distinct passages for the chapter of each book. For each passage there is a list of retrieved passages that were looked up from the KB and memory for each block, 10 per passage, using the baseline. Each passage has a dot product score and the marginalised probability. The memory_id for a passage indicates the relative position in the story. The main reason for inclusion is it shows the memory lookup is highly non-linear and the retrieved passages from earlier in the story are strongly related to the characters, places and themes involved from anywhere in the story and not the most recent chapters.

A.3 Shmoop Alignment
For Shmoop alignment some examples are also found in table 4 that illustrates a few different types of sentences and the matches against the full text. In the supplementary materialfile within the shmoop_alignments folder are the alignment files for 20,000 Leagues Under the Sea and Richard III. Only two are included to demonsrate the alignment since Shmoop requires permission to use the summaries for the dataset. The format of the file is: • Within the main alignment json files the main json element with the annoations is chapters which splits books or plays into sub-sections.
• Within each element of chapters there is a summary element.
summary is a list with each of the summary sentences each one has a list of alignments that contains the index, text, and cosine similarity score of the full text sentences it is linked to 4 .
• Within each element of chapters there is a full_text element.
-full_text contains a list of the sentences with the full text. Each sentence has a salient boolean attribute and a salience_score attribute. -Not included in the submission but these are exported to separate per book json files for running through the RAG model.
For running the Shmoop alignment from the Github repository: • align_shmoop_summaries.py processes the separate raw summaries and full text books into a single jsonl file.
• salience_event_processing.sh, a slurm script runs the Shmoop alignment and produce runnable files for the model input. The file has configurable parameters for changing the thresholds and the models used.

B.1 Setup
The environment can be setup either via the requirements.txt file with pip on the Anaconda environment.yaml file, both in the Github repository.

B.2 Training Datasets
The datasets are: BooksCorpus, Gutenberg, Script-Corpus, and Wikiplots (as a KB, and not for training).
The preprocessing for all datasets is the same: 1. Sentence splitting using Blingfire.
2. Stories are randomly shuffled according to a fixed seed.
3. There is a 80/10/10 training/validation/test split but this is only used for early stopping in training since evaluation is on separate datasets -ProppLearner and Shmoop.

B.3 Training and Inference
• Training GPU. This is pretty reasonable given the length of the works, and obviously much shorter novels and plays have proportionally shorter inference time. -Memory: The base configuration uses 28GB of general purpose memory, this needs to be increased to 64 is the full Wikipedia KB with 23M passages is used.

B.4 Evaluation
Within the Github repository the main evaluation scripts is salience_evaluation.sh. This script produces a per chapter csv file with all the evaluation metrics stats, and a single aggregated whole. It used evaluating output in jsonl format produced by predictor_batch.sh the main script to run salience inference with an existing model over a batch of stories. There is also a salience_plot.sh script for producing the interactive charts for each evaluation output. Note more documentation is needed for the env variables to set but they are fairly self-explanatory in the slurm files. The main inference code is in the story_fragments/predictors package. The config is largely read through env variables in the python script. These need to be documented further for a full code release.

Summary
Full Text Score Then, two huge columns of water shoot from it, knocking the crew down.
The electric light suddenly went out, and two enormous waterspouts crashed onto the deck of the frigate, racing like a torrent from stem to stern, toppling crewmen, breaking spare masts and yardarms from their lashings.

0.723
He grabs the extinguisher cap thing and tries to smother the kid/grandpa ghost with it.
In the struggle, if that can be called a struggle in which the Ghost with no visible resistance on its own part was undisturbed by any effort of its adversary, Scrooge observed that its light was burning high and bright; and dimly connecting that with its influence over him, he seized the extinguisher-cap, and by a sudden action pressed it down upon its head.

0.600
Sara wants to say that she already knows French, but she doesn't know how to say so and ends up giving Miss Minchin the impression that she's being difficult and doesn't want to learn the language.
Miss Minchin was a very severe and imposing person, and she seemed so absolutely sure that Sara knew nothing whatever of French that she felt as if it would be almost rude to correct her.

0.722
Emerson still feels rough about ruining the lecturer's talk in the chapel.
But Mr. Emerson, contrite and unhappy, hurried away to apologize to the Rev. Cuthbert Eager.

0.486
She thinks Dinah should find a nice man and settle down.
And then you might get married to some decent man, and there'd be plenty ready to have you, if you'd only leave off that preaching, as is ten times worse than anything your Aunt Judith ever did.

0.334
According to him, the driftwood is dry and ideal for starting a fire.
It is now dry and would burn like tinder. 0.630 Detectives were sent to each port in England to see if the money might be recovered.
As soon as the robbery was discovered, picked detectives hastened off to Liverpool, Glasgow, Havre, Suez, Brindisi, New York, and other ports, inspired by the proffered reward of two thousand pounds, and five per cent. on the sum that might be recovered.