CoRRPUS: Code-based Structured Prompting for Neurosymbolic Story Understanding

Story generation and understanding -- as with all NLG/NLU tasks -- has seen a surge in neurosymbolic work. Researchers have recognized that, while large language models (LLMs) have tremendous utility, they can be augmented with symbolic means to be even better and to make up for any flaws that the neural networks might have. However, symbolic methods are extremely costly in terms of the amount of time and expertise needed to create them. In this work, we capitalize on state-of-the-art Code-LLMs, such as Codex, to bootstrap the use of symbolic methods for tracking the state of stories and aiding in story understanding. We show that our CoRRPUS system and abstracted prompting procedures can beat current state-of-the-art structured LLM techniques on pre-existing story understanding tasks (bAbI Task 2 and Re^3) with minimal hand engineering. We hope that this work can help highlight the importance of symbolic representations and specialized prompting for LLMs as these models require some guidance for performing reasoning tasks properly.


Introduction
Stories are a complex form of writing that involve many inter-dependent components, such as causal reasoning (Mostafazadeh et al., 2020;Han et al., 2021), temporal ordering (Basu et al., 2021), social commonsense (Hwang et al., 2021), and the consistent portrayal of facts and events as the story unfolds (Yu et al., 2021;Wilmot and Keller, 2021). Despite the recent uptick in research involving story understanding and reasoning (Qin et al., 2019;Han et al., 2019;Mostafazadeh et al., 2020;Brahman et al., 2021;Martin, 2021;Brahman, 2022;Chen et al., 2022;Guan et al., 2022;Andrus et al., 2022), there is still a significant gap in the abilities of models to actually understand (or generate) stories at a reasonable level of commonsense. Ranade et al. (2022) provide a recent overview of the area. * Work done while at the University of Pennsylvania. Meanwhile with natural language prompts, GPT-3 fails to keep track of the laptop through the character Amy's movement. It can follow that Amy has her laptop and that she brought it to the dorm, but it "loses track" of the laptop after Amy goes to the cafeteria.
Although large language models (LLMs) like GPT (Radford et al., 2019;Brown et al., 2020) by themselves can produce text that can be indistinguishable from human-written stories, these models still struggle to generate coherent long-form text (See et al., 2019) & solve simple commonsense reasoning tasks (See Figure 1 for an example.) and therefore, end up only writing at human-level quality less than three-quarters of the time (Ippolito et al., 2020). It is also likely that the majority of this generated text is actually memorized and generated verbatim from human-written stories (Lee et al., 2022).
Over the past few years, researchers have begun to see value in integrating symbolic AI methods-that are consistent and logical-with neural networks-that are flexible and unpredictable (Garcez and Lamb, 2020). These neurosymbolic techniques try to balance the best of both worlds. Nye et al. (2021) draws a parallel between neurosymbolic AI and the System 1/System 2 cognitive science theory (Evans, 2003): neural sequence models are a fast and intuitive system (System 1) that can be improved by adding a logical reasoning system (System 2). Martin (2021) and Nye et al. (2021) use GPT-2/3, respectively, to produce candidate story continuations and then use a symbolic state representation to compare against in order to remove suggestions that would be inconsistent.
We extend this work with the use of the Code-LLM Codex (Chen et al., 2021) for structuring and tracking state changes for story understanding. Our system, which we call CORRPUS (Code Representations to Reason & Prompt over for Understanding in Stories) 1 , extracts structured information (the story world model) which is then used for reasoning. We compare to neural-only, symbolic-only, and prior works' neurosymbolic systems on two pre-existing story understanding tasks: bAbI and Re 3 , which we will describe in Section 4.
Our contributions are as follows: 1. We adapt the Code-LLM model for modeling story worlds and tracking information over time. Our CoRRPUS technique outperforms existing state-of-the-art models on bAbI and Re 3 story understanding tasks.
2. We explore various prompting styles to achieve the best performance from Code-LLMs and report on best practices for working with Code-LLMs.
In the rest of the paper, we will outline a brief history of neurosymbolic work in storytelling and recent use of Code-LLMs. We will then go over CoRRPUS's three prompting methods, followed by a description of the two story understanding tasks bAbI and Re 3 and our experiments with CoRRPUS on both.
2 Related Work

Neurosymbolic Reasoning in Storytelling
Structured representations have existed in story generation for decades (e.g., (Lebowitz, 1984;Turner and Dyer, 1985)), and while pure symbolic methods are still researched today (Garbe et al., 2019;Christensen et al., 2020;Ware and Siler, 2021), a recent push for combining these structured symbolic methods with probabilistic neural networks has grown. We outline some of these methods here.
Much of the research in story generation uses a hierarchical or multi-stage technique to first plan out the underlying plot (also known as fabula) or other underlying structures, and then generate natural language sentences that are informed by the structure (Martin et al., 2018;Fan et al., 2018;Tambwekar et al., 2019;Yao et al., 2019;Goldfarb-Tarrant et al., 2019;Ammanabrolu et al., 2020;Rashkin et al., 2020;Guan et al., 2020;Sun et al., 2022;. Other popular ways to include structure to enhance story generation include using external resources  with ConceptNet (Speer et al., 2017), Martin (2021) with Verb-Net (Brown et al., 2019),  with ATOMIC (Sap et al., 2019)) or by extracting information from the original CoRRPUS beyond the series of events, such as characters' emotions toward each other (Rao Vijjini et al., 2022) or knowledge triplets (Alabdulkarim et al., 2021). Similarly, summary information of the story so far , and a combination of summaries and character relationships (Si et al., 2021) have been used to augment storytelling dialog generation.
Researchers have also seen the benefit of neurosymbolic methods by using external knowledge bases (Zhang et al., 2021) or symbolic representations  to augment a transformer's ability to perform story understanding.
Most relevant to our work, however, is the extraction of a story world or world model. This work uses a dynamic external representation in order to keep track of the story while the system is generating in order to make for more consistent stories. So far this has been done with world state dictionaries (Martin, 2021), object-oriented notation (Nye et al., 2021), and knowledge graphs (Andrus et al., 2022).

Introduction of Code-LLMs for Reasoning
Recently, thanks to the LLMs and large-scale code training data (Husain et al., 2019), many breakthroughs has been made in automatic code synthesis (Feng et al., 2020;Clement et al., 2020;Chen et al., 2021). Code-LLMs have been shown to be adept at logical reasoning (Wei et al., 2022), numerical reasoning (Cobbe et al., 2021), theorem proving (Wu et al., 2022), and linguistic reasoning tasks, such as the command composition task SCAN (Lake and Baroni, 2018) as shown by Zhou et al. (2023). Madaan et al. (2022) show that Code-LLMs perform better than LLMs in various structured commonsense reasoning tasks including procedural rea-Figure 2: An example prompt used for bAbI Task 2. All three prompting methods have the same prompt initialization (a) followed by their respective additional functions (b, c, or d), found inside the World class (end of a). That is, for the 1-shot example, CoRRPUS would be provided (a) + (b, c, or d) depending on the prompting method. To prompt for the next story, CoRRPUS is given (a) + the non-highlighted of (b, c, or d). The highlighted section would then be generated by CoRRPUS.
soning and entity state tracking, where they used a graph-based representation. Similarly, we prompt Code-LLMs to extract structured world model for neurosymbolic story understanding.

The CoRRPUS Prompting System
In this work, we examine the types of prompts for tracking symbolic story state representations using OpenAI's Code-LLM, Codex (Chen et al., 2021). All experiments that use the CoRRPUS prompt system are conducted with code-davinci-002 using OpenAI's API (which was free for research purposes). Following Holtzman et al. (2020), we use nucleus sampling with top-p value equals to 0.95.
We set the temperature to 0 when no diverse generation is needed (i.e., answering multiple choice questions from the bAbI task, Section 4.2) and set the temperature to 0.7 when diverse generation is needed (such as in the Re 3 task, Section 4.3). All the experimental results come from a single run.
We provide the Codex model a collection of classes in Python to represent a model of characters and objects that we want to track in the story, in addition to an initialization of the specific characters and objects for the current story, which is presented in a World class. See Figure 2a and Appendix 8.1 for examples of these classes. In addition to this information for the current story, we also provide Codex a 1-shot example of the full process we want it to complete.
Formally, we define the problem as: given a story S = [S 0 : S 1 : ... : S n ] and a story world state initialization W 0 , we want the model to update the story world state W i until the story is complete W n . After each sentence S i of the story, the model updates the world state using some black-boxed update function U i . Thus, the final world state Note that, with the except of W 0 and W n , these intermediate world states are also opaque and unseen from the user's perspective. This system, which we call CoRRPUS, extracts structured information using code-based chain-ofthought-type prompting in order to more accurately track the underlying story state and detect any inconsistencies.
We experiment with three different prompting techniques.
The following examples show how the story sentence "Sandra journeyed to the bedroom" would be modified for each prompt type.
• Comment Only: This is the simplest prompt, which is given no extra structural information.
Thus, it has to rely directly on the comments from the prompt and the story world state initialization. This shows us how well the Code-LLM can infer and fill in information with very little guidance. Example: Given the comment ## Sandra journeyed to the bedroom., the model should generate self.Sandra.location = "bedroom".
• Specific Functions: This prompt converts each individual sentence into a specific function. In addition to the commented sentences at the beginning of the prompt, the system is also prompted to generate the functions for each sentence. Example: Given the function name Sandra_journeyed_to_the_ bedroom(), the model should generate a definition for the function, which should contain self.Sandra.location = "bedroom".
• Abstract Functions: This last prompting style provides the model with functions for actions or setting attributes. The model then does not have to figure out what it means when a particular event happens but still has to map which function is appropriate for a given story sentence and how to fill in the arguments of the function. Example: Given go(character, location) and other functions, generate go(character=Sandra, destination=bedroom).
Full examples of these prompts can be shown in Figure 2 and Appendix 8.1.

Experiments
To show how CoRRPUS can improve story understanding via maintaining a world model, we evaluate our system on two tasks: (1) Task 2 of the bAbI set of tasks (Weston et al., 2015)-which tests multi-step reasoning, and (2)    built a set of stories, each with a variation that is counterfactual to the original story. These stories were generated by LLMs to generate consistent/contradictory stories based on a story premise and then the model was asked to detect any story inconsistencies.  make use of the Edit function for GPT-3 to correct the detected factual inconsistencies in order to maintain long-range story consistency.
These two tasks have been seen to be extremely challenging for LLMs with naive, few-shot prompting leading to accuracy barely above random chance.

Question-and-Answering (bAbI Task 2)
bAbI (Weston et al., 2015) is a question answering dataset for logic-based reasoning tasks. In Task 2, it first provide a story S focusing on the movement of objects and characters throughout the story and a query Q about their locations. The dataset contains 1,000 testing samples of (S, Q, A) tuples, where A is the answer to the query. We choose Task 2 because of the recent neurosymbolic reasoning work evaluated on it (Nye et al., 2021;Creswell et al., 2023), which we use as our baselines.
The CoRRPUS Formulation for bAbI. Starting with the CoRRPUS prompt formulation that was described in Section 3, we first initialize the character class with the attributes name, location, and inventory. The object class has attributes name, location, and carrier. Then, we let Codex complete the World class by generating the rest of the story() function, which tracks the story state changes and ends with a print() function that gives the answer to the query Q given in the bAbI Task. We then measure the accuracy of the model selecting the correct answer. We compare our CoRRPUS system to the following baselines for bAbI Task 2: 1. Random: This is just the likelihood of selecting the right multiple choice answer randomly. 6. GPT-3 Dual-System (DS) (Nye et al., 2021): This method is based off of System 1/System 2 thinking (Evans, 2003). They specify the functions of all actions used in bAbI Task 2 to create a logical, symbolic component (System 2). Then the system prompts GPT-3 to match each sentence with the actions (System 1) and executes the corresponding function. We did not rerun the results from this experiment.
Results and Discussion. As show in Table 1, even the one-shot prompting CoRRPUS system with Comment Only or Specific Functions achieve 12% and 22% higher accuracy, respectively, compared to vanilla one-shot GPT-3. We believe that is because GPT-3 when used out-of-the-box and unaided is known to be bad at multi-step reasoning (Shridhar et al., 2023;Sap et al., 2022). Meanwhile, Code-LLMs represent knowledge symbolically and therefore are better at logical reasoning (Madaan et al., 2022). However, the increase in accuracy is not solely due to Codex. When we use the same natural language prompt as the GPT-3 one-shot experiment, the accuracy only goes up 1.3%. It isn't until we introduce the CoRRPUS structured prompting system that we start to see accuracies between 67%-99.1%. We see that when the underlying functions for actions are provided for CoRRPUS (Abstract Functions), it can achieve near-perfect accuracy (99.1%), vastly exceeding other prompting methods. This experiment shows the importance of abstraction to trackability and composability in symbolic reasoning. The other system which approximately matches CoRRPUS in accuracy is the GPT-3 Dual System (Nye et al., 2021), which splits the reasoning steps into two separate systems, uses highly-detailed hand-written rules, and requires 10shot prompting. We show that providing a simpler one-shot prompt to Codex is enough to perform logic reasoning on these simple stories, as long as it is provided low-level functions to compute over.
With such high accuracies from our CoRRPUS system and Nye et al. (2021)'s Dual System, we are tempted to consider bAbI's Task 2 a solved problem.

Detecting Story Inconsistencies (Re 3 )
Given the simplicity of the sentences in bAbI, we wondered how well CoRRPUS could process and understand stories with complicated real-world sentences, such as those in the Re 3 task. This dataset contains 50 (P, P', S, S') tuples where P denotes a story premise, P' is a premise contradictory to P, S is the story generated from P, and S' is the story generated following P'. The task is framed as classification; one wants to flag (P, S) and (P', S') as consistent and (P, S') and (P', S) as contradictory. They then report the ROC-AUC (area under the ROC curve) score. An example of the comparison between facts of the premise and the story is shown in Figure 3.
CoRRPUS Formulation for Re 3 . Following our CoRRPUS prompt starting point (see Section 3), we initialize the character class with common person attributes including name, appearance, occupation, gender, age, and relations (to other characters). Then we initialize the main characters in the World class and one-shot prompt Codex to complete the story() function for tracking the state changes of the characters. Full examples of each of these prompts can be found in Appendix 8.1. Once the World state is complete, it is fed into GPT-3 to be converted into natural language text. We then, following the methods of , pass this natural language story state to BART-Large to find any contradictions between the story and the original corresponding premise from the dataset. We use the following baselines for the Re 3 experiment: 1. GPT-3: We query GPT-3 using zero-shot prompting to determine whether there are inconsistencies in the (S, P) pair. : This method uses the BART-Large-based (Lewis et al., 2020) entailment model trained on MNLI (Williams et al., 2018) to score (S, P) pairs.

Textual Entailment
3. Entailment-DPR : For each sentence s i in S, this method computes its relevance score against each sentence in P by using Dense Passage Retrieval (Karpukhin et al., 2020). Then the method takes the sentence with highest relevance score p i and use the entailment model to detect contradictions. Figure 3: An illustration of the contraction detection process for Re 3 . Following , we check attributes across the premise and the story using a BART-Large-based entailment model and flag any contradictions for attribute-value pairs with the same attribute key. In this example, the "appearance" attribute would be flagged as a contradiction.
4. Structured-Detect : This method prompts GPT-3 to extract an attribute dictionary for each character in the story premise and the story. To prevent hallucinations, the method prompts GPT-3 three times and then uses the BART-Large-based entailment model to take the most-agreed attributes. Finally, the method converts the attributevalue pairs into natural sentences and uses the entailment model to detect contradictions for values under the same key.
Inspired by the Structured-Detect  and Self-Consistency (Wang et al., 2023) methods, we ask CoRRPUS to complete 3 different generations with temperature equal to 0.7 and select the attribute-value pair by majority voting. Finally, like , we use a BART-Large-based entailment model to flag any contradictions for the values of any attributes found in both the premise and the story. A toy example of this comparison process is shown in Figure 3.  Results and Discussion. Table 2 shows that all version of CoRRPUS greatly outperform the baseline methods, with CoRRPUS (specific functions) performing the best at a ROC-AUC score of 0.79. Subjectively comparing CoRRPUS to GPT-3 shows that GPT-3 tends to struggle with the parsing of the sentences, not knowing what is relevant. Take, for example, the sentence "Mark Woodbury, a middle-aged man with graying hair and a mustache, smiled at Shannon as she walked into his office." CoRRPUS generates • self.Mark_Woodbury.appearance. append('graying hair')

ROC-AUC
• self.Mark_Woodbury.appearance. append('mustache') • self.Mark_Woodbury.age. append('middle-aged') Meanwhile, GPT-3 generates "Mark is a middleaged man with graying hair and a mustache." Our particular way of prompting with CoRRPUS uses pre-specified attributes in the character class initialization. By pointing out what types of attributes the model should be paying attention to (e.g. appearance or relations), CoRRPUS is better able to extract the relevant information from the natural language sentences of the given story. Meanwhile, GPT-3 ends up summarizing the original sentence. However, among our three different CoRRPUS prompting methods, we found that their performance are similar, with the Abstract Functions performing the worst and the Specific Functions performing the best. Their similarity in score could stem from the simplicity of the type of information that needs to be reasoned over, namely attributes of characters-with no functions over verbs and how they unfold. Because of this, the Abstract Functions prompting ends up being a collection of "set" functions (Appendix Figure 6), which is probably making the code more complicated and giving Codex a harder time following it.
However, it's uncertain why Specific Functions perform the best on this task. It could be due to keeping the settings of attributes better separated for each sentence via unique functions (Appendix Figure 5), instead of simply separated by comments (Appendix Figure 4)-since comments are never operated on in real code.

Conclusion
We present CoRRPUS, a code-based prompting system that can extract structured information from complicated stories using 1-shot prompting. We conducted experiments on bAbI's Task 2 to show that code-based prompting can leverage symbolic information to perform multi-step logical reasoning better than natural-language-based prompting, regardless of whether a Code-LLM or regular LLM is used. We also evaluate CoRRPUS on detecting story inconsistencies using the Re 3 task , showing that Code-LLMs can extract relevant information from story sentences better than LLMs. We emphasize that a careful prompting procedure that provides relevant low-level semantic abstractions can greatly improve the accuracy and generalizability of neurosymbolic reasoning but the type of prompt needed is entirely task-dependent.

Limitations
We recognize that this work was only performed on two tasks related to story understanding, thus it is difficult to say exactly how robust it really is. However, given the capabilities of LLMs and Code-LLMs, we believe our prompting techniques or similar will prove to be useful to the story understanding community.
Our work also assumes that the CoRRPUS will be asked the same question across stories. In other words, given an example as a prompt, CoRRPUS will follow that example to generate code for the next story. We are not providing the task to CoRR-PUS and having it interpret the question to figure out what it should be tracking. We simply tell it to track certain information (e.g., objects, physical features of characters) so that it can solve these tasks. Therefore, for CoRRPUS to work, the user would need to know what information is salient for their task and prompt it to the system. Even though the Re 3 dataset contains more complicated sentences than bAbI, these are still relatively simple English sentences. We do not know how CoRRPUS would perform on more complex stories or on stories in other languages.
Lastly, there is the issue of access. Due to cost, we were unable to rerun all of the GPT-3 experiments. The pricing of GPT-3 not only hinders new research, but it hinders reproducability efforts such as ours. Furthermore, as of the publication of this paper Codex has been removed from the Ope-nAI API, and it is as-of-yet unknown if GPT-3.5 or GPT-4 can handle code-based prompting as well. There are, however, still other code-based LLMs available, such as Github's Copilot and Hugging Face's Starcoder.

Risks
Working with LLMs is always a risk in itself since they are trained on huge amounts of data, some of which has never been read by developers. This text can include racist, misogynistic, queerphobic, etc. sentiments. Although the risk of harmful text might be reduced in a Code-LLM, comments, variables, and function names might still contain harmful messaging and should always be used with caution, especially when used outside of controlled research settings.
Furthermore, any code that CoRRPUS produces is not guaranteed to run nor is it guaranteed to be completely accurate in its reasoning. However, story understanding is a relatively safe testing space for reasoning and understanding tasks. A2. Did you discuss any potential risks of your work?

Section 7
A3. Do the abstract and introduction summarize the paper's main claims?
Left blank.
A4. Have you used AI writing assistants when working on this paper?
Left blank.
B Did you use or create scientific artifacts?
Sections 1, 3, 4 B1. Did you cite the creators of artifacts you used?
Sections 1, 3, 4 B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
These are tools and evaluation metrics free for use within research. We cite them and use them as intended.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? Both GPT-3 and Codex were used to generate text in a responsible manner through controlled experiments. Our use fits within OpenAI's usage policies.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?
The data we use is from pre-existing datasets. C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? Section 3 The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.