Open-Domain Hierarchical Event Schema Induction by Incremental Prompting and Verification

Event schemas are a form of world knowledge about the typical progression of events. Recent methods for event schema induction use information extraction systems to construct a large number of event graph instances from documents, and then learn to generalize the schema from such instances. In contrast, we propose to treat event schemas as a form of commonsense knowledge that can be derived from large language models (LLMs). This new paradigm greatly simplifies the schema induction process and allows us to handle both hierarchical relations and temporal relations between events in a straightforward way. Since event schemas have complex graph structures, we design an incremental prompting and verification method IncPrompt to break down the construction of a complex event graph into three stages: event skeleton construction, event expansion, and event-event relation verification. Compared to directly using LLMs to generate a linearized graph, IncSchema can generate large and complex schemas with 7.2% F1 improvement in temporal relations and 31.0% F1 improvement in hierarchical relations. In addition, compared to the previous state-of-the-art closed-domain schema induction model, human assessors were able to cover ~10% more events when translating the schemas into coherent stories and rated our schemas 1.3 points higher (on a 5-point scale) in terms of readability.


Introduction
Schemas, defined by (Schank and Abelson, 1975) as "a predetermined, stereotyped sequence of actions that defines a well-known situation", are a manifestation of world knowledge. With the help of schemas, a model can then infer missing events such as a person must have "been within contact with a pathogen" before the event "the person was sent to the hospital for treatment" and also predict that if a large-scale incident happened, this  Figure 1: A comparison between the instance-based schema induction pipeline (gray background) and our INCSCHEMA approach. By directly prompting LLMs to construct the schema graph, our framework is conceptually simpler, open-domain, extensible, and more interpretable.
might trigger an "investigation of the source of the pathogen". To automate schema creation, two mainstream approaches are to learn from manually created reference schemas or learn from large amounts of event instances automatically extracted from documents. Manual creation of complex hierarchical schemas requires expert annotation, which is not scalable 1 . On the other hand, instance-based schema induction methods (Li et al., , 2021 rely on complicated preprocessing 2 to transform documents into instance graphs for learning. Moreover, supervised information extraction systems (Ji and Grishman, 2008;Lin et al., 2021c) are domain-specific and suffer from error propagation through multiple components, which makes the downstream schema induction model closed-domain and low in quality. Figure 2: To create the schema for a given scenario, our model follows 3 rounds of operation: (1) event skeleton construction where we ask the LLM to list the important events; (2) event expansion to discover more related events for each existing event; event-event relation verification where we update the event-event relations based on the LLM's answers to questions about each event pair.
Tracing back to the original definition of schemas, we observe that "stereotyped sequences of events" or "the typical progression of events" can be viewed as commonsense knowledge that can be implicitly learned by training on large corpora. Through the language modeling objective, models can pick up which events statistically frequently co-occur and how their relationship is typically described. More recently, large language models (LLMs) such as GPT3 (Brown et al., 2020) and PaLM (Chowdhery et al., 2022) have shown impressive zero-shot performance on closely-related commonsense reasoning tasks such as goal-step reasoning (Zhang et al., 2020) and temporal reasoning 3 .
By utilizing LLMs to directly prompt for schematic knowledge, our approach is opendomain, extensible and more interpretable for humans. Given a new scenario name, our model only requires lightweight human guidance in providing some top-level chapter structure (as shown in the left of Figure 2) and can produce the entire schema in under an hour whereas instance-based methods require months to collect the data and retrain the IE system for new domains. Our model is extensible and can support new types of event-event relations by adding new prompt templates. To showcase this, in addition to the temporal relation between events 3 https://github.com/google/BIG-bench/ tree/main/bigbench/benchmark_tasks/ temporal_sequences which is the focus of prior work, we also account for the different event granularities by supporting hierarchical relations between events (for example, a physical conflict could happen as a part of a protest). Finally, by representing events with free-form text instead of types and organizing them into a hierarchy, our generated schemas are considered more interpretable.
We find that directly asking LLMs to generate linearized strings of schemas leads to suboptimal results due to the size and complexity of the graph structure. To solve this problem, we design an incremental prompting and verification scheme to break down the construction of a complex event graph schema into three major stages: event skeleton construction, event expansion, and event-event relation verification. As shown in Figure 2, each stage utilizes templated prompts (What happens before cases increase?) which can be instantiated either with the scenario name or the name of a previously generated event.
The key contributions of this paper are: • We propose a framework INCSCHEMA for inducing complex event schemas by treating the task as knowledge probing from LLMs. Compared to previous approaches that rely on the creation of event instance graphs, our method greatly simplifies the process and as a result, is not confined to the working domain of any IE system.
• We extend the expressive power of event schemas by inducing hierarchical relations and temporal relations between events at the same time. Our modularized prompting framework allows us to support a new type of event-event relation easily, whereas prior work (Zhou et al., 2022b; required specialized pipelines or components. • We verify the effectiveness of our framework on two complex schema datasets: ODIN, an Open-Domain Newswire schema library, and RESIN-11 (Du et al., 2022). Compared to directly generating the schema using a linearized graph description language (Sakaguchi et al., 2021), INCSCHEMA shows 7.2% improvement in temporal relation F1 and 31.0% improvement in hierarchical relation F1.

Task Overview
Given a scenario name, a schema depicts the general progression of events within that scenario. Following (Li et al., 2021), we consider the schema to be a graph structure of events. We output a schema graph of event nodes and event-event relation edges, including temporal relations and hierarchical relations.
Since our algorithm is designed to be opendomain, we represent each event e with a description string such as "A person shows early symptoms of the disease" instead of a type from a restricted ontology (e.g., Illness). Description strings are more flexible in representing different granularities of events and are more informative. It is noteworthy that event descriptions in a schema should be general, instead of a specific instance, such as "John had a mild fever due to COVID".
In addition, we support the representation of chapters, which are "a collection of events that share the same theme and are connected in spacetime". When a high-level chapter structure G c (as shown in the left side of Figure 2) is available, we condition on the given chapters to guide the schema generation process. Chapters are also treated as events and can potentially have temporal relations between them. Every other event must be a descendant of a chapter event. If no chapter structure is available, we create a single chapter from the scenario name.

Our Approach
Leveraging LLMs to directly generate the full schema graph is challenging due to the size and complexity of schemas. Thus, we divide our schema induction algorithm INCSCHEMA into three stages as depicted in Figure 2. Starting from the scenario node or one of the chapter nodes, the skeleton construction stage first produces a list of major events that are subevents of the scenario (chapter) following sequential order. For each generated event, we expand the schema graph to include its temporally-related neighbors and potential children in the event expansion stage. For each pair of events, we further rescore their temporal and hierarchical relation probability in the relation verification stage to enrich the relations between events.

Retrieval-Augmented Prompting
To make the model more informed of how events are typically depicted in news, we introduce a retrieval component to guide LLMs to focus on scenario-related passages. The key difficulty of schema induction is to generalize from multiple passages and reflect the "stereotyped sequence of events" instead of providing concrete and specific answers. We, therefore, retrieve multiple passages each time and ask the model to provide a generalized answer that is suitable for all passages.
To build a document collection containing typical events of the given scenario, we leverage its Wikipedia category page and retrieve the reference news articles of each Wikipedia article under the category, as detailed in Appendix A. With such a document collection, for each prompt, we are able to use the description of the event as the query and retrieve k = 3 passages based on state-of-theart document retrieval system TCT-ColBERT (Lin et al., 2021b). The input to the LM is structured as follows:

Retrieval-Augmented Prompt
Based on the following passages {retrieved passages}, {prompt} Providing more than one passage is critical as we want the model to produce a generalized response instead of a specific response that only pertains to one event instance.

Event Skeleton Construction
We use the following prompt to query the LM about events that belong to the chapter c:

Event Skeleton Prompt
{evt.name} is defined as "{evt.description}". List the major events that happen in the {evt.name} of a {scenario}: This typically gives us a list of sentences, which is further translated into a linear chain of event nodes by treating each sentence as an event description and regarding the events as listed in temporal order. To assign a name to each event for easier human understanding, we leverage the LLM again with in-context learning using 10 {description, name} pairs such as {Disinfect the area to prevent infection of the disease, Sanitize} (the complete list of in-context examples is in Appendix D).

Event Expansion and Validation
Given an event e (such as Cases Increase in Figure  2), we expand the schema by probing for its connected events in terms of temporal and hierarchical relations using prompts as below:

Event Expansion Prompt
What happened during "{evt.description}"? List the answers: (See Appendix D for a full list of prompts used.) Every sentence in the generated response will be treated as a candidate event.
For every candidate event e ′ (such as Disease-Transmit in Figure 2), we perform a few validation tests as listed below. The event is only added to the schema when all the tests pass.

Duplication Test
To check if a new event is a duplicate of an existing event, we use both embedding similarity computed through cosine similarity of SBERT embeddings (Reimers and Gurevych, 2019) 4 and string similarity using Jaro-Winkler similarity (Winkler, 1990). If the event description, event name, or the embedding of the event description is sufficiently similar to an existing event in the schema, we will discard the new event. 5 Specificity Test When we augment the prompt with retrieved documents, at times the model will answer the prompt with details that are too specific to a certain news article, for instance, include the time and location of the event. The specificity test seeks to remove such events. We implement this by asking the LLM "Does the text contain any specific names, numbers, locations, or dates?" and requesting a yes-no answer. We use 10 in-context examples to help the LLM adhere to the correct answer format and understand the instructions.
Chapter Test For the chapter assignment test, we present the name and the definition of the chapter event c and the target event e ′ respectively, then ask "Is e ′ a part of c? ". If the answer is "yes", we keep the event e ′ .
If a new event e ′ passes validation, we assign a name to the event following the same procedure as in Section 3.2.

Event-Event Relation Verification
Although the prompts from the previous step naturally provide us with some relations between events (the answer to "What are the steps in e?" should be subevents of e), such relations may be incomplete or noisy. To remedy this problem, for every pair of events (e 1 , e 2 ) in the same chapter, we verify their potential temporal/hierarchical relation.
A straightforward way to perform verification would be to ask questions such as "Is e 1 a part of e 2 ?" and "Does e 1 happen before e 2 ?". Our pilot experiments show that this form of verification leads to sub-optimal results in two aspects: (1) relation confusion: the language model will predict both e 2 ≺ e 1 and e 1 ⊂ e 2 ; and (2) order sensitivity: the language model tends to return "yes" for both "Does e 1 happen before e 2 ?" and "Does e 1 happen after e 2 ?".
To solve this relation confusion problem, inspired by Allen interval algebra (Allen, 1983) and the neural-symbolic system in (Zhou et al., 2021), we decompose the decision of a temporal relation into questions about start time, end time, and duration. In addition, following HiEve (Glavaš et al., 2014), we define the hierarchical relation as spatialtemporal containment. Thus a necessary condition for a hierarchical relation to hold between e 1 and e 2 is that the time period of e 1 contains e 2 . This allows us to make decisions about temporal relations cosine similarity. Relation Allen's base relations e1 starts before e2? e1 ends before e2? Is the duration of e1 longer than e2?
No Yes Yes Table 1: Schema event-event relations, the correspondence to Allen's interval algebra, and the related relation questions. In the case of temporal overlap (last two rows), we refrain from adding an edge between the two events. and hierarchical relations jointly using the three questions as shown in Table 1.
For each question, to obtain the probability of the answers, we take the log probability of the top 5 tokens 6 in the vocabulary and check for the probability predicted for "yes", "no" and "unknown" tokens.
To handle the order sensitivity, we average the scores obtained from the different orderings ("Does e 1 start before e 2 ?" and "Does e 2 start before e 1 ") and different prompts ("Does e 1 start before e 2 ?" and "Does e 2 start after e 1 ?").
After obtaining the response for start time, end time, and duration questions, we only keep edges that have scores higher than a certain threshold for all of the three questions.
Since our temporal edges were only scored based on the descriptions of the event pair, we need to remove loops consisting of more than 2 events, ideally with minimal changes, to maintain global consistency. This problem is equivalent to the problem of finding the minimal feedback arc set, which is shown to be NP-hard. We adopt the greedy algorithm proposed in (Eades et al., 1993) using the previously predicted probabilities as edge weights to obtain a node ordering. Based on this ordering we can keep all edges directionally consistent. The detailed algorithm is provided in Appendix B. Finally, to simplify the schema, we perform transitive reduction on the relation and hierarchy edges respectively. 6 At the time of writing, OpenAI API only supports returning the log probability of a maximum of 5 tokens.

Experiments
We design our experiments based on the following three research questions: Q1: Hierarchical Schema Quality Can our model produce high-quality event graph schemas with both temporal and hierarchical relations?
Q2: Interpretability Is our model's output more interpretable than prior instance-based schema induction methods?
Q3: Model Generalization Can our model also be applied to everyday scenarios as in (Sakaguchi et al., 2021)?

Dataset
RESIN-11 (Du et al., 2022) is a schema library targeted at 11 newsworthy scenarios and includes both temporal and hierarchical relations between events. However, RESIN-11 is still quite heavily focused on attack and disaster-related scenarios, so we expand the coverage and create a new Open-Domain Newswire schema library ODIN which consists of 18 new scenarios, including coup, investment, and health care. The complete list of scenarios is in Appendix C.
Upon selecting the scenarios, we collected related documents from Wikipedia (following the procedure described in Section 3.1) and create the ground truth reference schemas by asking human annotators to curate the schemas generated by our algorithm by referring to the news reports of event instances. Human annotators used a schema visualization tool 7 to help visualize the graph structure while performing curation. Curators were encouraged to (1) add or remove events; (2) change the event names and descriptions; (3) change the temporal ordering between events; and (4) change the hierarchical relation between events. After the curation, the schemas were examined by linguistic experts. We present the statistics of ODIN along with RESIN-11 and ProScript in Table 2.

Evaluation Metrics
For automatic evaluation of the schema quality against human-created schemas, we adopt Event F1 and Relation F1 metrics. Event F1 is similar to the Event Match metric proposed in (Li et al., 2021) but since here we are generating event descriptions instead of performing classification over a fixed set of event types, we first compute the similarity score s between each generated event description and ground truth event description using cosine similarity of SBERT embeddings (Reimers and Gurevych, 2019). Then we find the maximum weight matching assignment ϕ between the predicted eventsÊ and the ground truth events E by treating it as an assignment problem between two bipartite graphs 8 .
Based on the event mapping ϕ, we further define Relation F1 metrics for temporal relations and hierarchical relations respectively. Note that this metric only applies to events that have a mapping.

Implementation Details
For both our model and the baseline, we use the GPT3 model text-davinci-003 through the OpenAI API. We set the temperature to 0.7 and top_p to 0.95.
For INCSCHEMA we set the minimum number of events within a chapter to be 3 and the maximal number of events to be 10. During the event skeleton construction stage, if the response contains less than 3 sentences, we will re-sample the response. Once the number of events within a chapter reaches the maximal limit, we will not add any more new events through the event expansion stage.
We set the threshold for the duplication test to be 0.9 for Jaro-Winkler string similar OR 0.85 for cosine similarity between SBERT embeddings. For the shorter event name, we also check if the Levenshtein edit distance is less than 3. For the eventevent relation verification, we set the threshold for the start time and end time questions to be 0.2 and the threshold for the duration question to be 0.7.

Q1: Hierarchical Schema Quality
We test our algorithm's ability to induce complex hierarchical schemas for news scenarios in RESIN-11 (Du et al., 2022) and our new Open-Domain Newswire schema library ODIN. We compare our model against a different prompt formulation method using the DOT graph description language as purposed by (Sakaguchi et al., 2021) (GPT-DOT). This method requires the LLM to generate all events and event-event relations in a single pass. To inform the model of the DOT language format, we use one in-context example converted from the Chemical Spill ground truth schema (the prompt is shown in Appendix D). During inference, we will input the scenario name and the chapter structure.
We show our results on the RESIN-11 dataset in Table 3 and the results for ODIN in Table 4 9 . Compared to our incremental prompting procedure, GPT-DOT generally outputs fewer events (10.11 events for GPT-DOT VS 52.6 events for INCSCHEMA on ODIN), which leads to high precision but low recall. While the generated events from GPT-DOT are still reasonable, the real deficiency of this formulation is its inability to identify hierarchical relations, especially when hierarchical relations co-exist with temporal relations.
To test if using an in-context learning prompt is the reason for low performance, we also experiment with an instruction-style prompt (GPT-DOT-Instruct) that explains the task and output format in detail and a step-by-step reasoning prompt (GPT-DOT-StepByStep) that allows the model to output parts of the schema separately (and we will merge them together). For the ODIN dataset, we find that the different prompt styles do not vary much except for improved temporal relation F1 when we use the step-by-step formulation.
Compared with the variants of our model, we   can see that the retrieval component helps improve event generation quality and the question decomposition strategy can greatly improve temporal relation F1.
Since RESIN-11 schemas were created without referencing any automatic results, the scores on RESIN-11 are generally lower than that of ODIN. However, on both datasets, our method can generally outperform GPT-DOT.

Q2: Schema Interpretability
To be able to compare our schemas side-by-side with previous work that assumed a limited ontology, we conduct a human evaluation that focuses on the interpretability of the induced schemas.
Human assessors are presented with the scenario name and a subgraph from the schema induction algorithm's output. We then ask the assessor to write a coherent short story by looking at the graph and indicate which events were included in their story. An example of the subschema and the story is shown in Figure 3. After they complete the story writing task, they will be asked to rate their experience from several aspects on a 5-point Likert scale. The human assessment interface is shown in Appendix F.
We compare against the state-of-the-art closed-domain schema induction method Double-GAE . DoubleGAE is an example of the instance-based methods that rely on IE: the nodes in the schema graph are typed instead of described with text.
In Table 5 we show the results for the story writing task. We observe that human assessors are able to compose a longer story with higher event cover-  age when presented with our schemas while taking roughly the same time. In the post-task questionnaire, as shown in Figure 4, the human assessors on average strongly agreed that the event names and event descriptions produced by our model were helpful and thought that our schemas were easier to understand compared to the baseline (4.50 vs 3.20 points). Both schemas contained events that were highly relevant to the scenario and the temporal ordering in the schemas was mostly correct.

Q3: Model Generalization
For this experiment, we use Proscript (Sakaguchi et al., 2021) as our dataset. Schemas in Proscript are typically short (5.45 events on average) and describe everyday scenarios. Proscript schemas only contain temporal relations and have no chapter structure, so we include the first two events (by topological sorting) as part of the prompt. We show the results in Table 6.
For our algorithm INCSCHEMA we omitted the event expansion stage since the event skeleton construction stage already generated enough events. In the event-event relation verification stage, we continue to add temporal relations among events The authorities were on high alert after reports of a car bombing in the city.They immediately began their investigation, determined to apprehend the bomber and bring them to justice. They interviewed witnesses and gathered evidence, piecing together the details of the heinous act. Their hard work paid off when they were able to track down and apprehend the bomber. During questioning, the bomber shocked investigators by confessing to the crime, sparing them the trouble of having to build a case against him. The bomber cited his motives for the attack as a twisted desire for public shaming. With the bomber in custody and a confession in hand, the authorities were able to swiftly bring him to trial. The bomber was found guilty and given a harsh prison sentence, ensuring he would spend a long time behind bars for his terrible crime. ... This is part of a schema about the Criminal Investigation of a car bombing event. Please write a story describing the figure: Figure 3: One example response from the schema interpretability human assessment. On the left we show the subevents of the Criminal Investigation chapter produced by our model. On the right is the human-written story describing the schema. We highlight the events that match the chapter in blue, events that appear in the schema in red and additional events in pink.  based on their verification score beyond the threshold until the graph is connected. On these small-scaled schemas with only temporal relations, we see that directly generating the full schema and incrementally prompting the schema lead to comparable results. This shows that GPT3 can indeed understand and generate valid graph description DOT language and the gap that we observe in Table 4 is mainly due to the diffi-culty of capturing long-range dependencies in large schemas and the confusion between temporal and hierarchical relations.

Related Work
Event Schema Induction Event schema induction, or script induction, is the task of inducing typical event-event relation structures for given scenarios/situations 10 . A large fraction of work considers event schemas as narrative chains Jurafsky, 2008, 2009;Jans et al., 2012;Mooney, 2014, 2016;Rudinger et al., 2015a;Ahrendt and Demberg, 2016;Granroth-Wilding and Clark, 2016;Wang et al., 2017;Weber et al., 2018), limiting the structure to include only sequential temporal relations. More recently, non-sequential, partially ordered temporal relations have been taken into consideration (Li et al., 2018(Li et al., , 2021Sakaguchi et al., 2021; but they do not consider the different scales of events and the potential hierarchical relations. In terms of schema expressiveness,  is the most similar to ours as they also consider both partial temporal order and hierarchical relations. Our work also resembles a line of recent research on inducing schema knowledge from pre-trained language models. Our schema induction process can be seen as a super-set of the post-processing in (Sancheti and Rudinger, 2022), which comprises irrelevant event removal, de-duplication, and temporal relation correction. We compare our incre-mental prompting approach with the end-to-end approach proposed in (Sakaguchi et al., 2021) in Section 4. The work of  is orthogonal to ours as they use LLMs for data generation instead of probing for schema knowledge.
Language Model Prompting Prompting has been the major method of interaction with billionscale language models (Brown et al., 2020;Rae et al., 2021;Wei et al., 2022;Chowdhery et al., 2022). Prompting can either be used to inform the model of the task instructions (Wei et al., 2022), provide the model with task input/output examples (Brown et al., 2020), or guide the model with explanations (Lampinen et al., 2022) and reasoning paths . In this work, we explore how a complex knowledge structure such as an event graph schema can be induced using LLMs by decomposing the task through incremental prompting.

Conclusions and Future Work
Prior work on schema induction has either relied on existing information extraction pipelines to convert unstructured documents into event graphs, or require massive human effort in annotating event schemas. We propose to view schema induction as a type of event-oriented commonsense that can be implicitly learned with large language models. However, since schemas are complex graph structures, instead of directly querying for schemas, we design an incremental prompting and verification framework INCSCHEMA to decompose the schema induction task into a series of simple questions. As a result, our model is applicable to the open-domain and can jointly induce temporal and hierarchical relations between events.
For future work, we plan to cover more aspects of schemas, including accounting for entity coreference, entity relations and entity attributes. While this work is focused on the task of schema induction, we hope to show the possibility of using LLMs for constructing complex knowledge structures.

Limitations
The event schemas generated by our model are not directly comparable to those generated by previous work that utilized a close-domain ontology. As a result, we were unable to adopt the same metrics and evaluate our schemas on type-level event prediction tasks as in (Li et al., 2021;. Grounding the events generated by the LLM into one of the types in the ontology could be added as a post-processing step to our model, but this would require some ontology-specific training data, which goes against our principles of designing an open-domain, portable framework.
Our event schema does not explicitly represent entity coreference, entity relations, and entity attributes. The current schemas that we produce focus on events and their relations, with entity information captured implicitly through the event descriptions. For instance, the See Medical Professional event is described as "The patient is seen by a doctor or other medical professional" and the proceeding Obtain Medical History event is described as "The medical professional obtains a medical history from the patient". The "medical professional" and "patient" are implied to be coreferential entities in this case, but not explicitly connected in the schema graph.
Our approach is also quite distinct from prior work (Rudinger et al., 2015b;Wang et al., 2017;Li et al., 2021; that consider a probabilistic model as an implicit schema where the schema graph, or event narrative chain can be sampled from. Probabilistic schema models have the advantage of being adaptive and can be conditioned on partially observed event sequences, but are hard to interpret. We make the conscious design decision to generate explicit, human-readable schema graphs instead of black-box schema models. Finally, our model relies on the usage of LMs, which have been observed to sometimes show inconsistent behavior between different runs or when using different prompts with the same meaning (Elazar et al., 2021;Zhou et al., 2022a). However, quantification of consistency has only been done for factual probing tasks while schema generation is a more open-ended task. For example, in our experiments on everyday scenarios, we observe that the model could generate distinct schemas for Buying a (computer) mouse based on whether the purchase was done online or in person. This variance is often benign and we leave it to future work to take advantage of such variance and possibly aggregate results over multiple runs.

A Retrieval Component
To build our document collection, we first search for the scenario name on Wikipedia, find its corresponding category page 11 and then for each Wikipedia article listed under the category, we follow the external reference links to news sources under the Wikipedia pages to retrieve the original news articles. We only keep English articles and filter out articles that have fewer than 4 sentences. Then we split the articles into overlapping segments of 5 sentences with 1 sentence overlap for indexing.
Our retrieval model is based on TCT-ColBERT (Lin et al., 2021b), specifically, the implementation provided by Pyserini (Lin et al., 2021a) and pretrained on the MSMARCO dataset (Campos et al., 2016). TCT-ColBERT is a distillation of ColBERT (Khattab and Zaharia, 2020) which is a late-interaction bi-encoder model. It encodes the query and the document separately into multiple vectors offline and then employs an interaction step to compute their similarity.

B Algorithm for Removing Temporal Loops
The key observation for finding the minimum feedback arc set is to convert the problem into finding an ordering v 1 v 2 · · · v n among the vertices of graph G, then all of the edges v i v j that violate this ordering by having i > j will be feedback arcs.
To create a good ordering (with a small number of feedback arcs), we maintain two lists s 1 and s 2 which correspond to the head and tail of the vertex ordering. We first remove the source and sink nodes from the graph recursively by adding the source nodes to s 1 and the sink nodes to s 2 .
For the remaining nodes, we compute a δ(v) score for each node which is the difference between the weights of its outgoing edges and incoming edges. Then the node with the maximal δ(v) will be appended to the end of s 1 and removed from the graph. This step is also done recursively and δ needs to be recomputed after removing nodes. Finally the ordering is obtained by concatenating s 1 and s 2 . The complete algorithm is shown in Algorithm 1.
This ordering will divide the edges in graph G into 2 sets: the set of edges (v i , v j ) that follow the ordering i < j and the set of edges that go against the ordering j > i. The feedback arc set will be whichever of these two sets that have lesser edges.
Algorithm 1 A greedy algorithm for finding the minimal feedback arc set (Eades et al., 1993) Require

C List of Scenario Names
We show the complete list of scenarios in    Note that in the verification stage, we use the probabilities assigned to the "yes", "no", and "unknown" tokens instead of directly taking the generated text as the answer.

D.2 In-Context Example for GPT3-DOT
The in-context example follows the DOT language specifications (https: //graphviz.org/doc/info/lang.html) to linearize a graph. Here we only list a few events and relations due to length considerations. 1 List relevant events and edges in "chemical spills":

E Schema Examples
We show an example of a schema generated by GPT3-DOT in Figure 5 and an example schema generated by INCSCHEMA in 6. In the visualization, blue nodes are events with subevents (children nodes) and yellow nodes are primitive events (leaf nodes). Blue edges represent hierarchical relations and go from parent to child. Black edges represent temporal edges and go from the previous event to the proceeding event. Schemas generated by GPT3-DOT are typically much smaller in size and confuse hierarchical relations with temporal relations.

F Human Assessment Details
We designed and distributed our human assessment task with Qualtrics 12 . We recruited 15 graduate students as our human assessors (all of which are paid as research assistants). The assessors had basic knowledge of what a schema is, but were not involved in the development of our model. Assessors were informed of the purpose of the study. Before they begin to work on the story-writing task, they were presented with task instructions ( Figure  7 and an example response. We did not collect any personal identifiers during the assessment. The order of the schema graphs is randomized both in terms of the schema induction algorithm and the scenario. We show two screenshots of the interface in Figure 8 and Figure 9. Additionally, we show a figure of the schema generated by Double-GAE and a human response corresponding to the schema in Figure 10.    Figure 9: After the assessors write the story for the schema, they will be asked to choose which events were included in the story.
A car bomb was detonated by terrorists in a city. The explosion killed multiple people at the scene. Other people nearby quickly contacted a local hospital for emergency aid.
A local news agency also covered the incident and urged residents to offer help. A couple of ambulances as well as other concerned citizens helped to take injured victims to the hospital. At the same time, local police force quickly reacted to the incident and chased the suspects. The suspects resisted arrest and skirmished with the police. Eventually, some suspects were severely injured by police fire and the others surrendered to the police. This put an end to this heartbreaking tragedy. Figure 10: An example schema from Double-GAE  and human response for the interpretability assessment.