Human-in-the-loop Schema Induction

Schema induction builds a graph representation explaining how events unfold in a scenario. Existing approaches have been based on information retrieval (IR) and information extraction (IE), often with limited human curation. We demonstrate a human-in-the-loop schema induction system powered by GPT-3. We first describe the different modules of our system, including prompting to generate schematic elements, manual edit of those elements, and conversion of those into a schema graph. By qualitatively comparing our system to previous ones, we show that our system not only transfers to new domains more easily than previous approaches, but also reduces efforts of human curation thanks to our interactive interface.


Introduction
Event-centric natural language understanding (NLU) has been increasingly popular in recent years. Systems built from an event-centric perspective have resulted in impressive improvements in numerous tasks, including open-domain question answering (Yang et al., 2003), intent prediction (Rashkin et al., 2018), timeline construction (Do et al., 2012), text summarization, (Daumé and Marcu, 2006) and misinformation detection (Fung et al., 2021). At the heart of event-centric NLU lie event schemas, an abstract representation of how complex events typically unfold. The study for such a representation dates back to the 70s, where scripts were proposed as a series of sequential actions (Roger C. Schank, 1977). Back then, the schemas were limited to linear and temporal ones. A more recent formulation of event schemas is a graph where the vertices are event flows and the edges are temporal or hierarchical relations between those events (Du et al., 2022). For example, as shown in Figure 1, the event schema for a "cyber attack" could include subevents such as "gain access", "control system", "exfiltrate files", "modify system logs", etc. The schema would also include the relationships between these sub-events. For instance, the event "gain access" would take place before the event "modify system logs" since a person needs access to a system before modifying it. For the same reason, "exfiltrate data" would only take place after "gain access". Event schemas like this encode highlevel knowledge about the world and allow artificial intelligence systems to reason about unseen events (Du et al., 2022).
The DARPA Knowledge-directed Artificial Intelligence Reasoning Over Schemas (KAIROS) program 2 aims at developing schema-based AI sys- tems that can identify, comprehend, and forecast complex events in a diverse set of domains. To enable such a system, scalable generation of highquality event schemas is very crucial. On one hand, fully-manual schema creation at a large scale can be inefficient, since people have diverse views about a certain concept, leading to inconsistent schema results. On the other hand, fully automated systems are scalable, but not with high-quality. In fact, the majority of existing approaches under the KAIROS program are fully-automated IR and IE systems over large collections of news articles (Li et al., 2020(Li et al., , 2021. Only some of limited human post-processing on schemas (Ciosici et al., 2021) have been explored. Further discussion of the advantages and limitations of existing systems can be found in Related Work.
Instead of focusing on fully-automated schema induction systems, we propose a human-in-theloop schema induction pipeline system. Rather than using IR and IE over a large document collection, our system relies on pre-trained large language models (LLMs) and human intervention to jointly produce schemas. Our main motivation is that human-verified schemas are of higher quality. That is because human curation can filter out failure cases such as incompletness, instability, or poor domain transfer results in previous systems Peng et al., 2019). With human curation, schemas are more reliable and accountable when applied to downstream tasks such as event prediction. This is significant if the downstream tasks involve safety-critical applications like epidemic prevention, where the quality of the schema matters beyond task performance numbers. emas Figure 2 is a flowchart of our four-stage schema induction system: step generation, node extraction, graph construction, and node grounding. Each stage has two main components: the LLM (e.g. GPT-3) at the back-end to output predictions (the purple boxes in the figure) and an interactive interface at the front-end for human curation of the model output (the yellow boxes). The GPT-3 prompts that are used in each stage of the process are given in the Appendix A, along with example inputs and outputs.
A more comprehensive description of the implementation and functionalities of our interface can be found in Section 4. A case study is given in Section 5. It walks through each step in our pipeline system under an example scenario, cyber attack. Also, in Section 5, we provide a qualitative evaluation of five example scenarios. The summary and discussion of our system are included in Section 6.

Schema Induction
Early work from Jurafsky (2008, 2009) automatically learned a schema from newswire text based on coreference and statistical probability models. Later, Peng and Roth (2016); Peng et al. (2019) generated an event schema based on their proposed semantic language model (like an RNN structure). Their work represented the whole schema as a linear sequence of abstract verb senses like arrest.01 from VerbNet (Schuler, 2005). Those works had two main shortcomings: first, the schema was created for a single actor (protagonist), e.g. suspect. It caused limited coverage in a more complex scenario, e.g. business changeacquisition; second, the generated schema, a simple linear sequence, failed to consider different alternatives such as XOR.
More recently, Li et al. (2020Li et al. ( , 2021 used transformers to handle schema generation in a complex scenario. It viewed a schema as a graph instead of a linear sequence. However, this approach was unable to transfer to new domains where the supervised event retrieval and extraction model failed.  took GPT-3 generated documents to build a schema. Although it bypassed the event retrieval and extraction process and solved the domain transfer problem, it suffered from the incompleteness and instability of GPT-3 outputs. Currently, neither do they offer a perfect solution for schema induction without manual postprocessing, nor build a timely human correction system (Du et al., 2022). Our demonstration system develops a curation interface that can generate a comprehensive schema easily with a human curator in the loop. The curated data collected through our tool could be useful for fine-tuning and improving the models.

Human-in-the-loop Schema Curation Interface
Another area related to our work is human-in-theloop schema generation, where annotators collaborate with computational models to create highquality event schema. In this field, one of the closest approachs is the Machine-Assisted Script Curation (Ciosici et al., 2021) created for script induction. With a fully interactive interface, they have shown the feasibility of realtime interaction between humans and pre-trained LLMs (e.g. GPT-2 or GPT-3). The main differences are the level of automation and adaptability to other generative models. In terms of automation, our interface makes use of pre-trained LLMs to automatically generate schema content, compared to their interface which largely counts on human input. For adaptability, our interface supports the curation of the schema generated by different language models (e.g. GPT-3 models with different sizes), which makes it possible for users to evaluate the generations of different models. In contrast, there is no such possibility in their interface. Another interface built for schema curation focuses on visualization of the schema structure, such as the temporal relations between event nodes and internal relations among entities (Mishra et al., 2021). While this interface provides a user-friendly experience when it comes to schema graph curation, it requires the user to come up with the content of event schemas in json format, which requires much more human effort compared to our interface. In addition, our interface also provides an optional grounding function after the event graph curation step, which is not presented in this interface.

Terminology and Problem Definition
Our work focuses on efficiently building a schema graph of a scenario using both LLMs and human input. Following the workflow of our system (see the workflow in Figure 2), a scenario is a general event type that an interested party will build the schema for, e.g. a 'disease outbreak'. Steps are a list of sub-events generated by GPT-3 according to a prompt in the step generation stage. Each step can be a phrase or a short sentence, such as 'spread to other areas', etc. Nodes or tuples are subjectverb-object pairs extracted from steps at the node extraction stage, such as '(disease, spread, to other area)'. Graphs are a visualization of the schema, whose edges joining the nodes represente temporal and hierarchical relations.

Implementation
Our pipeline system contains four sequential stages: step generation, node extraction, graph construction, and node grounding. A flowchart of the interface system is shown in Figure 2. The step generation stage generates steps for a scenario and the user can specify how many steps they would like to generate. The node extraction stage extracts nodes (subject-verb-object tuples) from the previous verbose steps. The graph construction stage orders the extracted nodes temporally and hierarchically. Meanwhile, modifications of the nodes are still possible. The node grounding stage maps node text to a node in the XPO ontology (Elizabeth Spaulding et al., In preparation) (derived from WikiData 3 ). The flexible interface system allows users to either go through the entire process to create a schema from scratch or directly start at any stage to edit the model's prediction. In addition, the back-end GPT-3 models can be replaced by other user pre-trained models if deployed locally.

Step Generation
The step generation stage aims at generating steps given a scenario. At the backend, zero-shot GPT-3 incorporates a user's input into a prompt and generates ordered steps. The interface allows users to generate steps quickly with prompt templates 4 or finetune the generated steps with user-designed prompts. A typical use case of the user-designed prompts is to expand a certain step to more detailed steps. For instance, a template prompt "List the steps involved in {disease outbreak}:" may create steps such as "1. Identify the symptoms of the disease; 2. Collect data from affected individuals; ...". Then, the user can re-prompt for, e.g., the second step, "List the steps involved {step2} in detail:". Additionally, users can modify and select GPT-3 generated steps easily by clicking on them. When the 'save' button is clicked, all user selected steps will be saved in the database for the use of the node extraction stage or further fine-tuning of the step generation model. A screenshot of the step generation interface with user's operations can be seen in Figure 3.

Node Extraction
Nodes are structured representations of events in the form of a {subject, verb, object} tuple. Node extraction is to extract these nodes from the GPT-3 generated steps saved in step generation stage, which are unstructured sentences.
There are two methods, based on AllenNLP (Shi and Lin, 2019) or GPT-3, that users can choose from to extract nodes. The former uses AllenNLP's Semantic Role Labelling (SRL) model to extract nodes from the steps. The SRL model implements a BERT (Devlin et al., 2018) sequence prediction model to identify the predicates and the arguments (e.g. A0, A1) in a text. We simply choose the identified A0 as subject, A1 as object, and predicate as the verb to form a node. An optional coreference resolution model can be used to resolve referenced entities between the different steps with an AllenNLP's SpanBERT-based model . Here, we concatenate all the steps and replace a pronoun with its referenced entity (noun) in the original steps.
The GPT-3 node extraction method uses instructional few-shot prompting to extract {subject, verb, object} tuples from the steps. Several example sen-tences are given to show GPT-3 the expected syntactic and semantic output. We follow 's recommendation for few-shot design by including context examples that are semantically similar to the KAIROS application environment (daily life and news). See appendix A for our fewshot prompts.
The extracted nodes are shown to the user in a table with 3 columns (subject, verb, object). For example, for "The CDC collects and analyzes data on disease outbreaks", one of the extracted nodes is "The CDC (subject) collects (verb) data (object)". Users are able to choose and edit nodes (tuples). User edits are saved and will be used for graph construction and fine-tuning of the GPT-3 node extraction model.

Graph Construction
In the graph construction stage, our system automatically adds temporal and hierarchical edges to the previously extracted nodes. The edges are created using zero-shot GPT-3 with multiple choice questions. For each pair of nodes, GPT-3 is instructed to choose between 'Before', 'After', 'Same time' or 'no relation' for temporal eges; and 'Parent', 'Child' or 'no relation' for hierarchical edges. For example, for the node pair "collect data" and "identify the signs and symptoms", GPT-3 predicts 'After' for temporal order and 'no relation' for hierarchical order, in which case we will add a temporal edge from "identify the signs and symptoms" to "collect data", and no hierarchical edge will be created. If a conflict occurs between (node1, node 2) pair and (node2, node1) pair, e.g. 'After' and 'After' for a temporal order or 'Parent' and 'Parent' for a hierarchical order, we will treat it as no relation to resolve the conflict, thus adding no new edges to the graph.
The graph construction interface allows users to modify the GPT-3 generated schema with ease. After predicting both temporal and hierarchical relations between all pairs of nodes, the interface will display the graph via the Vis-network framework 5 . It supports adding, editing, deleting graph nodes and edges. When the user clicks on a node, the detailed information including the ID and description of a node will be shown as well as the button to delete or edit the node. By clicking the edge, users can modify the edge type or delete it. Users will be able to create a new node by double clicking and a new edge by dragging and dropping an arrow from two nodes. A screenshot of our graph construction interface can be seen in figure 4.

Node Grounding
Although a schema (graph) is completely created after the previous stages, some nodes may express the same semantic information, e.g., "refugees flee" and "refugees ran away". To ensure the reliability and comparability of created schemas, our system grounds the nodes to an ontology, namely the XPO ontology, in the last stage. Each node in the XPO ontology contains a unique node ID, a node name, and a concise description (definition), and a list of similar nodes. Our system offers two ways of grounding, "name inference grounding" or "name similarity grounding". Name inference grounding maps the schema nodes to XPO nodes by predicting the XPO node's name; name similarity grounding finds the XPO nodes by comparing the similarities between the embeddings of a schema node and a XPO node's name.
In name inference grounding, given a graph node, our system first uses few-shot GPT-3 to deduce a list of possible XPO names (see few-shot prompt example in appendix A). Then, the candidate XPO names are postprocessed by dropping off the wrong prediction and adding similar XPO names to the true prediction. After that, each possible XPO name will be checked for entailment with the original graph node. The entailment model is a BART-large model fine-tuned on the MNLI dataset (Lewis et al., 2020;Williams et al., 2018). The input is the original graph node as the premise and the possible XPO name as the hypothesis, and the output is the entailment score. We sort the possible XPO node names by their entailment scores. Users can view and choose from the top-k suggested XPO nodes for the grounding of the original graph node. In name similarity grounding, the topk related XPO nodes are retrieved by computing the cosine-similarity of the GloVe embedding between the graph node and the name of XPO nodes (Pennington et al., 2014). The above two methods are complementary to each other especially when users cannot find expected XPO nodes with one method. Human-curated data is saved in the backend database. A screenshot of node grounding can be seen in Figure 5.

A Case Study
In this section, we walk through the whole process of creating a toy schema with our interface which is much simpler than a fully developed schema. We assume the scenario is 'cyber attack'.
In the step generation stage, users can form a prompt from templates such as "list the steps involved in a cyber attack" with 'cyber attack' as the name and sub-event as the prompt type. Then, GPT-3 will generate 5 steps. For example, "1. A cyber attacker gains initial access to a system" and "5. The attacker exfiltrates data from the compromised system." Users can modify the content and choose steps to save. For example, one may change the first step to "1. A cyber attacker access a system." and save the step. See a screenshot of five steps for reference in figure 3. Next, in the node extraction stage, GPT-3 will be prompted to extract nodes from the selected steps. For example, GPT-3 will output {cyber attacker, access, system} for the first step. The user can change the outputs to correct any mistakes. In this sample, we extract 4 nodes, they are: {cyber attacker, access, system}, {attacker, enumerate, system information and user account}, {attacker, escalates, privileges}, {attacker, exfiltrate, data}. And we concatenate the {subject, verb, object} into a piece of text as a node for the next stage.
Thereafter, in the graph construction stage, we prompt GPT-3 to automatically build linear temporal edges on the above four nodes that users can modify. We manually add a scenario node 'cyber attack' and link with the other four existing nodes through hierarchical edges. see a screenshot of the graph in figure 4.
Finally, we can optionally ground our graph node into the XPO ontology. For example, the node "cyber attacker access system" can be mapped to choices of 'access', 'computer monitoring', 'remote communicating' using name similarity grounding. In this case, we don't get any results

User Evaluation
We followed the evaluation methodology used by Ciosici et al. (2021) with slight modifications to assess our system. Evaluation is done by researchers in the field of NLP who have experience in handwriting event schemas but have not used the interface before. In the step generation and node extraction stage, we count the number of human selected steps/nodes out of the total number of machine generated results as accuracy. For simplicity, we ignore users' modifications (e.g. rephrasing) at this point. In the graph construction stage, we compare how many nodes and edges are modified (added or deleted) using graph edit distance. In the grounding part, the success rate is measured as successful retrieval of at least one relevant XPO node within top-3 grounding results for a given event node. We also ask users to self-report their total time of interaction. For all the evaluations, we use GPT-3 Davinci model as the language model. 6 We follow prior work and evaluate our system on five scenarios: Evacuation (EVC), Ordering Food in a Restaurant (FOD), Finding and Starting a New Job (JOB), Obtaining Medical Treatment (MED), Corporate Merger or Acquisition (MRG). 7 As shown in Table 1, our interactive system shows high accuracy in step and node generation phases, thanks to the richness of world knowledge from LLMs. However, the graph construction and the node grouding require more human curation, due to the difficulty of event reasoning, such as the understanding of temporal and hierarchical relationships; and the retrieval ability from large database. In those cases, we showed that human curation can step in timely and improve the quality of event schema when LLM-based models make mistakes 7 . In addition, our interface is easy to use, with much shorter time required to complete each event schema task compared to previous work (Ciosici et al., 2021).
We also report a qualitative study introducing the types of human modifications on the automated generations. At the step generation stage, GPTs aren't likely to make commonsense and grammar errors. However, if its required to generate more steps, it may be susceptible to redundancy, such as, "A does B" and "A finishes doing B", and hu-man removes these steps. Then, for the node extraction, results can be simplistic and ambiguous when the original sentence contains rich information, such as location, condition, or other modifiers. For example, given the step "waitress bring order to the kitchen", automatic node extraction produces "(waitress, bring, order)", while human needs to add back some necessary components, e.g. the location information "kitchen" or constraint "food order". Last, for the graph construction, current graph is often linear based on the previous nodes' order, human efforts play an essential role to elaborate on the specific relations including AND, OR. For example, "person updates the resume" and "person tailors the cover letter" are independent and can be concurrent, not sequential.

Conclusion
With the acknowledgements that fully depending on human annotation is expensive and inefficient, while wholly automated generations can be unreliable, we propose a human-in-the-loop schema curation interface with pre-trained large language models (LLMs) as the backbone. We use LLMs to generate candidate components of a schema and involve human as the final judge for both the content and structure of the event schema. With empirical evaluations, we show that our system can efficiently produce human-validated event schemas with minori human efforts.

Limitations
We have several limitations in our current approach. First, our current system uses zero-shot or few-shot to prompt GPT-3 without any fine tuning. In future work, we plan to fine-tune our GPT-3 with human curated data that we collect. We expect that finetuning will improve our models' performance. It may also be possible to use human curated data to train a policy network recommended by Ope-nAI (Ouyang et al., 2022). Second, we can replace GPT-3 with more robust task specific models at some stages, e.g., the pre-trained model for predicting temporal and hierarchical orders. Third, some users suggested incorporating a graph view at the other three stages, which will help users to generate based on the current graph. We will include this graph view in our next version. Forth, our current evaluation is experimental and probably subjective, we will develop more robust evaluation metrics comparing manual, Ciosici et al. (2021)'s and our schema and test on downstream tasks in the next step.

Ethics Statement
To our knowledge, our back-end GPT-3 model was trained mainly on English web data, it may prefer events happen in an English environment. Furthermore, our test showed that it generated events specifically fit in American setting, for example, Miranda Rights for arrest, Democrats and Republicans in United States for election. These facts suggest GPT-3 may ignore the knowledge of non-American cultures or minority groups. In addition, currently, we only create schemas for scenarios that are reported in mainstream news media, e.g. conflict, communication. It excludes the schemas from other domains, such as biology, medicine.