proScript: Partially Ordered Scripts Generation via Pre-trained Language Models

Scripts - standardized event sequences describing typical everyday activities - have been shown to help understand narratives by providing expectations, resolving ambiguity, and filling in unstated information. However, to date they have proved hard to author or extract from text. In this work, we demonstrate for the first time that pre-trained neural language models (LMs) can be be finetuned to generate high-quality scripts, at varying levels of granularity, for a wide range of everyday scenarios (e.g., bake a cake). To do this, we collected a large (6.4k), crowdsourced partially ordered scripts (named proScript), which is substantially larger than prior datasets, and developed models that generate scripts with combining language generation and structure prediction. We define two complementary tasks: (i) edge prediction: given a scenario and unordered events, organize the events into a valid (possibly partial-order) script, and (ii) script generation: given only a scenario, generate events and organize them into a (possibly partial-order) script. Our experiments show that our models perform well (e.g., F1=75.7 in task (i)), illustrating a new approach to overcoming previous barriers to script collection. We also show that there is still significant room for improvement toward human level performance. Together, our tasks, dataset, and models offer a new research direction for learning script knowledge.


Introduction
Scripts, originally introduced by Schank and Abelson (1975), represent structured commonsense knowledge about prototypical events in everyday situations/scenarios such as bake a cake and fuel a car ( Figure 1). However, while scripts have been shown to help understand narratives by providing expectations, resolving ambiguity, and filling in unstated information (Chambers and Jurafsky, 2008; find the cake recipe gather the ingredients turn on the oven mix the ingredients put the cake batter in the oven bake for the right amount of time take the cake out of the oven Scenario: bake a cake Figure 1: We collected 6.4k of partially ordered scripts (proScript) and developed models that take a scenario (e.g., bake a cake) as the input and generate a (possibly partial-order) script. Modi et al., 2017, inter alia), they have proved hard to author or extract from text, with only small script databases available (Regneri et al., 2010;Chambers, 2017;Ostermann, 2020).
In this work, we show for the first time that pre-trained neural language models (LMs) can be adapted to generate high-quality scripts, including appropriately partial ordering events where a specific temporal ordering is required only when it is necessary. LMs have previously been shown to successfully generate stories (Rashkin et al., 2020), summaries (Lewis et al., 2020), and commonsense facts (Bosselut et al., 2019;Hwang et al., 2020). Here we investigate their application to script generation. First, we collect large amount (6.4k) of partially ordered script from crowdsourcing with a similar but simplified collection method (Ciosici et al., 2021). We call the dataset as proScript (PaRtial Order SCRIPt for generaTion), and this is substantially larger than prior (crowdsourced) dataset such as DeScript (Regneri et al., 2010) that has 40 scripts. Since the granularity of scripts (and the events) are inherently vague and subjective (Modi et al., 2016), we collected wider variety of micro and macroscopic scripts than previous datasets. Additionally, temporal duration of each event is also annotated (e.g., take the cake out of the oven typically takes one minute in the bake a cake script), which will potentially link script knowledge with temporal reasoning in future work.
Second, with the collected data, we introduce two complementary tasks: script edge prediction and entire script generation. In the edge prediction task, given a scenario and unordered intermediate events, models must organize the events as a valid partial-order script. On the other hand, the script generation task is to generate intermediate events and the partial-order of those events for a given scenario. This task requires both natural language generation (for nodes) and graph structure prediction (for edges).
Finally, based on our proposed dataset, we develop models for both edge prediction and entire script generation tasks. As Chambers (2017) has revealed that models trained and evaluated on missing events prediction (i.e., narrative cloze) are insufficient to assess script knowledge, our evaluation scheme evaluate the entire script. We compare the models against baselines, and show that our models outperform the baselines for both the edge prediction and the script generation tasks. Nonetheless, there is a significant room for improvement toward human-level performance -e.g., for edge prediction, the best model achieves 75.71 of F1 score while human achieves 89.28, and for script generation, the best model obtains a graph edit distance of 4.97 (i.e., number of human edits), while human-created scripts achieve 2.98 on average. Our contributions are thus: • A new dataset (proScript) of crowdsourced scripts that is substantially larger than prior (manually crafted) datasets • Two complementary task definitions against proScript • Two new models for these task, providing the first demonstration that generative models can be successfully applied, although it is still significantly below human levels 2 Related Work Script as narrative chain Mooney and DeJong (1985) and Chambers and Jurafsky (2008, in-ter alia) have investigated automatically inducing scripts from (unstructured) corpus. In particular, Chambers and Jurafsky (2008) introduced scripts as narrative chain, where verbs with the participants information (e.g., (claimed, subj), and (accused, obj) ) named narrative events are partially ordered according to causal and temporal relations. They also introduced narrative cloze task, where a model is expected to predict one removed narrative event, given all the other narrative events, while our proposed task requires to generate scripts as a partial-order graph for a given scenario. The "script as narrative chain" approach has been actively studied (Jans et al., 2012;Modi and Titov, 2014;Pichotta and Mooney, 2014;Rudinger et al., 2015;Granroth-Wilding and Clark, 2016;Weber et al., 2018;Belyy and Van Durme, 2020), but it has its drawbacks. First, the source corpora is mainly from a news domain rather than everyday scenarios, and a number of verbs in news texts are not script-relevant events such as reporting verbs (Mostafazadeh et al., 2016;Chambers, 2017). Second, events are highly abstracted as tuples of verb and the dependency (subj or obj) (Ostermann, 2020). Third, the evaluation scheme for the narrative cloze task is insufficient to evaluate script knowledge (Chambers, 2017).
Script as paraphrase sets Script as paraphrase sets (Regneri et al., 2010;Modi et al., 2016;Wanzare et al., 2016) is more recent approach to gather script knowledge, where crowd workers are asked to write down a sequence of events for a given everyday scenario (e.g., bake a cake) and the collected sequences (called event sequence description) are aligned with paraphrased events being clustered. The collected (partially ordered) scripts cover wide variety of everyday situations compared to narrative chains (news domain), but one shortcoming of this approach is the scalability; it is not easy to scale because of the cost for manual data collection (Chambers, 2017;Ostermann, 2020). In fact, Modi et al. (2016) crowdsourced 1000 stories that cover only 10 scripts, and similarly Regneri et al. (2010) end up with collecting 40 scripts. The limited amount of data hinders learning script knowledge by models. Furthermore, they provide no evaluation metric on the dataset for assessing model's script knowledge.
Story generation Neural models have been demonstrated to successfully generate stories (Kid-don et al., 2016;Peng et al., 2018;Zhai et al., 2019;Rashkin et al., 2020). Our work is related to story generation in terms of generating higher-level agenda (or plot) of a story. However, a main difference between stories and scripts is that stories often require surprising and incidental sequence of events as well as description about character's mental states and landscape depiction that make the story attractive for readers, whereas our script generation expects generating essential core events (Chambers, 2017) in partial order.

Definitions
proScript We define proScript as a directed acyclic graph (DAG), G(V, E) with a given scenario (s), where V is a set of essential events {v 1 , ...v i , ...v |V | } and E is a set of temporal ordering constraints between events {e ij } which means that the events v i must precede the event v j (v i ≺ v j ). 1 DAGs effectively encode the partialordering of core events-crucial for representing events which can be performed in any order. For example, in a bake a cake scenario, one can "gather the ingredients" and "turn on the oven" in any order ( Figure 1). We emphasize that scripts should not include non-core events such as discourse related events (e.g., reporting verbs) as Chambers (2017) proposed. In proScript, we also exclude alternative events in a proScript DAG. For example, in a bake a cake scenario, "get ingredients" and "buy ingredients" are alternative events with each other because either one is only necessary in the scenario. By excluding alternative events, we can resolve ambiguity of the edges in partial order structure as temporal relations or alternative paths. Regneri et al. (2010) and Modi et al. (2016) do not discriminate this ambiguity. 2 With the definition, we introduce proScript task in two complementary settings: script edge prediction and entire script generation.
Edge Prediction The script edge prediction task is to predict a set of partial-ordered edges (E) of the script G(V, E), given a scenario and a set of Script Generation The script generation task is to predict a partial order script G(V, E), but only the scenario is given. Models are additionally expected to generate events (V ) in natural language.

Datasets
Source of Scenarios We collected scenarios from ROCStories (Mostafazadeh et al., 2016), DeScript (Wanzare et al., 2016), and Virtual-Home (Puig et al., 2018). As ROCStories consists of sentences instead of scenarios, we extract phrases that match the manually curated patterns "want(ed) to ...", "need(ed) to ...", "look(ing) to ..." and that do not include personal pronouns or person's name. The 2,565 scenarios we collected include both high-level long-term ones (e.g., open a small business) and fine-grained short-term ones (e.g., sign into an email account). DeScript consists of 40 daily scenarios (e.g., making coffee) and we use all of them. VirtualHome is constructed to learn activities interactively in a household in a 3D simulated world. It has 233 indoor tasks (e.g., turn on light) and we include them as scenarios.
Crowdsourcing proScript For the collected scenarios, we crowdsource the corresponding proScript on the Amazon Mechanical Turk. Our crowdsourcing procedure is similar but simplified method to (Ciosici et al., 2021). First, crowdworkers are required to describe five to seven core events that they are essential for the given scenario (Chambers, 2017) with the estimated time it takes to complete each event. In the second question, they are asked to sort them in possibly partial order (=DAG), which represents the proScript for the scenario.
Due to the complex nature of this crowdsourcing procedure, it is crucial to maintain the quality. To identify and filter out noisy instances, two different workers are asked to sort the same set of events in partial order (i.e., the same as the second question described above). According to our manual analysis, we decided to retain scripts that have at least 65.0 F1 score between the workers. 3 To collect proScript with both micro and macroscopic scenarios, we iteratively picked two adjacent events in the DAGs and use them as a source of finer-grained scenarios. Dataset Statistics In total, we collected 6,414 valid scripts that include 311,502 pairs of events, and we split the proScript into training (3,252 scenarios), development (1,085), and test set (2,077). The training and development sets consist of scenarios collected from ROCStories, and the test set consists of those from ROCStories, De-Script, and VirtualHome. This helps us evaluate inand out-of-domain performance.
The average number of events in proScript scenarios is 5.45 and the maximum degrees of DAGs in the training set are distributed as follows: 2,198 scripts (67.6%) for degree 1, 915 scripts (28.1%) for degree 2, 108 scripts (3.3%) for degree 3, 31 scripts (0.9%) for degree 4 and above. Figure 2 shows the normalized histogram of the typical time to take for each script in proScript dataset. Most of the scripts take between a minute and an hour (e.g., "go to bathroom", "buy some new clothes"), while there are a reasonable amount of high-level long-term scripts (e.g., "find a new job", "open a small business").

Models
For the proScript edge prediction task ( §3), we implement a two-step approach baseline (pairwise model) and compare it with our proposed end-toend neural method (proScript edge-pred ).
Pairwise Model We implement a two-step baseline where we train a binary classifier to predict the precedence between pairs of events, followed by building a partial order scriptĜ by aggregating the predicted relations across all pairs of events.
Scores by the classifier are used as weights to create an adjacency matrix of G which is then automatically converted into a partial-order script with heuristics -when G contains a cycle, we iteratively remove edges by choosing the one with minimum weight until we get a valid DAG.
proScript edge-pred We propose an an end-toend neural model, which takes all the (unordered) events (v) and the scenario (s) as the input (x) and predicts the edges (Ê) in a partial-order script (Ĝ) at one time. To representÊ in a linear format (y), we use DOT, a graph description language as shown in Figure 3. 4 By flattening the nodes and edges of G (andĜ), we apply neural encoder-decoder models. Formally, flattened unordered events and scenario as x are embedded as continuous representation (emb(x)) by the encoder, then the decoder will generate tokens (y) as follows: p(y 1 , . . . , y N |x 1 , . . . , x M ) = (2) N n=1 p(y n |emb(x 1 , . . . , x M ), y 1 , . . . , y n−1 ).
Compared to the pairwise model, the proScript edge-pred model uses information from all the events jointly to build partial-order script with a broader context.

Evaluation Metrics
Given G(V, E) as a predicted (partial order) script andĜ(V,Ê) as the correct (oracle)  score is defined as follows:
For the proScript edge-pred model, we use the T5 with different model sizes (Large and 11B) and training sizes (100, 1k, and all 3.2k) to see how these factors affect the performance. 6 We followed a default set of hyper-parameters for the T5 models.

Results
The results are shown in Table 1. We find that the pairwise and proScript edge-pred models significantly outperform the random baseline where the edges are randomly assigned. The proScript edge-pred T5-11B model outperforms the pairwise T5-11B model. This indicates that the proScript edge-pred model benefits from a larger context from the input to predict edges more accurately, although there is still a significant room for improvement toward human-level performance. 7 Regarding the difference between 5 We used the implementation from Huggingface Transformers (Wolf et al., 2019). 6 We also used BART (Lewis et al., 2020), but we found that BART did not perform well on this task. 7 We find that 99% of the outputs from proScriptedge-pred are valid DOT language. in and out of domain, we find that the in-domain performance is higher than the out-of-domain performance, whereas human performance is robust regardless of the domain difference. We also see that the training set (100, 1k, all) and model sizes (Large, 11B) significantly affect the performance of proScript edge-pred . Figure 4 shows the performance of the pairwise (T5-11B) model, proScript edge-pred (T5-11B) and human according to the (maximum) degree of the script DAGs. We find that scripts with higher degree are more difficult to predict for both proScript edge-pred and pairwise models, whereas human shows smaller decrease for predicting higher-degree scripts. 6 proScript Generation 6.1 Models proScript gen The proScript generation task combines natural language generation (i.e. generating events in natural language) with structure prediction over the generated events (i.e. organizing the events into a DAG). Our approach (proScript gen ) is to formulate it as an end-toend problem, similar to the proScript edge-pred Step0: find the cake recipe; Step1: gather the ingredients; Step2: mix the ingredients; … omitted … Step5: bake for the right amount of time; Step6: take the cake out of the oven; Step0 --> Step1; Step0 --> Step3; … omitted … Step5 --> Step6; Figure 5: Example of input and output for proScript gen . The input is a scenario and number of events to generate in natural text format, and the output is a sequence of events and edges of the script.
for the proScript edge prediction task ( §5.1). Given a scenario (s) and the number of events to generate in the script, proScript gen generates events and edges for the partial-order script (Ĝ) in DOT language ( Figure 5). Formally, we use the same encoder-decoder framework (eq.2) except that a scenario and number of steps to generate are described in natural text as x and the decoder is expected to generate events as well as the edges (as y) in the script.
Transfer learning from WikiHow data Transfer learning often helps improve the performance when it is (pre-)trained on a similar task (Peters et al., 2018;Devlin et al., 2019). As additional resource for pre-training proScript gen , we use procedural texts extracted from WikiHow, 8 which contains 130k instances of a sequence of essential steps for a given topic in various categories (e.g., health, finance, hobbies, etc.). It is important to note that all the procedures in WikiHow are formatted as sequences rather than a partial-order. We refer to this approach as proScript gen-transfer .
Pipeline approach An alternative approach is to use proScript gen followed by the proScript edge-pred model. The approach relies on proScript gen to generate a set of events but allows to fix the predicted edges via the proScript edge-pred model. We refer to this approach as proScript gen-pipe , and study whether it can improve the performance over proScript gen .

Evaluation Metrics
Chambers (2017) emphasizes the importance of human annotation for evaluating script knowledge.
However, human evaluation for the proScript generation task is challenging because it involves natural text generation and structured prediction. As in the text generation tasks such as machine translation and text summarization, there are several possible correct answers. Therefore, we use two complementary evaluation metrics for the proScript generation task: (i) graph edit distance, and (ii) pairwise comparison. These are the absolute and relative measures of performance, respectively. Graph edit distance (Abu-Aisheh et al., 2015) computes the distance between two graphs. Formally, given two graphs G 1 and G 2 , where d 1 , . . . , d k is a list of graph edit operations from G 1 to G 2 . The operations include deletion, insertion, and replacement for vertex and edge. Each operation has its cost and we set the cost to be 1 for all the operations in our evaluation for simplicity. We use the graph edit distance between a modelgenerated script and the revised script by human annotators. The graph edit distance is indicative of the quality of the generated scripts; higher-quality scripts must have smaller graph edit distances to the gold-standard (i.e. they require a smaller number of human revisions). In addition, we also employ pairwise human judgments where we ask human annotators to compare the scripts generated by proScript gen with those from the other approaches.

Experiments
Setup For our proScript gen , we use T5-11B.
Similarly to the proScript edge-pred , we follow the default set of hyper-parameters recommended in (Raffel et al., 2020). For proScript gen-transfer , we pre-train the proScript gen with the 130k procedures, and finetune it on the proScript dataset. For the proScript gen-pipe , we first obtain the actions generated by proScript gen (ignoring the edges), and use the set of events as input for proScript edge-pred , which is trained (see §5.3) to predict the edges. As defined in §3, we use graph edit distance and pairwise judgments to evaluate the quality of the generated scripts. For computing graph edit distances, we select 500 scripts (250 for dev and test sets) and ask crowdworkers to revise the generated scripts as necessary (e.g., add/delete/replace the   events and the edges).We use the revised scripts as gold-standard. Each script is revised by two annotators, and we compute the average of the graph edit distances.
In pairwise judgments, we compare the scripts generated by proScript gen with those from the other approaches. We randomly select 150 pairs, and ask three crowdworkers to judge whether the script generated by proScript gen is better, worse, or equal to the other (i.e. transfer, pipeline, or human). We use majority vote to decide the final pairwise human judgment between the two scripts.
Results The pairwise judgment result is shown in Figure 6. We see that the pipeline and transfer models show slight preference over the proScript gen (except pipeline-dev), although the difference is not large. We also see that the transfer model constantly have more preference over the proScript gen than the pipeline model in both dev and test sets. Regarding the pairwise comparison with human-created plans, proScript gen still has a significant room for improvement toward human level. Table 2 shows the average graph edit distance between the generated script and the human revisions. We find that neither transfer nor pipeline help to improve the graph edit distance over proScript gen , indicating that proScript gen is already a strong baseline (see examples in Appendix). The reason of no improvement by the transfer approach may be because WikiHow consists of sequences rather than partially ordered steps. No improvement by the pipeline approach indicates that the proScript gen can directly generate valid script   In terms of the edit types, many of the edits are edge-related, suggesting that proScript gen and the variants are all good at generating events but struggles with ordering them. Regarding inand out-of domains in the test sets, we observe that proScript gen and the variants have slightly better performance for in-domain scripts than outof domain, while human created scripts are not affected by domains. These findings are consistent with the result in the edge prediction task ( §5.3). Figure 7 shows a histogram of the graph edit distance. It is evident that human created scripts are corrected less often than scripts generate by proScript gen , whereas the scripts from proScript gen and the variants often have a large number of edits (e.g., 4 or more). It is interesting to see that fewer number of scripts have 1 to 3 edits (except scripts created by human). The reason is because one simple revision tends to yield multiple graph edits (e.g., one node insertion yields multiple edge insertions).
Error Analysis We performed manual error analysis for the scripts generated by each model. We selected 40 random scripts that have non-zero graph edit distance and classified the human revisions into 7 types: (1) incorrect order, (2) missing event, (3) irrelevant/redundant event, (4) order by context, (5) granularity, (6) paraphrased event, and (7) wrong correction (examples are shown in Table 3).
Approximately, the first three types indicate that the script has crucial errors, the next three types are trivial revisions where both generated and revised scripts are plausible. The last type of revision is the one where the revised script is wrong (or worse). Table 4 shows the statistics of each error type. We see that edge-related revisions are more frequent than node-related revisions. This is consistent with the results in graph edit distance. Overall, we find that minor revisions are more frequent than crucial errors, indicating that proScript gen and the variants generates reasonably good scripts. In contrast, crucial errors are quite rare in human created scripts, indicating a significant room for future innovation.

Conclusions
We show for the first time that pre-trained neural language models can be adapted to generate partial order scripts. We collect 6,400 partially ordered script from crowdsourcing (proScript), which is substantially larger than prior manually crafted datasets. With the proScript dataset, we introduced two complementary task and models, providing the first demonstration that generative models can be successfully applied to script generation, although it is still below human performance. We believe that proScript dataset and models would advance future work on various NLP tasks such as story generation, machine comprehension, temporal reasoning, and high-level planning.