SubeventWriter: Iterative Sub-event Sequence Generation with Coherence Controller

In this paper, we propose a new task of sub-event generation for an unseen process to evaluate the understanding of the coherence of sub-event actions and objects. To solve the problem, we design SubeventWriter, a sub-event sequence generation framework with a coherence controller. Given an unseen process, the framework can iteratively construct the sub-event sequence by generating one sub-event at each iteration. We also design a very effective coherence controller to decode more coherent sub-events. As our extensive experiments and analysis indicate, SubeventWriter can generate more reliable and meaningful sub-event sequences for unseen processes.


Introduction
Natural language understanding involves deep understanding of events.In the NLP community, there have been many event understanding tasks.Most of them focus on parsing events into involved entities, time, and locations as semantic roles (Kingsbury and Palmer, 2002;Li et al., 2013;Lv et al., 2020;Lin et al., 2020;Du and Cardie, 2020;Zhang et al., 2021;Lyu et al., 2021a), or identifying their binary relations such as temporal or causal relations (Berant et al., 2014;Smith et al., 2018;Sap et al., 2019;Wang et al., 2020Wang et al., , 2021)).However, our natural language can be used to describe relations more than binary ones.For example, processes (Craig et al., 1998), also known as scripts (Schank and Abelson, 1977) or activities (Mourelatos, 1978), are complex events constituted by a sequence of sub-events.Understanding processes can be more challenging than individual or pair of events.
As shown in Figure 1, to complete the process of making a chocolate cake, we need to consider a sequence of actions, "mix," "add," "pour," and "bake," 1 Code is available at https://github.com/HKUST-KnowComp/SubeventWriter.

Process: make a chocolate cake
How to make a chocolate cake?Step 1: mix the dry ingredients.3. Pour batter into cake pans.
2. Add the coffee and milk.

SubeventWriter
Figure 1: A motivating example of SubeventWriter, which generates one sub-event at a time iteratively.We show the process "make a chocolate cake" and the second iteration of the generation.By considering coherence, we can re-rank candidates and reach the right sub-event.
[M] is a mask token.
which involves different objects, e.g., dry ingredients, coffee, milk, etc.Those actions should follow a logically coherent procedure while the objects should be all related to the target, chocolate cake.Thus, building such a coherent sequence should take the whole sub-events into consideration.
There have been two categories of related studies to processes, namely process induction and narrative cloze tasks.Zhang et al. (2020a) proposed a task to learn the hierarchical structure called process induction, where a model needs to generate a sub-event sequence to finish a given process.Their framework aggregates existing events so that it can conceptualize and instantiate similar processes.However, the aggregation procedure does not consider the coherence of actions and their objects.In addition, to build the dataset, they extracted events using a dependency parser with pre-defined verb-argument templates (Zhang et al., 2020b(Zhang et al., , 2022)).Such structured events might harm coherence as only head words are retained after extraction.Consider the first sub-event in Figure 1.After parsing, we lost the indispensable modifier "dry" and the sub-event becomes (mix, ingredients) 2 , which includes the wet ingredients (e.g., "milk") in the second sub-event.Thus, the logical relation between the two adjacent sub-events (i.e., coherence (Van Dijk, 1980)) is defective.
On the other hand, narrative cloze tasks (Chambers and Jurafsky, 2008;Granroth-Wilding and Clark, 2016;Chambers, 2017;Mostafazadeh et al., 2016) evaluate whether a model can predict the missing (usually the last) event in a narrative.These tasks essentially evaluate the semantic similarity and relatedness between the target event and the context.However, they did not emphasize how all events in the contexts are unified as a whole process in an ordered and coherent way.
To evaluate complex process understanding, we propose a new generation-based task to directly generate sub-event sequences in the free-text form, as shown in Figure 1.In the task, better generation of a process means better understanding of the coherence among action verbs as well as their operational objects.In fact, we find that generating free-text events is a non-trivial task, even with existing strong pre-trained models like T5 (Raffel et al., 2020) and BART (Lewis et al., 2020).First, generating an overlong piece of text containing several temporally ordered sub-events at once is challenging to current pre-trained models (Zhou et al., 2022;Lin et al., 2021;Brown et al., 2020).Next, sub-events are generated without considering the coherence of actions and their objects, which might give rise to irrelevant or redundant results.
To solve the task, we propose SubeventWriter to generate sub-events iteratively in the temporal order.SubeventWriter only generates the next subevent in each generation iteration, given the process and prior generated sub-events.It eases the generation difficulty by decomposing the sub-event sequence.Moreover, sub-events should be coherently organized to complete a process.To consider coherence in each iteration, we can get a few sub-event candidates from the beam search and select the most coherent one, as shown in Figure 1.In SubeventWriter, we introduce a coherence controller to score whether a candidate is coherent with 2 The matched pre-defined template is (verb, object).
the process and prior generated sub-events.As a result, SubeventWriter can construct more reliable and meaningful sub-event sequences.
To evaluate our framework, we extract a largescale general-domain process dataset from Wiki-How3 , containing over 80k examples.We conduct extensive experiments with multiple pre-trained models, and automatic and human evaluations show that SubeventWriter can produce more meaningful sub-event sequences compared to existing models by a large margin.Moreover, we conduct few-shot experiments to demonstrate that our framework has a strong ability to handle few-shot cases.Last but not least, we evaluate the generalization ability of SubeventWriter on two out-ofdomain datasets: SMILE (Regneri et al., 2010) and DeScript (Wanzare et al., 2016).The results manifest our framework can generalize well.

Textual Sub-event Sequence Generation
We formally define the sub-event sequence generation task as follows.Given a process S, we ask the model to generate sub-event sequences E, which are steps to solve the process.This task is essentially a conditional language modeling problem.Specifically, given a process S consisting of n tokens: x 1 , x 2 , . . ., x n and a sequence E consists of m sub-events e 1 , e 2 , . . ., e m (each sub-event refers to a sentence containing t i tokens: Figure 2 illustrates the details of the proposed SubeventWriter framework.For a given process, the framework decomposes the generation into multiple iterations.The sequence-to-sequence (seq2seq) language model generates a few candidates for the next sub-event in each iteration.We then leverage a coherence controller to re-rank the generated candidates by considering whether they are coherent with the process and prior generated Figure 2: The overview of our SubeventWriter.In each iteration, the Seq2Seq language model takes the process and prior generated sub-events as input and generates a few candidates for the next sub-event.Then the coherence controller is used to select the most coherent candidate as the next sub-event.
sub-events.The coherence controller is a discriminative model that can assign a coherence score to a sub-event sequence.It is fine-tuned independently on our synthetic data generated according to our manually designed coherence rules.Finally, the framework appends the generated sub-event to the end of the input to serve as new context and start the next iteration.The detailed description of SubeventWriter components is as follows:

Iterative Event-level Decoding
The iterative event-level decoding scheme is built on top of seq2seq language models, including T5 (Raffel et al., 2020) and BART (Lewis et al., 2020).We describe training and inference details as follows.
Training: The seq2seq language models are fine-tuned to decode one sub-event each time in chronological order.For each process with its sub-event sequences in the training data, we create an augmented set of training examples with each sub-event in the sequence as the output in turns.For example, if the valid sequence of a process S consists of temporally ordered sub-events e 1 , e 2 , and e 3 , we then create four training examples: S → e 1 , S ∪ {e 1 } → e 2 , S ∪ {e 1 , e 2 } → e 3 , and S ∪ {e 1 , e 2 , e 3 } → none, where "none" is a special token to end sequences.The order of adding sub-events e i follows the temporal order, which ensures that the model only needs to predict what will happen next without a longer-term forecast.
To minimize the gap between pre-training and fine-tuning, we design a textual prompt template to construct input in human language.If we want to generate the i + 1 th sub-event given a process S and sub-events e 1 , e 2 , ..., e i , the template takes the form of "How to S? Step 1: e 1 ... Inference: During the inference, we apply the seq2seq language models iteratively to generating the sub-event sequence of a process.The aforementioned prompt template is also used.For instance, the model first generates sub-event e 1 for a process S. It then takes S and e 1 as input and generates the second sub-event e 2 .The model repeats this process until the special token "none" is generated, which means no more sub-events are required.Then, generated sub-events are concatenated into a sequence as the final output.

Coherence Controller
As a sub-event sequence should be coherent to complete a process, we propose a coherence controller to control the iterative event-level decoding.At each iteration, the coherence controller considers whether each sub-event candidate is coherent with the given process and sub-events generated in previous iterations.Considering that sub-events (one or more sentences) are diverse and complicated, here we employ a coherence model (Jwalapuram et al., 2021) based on BERT (Devlin et al., 2019) as the coherence controller to score sub-event candidates.
We train the coherence controller as a binary classification task to discriminate coherent subevent sequences from incoherent ones.Following previous works (Mesgar and Strube, 2018;Moon et al., 2019), we regard a human-written sub-event sequence as coherent, and we synthetically build two types of incoherent sub-event sequences by corrupting the local or global coherence of the humanwritten one.For local coherence, we randomly copy a sub-event in the current process and place this duplicate sub-event at a random location.In this way, the relation between two sub-events adjacent to the duplicate sub-event is corrupted, entitled local coherence in linguistics (Van Dijk, 1980).For global coherence, we randomly choose a sub-event from other processes with a different theme and insert this irrelevant sub-event at a random location.In this way, the theme among all sub-events is corrupted, called global coherence (Van Dijk, 1980).We show positive and two types of negative examples in Appendix A.2.
We use the cross-entropy loss shown in Eq. 2 to optimize the coherence controller, where y and ŷ are label and coherence scores, respectively.Since y equals 1 for positive examples, our model will give higher scores for more coherent input.For each positive example, we sample N negative examples by corrupting local coherence and the same number by corrupting global coherence (2N in total).Thus, we balance the loss function by dividing negative loss by 2N : (2) At the inference stage (Figure 2), we concatenate the process and the current generated sequence into the input to the coherence controller.For example, in the i th iteration, the Seq2Seq language model with beam search returns top-k possible sub-event candidates: êi1 , êi2 , . . ., êik .We construct the input S; ê1 , ê2 , . . ., êi−1 , êij for every candidate êij , given process S and prior subevents ê1 , ê2 , . . ., êi−1 .With such input, the coherence controller computes coherence scores C(ê ij ).As the sequence-to-sequence model can return the logarithm of conditional probability P ′ (ê ij ) = log P θ (ê ij |ê <i , S) (Eq. 1) for each candidate, we re-rank candidates and return the best one according to the sum of the two scores: where λ is a hyper-parameter to weight coherence scores.Appendix A.3 gives a concrete example of the inference stage of the coherence controller.

Experiments
We conduct extensive experiments and compare SubeventWriter with a wide selection of baselines.

Dataset
We collect processes and corresponding event sequences from the WikiHow website4 (Koupaee and Wang, 2018), where each process is associated with a sequence of temporally ordered human-annotated sub-events.We randomly split them into the training, validation, and testing sets.As a result, we got 73,847 examples for the training set and 5,000 examples for both validation and testing sets, whose average sub-event sequence length is 4.25.

Evaluation Metric
For each pair of a predicted sequence and a ground truth sequence, we compute BLEU-1 (Papineni et al., 2002), BLEU-2, ROUGE-L (Lin, 2004), and BERTScore (Zhang et al., 2019) between them and take the average of each metric over all data.For inference cases with multiple references, we take the best performance among all references.

Baseline Methods
We compare our framework to three methods: All-at-once Seq2Seq: An intuitive solution to the textual sub-event sequence generation task would be modeling it as an end-to-end sequenceto-sequence (Seq2Seq) problem, where Seq2Seq language models are fine-tuned to predict all subevents at once, given a process as input.Here we test multiple Seq2Seq language models: T5base/large/3b and BART-base/large.We refer to this baseline as "All-at-once" for short in following sections.
Top-one Similar Sequence: Following previous work (Zhang et al., 2020a), another naive yet potentially strong baseline is Top-one Similar Sequence.For each unseen process in the validation or testing set, the baseline finds the most similar process in the training data.The sub-event sequence of the most similar process is then regarded as the prediction.If more than one subevent sequence exists for the most similar process, we randomly pick one from them.Here, we consider two methods to measure similarities: cosine similarity of Glove (Pennington et al., 2014) and Sentence-BERT (SBERT) (Reimers and Gurevych, 2019) embeddings.
Zero-shot Large LM: Large language models (LMs) have shown stronger performance on extensive NLP tasks (Raffel et al., 2020).The third baseline we introduce is prompting large language models in the zero-shot setting.We consider GPT-J (Wang and Komatsuzaki, 2021) and T5-11b, which contain ~6 billion and ~11 billion parameters, respectively.We choose the prompt template "How to S? Generate the events to solve it."for every process S.

Implementation Details
We fine-tune SubeventWriter and All-at-once Seq2Seq based on T5-base/large/3b and BARTbase/large for four epochs.The best checkpoint is selected according to the sum of all metrics on the validation set.The grid search explored learning rates of 1e-5, 5e-5, 1e-4, 5e-4, batch size of 32 and 64, and weight λ of coherence scores (Eq. 3) of 0.5, 1, 2, 5.We test multiple LMs for the coherence controller and choose BERT-base due to its efficiency.We show more details in Appendix B.

Main Evaluation
We show the results on the testing set of the Wik-iHow dataset in Table 1.In general, Subeven-tWriter can generate relevant sub-event sequences, outperforming all baseline frameworks by a great margin.For example, 11.30% of bi-grams generated by SubeventWriter (T5-3b) are covered by the references, increasing by 2.58% absolutely and 29.6% relatively compared to All-at-once Seq2Seq (T5-3b).Even though GPT-J and T5-11b are much larger, the smallest fine-tuned Subeven-tWriter (BART-base) can still surpass them.
Besides, our framework improves more significantly on smaller-sized language models and is parameter efficient.With T5 going down from "3b" to "base", we observe improvements increase (e.g., from 6.76% to 10.41% for BLEU-1).Also, SubeventWriter (T5-base) achieves comparable performance compared to All-at-once Seq2Seq (T5-3b) with only about 12% parameters5 because generating one event each time is not hard to T5-base with the help of the coherence controller.
Another interesting observation is that when comparing between SubeventWriter based on T5 and BART, T5 always performs slightly better than BART in both "base" and "large" sizes.Such advances are consistent with intuition since T5 is about 1.5x -2x larger than BART.
In the rest of this section, we conduct more analysis to demonstrate the reason behind the success of SubeventWriter.

Ablation Study
To measure the contribution of each module to the final results, we conduct an ablation study on SubeventWriter in Table 2.
ative event-level decoding can boost the performance.The coherence controller depends on iterative event-level decoding as it controls sub-events one by one in chronological order.Thus, we cannot only drop iterative event-level decoding while the coherence controller is kept (no ⋄ w/o ITER.).
From the results in Table 2, we observe that both the coherence controller and the iterative eventlevel decoding play essential roles in generating high-quality sub-event sequences.Dropping each of them will cause drastic decreases in all metrics.Taking SubeventWriter (BART-large) as an example, the BLEU-1 decreases by 4.53% without the coherence controller.When the iterative eventlevel decoding is further removed, the BLEU-1 declines to 21.84%, which is only 70% of the original BLEU-1 score.

Human Evaluation
We perform human evaluation to complement automatic evaluation.As sub-event sequences are complicated and diverse, we decompose them and score every sub-event in the following 3 aspects: Relevance: whether a sub-event is relevant to solving the given process, measuring how well sub-5,000 10,000 20,000 30,000 Training data (#process) events focus on the same theme (global coherence).
Conciseness: whether a sub-event is not redundant to others in the same sequence.We introduce this aspect since generating duplicates is a common failure of language models (Brown et al., 2020) and destroys local coherence.
Orderliness: whether a sub-event is placed in proper order, considering its prior sub-events.As the order of sub-events irrelevant to the given process is not defined clearly, we only consider the order of sub-events that satisfy the first aspect (Relevance).
We choose to evaluate the generation of Subeven-tWriter (T5-3b) and All-at-once (T5-3b) as they have the best quantitative performance.We randomly select 50 processes from the testing set, containing about 200 sub-events.Three experts are asked to evaluate every sub-event, yielding 1,800 total ratings for each model (200 sub-events × 3 aspects × 3 experts).We take the majority vote among three votes as the final result for each subevents.The IAA score is 83.78% calculated using pairwise agreement proportion, and the Fleiss's κ (Fleiss, 1971) is 0.57.
We show the average scores in Figure 3.We can observe that both models achieve acceptable scores in orderliness.SubeventWriter generates more relevant and less redundant sub-events in comparison to All-at-once Seq2Seq as the global and local coherence in the coherence controller reflects the relevance and conciseness, respectively.Our model also produces more "All-Good" sub-events, which satisfy all three aspects.

Few-shot Learning Ability
We conduct main evaluation on the WikiHow dataset, which contains a large training set.To better understand the generalization ability of Subeven- tWriter, we conduct few-shot experiments to confirm its ability to generalize with fewer data.
Referring to the size of the validation set (5,000 examples), we conduct experiments with training data of 5,000, 10,000, 20,000, and 30,000 shots (1x, 2x, 4x, 6x as large as the validation set) 6 .As shown in Figure 4, SubeventWriter achieves better performance compared to All-at-once Seq2Seq, demonstrating that SubeventWriter owns the ability to generalize with fewer data.

Zero-shot Transfer Learning
To further verify the generalization ability of SubeventWriter, we test it on two small-scale and domain-specific datasets: SMILE (Regneri et al., 2010) and DeScript (Wanzare et al., 2016).Both contain hundreds of human-curated sub-event sequences pertaining to human activities.Statistics for each dataset are in Table 4.
We directly use SubeventWriter fine-tuned on the WikiHow dataset to test its zero-shot transferring ability because it performs well on the WikiHow dataset.Since we do not tune hyper-parameters on these datasets, we treat each entire dataset as a 6 The full training set is 15x as large as the validation set testing set, and there is no validation set.
We report the zero-shot transferring results on SubeventWriter on SMILE and DeScript in Table 3.Among all baseline methods introduced in Section 4.3, we choose Top-one Similar Sequence and All-at-once Seq2Seq as baselines.The method Zero-shot Large LM does not fit WikiHow data, so it is not suitable to test the zero-shot transferring ability.From Table 3, we can find the performance on SMILE and DeScript is higher than the WikiHow dataset for all models since more references are provided.We find SubeventWriter surpasses Top-one Similar Sequence and All-atonce Seq2Seq on both datasets and all model sizes.Such improvements indicate that our framework is able to learn non-trivial knowledge about sub-event sequences and has a strong generalization ability.

Cutting Down the Model Parameters
Most of our experiments use T5-base (~220M parameters) and T5-large (~770M parameters), or the counterpart of BART, but in practice, we might prefer to use smaller models due to computational limitations.Here, we investigate the impact of model size by using T5-small (~60M parameters) models.Table 5 presents the results for fine-tuning Allat-once Seq2Seq (T5-small) and SubeventWriter (T5-small).Since the coherence controller is based on BERT-base (~110M parameters), we remove it from SubeventWriter to keep the number of parameters consistent for a fair comparison.
There are two meaningful observations.First, SubeventWriter (T5-small) can still provide superior sub-event sequences compared to All-at-once  Seq2Seq (T5-small) by a large margin (e.g., 6.35% in BLEU-1).Second, both SubeventWriter and Allat-once Seq2Seq perform worse when we replace T5-base with T5-small since the model size reduces to 27% of the original one.

Comparison of Sub-event Sequence Length
While prior experiments and analysis mainly focus on the generated content, we also compare lengths of generated sub-event sequences with the ground truth to better assess SubeventWriter.We consider the lengths as a regression problem and decide to use two metrics: Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).
From Table 6, we can observe that Subeven-tWriter achieves less mean absolute and root mean squared error than All-at-once Seq2Seq, which indicates that our framework can generate sequences with more precise numbers of sub-events.

Case Study
We show two sub-event sequences produced by SubeventWriter in Figure 5.We also generate subevents without the coherence controller to analyze how it works.From the first example, we can observe that SubeventWriter without the coherence controller produces a digressive sub-event "Add a sentiment."The coherence controller can correct the generation and keep a consistent theme across the process and all sub-events (global coherence).From the second example, two redundant sub-events, "Add the chocolate chips," are Process1: How to make a felt heart card?Reference: "Measure heart shapes."-> "Cut out heart shapes."-> "Glue into place on a card."

Related Work
Understanding events has been a challenging task in NLP for a long time (Chen et al., 2021), to which the community has dedicated many works.Chambers and Jurafsky (2008) first introduced the narrative cloze task, where models are asked to predict the next event from given ones.After them, a few works are devoted to better modeling the event representations (Pichotta andMooney, 2014, 2016;Granroth-Wilding and Clark, 2016;Li et al., 2018;Ding et al., 2019;Bai et al., 2021).Mostafazadeh et al. (2016) studied the story cloze test, where a system needs to choose the correct ending for a short story.Nonetheless, those tasks emphasize the semantic similarity and relatedness among events, ignoring how events are organized coherently.
A similar work to ours is process induction (Zhang et al., 2020a), where they proposed a statistical framework to generate a sub-event sequence of a given process.The framework aggregates existing events with conceptualization and instantiation.The difference between our work and theirs is that we consider the coherence of both actions and their objects in generation.Tasks about processes in different forms are also studied, including sub-event sequence typing (Chen et al., 2020;Pepe et al., 2022), sub-event selection (Zhang et al., 2020c), chronological ordering (Jin et al., 2022), script construction with specified length (Lyu et al., 2021b), and multi-relation prediction (Lee and Goldwasser, 2019).Compared to their settings, our work directly tackles the most challenging one, where models are asked to generate whole sub-event sequences.

Conclusion
In this paper, we try to construct coherent subevent sequences by considering coherence in eventlevel decoding.Our SubeventWriter generates subevents iteratively.A coherence controller is introduced to re-rank candidates in each iteration.The extensive experiments demonstrate the effectiveness of SubeventWriter.

Acknowledgement
The authors of this paper were supported by the NSFC Fund (U20B2053) from the NSFC of China, the RIF (R6020-19 and R6021-20) and the GRF (16211520) from RGC of Hong Kong, the MHKJFS (MHP/001/19) from ITC of Hong Kong and the National Key R&D Program of China (2019YFE0198200) with special thanks to HKMAAC and CUSBLT, and the Jiangsu Province Science and Technology Collaboration Fund (BZ2021065).We also thank the support from NVIDIA AI Technology Center (NVAITC) and the UGC Research Matching Grants (RMGS20EG01-D, RMGS20CR11, RMGS20CR12, RMGS20EG19, RMGS20EG21).

Limitations
The main limitation is that our SubeventWriter framework lacks knowledge while it needs to understand multiple entities in processes and how they interact.As shown in Figure 6, we ask the framework "How to make strawberry cupcakes?"However, SubeventWriter ignores "strawberry" in the question, which shows SubeventWriter does not have knowledge about general cupcakes and strawberry cupcakes.Thus, it cannot infer the way to make strawberry cupcakes from making cupcakes.Future work can investigate effective ways to integrate more knowledge and give models stronger Process: How to make strawberry cupcakes?"Preheat the oven to 350 degrees Fahrenheit." -> "Make the cupcakes."-> "Frost the cupcakes."
30 cm x 19.05cm Figure 6: An error analysis.The SubeventWriter ignores "strawberry" in the question and answers how to make cupcakes.We add SubeventWriter (T5-3b), which achieves the best machine performance.reasoning ability.For example, Zhang et al. (2020a) utilized the hierarchical structure among events to conceptualize and instantiate similar processes.
We also test human performance to show the limitations of SubeventWriter and the large room for improvements.Notice that a process usually owns multiple ground truth references, which are annotated by humans.For every process, we randomly select a sub-event sequence from ground truth as a human prediction.The randomly selected one will be excluded from references.
From the results in Table 7, we can observe that there is still a notable gap between machine performance and human performance.For example, the BLEU-2 of human performance is more than twice of SubeventWriter (T5-3b) (28.33% vs. 11.30%).Step 2: Bring the water to a boil.
Step 3: Turn off the heat and place the eggs in cold water.
Step 4: [M] Output: none Table 8: An example of the process "cook eggs" with the prompt template.Each sub-event in the sequence is regarded as the output in turns.
[M] is a masked token used in pre-train models, like <extra_id_0> of T5.

A Input and Output Examples
We list examples of input and output in this section,

A.1 Examples of the Prompt Template
In Table 8 and Table 9, we give two examples to show how we train SubeventWriter with the prompt template.We train SubeventWriter to generate one sub-event each time in chronological order and append it back to the input.If all sub-events are generated, SubeventWriter generates "none".

A.2 Examples of Training the Coherence Controller
In Table 10, we give positive and negative examples to show how we train the coherence controller.For negative examples, we provide examples for using a duplicate sub-event to corrupt the local coherence and using an irrelevant sub-event to corrupt the global coherence.

A.3 Examples of the Inference Stage of the Coherence Controller
In Table 11, we give two candidates for the third sub-event in the process "make a felt hear card".We also show the input to the coherence controller, which is concatenated from the process, sub-events generated in prior iterations, and current candidates.The coherence controller can assign coherent input higher scores and penalize incoherent input.Table 11: An example of the process "make a felt heart card" for the inference stage of the coherence controller.We compare two candidates for the third sub-events.We can see that the coherence controller can score the coherent candidate higher.

B Implementation Details
We conduct all experiments on 8 NVIDIA A100 GPUs.

B.1 Coherence Controller
The coherence controller is fine-tuned on BERTbase due to efficiency.We also tested three other variants of the Transformer (Vaswani et al., 2017): BERT-large, RoBERTa-large (Liu et al., 2019), and RoBERTa-base.We fine-tune them with sub-event sequences from WikiHow to keep the domain consistent inside SubeventWriter.Two negative examples are sampled using duplicate sub-event and the Process: have a relaxing evening Reference: Turn the lights down.→ Put on some music or some relaxing nature sounds.→ Make sure the temperature is comfortable.→ Turn your phone off.
Example: How to have a relaxing evening?Step 1: Turn the lights down.
Step 2: Put on some music or some relaxing nature sounds.
Step 3: Make sure the temperature is comfortable.
Step 4: Turn your phone off.

Label: Positive
Example: How to have a relaxing evening?
Step 1: Turn the lights down.
Step 2: Put on some music or some relaxing nature sounds.Step3: Turn the lights down.
Step 4: Make sure the temperature is comfortable.
Step 6: Turn your phone off.
Label: Negative (with a duplicate sub-event) Example: How to have a relaxing evening?
Step 1: Turn the lights down.
Step 2: Put on some music or some relaxing nature sounds.
Step 3: Make sure the temperature is comfortable.
Step 4: Place eggs in a pot of water.
Step 5: Turn your phone off.Label: Negative (with an irrelevant sub-event) same number using irrelevant sub-event (2N = 4 in total).We build two testing sets with positive and negative samples of 1:1.Negative examples in the first testing set are examples with corrupted local coherence, while those in the second set are examples with corrupted global coherence.Accuracy is shown in Table 12 on both testing sets.We also show accuracy on all testing data of both sets ("All").We can observe that BERT-base already achieves satisfying accuracy (93.14%).Using larger models does not improve too much and increases computation cost.

C Results on WikiHow Validation Dataset
We collect the performance on the validation set of the WikiHow dataset in Table 14.SubeventWriter also works well on validation data.

D Main Evaluation and Analysis
We provide complementary results of main evaluation and analysis as follows.

D.1 Full Results of Ablation Study
Here we present the ablation study results of SubeventWriter based on all BART and T5 models in Table 15

D.2 Full Results of Few-shot Learning
We offer full results of few-shot learning on the testing set of WikiHow dataset in Table 17 for Allat-once Seq2Seq and Table 18 for SubeventWriter.
Step i: e i .Step i+1: [M]" as the example shown in Figure 2. [M] is the mask token of the model.More examples of input/output are shown in Appendix A.1.

Figure 4 :
Figure4: Few-shot learning performance of Subeven-tWriter ("Ours") and All-at-once Seq2Seq ("All-atonce") based on T5 is shown.We also include the results of other metrics in Appendix D.2.
How to cook eggs?Step 1: [M]Output: Place eggs in a pot of water.2Input:How to cook eggs?Step 1: Place eggs in a pot of water.Step 2: [M] Output: Bring the water to a boil.

3:
InputHow to cook eggs?Step 1: Place eggs in a pot of water.Step 2: Bring the water to a boil.Step 3: [M] Output: Turn off the heat and place the eggs in cold water.

4:
InputHow to cook eggs?Step 1: Place eggs in a pot of water.

Table 1 :
Performance of all frameworks on the testing set of the WikiHow dataset.SubeventWriter is our model.We abbreviate BLEU-1, BLEU-2, ROUGE-L, and BERTScore to B-1, B-2, R-L, and BERT, respectively.Compared to All-at-once Seq2Seq, improvements of our frameworks are shown under ∆ B-1 and ∆ B-2 for each size of T5 and BART.We also include the performance of all models on the validation set in Appendix C.

Table 2 :
Ablation study on SubeventWriter."w/o CoCo" refers to ablation of coherence controller.Further ablation of iterative event-level decoding is shown in "w/o CoCo & ITER."See Appendix D.1 for results on other sizes of BART and T5.

Table 3 :
Performance of zero-shot transfer learning on SMILE and DeScript.SubeventWriter ("Ours") outperforms the All-at-once Seq2Seq baseline ("All-at-once") by a large margin.

Table 4 :
Statistics of SMILE and DeScript.#Process, #Seq, Avg-Ref, and Avg-Len are the number of processes, sub-event sequences, average sequences per process, and average sub-events per sequence, respectively.

Table 5 :
Performance of using T5-small.δ B-1 and δ B-2 indicate performance drops when replacing T5-base with T5-small.See Appendix D.3 for full results.

Table 6 :
Regression errors of sub-event sequence length of SubeventWriter and All-at-once Seq2Seq, which verify our framework can predict precise lengths.

Table 7 :
Human performance on the WikiHow dataset.

Table 10 :
Positive and negative examples of the process "have a relaxing evening" for the training stage of the coherence controller.We mark the sub-event used to build negative examples with blue color.

Table 12 :
Accuracy of coherence controllers."Local" and "Global" refer to testing sets with corrupted local and global coherence, respectively."All" contains all testing data of both sets.

Table 15 :
and Table16, respectively.Ablation study results on BART-base and BART-large.