A Diffusion Model for Event Skeleton Generation

Event skeleton generation, aiming to induce an event schema skeleton graph with abstracted event nodes and their temporal relations from a set of event instance graphs, is a critical step in the temporal complex event schema induction task. Existing methods effectively address this task from a graph generation perspective but suffer from noise-sensitive and error accumulation, e.g., the inability to correct errors while generating schema. We, therefore, propose a novel Diffusion Event Graph Model~(DEGM) to address these issues. Our DEGM is the first workable diffusion model for event skeleton generation, where the embedding and rounding techniques with a custom edge-based loss are introduced to transform a discrete event graph into learnable latent representation. Furthermore, we propose a denoising training process to maintain the model's robustness. Consequently, DEGM derives the final schema, where error correction is guaranteed by iteratively refining the latent representation during the schema generation process. Experimental results on three IED bombing datasets demonstrate that our DEGM achieves better results than other state-of-the-art baselines. Our code and data are available at https://github.com/zhufq00/EventSkeletonGeneration.


Introduction
Event schema induction is to identify common patterns and structures in event data, which can extract high-level representation of the events.Current event schema induction tasks mainly focus on simple event schemas, e.g., templates (Chambers, 2013) and scripts (Chambers and Jurafsky, 2009).However, real-world events are usually more complex, which include multiple atomic events, entities, and their relations, which require more advanced Figure 1: An illustrated example demonstrates the utilization of multiple instance graphs extracted from news articles depicting complex events to generate an event schema skeleton graph for the complex event type Car bombing.The presented instance graph specifically represents the complex event known as the Kabul ambulance bombing.A circle symbolizes an atomic event.
techniques to adequately capture and represent the different aspects and relations involved.
Recently, Li et al. (2021) propose the temporal complex event schema induction task in order to understand these complex events.The task seeks to abstract a general evolution pattern for complex events from multiple event instance graphs.It is divided into two subtasks: event skeleton generation and entity-entity relation completion.The first task focuses on creating the event skeleton, i.e., representing each atomic event with its associated event type as an event node and exploring their temporal relations.The second one is to complete entities and entity links for the event skeleton.In this paper, we focus on event skeleton generation as it is a prerequisite yet formidable task in temporal complex event schema induction.Figure 1 illustrates an example of instance graphs1 and the corresponding abstracted schema.Both include abstract event types, such as Attack, and their temporal relations, like Injure happening after Attack.
Event skeleton generation requires a deep understanding of events and their multi-dimensional relations.Previous methods employ autoregressive graph generation models to generate a schema, sequentially generating event nodes from the previous ones.For example, Li et al. (2021) generate the event node with its potential arguments and propagates edge-aware information within the temporal orders.Jin et al. (2022) improves this approach by applying a Graph Convolutional Network (GCN) to better capture structural information in instance graphs and adopting a similar autoregressive generation approach to generate event graphs.However, autoregressive generation methods for event skeleton generation result in errors accumulating over time, which may degrade the performance of the generation model.For instance, as shown in Figure 1, the model may mistakenly generate "Explode" as "Die", causing it to fail to generate subsequent events correctly.Intuitively, as the number of event nodes increases, the error accumulation becomes more severe.This comes from two factors.The first one is error propagation in the autoregressive graph generation models because they are noisesensitive and strongly rely on the correctness of the generated node.If the model generates an incorrect node, it will lead to a cascading effect of errors in generating the schema.Robustness is a serious issue in autoregressive methods.The second factor is the model's inability to correct errors in the generation procedure.Hence, we need a model, which can correct the generated event-type nodes during generating.
To this end, we propose a novel event graph generation model, dubbed Diffusion Event Graph Model (DEGM), to address these issues.To battle the model's robustness, we propose a diffusionbased method, inspired by the outstanding performance in recent research (Sun et al., 2022;Xiao et al., 2022).By carefully selecting the amount of Gaussian noise in the diffusion process, the model can remove adversarial perturbations, thereby increasing the model's robustness.However, there are still two challenges in applying this method directly to the event graph: (1) mapping the discrete graph structures and event types to a continuous space, and (2) finding a way to recover the event graph from the continuous space.We then develop the denoising stage, including converting the event graph into a sequence and applying an embedding technique to project it to the continuous space.Additionally, we introduce a custom edge-based loss function to capture the missing structural information during the transformation.To tackle the second challenge, we develop a rounding technique to predict the event types based on their representation and a pre-trained classifier to predict the event edges.To address the second issue, we derive the final schema, which guarantees error correction, by iteratively refining the latent representation.
We summarize our contributions as follows: • We propose a novel Diffusion Event Graph model (DEGM) for event skeleton generation, in which a denoising training stage guarantees the model's robustness and the schema generation process fulfills error correction via iterative refinement on the latent representation.• We are the first to tackle event skeleton generation via diffusion models, where we convert an event graph from discrete nodes to latent variables in a continuous space and train the model parameters by optimizing the event sequence reconstruction and graph structure reconstruction simultaneously.• Experimental results on the event skeleton generation task demonstrate that our approach achieves better results than state-of-the-art baselines.
2 Preliminaries and Problem Statement

Diffusion Models in a Continuous Space
A diffusion model typically consists of forward and reverse processes.Given data x 0 ∈ R d , the forward process gradually adds noise to x 0 to obtain a sequence of latent variables in R d , x 1 , . . ., x T , where x T is a Gaussian noise.Formally, the forward process can be attained by q , where β t controls the noise level at the t-th step.Denote α t = 1 − β t and α t = t s=1 α s , we can directly obtain x t as q (x t | x 0 ) = N √ α t x 0 , 1 − α t I .After the forward process is completed, the reverse denoising process can be formulated as p θ (x t−1 | x t ) = N (x t−1 ; µ θ (x t , t) , Σ θ (x t , t)) where µ θ (•) and Σ θ (•) can be implemented using a neural network.

Diffusion Models in a Discrete Space
For discrete data, e.g., text, Li et al. (2022) employ embedding and rounding techniques to map the text to a continuous space, which can also be recovered.
Given the embedding of the text w, EMB(w), and suppose x 0 is computed as q(x 0 |w) = N (x 0 ; w, β 0 I), the corresponding training objective is (1) The first expectation is to train the predicted model f θ (x t , t) to fit x 0 from 2 to T .Empirically, it can effectively reduce rounding errors (Li et al., 2022).The second expectation consists of two terms: the first item makes the predicted x 0 , i.e., f θ (x 1 , 1), closer to the embedding EMB(w) while the second item aims to correctly round x 0 to the text w.

Problem Statement
Event skeleton generation is a subtask of temporal complex event schema induction (Li et al., 2021).It aims to automatically induce a schema from instance graphs for a given complex event type, where a complex event type encompasses multiple complex events; see an example of car-bombing shown in Fig. 1.An event schema skeleton consists of nodes for atomic event types and edges for their temporal relations.Since event skeleton generation is a prerequisite yet challenging task in the temporal complex event schema induction task, we focus on this task in our work.
Formally, let G = (N , E) be an instance graph with N = |N | nodes in N and E be the set of directed edges, one can obtain the corresponding adjacency matrix, A = {a ij } ∈ {0, 1} N ×N , where a ij = 1 if edge(i, j) ∈ E and a ij = 0 otherwise.Due to temporal relations, G is a directed acyclic graph (DAG), and A is an upper triangular matrix.Each node n ∈ N represents an atomic event and is assigned with an event type n e ∈ Φ, where Φ denotes the set of event types.The type of each atomic event is abstracted by the DARPA KAIROS ontology2 based on its event mention.In practice, we extract a set of instance graphs G as outlined in Sec.4.1 from news articles, where each instance graph G ∈ G describes a complex event, e.g., Kabul ambulance bombing as shown Fig. 1.

Given an instance graph set
our goal is to generate a schema S that outlines the underlying evolution pattern of complex events under the given complex event type.

Method
We propose Diffusion Event Graph Model (DEGM) to tackle the event skeleton generation task.Our DEGM is capable of generating temporal event graphs from random noise.Fig. 2 illustrates an overview of our DEGM.

Denoising Training
The denoising training stage consists of three steps to reconstruct the event sequence and graph structure: 1) mapping the event graph into its embedding representation in a continuous space; 2) performing a forward step to obtain the latent variables, or representation with various levels of noise; 3) conducting the denoising step to remove the introduced noise from latent representation.
Embedding representation Given an instance graph G, we first convert it into a sequence of m events, E = [e 1 , e 2 , . . ., e m ], where e i denotes the event type of node i, via topological sorting.We then project E into its embedding representation in a continuous embedding space, e = [EMB e (e 1 ), . . ., EMB e (e m )] ∈ R d×m , (2) where d is the representation size.Note that m is a preset number of nodes to ensure all graphs are well-aligned.For graphs with less than m nodes, we pad them by a pre-defined event type: PAD, which makes the total number of event types, M = |Φ| + 1.

Forward
Step After obtaining the embedded event sequence e, we deliver the forward process in the diffusion framework to acquire a sequence of latent variables by monotonically increasing the level of introduced noise.We sample variables of x 0 and x t via q(x 0 |e) = N (x 0 ; e, β 0 I), (3) where t = 1, . . ., T .Moreover, we introduce two additional embeddings to enhance the expressiveness of latent variables, i.e., the absolute position embedding W pos ∈ R m×d and the step embedding Following that, we perform DEGM accordingly.We first convert the discrete events into their representation in a continuous space.The forward step and the denoising step are conducted iteratively to reconstruct the event sequence and the graph structure.Note that we convert the latent variable h t la into three representation in two levels, i.e., the shared representation h t sh and two task-specific representation for the node's type h t ty and the node's structure h t st , respectively; see more details in the text.
EMB s (t).They allow us to capture the event's temporal order in the obtained event sequence and specify that it is at the t-th diffusion step.Adding them together, we obtain the latent variables at t-th diffusion step as (5)

Denoising
Step Before optimizing the two objectives, event sequence reconstruction and graph structure reconstruction, we first convert the latent variable h t la into three variables in two levels, i.e., via a shared encoder E sh to h t sh and two taskspecific encoders, the node's type encoder E ty to h t ty and the node's structure encoder E st to h t st .That is, In the following, we outline the procedure of constructing encoders E sh , E ty , and E st , each contains l layer.With a little abuse of notations, we define h = [h 1 , . . ., h m ] as the input representation for a layer and the corresponding output as Here, we utilize the graph-attention (Veličković et al., 2018) to transform the input representation into a high-level representation as follows: where W ∈ R d×d is a weight matrix, σ(•) is a nonlinear activation function.Here, α ij is the attention weight defined by where a ∈ R 2d is a weight vector, LR is the LeakyReLU activation function, and || denotes the concatenation operation.We compute attention weights in this way instead of relying on the inner product to prevent higher attention weights between atomic events of the same event type3 , which is not appropriate for constructing the event graph.For instance, the attention weight between two independent Attack events should be less than the weight of one Attack and its successor events.After attaining h t ty ,h t st , via E ty and E st , respectively, we compute two losses, the event sequence reconstruction loss L t ty (G) and the graph structure reconstruction loss L t st (G) at the t-th diffusion step as: The objective of L t ty (G) in Eq. ( 11) is to reduce the difference between the ground truth E and h t ty W T e ∈ R m×M , which represents the probabilities of each node belonging to each event type.It is worth noting that L t ty (G) offers a simplified version of the training objective outlined in Eq. ( 1), and empirically improves the quality of the generated schemas.Meanwhile, the objective of L t st (G) in Eq. ( 12) aims to predict the probability of a directed edge from node i to node j and fit their adjacency matrix value a ij ∈ A. Finally, we obtain the model by minimizing the following loss: where T denotes the total diffusion steps and λ is a constant to balance the two objectives.When training our model, we randomly select a few instance graphs and then sample a diffusion step t for each of these graphs.We then minimize Eq. ( 13) to update the model's weights until it converges.

Schema Generation
We start the schema generation procedure from hT la ∈ R m×d , which are sampled from Gaussian noise.We then compute its shared representation ht sh and the node type representation ht ty at the t-th diffusion step reversely: ht ty = E ty ( ht sh ), ht−1 la = ht ty , t = T, . . ., 1. (15) After T denoising steps, we obtain the final representation h0 sh , h0 ty , and compute h0 st = E st ( h0 sh ).Next, we apply the node type representation h0 ty and the structure representation h0 st to generate the schema.First, with h0 ty = [ h1 ty , . . ., hm ty ] ∈ R m×d , we obtain each event's type e i ∈ Ẽ by assigning the event type whose embedding is nearest to hi ty as: Second, with h0 st = [ h1 st , . . ., hm st ] ∈ R m×d , we predict the directed edge from node i to node j where i < j by using a pre-trained classifier MLP trained via Eq.( 12) as follows: where τ is a threshold to determine the final edges and β ij ∈ Ã is the adjacency matrix value of the generated schema.We generate the schema from the reconstructed event sequence Ẽ and adjacency matrix Ã, and remove PAD type events and the edges associated with them and derive the final schema S.

Datasets
We conduct experiments to evaluate our model in three IED bombings datasets (Li et al., 2021;Jin et al., 2022).Each dataset associates with a distinct complex event type: General IED, Car bombing IED, and Suicide IED.Taking the complex event type Car bombing IED as an example, to construct the corresponding dataset, we need to build an instance graph set, where each instance graph describes a complex event, e.g., Kabul ambulance bombing.Li et al. (2021) first identify some complex events related to the complex event type based on Wikipedia.Then, each instance graph is constructed from the reference news articles in Wikipedia pages related to the complex event.Specifically, Li et al. (2021) utilized the state-ofthe-art information extraction system RESIN (Wen et al., 2021) to extract atomic events, represented as event types, and their temporal relations from news articles, and finally obtained the instance graph set.Next, a human curation is performed to ensure the soundness of the instance graphs (Jin et al., 2022).We utilize the released curated datasets for our experiments and follow previous work (Jin et al., 2022) to split the data into train, validation, and test sets.The statistics of the three datasets are summarized in Table 1.1: The statistics for the three datasets."e" and "ee" denote event and event-event, respectively.

Baselines
We compare our method with the following strong baselines: • Temporal Event Graph Model (TEGM) (Li et al., 2021): TEGM is based on an autoregressive method that step-by-step generates event and edges between newly generated event and existing events and subsequently uses greedy decoding to obtain the schema, starting from a specially predefined START event.
• Frequency-Based Sampling (FBS) (Jin et al., 2022): FBS first counts the occurrence frequency of edges between two event types in the train set.Then the schema is constructed in which each node corresponds to one event type, and initially, the schema does not have any edges.After that, FBS samples one pair of event types based on the occurrence frequency of edges and adds an edge between the corresponding nodes into the schema.The process is repeated until the newly added edge resulting in a cycle in the schema.• DoubleGAE (Jin et al., 2022): Double-GAE generates an event graph based on DVAE (Zhang et al., 2019).They first use a directed GCN encoder to obtain the mean and variance of the event graph's latent variables, and then according to the sampled latent variables to recover the event graph in an autoregressive paradigm, similar to TEGM.Finally, they obtain the schema by feeding the hidden variables sampled from Gaussian noise into the model.

Experimental Setup
Quantitative metrics.We train our model in the train set for a given dataset and then generate the schema according to Sec. 3.2.To evaluate the quality of the schema, we compare the schema with the instance graphs in the test set using the following metrics: (1) Event type match.We compute the set of event types in the generated schema and the set for a test instance graph and compute the F1 score between the two sets to see whether our schema contains the event types in the real-word complex events.
(2) Event sequence match.We compute the set of event sequences with a length 2 (or 3) in the generated schema, as well as the set for a test instance graph, and compute the F1 scores between the two sets to measure how the schema captures substructures in the test instance graphs.Note that we calculate the average values of each metric above between the generated schema and each instance graph in the test set as the final results.We generate a set of candidate schemas and test their performance on the validation set, and select the best-performing one as the final schema for the focused complex event type.
Implementation Details.For our DEGM, the representation dimension d is 256.The number of encoder layers, l, is set to 4. The graph structure reconstruction loss weight λ is 1, and the edge classification threshold τ is 0.8.The learning rate is 1e-4 and the number of training epochs is 100.All hyperparameters are chosen based on the validation set.We select the best checkpoint, and the bestperforming schema on the validation set according to the event type match (F1) metric.The maximum number of graph nodes m is 50, and the number of our candidate schema is 500 following Jin et al. (2022).The event type in DARPA KAIROS ontology is 67.We define the noise schedule as α t = 1 − t + 1/T following Li et al. (2022) and the total diffusion step T is 100.All the experiments are conducted on Tesla A100 GPU with 40G memory.

Results and Analysis
Table 2 reports the main results of our model and shows some notable observations: (1) Our model has achieved significant progress compared to the baselines across three datasets and three metrics; (2) The average performance of the generated candidate schemas also performs better than previous methods.The reasons for the first observation can be attributed to the ability of our model to iteratively refine the generated schema, enabling the node types and edges between nodes to better match the evolution pattern of the unseen complex events, resulting in superior performance on the test set.In contrast, Temporal Event Graph Model (TEGM) can only generate the next event based on the partially generated event graph during training and generation.DoubleGAE has We find that both are crucial for improving the event type match (F1) metric.
improved this problem by utilizing the encoder structure to capture the global structure of instance graphs.However, DoubleGAE still employs a similar generation procedure as TEGM during schema generation, resulting in a substantial performance gap with our method.Meanwhile, the performance of FBS is much lower than our method, indicating that the heuristic approach is challenging to generate such a schema, demonstrating the necessity for probabilistic modeling for the event graphs.
For the second observation, we claim that our model is proficient in modeling the distribution of instance graphs.Also, selecting the bestperforming schema based on the validation set helps immensely, especially for the event type match (F1) (l=3) metric.This may be because this metric is more sensitive to the gap between the truth distribution of instance graphs and the modeled distribution, and selecting schema based on the validation set reduces the gap.

Ablation Studies
We verify the importance of our simplified training objective and a design choice while generating the schema through two ablation studies.As shown in Figure 4, we can observe that our simplified training objective L t ty (G) in Eq. 11 performs significantly better than the original one Eq. 1.This may be due to the fact that the original training objective includes three optimization objectives, while ours includes only one.And too many optimization objectives may lead to a larger loss variance, resulting in difficulty in convergence and thus degrading the performance.At the same time, both training objectives share the same goal: to maximize the model's ability to reconstruct the original event sequence at each diffusion step.
Besides, we also investigate an alternative which we assign h t−1 la as h t st in Eq. ( 15) while generating schema.We aim to explore whether it would be better to denoise based on the structure representation h t st .However, this leads to a collapse of the event type match (F1) metric as in Figure 4. Probably due to the model is trained based on the embedded event sequence to reconstruct the event sequence and its graph structure.Therefore, the model prefers to denoise based on the node type representation h t ty .

Impact of Topological Sorting
Our approach, as well as previous autoregressive graph generation methods, all require a topological sorting of the instance graph to obtain a sorted version of the graph that is not unique.Therefore, we want to investigate whether the model's performance is affected when we train our model with multiple isomorphic instance graphs randomly sorted from one instance graph.Getting n randomly sorted instance graphs from one instance graph is equivalent to expanding the training set n times.We test our model's performance respectively by setting the n range from 1 to 9. As shown in Figure 3, however, we observe that training our model on the expanded training set hardly affects the model's performance across all three datasets and three metrics.Indicating that our model captures the evolution pattern of the instance graph based only on one sorted instance graph.In Figure 5, we present a snippet of the schema generated by our model.From this, we can observe two phenomena: (1) The generated schema contains precise types of atomic events and the common substructures.(2)The model has a tendency to generate repeated subsequent events and substructures.The superior performance of our model is by the first phenomenon, which demonstrates its ability to accurately generate both events and substructures.However, the second phenomenon highlights a drawback of the model, which is its tendency to produce duplicate substructures and events.Further analysis revealed that this repetitive structure is caused by a high number of repetitive substructures in the training set, due to the fact that the instance graphs used were extracted from news articles, which can be noisy.As a result, the model learns to replicate these patterns.

Related Work
According to Jin et al. (2022), event schema induction can be divided into three categories: (1) atomic event schema induction (Chambers, 2013;Cheung et al., 2013;Nguyen et al., 2015;Sha et al., 2016;Yuan et al., 2018) has focused on inducing an event template, called atomic event schema, for multiple similar atomic events.The template includes an abstracted event type and a set of entity roles shared by all atomic events, while ignoring the relations between events.(2) narrative event schema induction (Chambers andJurafsky, 2008, 2009;Jans et al., 2012;Rudinger et al., 2015;Granroth-Wilding and Clark, 2016;Zhu et al., 2022;Gao et al., 2022a,b;Long et al., 2022;Yang et al., 2021), in contrast, pays attention to the relations between events.In this task, schema is defined as a narrativeordered sequence of events, with each event including its entity roles.However, complex events in real-world scenarios often consists of multiple events and entities with innerwined relations.
To under such complex events, Li et al. (2020) incorporate graph structure into schema definition.However, they only consider the relations between two events and their entities.(3) temporal complex event induction task, recently, Li et al. (2021) propose this task in which a schema consists of events, entities, the temporal relations between events, relations between entities, and relations between event and entity (i.e., argument).Each event and entity is abstracted as an event type or entity type, and each event type contains multiple predefined arguments associated with entities.To address this issue, Li et al. (2021) generates the schema event by event.Each time an event is generated, the model links it to existing events, expands it with predefined arguments and entities, and links the entities to existing nodes.This approach leads to the entities' inability to perceive the events' position, resulting in entities cannot distinguish between events of the same type.Therefore (Jin et al., 2022) divide the task into two stages: event skeleton generation and entity-entity relation completion.In the first stage, they employ an autoregressive directed graph generation method (Zhang et al., 2019) to generate the schema skeleton, including events and their relations.In the second stage, they expand the schema skeleton with predefined arguments and entities and complete the remaining relations vis a link prediction method VGAE (Kipf and Welling, 2016).
The above event graph induction methods suffer from error accumulation due to the limitations of the autoregressive schema generation paradigm.To address this issue, we propose DEGM which utilizes a denoising training process to enhance the model's robustness to errors and a schema generationt process to continuously correct the errors in the generated schema.

Conclusions
We propose Diffusion Event Graph Model, the first workable diffusion model for event skeleton generation.A significant breakthrough is to convert the discrete nodes in event instance graphs into a continuous space via embedding and rounding techniques and a custom edge-based loss.The denoising training process improves model robustness.During the schema generation process, we iteratively correct the errors in the schema via latent representation refinement.Experimental results on the three IED bombing datasets demonstrate that our approach achieves better results than other state-of-the-art baselines.

Limitations
Our proposed DEGM for event skeleton generation still contains some limitations: • It only considers the problem of event skeleton generation, a subtask of temporal complex event schema induction.It is promising to explore the whole task, which includes entities and entity-event relations.• Perspective from errors found that our model suffers from a tendency to generate correct duplicate substructures.

Ethics Statement
We follow the ACL Code of Ethics.In our work, there are no human subjects and informed consent is not applicable.

Figure 2 :
Figure2: The procedure of training our DEGM.At the preprocessing step, an instance graph G is converted into a temporal sequence of events e via topological sorting and the associated adjacency matrix A, which represents the graph structure.Following that, we perform DEGM accordingly.We first convert the discrete events into their representation in a continuous space.The forward step and the denoising step are conducted iteratively to reconstruct the event sequence and the graph structure.Note that we convert the latent variable h t la into three representation in two levels, i.e., the shared representation h t sh and two task-specific representation for the node's type h t ty and the node's structure h t st , respectively; see more details in the text.

FigureFigure 4 :
Figure To investigate the impact of topological sorting, we extend the train set by obtaining multiple (isomorphic graph number) isomorphic instance graphs sorted from one original train instance graph.We train and test our model on the extended dataset.All results are mean values under five different random seeds.

Figure 5 :
Figure 5: A snippet of schema generated by DEGM.

Table 2 :
Results of all methods for the three datasets.Our results include the mean and variance under five different random seeds, while other methods' results are from previous work.The best results are in bold.