SPEECH: Structured Prediction with Energy-Based Event-Centric Hyperspheres

Event-centric structured prediction involves predicting structured outputs of events. In most NLP cases, event structures are complex with manifold dependency, and it is challenging to effectively represent these complicated structured events. To address these issues, we propose Structured Prediction with Energy-based Event-Centric Hyperspheres (SPEECH). SPEECH models complex dependency among event structured components with energy-based modeling, and represents event classes with simple but effective hyperspheres. Experiments on two unified-annotated event datasets indicate that SPEECH is predominant in event detection and event-relation extraction tasks.


Introduction
Structured prediction (Taskar et al., 2005) is a task where the predicted outputs are complex structured components. This arises in many NLP tasks (Smith, 2011;Kreutzer et al., 2017;Wang et al., 2023) and supports various applications (Jagannatha and Yu, 2016;Kreutzer et al., 2021). In event-centric NLP tasks, there exists strong complex dependency between the structured outputs, such as event detection (ED) (Chen et al., 2015), event-relation extraction (ERE) , and event schema induction . Thus, these tasks can also be revisited as event-centric structured prediction problems (Li et al., 2013).
Event-centric structured prediction (ECSP) tasks require to consider manifold structures and dependency of events, including intra-/inter-sentence structures. For example, as seen in Figure 1, given a document containing some event mentions "David Warren shot and killed Henry Glover ... David was convicted and sentenced to 25 years and 9 months ...", in ED task mainly considering intra-sentence structures, we need to identify event triggers (killed, convicted) from these tokens and categorize them * Corresponding Author. …… [S1] Former NOPD police officer David Warren shot and killed Henry Glover.
[S2] Five current and former officers of the NOPD were charged with Glover's death.
[S3] David was convicted and sentenced to 25 years and 9 months in prison for shooting and killing Glover. …….
As seen from Figure 1, the outputs of ECSP lie on a complex manifold and possess interdependent structures, e.g., the long-range dependency of tokens, the association among triggers and event classes, and the dependency among event classes and event relations. Thus it is challenging to model such complex event structures while efficiently representing these events. Previous works increasingly apply deep representation learning to tackle these problems. ;  propose to predict event structures based on the event graph schema. Hsu et al. (2022) generate event structures with manually designed prompts. However, these methods mainly focus on one of ECSP tasks and their event structures are hard to represent effectively. Paolini et al. (2021); Lu et al. (2021Lu et al. ( , 2022 propose to extract multiple event structures from texts with a unified generation paradigm. However, the event structures of these approaches are usually quite simplistic and they often ignore the complex dependency among tasks. In this paper, we focus more on: (i) how to learn complex event structures for manifold ECSP tasks; and (ii) how to simultane-ously represent events for these complex structured prediction models effectively.
To resolve the first challenging problem of modeling manifold event structures, we utilize energy networks (Lecun et al., 2006;Belanger and McCallum, 2016;Belanger et al., 2017;Tu and Gimpel, 2018), inspired by their potential benefits in capturing complex dependency of structured components. We define the energy function to evaluate compatibility of input/output pairs, which places no limits on the size of the structured components, making it powerful to model complex and manifold event structures. We generally consider token-, sentence-, and document-level energy respectively for trigger classification, event classification and event-relation extraction tasks. To the best of our knowledge, this work firstly address event-centric structured prediction with energy-based modeling.
To resolve the second challenging problem of efficiently representing events, we take advantage of hyperspheres (Mettes et al., 2019;Wang and Isola, 2020), which is demonstrated to be a simple and effective approach to model class representation (Deng et al., 2022). We assume that the event mentions of each event class distribute on the corresponding energy-based hypersphere, so that we can represent each event class with a hyperspherical centroid and radius embedding. The geometrical modeling strategy (Ding et al., 2021;Lai et al., 2021) is demonstrated to be beneficial for modelling enriched class-level information and suitable for constructing measurements in Euclidean space, making it intuitively applicable to manifold eventcentric structured prediction tasks.
Summarily, considering the two issues, we propose to address Structured Prediction with Energybased Event-Centric Hyperspheres (SPEECH), and our contributions can be summarized as follows: • We revisit the event-centric structured prediction tasks in consideration of both complex event structures with manifold dependency and efficient representation of events.
• We propose a novel approach named SPEECH to model complex event structures with energy-based networks and efficiently represent events with event-centric hyperspheres.
• We evaluate SPEECH on two newly proposed datasets for both event detection and eventrelation extraction, and experiments demonstrate that our model is advantageous.
Since the boom in deep learning, traditional approaches to ECSP mostly define a score function between inputs and outputs based on a neural network, such as CNN (Chen et al., 2015;Deng et al., 2020), RNN (Nguyen et al., 2016;Meng and Rumshisky, 2018;Nguyen and Nguyen, 2019), and GCN (Yan et al., 2019;Lai et al., 2020;Cui et al., 2020). With the development of pretrained large models, more recent research has entered a new era.  Lu et al. (2022) propose generative ECSP models based on pre-trained T5 (Raffel et al., 2020). Wang et al. (2023) tackle ECSP with code generation based on code pretraining. However, these approaches are equipped with fairly simplistic event structures and have difficulty in tackling complex dependency in events. Besides, most of them fail to represent manifold events effectively. Energy Networks for Structured Prediction and Hyperspheres for Class Representation. Energy networks define an energy function over input/output pairs with arbitrary neural networks, which places no limits on the size of the structured components, making it advantageous in modeling complex and manifold event structures. Lecun et al. (2006); Belanger and McCallum (2016) associate a scalar measure to evaluate the compatibility to each configuration of inputs and outputs. (Belanger and McCallum, 2016) formulate deep energy-based models for structured prediction, called structured prediction energy networks (SPENs). Belanger et al. (2017) present end-to-end learning for SPENs, Tu and Gimpel (2018) jointly train structured energy functions and inference networks with largemargin objectives. Some previous researches also regard event-centric NLP tasks as structured prediction (Li et al., 2013;Paolini et al., 2021). Furthermore, to effectively obtain event representations, Deng et al. (2022) demonstrate that hyperspherical prototypical networks (Mettes et al., 2019) are powerful to encode enriched semantics and dependency in event structures, but they merely consider support for pairwise event structures.

Preliminaries
For structured prediction tasks, given input x ∈ X , we denote the structured outputs by M Φ (x) ∈Ỹ with a prediction model M Φ . Structured Prediction Energy Networks (SPENs) score structured outputs with an energy function E Θ : X ×Ỹ → R parameterized by Θ that iteratively optimize the energy between the input/output pair (Belanger and Mc-Callum, 2016), where lower energy means greater compatibility between the pair.
We introduce event-centric structured prediction (ECSP) following the similar setting as SPENs for multi-label classification and sequence labeling proposed by Tu and Gimpel (2018). Given a feature vector x belonging to one of T labels, the model The energy function contains two terms: is the sum of linear models, and y i ∈ y, V i is a parameter vector for label i and f (x) is a multi-layer perceptron computing a feature representation for the input x; E label Θ (y) = w ⊤ g(W y) returns a scalar which quantifies the full set of labels, scoring y independent of x, thereinto, w is a parameter vector, g(·) is an elementwise non-linearity function, and W is a parameter matrix learned from data indicating the interaction between labels.
After learning the energy function, prediction minimizes energy: The final theoretical optimum for SPEN is denoted by: (3) where [a] + = max(0, a), and △(ỹ, y), often referred to "margin-rescaled" structured hinge loss, is a structured cost function that returns a nonnegative value indicating the difference between the predicted resultỹ and ground truth y.

Problem Formulation
In this paper, we focus on ECSP tasks of event detection (ED) and event-relation extraction (ERE). ED can be divided into trigger classification for tokens and event classification for sentences. We denote the dataset by D = {E, R, X } containing an event class set E, a multi-faceted eventrelation set R and the event corpus X , thereinto, |R| temporal, causal, subevent and coreference relationships among event mentions including a NA event-relation; For trigger classification, the goal is to predict the index t (1 ≤ t ≤ L) of the trigger x t in each token sequence x and categorize x t into a specific event class e i ∈ E. For event classification, we expect to predict the event label e i for each event mention X i . For event-relation extraction, we require to identify the relation r i ∈ R for a pair of event mentionsẌ ⟨ij⟩ = (X i , X j ).
In summary, our goal is to design an ECSP model M Φ , aiming to tackle the tasks of: (1) trigger classification: to predict the token labelỹ = M Φ (x) for the token list x; (2) event classification: to predict the event class labelỸ = M Φ (X) for the event mention X; (3) event-relation extraction: to predict the event-relation labelz = M Φ (Ẍ) for the event mention pairẌ.

Model Overview
As seen in Figure 2, SPEECH combines three levels of energy: token, sentence, as well as document, and they respectively serve for three kinds of ECSP tasks: (1) token-level energy for trigger classification: considering energy-based modeling is able to capture long-range dependency among tokens without limits to token size; (2) sentencelevel energy for event classification: considering energy-based hyperspheres can model the complex event structures and represent events efficiently; and (3) document-level energy for event-relation extraction: considering energy-based modeling enables us to address the association among event mention pairs and event-relations. We leverage the trigger embeddings as event mention embeddings; the energy-based hyperspheres with a centroid and a radius as event class embeddings, and these three tasks are associative to each other.

Killing Legal_rulings
[CLS] Former NOPD police officer David Warren shot and killed Henry Glover Energy-based hypersphere embedding with a centroid and a radius as event type embedding

Token-Level Energy
Token-level energy serves for trigger classification. Given a token sequence x = {x j |j ∈ [1, L]} with trigger x t , we leverage a pluggable backbone encoder to obtain the contextual representation f 1 (x) for each token, such as pre-trained BERT (Devlin et al., 2019), RoBERTa , Distil-BERT (Sanh et al., 2019) and so on. We then predict the labelỹ = M Φ (x) of each token with an additional linear classifier. Inspired by SPENs for sequence labeling (Tu and Gimpel, 2018), we also adopt an energy function for token classification. Energy Function. The token-level energy function is inherited from Eq (1), defined as: (4) where y i n is the i th entry of the vector y n ∈ y, indicating the probability of the n th token x n being labeled with i (i for e i , |E|+1 for non-trigger and |E|+2 for padding token). f 1 (·) denotes the feature encoder of tokens. Here our learnable parameters are Θ = (V 1 , W 1 ), thereinto, V 1,i ∈ R d is a parameter vector for token label i, and W 1 ∈ R (|E|+2)×(|E|+2) contains the bilinear product between y n−1 and y n for token label pair terms.
Loss Function. The training objective for trigger classification is denoted by: whereỹ i and y i respectively denote predicted results and ground truth. The first half of Eq (5) is inherited from Eq (3) for the energy function, and in the latter half, L CE (ỹ i , y i ) is the trigger classification cross entropy loss, and µ 1 is its ratio.

Sentence-Level Energy
Sentence-level energy serves for event classification. Given the event mention X i with the trigger x t , we utilize the trigger embedding f 1 (x t ) as the event mention embedding f 2 (X), where f 2 (·) denotes the feature encoder of event mentions. We then predict the class of each event mention with energy-based hyperspheres, denoted bỹ Y = M Φ (X).
Specifically, we use an energy-based hypersphere to represent each event class, and assume that the event mentions of each event class should distribute on the corresponding hypersphere with the lowest energy. We then calculate the probability of the event mention X categorizing into the class e i with a hyperspherical measurement function: where [a] + = max(0, a), P i denotes the hypersphere centroid embedding of e i . ∥ · ∥ denotes the Euclidean distance. γ is the radius of the hypersphere, which can be scalable or constant. We simply set γ = 1 in this paper, meaning that each event class is represented by a unit hypersphere. Larger S(X, P i ) signifies that the event mention X are more likely be categorized into P i corresponding to e i . To measure the energy score between event classes and event mentions, we also adopt an energy function for event classification. Energy Function. The sentence-level energy function is inherited from Eq (1), defined as: where Y i ∈ Y indicates the probability of the event mention X being categorized to e i . Here our learnable parameters are Θ = (V 2 , w 2 , W 2 ), thereinto, V 2,i ∈ R d is a parameter vector for e i , w 2 ∈ R |E| and W 2 ∈ R |E|×|E| . Loss Function. The training objective for event classification is denoted by: where the first half is inherited from Eq (3), and in the latter half, L CE is a cross entropy loss for predicted resultsỸ i and ground truth Y i . µ 2 is a ratio for event classification cross entropy loss.

Document-Level Energy
Document-level energy serves for event-relation extraction. Given event mentions X in each document, we model the embedding interactions of each event mention pair with a comprehensive feature vector f 3 (Ẍ ⟨ij⟩ ) = f 2 (X i ), f 2 (X j ), f 2 (X i ) ⊙ f 2 (X j ) . We then predict the relation between each event mention pair with a linear classifier, denoted byz = M Φ (Ẍ). Inspired by SPENs for multi-label classification (Tu and Gimpel, 2018), we also adopt an energy function for ERE.
Energy Function. The document-level energy function is inherited from Eq (1), defined as: where z i ∈ z indicates the probability of the event mention pairẌ having the relation of r i . Here our learnable parameters are Θ = (V 3 , w 3 , W 3 ), thereinto, V 3,i ∈ R 3d is a parameter vector for r i , w 3 ∈ R |R| and W 3 ∈ R |R|×|R| .
Loss Function. The training objective for eventrelation extraction is denoted by: where the first half is inherited from Eq (3), and in the latter half, L CE (z k , z k ) is the event-relation extraction cross entropy loss, µ 3 is its ratio, and N denotes the quantity of event mention pairs.
The final training loss for SPEECH M Φ parameterized by Φ is defined as: where λ 1 , λ 2 , λ 3 are the loss ratios respectively for trigger classification, event classification and event-relation extraction tasks. We add the penalty term ∥Φ∥ 2 2 with L 2 regularization.

Experiments
The experiments refer to event-centric structured prediction (ECSP) and comprise three tasks: (1) Trigger Classification; (2) Event Classification; and (3) Event-Relation Extraction.  Datasets. Considering event-centric structured prediction tasks in this paper require fine-grained annotations for events, such as labels of tokens, event mentions, and event-relations, we select two newly-proposed datasets meeting the requirements: MAVEN-ERE  and ONTOEVENT-DOC . Note that ONTOEVENT-DOC is derived from ONTOEVENT  which is formatted in a sentence level. We reorganize it and make it format in a document level, similar to MAVEN-ERE. Thus the train, validation, and test sets of ONTOEVENT-DOC are also different from the original ONTO-EVENT. We release the reconstructed dataset and  Baselines. For trigger classification and event classification, we adopt models aggregated dynamic multi-pooling mechanism, i.e., DMCNN (Chen et al., 2015) and DMBERT (Wang et al., 2019); sequence labeling models with conditional random field (CRF) (Lafferty et al., 2001), i.e., BiLSTM-CRF and BERT-CRF; generative ED models, i.e., TANL (Paolini et al., 2021) and TEXT2EVENT (Lu et al., 2021). We also adopt some ED models considering document-level associations, i.e., MLBiNet (Lou et al., 2021) and CorED-BERT (Sheng et al., 2022). Besides, we compare our energy-based hyperspheres with the vanilla hyperspherical prototype network (HPN) (Mettes et al., 2019) and prototype-based model OntoED . Note that unlike vanilla HPN (Mettes et al., 2019) which represents all classes on one hypersphere, the HPN adopted in this paper represents each class with a distinct hypersphere. For event-relation extraction, we select RoBERTa , which is the same baseline used in MAVEN-ERE , and also serves as the backbone for most of recent ERE models (Hwang et al., 2022;Man et al., 2022). 1 https://github.com/zjunlp/SPEECH.

Implementation Details
With regard to settings of the training process, Adam (Kingma and Ba, 2015) optimizer is used, with the learning rate of 5e-5. The maximum length L of a token sequence is 128, and the maximum quantity of event mentions in one document is set to 40 for MAVEN-ERE and 50 for ONTOEVENT-DOC. The loss ratios, µ 1 , µ 2 , µ 3 , for token, sentence and document-level energy function are all set to 1. The value of loss ratio, λ 1 , λ 2 , λ 3 , for trigger classification, event classification and eventrelation extraction depends on different tasks, and we introduce them in Appendix B. We evaluate the performance of ED and ERE with micro precision (P), Recall (R) and F1 Score (F1).

Event Trigger Classification
We present details of event trigger classification experiment settings in Appendix B.1. As seen from the results in Table 2, SPEECH demonstrates superior performance over all baselines, notably MLBi-Net (Lou et al., 2021) and CorED-BERT (Sheng et al., 2022), even if these two models consider cross-sentence semantic information or incorporate type-level and instance-level correlations. The main reason may be due to the energy-based nature of SPEECH. As seen from the last row of Table 2, the removal of energy functions from SPEECH can result in a performance decrease. Specifically for trigger classification, energy-based modeling enables capture long-range dependency of tokens and places no limits on the size of event structures. In addition, SPEECH also excels generative models, i.e., TANL (Paolini et al., 2021) and TEXT2EVENT (Lu et al., 2021), thereby demonstrating the efficacy of energy-based modeling.

Event Classification
The specifics of event classification experiment settings are elaborated in Appendix B.2, with results illustrated in Table 3. We can observe that SPEECH provides considerable advantages on MAVEN-ERE, while the performance on ONTOEVENT-DOC is not superior enough. ONTOEVENT-DOC contains overlapping where multiple event classes may exist in the same event mention, which could be the primary reason for SPEECH not performing well enough in this case. This impact could be exacerbated when joint training with other ECSP tasks. Upon comparison with prototype-based methods without energy-based modeling, i.e., HPN (Mettes et al., 2019) and OntoED , SPEECH is still dominant on MAVEN-ERE, despite HPN represents classes with hyperspheres and On-toED leverages hyperspheres integrated with eventrelation semantics. If we exclude energy functions from SPEECH, performance will degrade, as seen from the last row in Table 3. This insight suggests that energy functions contribute positively to event classification, which enable the model to directly capture complicated dependency between event mentions and event types, instead of implicitly inferring from data. Besides, SPEECH also outperforms generative models like TANL and TEXT2EVENT on MAVEN-ERE, indicating the superiority of energy-based hyperspherical modeling.

Event-Relation Extraction
We present the specifics of event-relation extraction experiment settings in Appendix B.3. As seen from the results in Table 4 Table 4: F1 (%) performance of ERE on MAVEN-ERE valid set and ONTOEVENT-DOC test set. "+joint" in the 2 nd column denotes jointly training on all ERE tasks and evaluating on the specific one, with the same setting as . "All Joint" in the last two rows denotes treating all ERE tasks as one task. mention pairs and event-relation labels. While on MAVEN-ERE, SPEECH significantly outperforms RoBERTa on ERE subtasks referring to subevent relations or trained on all event-relations, but fails to exceed RoBERTa on ERE subtasks referring to temporal and causal relations. The possible reason is that MAVEN-ERE contains less positive eventrelations than negative NA relations. Given that SPEECH models all these relations equivalently with the energy function, it becomes challenging to classify NA effectively. But this issue will be markedly improved if the quantity of positive eventrelations decreases, since SPEECH performs better on subevent relations despite MAVEN-ERE having much less subevent relations than temporal and causal ones as shown in Table 1. Furthermore, even though ONTOEVENT-DOC containing fewer positive event-relations than NA overall, SPEECH still performs well. These results suggest that SPEECH excels in modeling classes with fewer samples. Note that SPEECH also performs well when training on all event-relations ("All Joint") of the two datasets, indicating that SPEECH is still advantageous in the scenario with more classes.

Analysis On Energy-Based Modeling
We list some values of energy loss defined in Eq (5), (8) and (10) when training respectively for token, sentence and document, as presented in Figure 3. The values of token-level energy loss are observably larger than those at the sentence and document levels. This can be attributed to the fact that the energy loss is related to the quantity of samples, and a single document typically contains much more tokens than sentences or sentence pairs. All three levels of energy loss exhibit a gradual decrease over the course of training, indicating that SPEECH, through energy-based modeling, effectively minimizes the discrepancy between predicted results and ground truth. The energy functions for token, sentence and document defined in Eq (4), (7) and (9), reflect that the implementation of energy-based modeling in SPEECH is geared towards enhancing compatibility between input/output pairs. The gradually-decreasing energy loss demonstrates that SPEECH can model intricate event structures at the token, sentence, and document levels through energy-based optimization, thereby improving the outcomes of structured prediction.

Case Study: Energy-Based Hyperspheres
As seen in Figure 4, we visualize the event class embedding of "Attack" and 20 event mention embeddings as generated by both SPEECH and SPEECH without energy functions. We observe that for SPEECH with energy-based modelling, the instances lie near the surface of the corresponding hypersphere, while they are more scattered when not equipped with energy-based modeling, which subsequently diminishes the performance of event classification. This observation suggests that SPEECH derives significant benefits from modeling with energy-based hyperspheres. The visualiza-tion results further demonstrate the effectiveness of SPEECH equipped with energy-based modeling.

Error Analysis
We further conduct error analysis by a retrospection of experimental results and datasets.
(1) One typical error relates to the unbalanced data distribution. Considering every event type and event-relation contain different amount of instances, unified modeling with energy-based hyperspheres may not always be impactful.
(2) The second error relates to the overlapping event mentions among event types, meaning that the same sentence may mention multiple event types. As ONTOEVENT-DOC contains many overlappings, it might be the reason for its mediocre performance on ED.
(3) The third error relates to associations with event-centric structured prediction tasks. As trigger classification is closely related to event classification, wrong prediction of tokens will also influence classifying events.

Conclusion and Future Work
In this paper, we propose a novel approach entitled SPEECH to tackle event-centric structured prediction with energy-based hyperspheres. We represent event classes as hyperspheres with token, sentence and document-level energy, respectively for trigger classification, event classification and event relation extraction tasks. We evaluate SPEECH on two event-centric structured prediction datasets, and experimental results demonstrate that SPEECH is able to model manifold event structures with dependency and obtain effective event representations. In the future, we intend to enhance our work by modeling more complicated structures and extend it to other structured prediction tasks.

Limitations
Although SPEECH performs well on event-centric structured prediction tasks in this paper, it still has some limitations. The first limitation relates to efficiency. As SPEECH involves many tasks and requires complex calculation, the training process is not very prompt. The second limitation relates to robustness. As seen in the experimental analysis in § 4.5, SPEECH seems not always robust to unevenly-distributed data. The third limitation relates to universality. Not all eventcentric structured prediction tasks can simultaneously achieve the best performance at the same settings of SPEECH.