GENEVA: Benchmarking Generalizability for Event Argument Extraction with Hundreds of Event Types and Argument Roles

Recent works in Event Argument Extraction (EAE) have focused on improving model generalizability to cater to new events and domains. However, standard benchmarking datasets like ACE and ERE cover less than 40 event types and 25 entity-centric argument roles. Limited diversity and coverage hinder these datasets from adequately evaluating the generalizability of EAE models. In this paper, we first contribute by creating a large and diverse EAE ontology. This ontology is created by transforming FrameNet, a comprehensive semantic role labeling (SRL) dataset for EAE, by exploiting the similarity between these two tasks. Then, exhaustive human expert annotations are collected to build the ontology, concluding with 115 events and 220 argument roles, with a significant portion of roles not being entities. We utilize this ontology to further introduce GENEVA, a diverse generalizability benchmarking dataset comprising four test suites aimed at evaluating models’ ability to handle limited data and unseen event type generalization. We benchmark six EAE models from various families. The results show that owing to non-entity argument roles, even the best-performing model can only achieve 39% F1 score, indicating how GENEVA provides new challenges for generalization in EAE. Overall, our large and diverse EAE ontology can aid in creating more comprehensive future resources, while GENEVA is a challenging benchmarking dataset encouraging further research for improving generalizability in EAE. The code and data can be found at https://github.com/PlusLabNLP/GENEVA.


Introduction
Event Argument Extraction (EAE) aims at extracting structured information of event-specific arguments and their roles for events from a pre-defined taxonomy. EAE is a classic topic (Sundheim, 1992) and elemental for a wide range of applications like building knowledge graphs (Zhang et al., 2020), Figure 1: Distribution of event types into various abstract event types 1 for GENEVA, ACE, ERE, RAMS, and WikiEvents datasets. We observe that GENEVA is relatively more diverse than the other datasets. question answering (Berant et al., 2014), and others (Hogenboom et al., 2016;Yang et al., 2019b). Recent works have focused on building generalizable EAE models (Huang et al., 2018;Lyu et al., 2021;Sainz et al., 2022) and they utilize existing datasets like ACE (Doddington et al., 2004) and ERE (Song et al., 2015) for benchmarking. However, as shown in Figure 1, these datasets have limited diversity as they focus only on two abstract types, 1 Action and Change. Furthermore, they have restricted coverage as they only comprise argument roles that are entities. The limited diversity and coverage restrict the ability of these existing datasets to robustly evaluate the generalizability of EAE models. Toward this end, we propose a new generalizability benchmarking dataset in our work.
To build a strong comprehensive benchmarking dataset, we first create a large and diverse ontology. Creating such an ontology from scratch is timeconsuming and requires expert knowledge. To reduce human effort, we exploit the shared properties between semantic role labeling (SRL) and EAE (Aguilar et al., 2014) and leverage a diverse and exhaustive SRL dataset, FrameNet (Baker et al., 1998), to build the ontology. Through extensive human expert annotations, we design mappings that transform the FrameNet schema to a large and diverse EAE ontology, spanning 115 event types from five different abstract types. Our ontology is also comprehensive, comprising 220 argument roles with a significant 37% of roles as non-entities.
Utilizing this ontology, we create GENEVA -a Generalizability BENchmarking Dataset for EVent Argument Extraction. We exploit the humancurated ontology mappings to transfer FrameNet data for EAE to build GENEVA. We further perform several human validation assessments to ensure high annotation quality. GENEVA comprises four test suites to assess the models' ability to learn from limited training data and generalize to unseen event types. These test suites are distinctly different based on the training and test data creation -(1) low resource, (2) few-shot, (3) zero-shot, and (4) cross-type transfer settings.
We use these test suites to benchmark various classes of EAE models -traditional classificationbased models (Wadden et al., 2019;Lin et al., 2020;Wang et al., 2022a), question-answeringbased models (Du and Cardie, 2020), and generative approaches (Paolini et al., 2021;Hsu et al., 2022b). We also introduce new automated refinements in the low resource state-of-the-art model DEGREE (Hsu et al., 2022b) to generalize and scale up its manual input prompts. Experiments reveal that DEGREE performs the best and exhibits the best generalizability. However, owing to nonentity arguments in GENEVA, DEGREE achieves an F1 score of only 39% on the zero-shot suite. Under a similar setup on ACE, DEGREE achieves 53%, indicating how GENEVA poses additional challenges for generalizability benchmarking.
To summarize, we make the following contributions. We construct a diverse and comprehensive EAE ontology introducing non-entity argument roles. This ontology can be utilized further to develop more comprehensive datasets for EAE. In addition, we propose a generalizability evaluation dataset GENEVA and benchmark various recent EAE models. Finally, we show how GENEVA is a challenging dataset, thus, encouraging future research for generalization in EAE.

Related Work
Event Extraction Datasets and Ontologies: The earliest datasets in event extraction date back to MUC (Sundheim, 1992;Grishman and Sundheim, 1996). Doddington et al. (2004) introduced the standard dataset ACE while restricting the ontology to focus on entity-centric arguments. The ACE ontology was further simplified and extended to ERE (Song et al., 2015) and various TAC KBP Challenges (Ellis et al., 2014(Ellis et al., , 2015Getman et al., 2017). These datasets cover a small and restricted set of event types and argument roles with limited diversity. Later, MAVEN (Wang et al., 2020) introduced a massive dataset spanning a wide range of event types. However, its ontology is limited to the task of Event Detection 2 and does not contain argument roles. Recent works have introduced document-level EAE datasets like RAMS (Ebner et al., 2020), WikiEvents (Li et al., 2021), and Do-cEE (Tong et al., 2022); but their ontologies are also entity-centric, and their event coverage is limited to specific abstract event types ( Figure 1). In our work, we focus on building a diverse and comprehensive dataset for benchmarking generalizability for sentence-level EAE.
Event Argument Extraction Models: Traditionally, EAE has been formulated as a classification problem (Nguyen et al., 2016).
Previous classification-based approaches have utilized pipelined approaches (Yang et al., 2019a;Wadden et al., 2019) as well as incorporating global features for joint inference (Li et al., 2013;Yang and Mitchell, 2016;Lin et al., 2020). However, these approaches exhibit poor generalizability in the low-data setting (Liu et al., 2020;Hsu et al., 2022b). To improve generalizability, some works have explored better usage of label semantics by formulating EAE as a question-answering task (Liu et al., 2020;Li et al., 2020;Du and Cardie, 2020). Recent approaches have explored the use of natural language generative models for structured prediction to boost generalizability (Schick and Schütze, 2021a,b;Paolini et al., 2021;Li et al., 2021). Another set of works transfers knowledge from similar tasks like abstract meaning representation and semantic role labeling (Huang et al., 2018;Lyu et al., 2021;Zhang et al., 2021). DEGREE (Hsu et al., 2022b) is a recently introduced state-of-theart generative model which has shown the best performance in the limited data regime. In our work, we benchmark the generalizability of various classes of old and new models on our dataset.

Ontology Creation
Event annotations start with ontology creation, which defines the scope of the events and their corresponding argument roles of interests. Towards this end, we aim to construct a large ontology of diverse event types with an exhaustive set of event argument roles. However, it is a challenging and tedious task that requires extensive expert supervision if building from scratch. To reduce human effort while maintaining high quality, we leverage the shared properties of SRL and EAE and utilize a diverse and comprehensive SRL dataset -FrameNet to design our ontology. We first re-iterate the EAE terminologies we follow ( § 3.1) and then describe how FrameNet aids our ontology design ( § 3.2). Finally, we present our steps for creating the final ontology in § 3.3 and ontology statistics in § 3.4.

Task Definition
We follow the definition of event as a class attribute with values such as occurrence, state, or reporting (Pustejovsky et al., 2003;Han et al., 2021). Event Triggers are word phrases that best express the occurrence of an event in a sentence. Following the early works of MUC (Sundheim, 1992;Grishman and Sundheim, 1996), event arguments are defined as participants in the event which provide specific and salient information about the event. Event argument role is the semantic category of the information the event argument provides. We provide an illustration in Figure 2 describing an event about "Destroying", where the event trigger is obliterated, and the event consists of argument roles -Cause and Patient.
It is worth mentioning that these definitions are disparate from the ones that previous works like ACE, and its inheritors, ERE and RAMS, follow. In ACE, the scope of events is restricted to the attribute of occurrence only, and event arguments are restricted to entities, wherein entities are defined as objects in the world. For example, in Figure 2, the subsequent explosions isn't an entity and will not be considered an argument as per ACE definitions. Consequently, Cause won't be part of their ontology. This exclusion of non-entities leads to incomplete information extraction of the event. In our work, we follow MUC to consider a broader range of events and event arguments.

FrameNet for EAE
To overcome the challenge of constructing an event ontology from scratch, we aim to leverage  (Fillmore et al., 1976), where a frame is a holistic background that unites similar words. Each frame is composed of frame-specific semantic roles (frame elements) and is evoked by specific sets of words (lexical units).
To transfer FrameNet's schema into an EAE ontology, we map frames as events, lexical units as event triggers, and frame elements as argument roles. However, this basic mapping is inaccurate and has shortcomings since not all frames are events, and not all frame elements are argument roles per the definitions in § 3.1. We highlight these shortcomings in Figure 3, which enlists some FrameNet frames and frame elements for the Arrest frame. Based on EAE definitions, only some frames like Arrest, Travel, etc (highlighted in yellow) can be mapped as events, and similarly, limited frame elements like Authorities, Charges, etc (highlighted in green) are mappable as argument roles.

Building the EAE Ontology
To overcome the shortcomings of the basic mapping, we follow a two-step approach (Figure 4). First, we build an event ontology for accurately mapping frames to events. Then, we augment this ontology with argument roles by building an event argument ontology. We describe these steps below.
Event Ontology: In order to build the event on-tology, we utilize the event mapping designed by MAVEN (Wang et al., 2020), which is an event detection dataset. They first recursively filter frames having a relation with the "Event" frame in FrameNet. Then they manually filter and merge frames based on the definitions, resulting in an event ontology comprising 168 event types mapped from 289 filtered frames.
Event Argument Ontology: In order to augment argument roles to the event ontology, we perform an extensive human expert annotation process. The goal of this annotation process is to create an argument mapping from FrameNet to our ontology by filtering and merging frame elements. We describe this annotation process below. Annotation Instructions: Annotators are provided with a list of frame elements along with their descriptions for each frame in the event ontology. 4 They are also provided with definitions for events and argument roles as discussed in Section 3.1. Based on these definitions, they are asked to annotate each frame element as (a) not argument role, (b) argument role, or (c) merge with existing argument role (and mention the argument role to merge with). To ensure arguments are salient, annotators are instructed to filter out frame elements that are super generic (e.g. Time, Place, Purpose) unless they are relevant to the event. Ambiguous cases are flagged and commonly reviewed at a later stage.
Additionally, annotators are asked to classify each argument role as an entity or not. This additional annotation provides flexibility for quick conversion of the ontology to ACE definitions. Figure 14 in the Appendix provides an illustration of these instructions and the annotation process. Annotation Results: We recruit two human experts 4 Event ontology frames can be viewed as candidate events. who are well-versed in the field of event extraction. We conduct three rounds of annotations and discussions to improve consistency and ensure a high inter-annotator agreement (IAA). The final IAA measured as Cohen's Kappa (McHugh, 2012) was 0.82 for mapping frame elements and 0.94 for entity classification. A total of 3, 729 frame elements from 289 frames were examined as part of the annotation process. About 63% frame elements were filtered out, 14% were merged, and the remaining 23% constitute as argument roles.

Schema Dataset
Event Ontology Calibration: The MAVEN event ontology is created independent of the argument roles. This leads to some inaccuracies in their ontology wherein two frames with disparate sets of argument roles are mapped as a single event. For example, Surrendering_possession and Surrendering frames are merged together despite having different argument roles. Based on our human expertcurated event argument ontology, we rectify these inaccuracies (roughly 8% of the event ontology) and create our final ontology.

Ontology Statistics
We present the statistics of our full ontology in Table 1 and compare it with existing ACE (Doddington et al., 2004) and RAMS (Ebner et al., 2020) ontologies. But as we will specify in § 4.1, we use a subset of this ontology 5 for creating GENEVA. Hence, we also include the statistics of the GENEVA ontology in the last column in Table 1. Overall, our curated full ontology is the largest and most comprehensive as it comprises 179 event types and 362 argument roles. Defining abstract event types as the top nodes of the ontology tree created by MAVEN (Wang et al., 2020), we show that our ontology spans 5 different abstract types and is the most diverse. We organize our ontology into a hierarchy of these abstract  event types in Appendix A.3. Our ontology is also dense with an average of 4.82 argument roles per event type. Finally, we note that a significant 35% of the event argument roles in our ontology are non-entities. This demonstrates how our ontology covers a broader and more comprehensive range of argument roles than other ontologies following ACE definitions of entity-centric argument roles.

GENEVA Dataset
Previous EAE datasets for evaluating generalizability like ACE and ERE have limited event diversity and are restricted to entity-centric arguments.
To overcome these issues, we utilize our ontology to construct a new generalizability benchmarking dataset GENEVA comprising four specialized test suites. We describe our data creation process in § 4.1, provide data statistics in § 4.2 and discuss out test suites in § 4.3.

Creation of GENEVA
Since annotating EAE data for our large ontology is an expensive process, we leverage the annotated dataset of FrameNet to create GENEVA ( Figure 4). We utilize the previously designed ontology mappings to repurpose the annotated sentences from FrameNet for EAE by mapping frames to corresponding events, lexical units to event triggers, and frame elements to corresponding arguments. Unmapped frames and frame elements (not in the ontology) are filtered out from the dataset. Since FrameNet doesn't provide annotations for all frames, some events from the full ontology are not present in our dataset (e.g. Military_Operation). Additionally, to aid better evaluation, we remove events that have less than 5 event mentions (e.g. Lighting). Finally, GENEVA comprises 115 event types and 220 argument roles. Some examples are provided in Figure 10 (Appendix).
Human Validation: We ensure the high quality of  our dataset by conducting two human assessments: (1) Ontology Quality Assessment: We present the human annotators with three sentences -one primary and two candidates -and ask them if the event in the primary sentence is similar to the events in either of the candidates or distinct from both (Example in Appendix F). One candidate sentence is chosen from the frame merged with the primary event, while the other candidate is chosen from a similar unmerged sister frame. The annotators chose the merged frame candidates 87% of the times, demonstrating the high quality of the ontology mappings. This validation was done by three annotators over 61 triplets with 0.7 IAA measured by Fleiss' kappa (Fleiss, 1971).
(2) Annotation Comprehensiveness Assessment: Human annotators are presented with annotated samples from our dataset and they are asked to report if there are any arguments in the sentence that have not been annotated. The annotation is considered comprehensive if all arguments are annotated correctly. The annotators reported that the annotations were 89% comprehensive, ensuring high dataset quality. Corrections majorly comprise ambiguous cases and incorrect role labels. This assessment was done by two experts over 100 sampled annotations with 0.93 IAA (Cohen's kappa).

Data Analysis
Overall, GENEVA is a dense, challenging, and diverse EAE dataset with good coverage. These characteristics make GENEVA better-suited than existing datasets like ACE/ERE for evaluating the generalizability of EAE models. The major statistics for GENEVA are shown in Table 2 along with its comparison with ACE and ERE. We provide more discussions about the characteristics of our dataset as follows.
Diverse: GENEVA has wide coverage with a tripled number of event types and 10 times the number of argument roles relative to ACE/ERE.   Dense: We plot the distribution of arguments per sentence 6 for ACE, ERE, and GENEVA in Figure 5. We note that GENEVA has the highest density of 4 argument mentions per sentence. Both ACE and ERE have more than 70% sentences with up to 2 arguments. In contrast, GENEVA is denser with almost 50% sentences having 3 or more arguments.
Coverage: Qualitatively, we show some coverage of diverse examples in Figure 9 (Appendix) and provide coverage for all events categorized by their abstraction in Figure 14 (Appendix). We observe frequent events like Statement, Arriving, Action while Recovering, Emergency, Hindering are less-frequent events. In terms of diversity of data sources, our data comprises a mixture of news articles, Wall Street Journal articles, books, Wikipedia, and other miscellaneous sources too.

Benchmarking Test Suites
With a focus on the generalizability evaluation of EAE models, we fabricate four benchmarking test suites clubbed into two higher-level settings: Limited Training Data: This setting mimics the realistic scenario when there are fewer annotations available for the target events and evaluates the models' ability to learn from limited training data. We present two test suites for this setting: • Low resource (LR): Training data is created by randomly sampling n event mentions. 7 We record the model performance across a spectrum from extremely low resource (n = 10) to moderate resource (n = 1200) settings. • Few-shot (FS): Training data is curated by sampling n event mentions uniformly across all events. This sampling strategy avoids biases towards high data events and assesses the model's ability to perform well uniformly across events. We study the model performance from one-shot (n = 1) to five-shot (n = 5).
Unseen Event Data: The second setting focuses on the scenario when there is no annotation available for the target events. This helps test models' ability to generalize to unseen events and argument roles. We propose two test suites: • Zero-shot (ZS): The training data comprises the top m events with most data, where m varies from 1 to 10. 8 The remaining 105 events are used for evaluation. • Cross-type Transfer (CTT): We curate a training dataset comprising of events of a single abstraction category (e.g. Scenario), while the test dataset comprises events of all other abstraction types. This test suite also assesses models' transfer learning strength. Data statistics for these suites are presented in Appendix A.2. For each setup, we sample 5 different datasets 9 and report the average model performance to account for the sampling variation.

Experimental Setup
We evaluate the generalizability of various EAE models on GENEVA. We describe these models in § 5.1 and the evaluation metrics in § 5.2.

Benchmarked Models
Overall, we benchmark six EAE models from various representative families are described below. Implementation details are specified in Appendix G.
Classification-based models: These traditional works predict arguments by learning to trace the argument span using a classification objective. We experiment with three models: Question-Answering models: Several works formulate event extraction as a machine reading comprehension task. We consider one such model -(4) BERT_QA (Du and Cardie, 2020), a BERT-based model leveraging label semantics using a questionanswering objective. In order to scale BERT_QA to the wide range of argument roles, we generate question queries of the form "What is {arg-name}?" for each argument role {arg-name}. (5) TE (Lyu et al., 2021), a zero-shot transfer model that utilizes an existing pre-trained textual entailment model to automatically extract events. Similar to BERT_QA, we design hypothesis questions as "What is {arg-name}?" for each argument role {arg-name}.
Generation-based models: Inspired by great strides in natural language generation, recent works frame EAE as a generation task using a languagemodeling objective. We consider two such models: (6) TANL (Paolini et al., 2021), a multi-task language generation model which treats EAE as a translation task. (7) DEGREE (Hsu et al., 2022b), an encoder-decoder framework that extracts event arguments using natural language input prompts.
Automating DEGREE: DEGREE requires human effort for manually creating natural language prompts and thus, can not be directly deployed for the large set of event types in GENEVA. In our work, we undertake efforts to scale up DEGREE by proposing a set of automated refinements. The first refinement automates the event type description as "The event type is {event-type}" where {event-type} is the input event type. The second refinement automates the event template generation by splitting each argument into a separate self-referencing minitemplate "The {arg-name} is some {arg-name}" where {arg-name} is the argument role. The final event-agnostic template is a simple concatenation of these mini-templates. We provide an illustration and ablation of these automated refinements for DEGREE in Appendix B.

Evaluation Metrics
Following the traditional evaluation for EAE tasks, we report the micro F1 scores for argument classification. To encourage better generalization across a wide range of events, we also use macro F1 score that reports the average of F1 scores for each event type. For the limited data test suites, we record a model performance curve, wherein we plot the F1 scores against the number of training instances.

Results
Following § 4.3, we organize the main experimental results into limited training data and unseen event data settings. When trained on complete training data, we observe that OneIE and Query&Extract models achieve poor micro F1 scores of just 30.03 and 40.41 while all other models achieve F1 scores above 55. This can be attributed to the inability of their model designs to effectively handle overlapping arguments. 10 Due to their inferior performance, we do not include OneIE and Query&Extract in the benchmarking results. We present the full results in Appendix H.

Limited Training Data
Limited training data setting comprises of the low resource and the few-shot test suites. We present the model benchmarking results in terms of macro and micro F1 scores for the low resource test suite in Figure 6 and for the few-shot test suite in Figure 7 respectively. We observe that DEGREE outperforms all other models for both the test suites and shows superior generalizability. In general, we observe that generation-based models show better generalization while on the other hand, traditional classification-based approaches show poor generalizability. This underlines the importance of using label semantics for better generalizability. We also detect a stark drop from micro to macro F1 scores for TANL and DyGIE++ in the low resource test suite. This indicates that these models are more easily biased toward high data events and do not generalize well uniformly across all events.

Unseen Event Data
This data setting includes the zero-shot and the cross-type transfer test suites. We collate the results in terms of micro F1 scores for both the test suites in Table 3. Models like DyGIE++ and TANL cannot support unseen events or argument roles and thus, we do not include these models in the experiments for these test suites. TE cannot be trained on additional EAE data, and hence we only report the pure zero-shot performance of this model. From Table 3, we observe that DEGREE achieves the best scores across both test suites outperforming BERT_QA by a significant margin of almost 13-15% F1 points. Although TE is not comparable as it's a pure zero-shot model (without training on any data), it's performance is relatively super low in both settings. Thus, DEGREE shows superior transferability to unseen event types and argument roles.  ...

In-context Examples Test Example
Passage: In the case of North Korea , determining the status of its nuclear weapons program is especially difficult . Event: confronting problem. Trigger: The event trigger word is difficult Query: The activity is some activity. The experiencer is some experiencer.

Analysis
In this section, we provide analyses highlighting the various new challenges introduced by GENEVA. We discuss the performance of large language models, the introduction of non-entity argument roles, and model performance including Time and Place argument roles.

Large Language Model Performance
Recently, there has been an advent of Generative   places placeholders with arguments if present, else copies the original teample. An illustration is provided in Figure 8. Despite the strong generation capability, GPT3.5-turbo achieves a mere 22.73 F1 score while DEGREE achieves 24.06 and 39.43 F1 scores in the ZS-1 and ZS-10 test suites respectively. Although these scores aren't directly comparable, it shows how GENEVA is quite challenging for LLMs in the zero-shot/few-shot setting.

New Challenge of Non-entity Roles
In Table 4, we show the model performances of BERT_QA and DEGREE on GENEVA and ACE under similar benchmarking setups. We note how both models exhibit relatively poor performance on GENEVA (especially the zero-shot test suite). To investigate this phenomenon, we break down the model performance based on entity and nonentity argument roles and show this analysis in Table 5. This ablation reveals a stark drop of 10-14% F1 points across all models when predicting non-entity arguments relative to entity-based arguments. This trend is observed consistently across all different test suites as well. We can attribute this difference in model performance to non-entity arguments being more abstract and having longer spans, in turn, being more challenging to predict accurately. Thus, owing to a significant 37% nonentity argument roles, GENEVA poses a new and interesting challenge for generalization in EAE.

GENEVA with Time and Place
In the original GENEVA dataset, we filtered super generic argument roles, but some of these roles like Time and Place are key for several downstream tasks. We include Time and Place arguments in GENEVA 12 and provide results of the models on the full dataset in Table 6. Compared to original GENEVA results in the same setting, we observe 12 We release this data for future development

Limitations
We would like to highlight a few limitations of our work. First, we would like to point out that GENEVA is designed to evaluate the generalizability of EAE models. Although the dataset contains event type and event trigger annotations, it can only be viewed as a partially-annotated dataset if end-to-end event extraction is considered. Second, GENEVA is derived from an existing dataset FrameNet. Despite human validation efforts, there is no guarantee that all possible events in the sentence are exhaustively annotated.

Ethical Consideration
We would like to list a few ethical considerations for our work. First, GENEVA is derived from FrameNet which comprises of annotated sentences from various news articles. Many of these news articles cover various political issues which might be biased and sensitive to specific demographic groups. We encourage careful consideration for utilizing this data for training models for real-world applications. A Additional Analysis of GENEVA

A.1 Event Type Distribution for GENEVA
We show the distribution of event mentions per event type for GENEVA in Figure 9. We observe a highly skewed distribution with 44 event types having less than 25 event mentions. Furthermore, 93 event types have less than 100 event mentions. We believe that this resembles a more practical scenario where there is a wide range of events with limited event mentions while a few events have a large number of mentions.

A.2 Data Statistics for different benchmarking test suites
We present the data statistics for the various test suites in Table 7. For the training set of the low resource and few-shot test suites (indicated by * in Table 7), we sample a smaller training set (as discussed in Section 4.3). For the zero-shot setup, the top 10 event types contribute to a large pool of 1, 889 sentences. For the test suites, a fixed number of 450 and 115 sentences are sampled for the training and the development set (indicated by + in Table 7) from this larger pool of data.

A.3 Event Ontology Organization
The broad set of event types in GENEVA can be organized into a hierarchical structure of abstract event types. Adhering to the hierarchical tree structure introduced in MAVEN, we show the corresponding organization for event types in GENEVA in Figure 15. The organization mainly assumes five abstract event categories -Action, Change, Scenario, Sentiment, and Possession. The most populous abstract type is Action with a total of 53 events, while Scenario abstraction has the lowest number of 9 events. We also study the distribution of event mentions per event type in Figure 15 where the bar heights are indicative of the number of event mentions for the corresponding event type (heights in log-scale). We observe that the most populous event is Statement which falls under the Action abstraction. On the other hand, the least populous event is Recovering which belongs to the Change abstraction.
GENEVA comprises of a diverse set of 115 event types and it naturally shares some of these with the ACE dataset. In Figure 15, we show the extent of the overlap of the mapped ACE events in the GENEVA event schema (text labels colored in red). 13 We can observe that although there is some overlap between the datasets, GENEVA brings in a vast pool of new event types. Furthermore, most of the overlap is for the Possession and Action abstraction types.

A.4 Dataset Examples
We provide some examples of annotated sentences from the GENEVA dataset in Figure 10. We indicate the abstract event type in braces and cover an example from each abstraction.

B Automated Refinements for DEGREE B.1 DEGREE
DEGREE is an encoder-decoder based generative model which utilizes natural language templates as part of input prompts. The input prompt comprises of three components -(1) Event Type Description which provides a definition of the given event type, (2) Query Trigger which indicates the trigger word for the event mention, and (3) EAE Template which is a natural sentence combining the different argument roles of the event. We illustrate DEGREE along with an example of its input prompt design in Figure 11. 13 We only show the events that could be directly mapped from ACE to GENEVA. Note that this overlap is not exhaustively complete. Furthermore, the mapping can be many-toone and one-to-many in nature. The US administration calls for a total embargo on nuclear technology to Iran , and urges other nuclear suppliers , including the PRC , to take similar action ( 6847 ) .

Convincing (Sentiment) urges
Speaker: The US administration Addressee: other nuclear suppliers , including the PRC Content: to take similar action Figure 10: Illustration of example annotations from the GENEVA dataset for various different abstract types.

Encoder Decoder
Passage

Prompt [SEP]
Output Text

Event Type Description
The event is related to employment, jobs or paid work.

Query Trigger
The event trigger word is job.

EAE Template Some person works at some organization as some position.
Louise works at Google as engineer.

Output Text Prompt
Louise has a job of an engineer at Google. Figure 11: Model architecture of DEGREE (top half) and an illustration of a manually created prompt for the event type Employment (bottom half).

Event Type Description
The event type is employment.

Query Trigger
The event trigger word is job.

EAE Template
The employer is some employer. The employee is some employee. The position is some position.
The employer is Google. The employee is Louise. The position is engineer.
Output Text Prompt Figure 12: An illustration of an automatically generated prompt by DEGREE for the event type Employment.
Despite the superior performance of DEGREE in the low-data setting, it can not be directly deployed on GENEVA. This is because DEGREE requires manual human effort for the creation of input prompts for each event type and argument role and can't be scaled to the wide set of events in GENEVA. Thus, there is a need to automate the manual human effort to scale up DEGREE.

B.2 Automated Refinements
DEGREE requires human effort for two input prompt components -(1) Event Type Description and (2) EAE Template. We describe the automated refinements in DEGREE for these components below.
Automating Event Type Description Event type description is a natural language sentence describing the event type. In order to automate this component, we propose a simple heuristic that creates a simple natural language sentence mentioning the event type -"The event type is {event-type}.", as illustrated in Figure 12.
Automating EAE Template EAE template generation in DEGREE can be split into two subtasks, which we discuss in detail below.
Argument Role Mapping: This subtask maps each argument role to a natural language placeholder phrase based on the characteristics of the argument role. For example, the argument role Employer is mapped to "some organization" in Figure 11. For automating this mapping process, we propose a simple refinement of selfmapping, which maps each argument role to a selfreferencing placeholder phrase "some {arg-name}", where {arg-name} is the argument role itself. For example, the argument role Employer would be mapped to "some employer". We illustrate an example of this heuristic in Figure 12.
Template Generation: The second subtask requires generating a natural sentence(s) using the argument role-mapped placeholder phrases (as shown in Figure 11). To automate this subtask, we create an event-agnostic template composed of argument role-specific sentences. For each argument role in the event, we generate a sentence of  the form "The {arg-name} is {arg-map}." where {arg-name} and {arg-map} is the argument role and its mapped placeholder phrase respectively. For example, the sentence for argument role Employer with self-mapping would be "The employer is some employer.". The final event-agnostic template is a simple concatenation of all the argument role sentences. We provide an illustration of the eventagnostic template in Figure 12.

B.3 Ablation Study
In our work, we introduce automated refinements for scaling DEGREE for GENEVA. We provide an ablation study for these automated refinements (Automated DEGREE) on the ACE dataset in Table 8. We observe that the automated DEGREE almost at-par with DEGREE with a minor difference of only 0.8% F1 points.

C Impact of Pre-training
In this section, we explore the impact of pretraining models on the generalizability evaluation. We consider DEGREE and BERT_QA, pre-train them on the ACE dataset and show the model performance on low resource test suite in Figure 13. We observe that pre-training helps model performance by 5-10% F1 points, and naturally in the low-data regime. But the gains diminish and are almost negligible as the number of training event mentions increases. In terms of zeroshot performance of the pre-trained models, DE-GREE achieves a micro F1 score of 12.83% and BERT_QA achieves a score of 6.82% respectively. Poor zero-shot performance and diminishing performance gains indicate that GENEVA is distributionally distinct from ACE, which makes it challenging to achieve good model performance on GENEVA merely via transfer learning.

D Case Study: Is ACE diverse enough?
We conduct a case study to analyze how the limited diversity of ACE can affect the generalizability of EAE models. We compare the performance of two models with different initializations -(1) DEGREE  pre-trained on the ACE dataset and (2) DEGREE with no pre-training -on the zero-shot with 10 event types benchmarking setup. We dissect the F1 scores into different abstract event types and show the results in Table 9.
We observe that pre-training yields major improvements for the abstractions of Action, Possession, and Change -which are well-represented in ACE. On the other hand, we observe lower performance improvement for the abstractions of Sentiment and Scenario -which are not represented in ACE. This trend clearly shows that the lack of diversity in ACE restricts the models' ability to generalize well to out-of-domain event types. We also highlight the significance of GENEVA as its diverse evaluation setup helps analyze these trends.

ANNOTATION INSTRUCTIONS
Event: An event includes a class attribute with values such as occurrence, state, or reporting. Event Arguments: Event participants that provide some event-centric information. We will try to make a sentence with frame element describing the event to consider it as a participant. If two frame elements are similar, we'll merge them and mark the more specific one as the event argument. We will remove frame elements which are super generic / present in most frames (e.g. Time, Place, Manner, Degree, Purpose, Explanation, Means); but we might want to include them if they are salient to the current event. Entities: Some object in the world. We mark a frame element as entity if it's highly probable to be entity (and/or short noun phrase). We mark only those frame elements as entities which have been marked as event arguments.

Depictive
This FE describes a participant of the institutionalization as being in some state during the action.

Place
The location in which the Facility is situated.

Patient
The person who is committed to a facility with a view towards helping them mentally or physically. E Human expert annotation for EAE ontology creation Figure 14 present the annotation instructions and example input data for the human expert annotation process used for event argument ontology creation.

F Human validation for GENEVA
We provide an example of the annotation setup used for the Ontology Quality Assessment as part of GENEVA validation process in Table 10. Similarly, we provide the annotation setup and some examples for the Annotation Comprehensiveness Assessment in Table 11.

G Implementation Details
In this section, we provide details about the experimental setups and training details for various EAE models we mentioned in our work.

G.1 DEGREE
We closely follow the training setup by DEGREE for training the DEGREE models. We run experiments for DEGREE on a NVIDIA GeForce RTX 2080 Ti machine with support for 8 GPUs. We present the complete range of hyperparameter details in Table 12. We deploy early stopping criteria for stopping the model training.

G.2 BERT_QA
We mostly follow the original experimental setup and hyperparameters as described in Du and Cardie (2020). We use BERT-LARGE instead of the original BERT-BASE to ensure that the PLMs are of comparable sizes for DEGREE and BERT_QA. We run experiments for this model on a NVIDIA A100-SXM4-40GB machine with support for 4 GPUs. A more comprensive list of hyperparameters is provided in Table 13.

G.3 TANL
We report the hyperparameter settings for the TANL experiments in Table 14. We make optimization changes in the provided source code of TANL to include multiple triggers in a single sentence. Experiments for TANL were run on a NVIDIA GeForce RTX 2080 Ti machine with support for 8 GPUs.

G.4 DyGIE++
We report the hyperparameter settings for the Dy-GIE++ experiments in Table 15. Experiments for DyGIE++ were run on a NVIDIA GeForce RTX 2080 Ti machine with support for 4 GPUs.

G.5 OneIE
We report the hyperparameter settings for the OneIE experiments in Table 16. Experiments for OneIE were run on a NVIDIA GeForce RTX 2080 Ti machine with support for 4 GPUs.

G.6 Query&Extract
We report the hyperparameter settings for the Query&Extract experiments in Table 17. Experiments for OneIE were run on an NVIDIA GeForce RTX 2080 Ti machine with support for 4 GPUs.

G.7 TE
We use the original SRL engine and model provided in the repo for running the TE model. Since there was no training, we do not change any hyperparameters.

H Complete Results
In this section, we present the exhaustive set of results for each of the runs for the different benchmarking suites. We show the results for the low resource and few-shot setting are shown in Figures 16 and 17 respectively. Figure 18 Figure 18: Complete set of results of the 5 different runs for all models for the zero-shot (ZS) and cross-type transfer (CTT) test suite. Here Micro is the micro F1 score and Macro is the macro F1 score. ZS-X denotes zero-shot with X training events.