DICE: Data-Efficient Clinical Event Extraction with Generative Models

Event extraction for the clinical domain is an under-explored research area. The lack of training data along with the high volume of domain-specific terminologies with vague entity boundaries makes the task especially challenging. In this paper, we introduce DICE, a robust and data-efficient generative model for clinical event extraction. DICE frames event extraction as a conditional generation problem and introduces a contrastive learning objective to accurately decide the boundaries of biomedical mentions. DICE also trains an auxiliary mention identification task jointly with event extraction tasks to better identify entity mention boundaries, and further introduces special markers to incorporate identified entity mentions as trigger and argument candidates for their respective tasks. To benchmark clinical event extraction, we compose MACCROBAT-EE, the first clinical event extraction dataset with argument annotation, based on an existing clinical information extraction dataset MACCROBAT. Our experiments demonstrate state-of-the-art performances of DICE for clinical and news domain event extraction, especially under low data settings.


Introduction
Event extraction (EE) is an information extraction task that aims to identify event triggers and arguments from a given text sequence (Ahn 2006). The EE task consists of two subtasks: 1) event detection, in which the model extracts trigger text and predicts the event type; and 2) event argument extraction, in which the model extracts argument text and predicts the role of each argument given an event trigger and associated event type.
Clinical EE aims to extract clinical events, which are occurrences at specific points in time during a clinical process, and textual entities that modify or describe properties of these events (Caufield et al. 2019). Fig. 1 shows an example sentence with two clinical events. There is a SIGN SYMPTOM event triggered by "nodule" where "0.8x1.5cm" functions as an argument of role type AREA.
The detail and volume of clinical information is often beyond the abilities of human readers which necessitates the development of the aforementioned EE methods that are applicable to the clinical domain (Wang et al. 2018). Such techniques are highly applicable to downstream tasks such as the * Equal contribution.
A 45 -year -old lady sought dermatology consultation for severely tender erythematous vesicles and bullae over back , chest and arms .  construction of patient histories to inform automated clinical decision support (Yadav et al. 2013), adverse medical event detection (Rochefort, Buckeridge, and Forster 2015), drug discovery (Wang et al. 2009) and clinical workflow optimization (Hsu et al. 2016). Event extraction in the clinical domain faces several nontrivial challenges compared to EE in the general domain. First, domain knowledge such as clinical term definitions and semantic meanings are mostly preserved in natural language; however, traditional information extraction models cannot easily leverage such information. Second, trigger and argument spans are mostly domain-specific terms that are difficult to extract precisely. Specifically, they are more than 50% longer than the general domain on average, as shown in Table 1, and have vague boundaries. Locating and bounding clinical mentions requires special model design and domain understanding as the adjective and descriptive words can also be a part of event triggers or arguments. For example, given the text "massive heart attack", we should identify "heart attack" (instead of one more or one less word) as the trigger because of it refers to a specific condition, and "massive" as an argument of the role type SEVERITY. However, when we consider "right common carotid artery", we find that despite "right" and "common" describing the "carotid artery", the entire text span refers to a biological structure, thus it is an argument of the role type BIOLOG-ICAL STRUCTURE. Third, the arguments associated with events are much more diverse than in the general domain. There are on average 2.6× argument roles for clinical events compared with the general domain, urging a model for better long-tail argument role handling. Finally, as clinical data is expensive to collect due to the high cost of expert annotations and patient privacy concerns, there is no existing clini-Metric ACE2005 ERE MACCROBAT-EE   Unique event types  33  38  13  Unique argument roles  22  21  22  Documents #  599  459  200  Sentences #  20,862 17,114  4,539  Entities #  54,820 46,185  23,898  Triggers #  5,348  7,287  13,128  Arguments #  8,102 10, cal EE dataset with argument annotation.
In this paper, we present DICE, a Data-effIcient generative model for Clinical Event extraction. It 1) formulates EE as a sequence-to-sequence text generation task and uses event type definitions and argument role descriptions as prompts to incorporate domain knowledge and boost low-resource performance; 2) specializes in clinical mention identification by training an auxiliary mention identification module to learn implicit mention properties and adding mention marker to hint the mention boundary explicitly; 3) performs an independent query for each argument role for better handling of long-tail argument roles. These techniques are inspired by the clinical domain, but are applicable to general domains as well. We also introduce MACCROBAT-EE, the first clinical event extraction dataset with argument information, which is derived from clinical experts' annotation of PubMed clinical case reports.
We benchmark DICE on MACCROBAT-EE against several recent event extraction models. Experiments show that DICE achieves the state-of-the-art performance on clinical event extraction, and we observe a larger performance gain under low-resource settings. We also perform ablation studies to demonstrate the effectiveness of each input segment and the design choice of DICE. Our contributions are threefold: 1) We develop a data-efficient generative model for clinical event extraction that is scalable to many argument roles and specialized in biomedical mention identification; 2) We present the first clinical event extraction dataset with argument annotations and report the performances of recent EE methods on it; 3) Our model achieves state-of-the-art performance on clinical event extraction and achieves higher performance gain under low-resource settings.

Event Extraction in the Clinical Domain
Due to high annotation costs and privacy concerns, dataset availability is a primary bottleneck for clinical EE. We propose a repurposing of an existing expert-annotated dataset, MACCROBAT, to compose a clinical EE benchmark, MACCROBAT-EE. We first introduce the problem formulation, then provide details about the data composition process and data statistics.

Task Formulation
We divide the event extraction task into Event Trigger Detection (ED) and Event Argument Extraction (EAE) and formulate a pipeline where the triggers extracted during the ED task are used as part of the inputs for the EAE task. The ED subtask takes an sentence (PASSAGE) as input to extract event triggers and predict event types. The trigger must be a sub-sequence of the input passage and the event type must be one of the n event type pre-defined types. The EAE subtask takes a tuple of (PASSAGE, EVENT TRIGGER, EVENT TYPE) as input, and extracts arguments from PASSAGE and predicts the argument role. Each event type holds a pool of n event type arg role argument roles as defined in the event ontology.

The MACCROBAT-EE Dataset
We derive our dataset from the MACCROBAT dataset (Caufield et al. 2019), 1 which consists of 200 pairs of clinical case reports from PubMed accompanying annotation files with partial event annotation. To our knowledge, this is the only openly accessible collection of clinical case reports annotated for entities and relations by human experts. Following existing sentence-level EE works (Lin et al. 2020a), we construct an event extraction dataset with full event structure, MACCROBAT-EE, which contains annotated span information for entities, event triggers, event types, event arguments and argument roles for each sentence. We include all tagged mentions in MACCROBAT as entities, and further specify that mentions tagged as events and their respective types are included as event triggers and event types.
To infer event arguments and their roles, which are not provided in MACCROBAT, we consider non-event entities that hold a MODIFY relation with event triggers as arguments, and we use the assigned entity types as argument roles. We infer arguments via the MODIFY relation because its definition matches well with argument definition of further characterizing the properties of an event as we shown in § A.2. The entity types in MACCROBAT defined a certain type of fine-grained physical or procedure property, which matches the argument role definition of being a type of participant or attribute of an event. We traverse all (event type, argument role) pairs to obtain the argument roles possible for each event type to create an event ontology, as shown in § A.3. The definition sentences of each event type and argument role written by clinical experts are provided.

Data Statistics
In  Figure 2: Model design of DICE. We use the mention identified by the MI module to add mention markers used for the ED/EAE modules. The ED module extracts event trigger and type, and the EAE module extracts argument and roles. They are trained jointly with the MI module for mention-enhanced event extraction.
ties used as event arguments in MACCROBAT-EE. This indicates that MACCROBAT-EE fills an different niche than ACE2005 and ERE-EN and will provide a valuable benchmark for event extraction in a clinical setting with high mention density, as well as allowing for future work to adapt clinical case report domain-specific features.

The DICE Event Extraction Model
We follow existing works Hsu et al. 2022;Ma et al. 2022) by formulating the Event Extraction task as a conditional generation task. We first introduce each model component in § 3.1 and techniques to leverage mention information in § 3.2. Finally, we introduce how to combine the components during training and inference time in § 3.3. Fig. 2 shows the overall model design.

Seq2seq Extraction Modules
There are three components in our design: 1) mention identification (MI) which identifies candidates for event triggers or arguments, 2) ED which extracts event triggers and predicts event types, and 3) EAE which extracts arguments and predicts argument roles. We integrate these components to form the MI-ED-EAE pipeline (details in § 3.3). Each component is formulated as sequence-to-sequence conditional generation task using a pre-trained text generation model T5-large (Raffel et al. 2020) as the backbone. The input is a natural language sequence consists of • Passage: the original input passage • Prompt: task-specific query information or instruction to guide the model to produce expected output The model learns to generate natural language text that can be parsed into the final output (e.g. , event trigger and argument mentions). With this formulation, the generative model learns the input and output sequence format, as well as the correlations and underlying patterns to perform information extraction. We design different prompts for each of the three tasks so that the extraction task is performed with extraction conditions according to the task-specific prompt. This formulation enables us to 1) handle nested and discontinuous entities (e.g. the argument "origin of right common carotid artery" in Fig. 2 is split into two spans in the original passage) that traditional sequence tagging models cannot handle; 2) incorporate domain knowledge such as event type and argument role definitions via natural language in the input prompt ; 3) scale to a large number of event types and argument roles and achieve better long-tail performance by performing an independent query for each type/role.
Mention Identification (MI). The MI module extracts all candidate trigger or argument mentions from the input passage. The input is just the passage and the output includes all mentions that are trigger or argument candidates in the input passage separated by a special token "[SEP]" following the prefix "Mentions are". If there are no mentions, a placeholder is generated (i.e. "Mentions are <mention>"). We conduct a global-then-local searching strategy for mentions by extracting mentions from the entire passage as well as from sentence segments selected by a sliding window, which enables shorter outputs and higher mention coverage. We enforce the condition that the order of output mentions match the order of their appearance in the input passage. This consistency helps the generative model to learn its expected behavior as well as allows for prior mention predictions to inform subsequent mention predictions.
Event Detection (ED). The ED module extracts event triggers from the passage. For a given passage, we construct n event type queries. For each query, we input the concatenation of passage and the following prompt segments: • Event type name: the name of the query event type such as "Sign symptom" with a prefix "Event type is" • Event type description: a brief definition of the event type, such as "Any symptom or clinical finding" The output of the ED task is the concatenation of the event trigger texts predicted for the queried event type separated by a special token "[SEP]", following the prefix "Event trigger is". When there are no valid triggers for the queried event type (which are considered to be negative samples), a special placeholder token is generated (i.e. "Event trigger is <trigger>"). The balance between positive and negative samples is a hyperparameter that may be tuned for a better precision-recall trade-off. To obtain the final predictions for a given passage, we decode the output sequence and obtain a list of pairs (EVENT TYPE, TRIGGER). We then evaluate the correctness of the predicted triggers and their respective types. We consider an event trigger to be correctly identified if it's extracted by any query and to be correctly classified if the queried event type matches the ground truth event type. Compared with existing works that do n event type to 1 classification, our design retains the flexibility to capture that triggers can represent multiple events with different types.
Event Argument Extraction (EAE). The EAE module extracts event arguments from queries consisting of the input passage, a given role type, and a pair consisting of an event trigger and its event type. We perform n event type arg roles queries to extract arguments corresponding to each potential argument role where n event type arg roles is the number of unique argument roles for an certain event type. The input sequence contains passage, event type name, and event type description segments as defined in the ED module in addition to: • Trigger marker: the trigger text is wrapped by special tokens "<trigger>" and "</trigger>" to explicitly provide the position information of the trigger • Event trigger phrase: a short sentence indicating the event trigger, such as "Event trigger is thrombus" • Argument role name: a short sentence describing the name of the queried argument role with a prefix, such as "Argument role is Biological structure" • Argument role description: a brief definition of the role The expected output of the decoder begins with a reiteration of the querying argument role (e.g. "Biological structure is") followed by the concatenated predicted argument texts or a placeholder ("<argument>") if there is no valid predictions. we then decode and evaluate the generated output according to the same rules outlined in ED.

Mention Identification Enhanced EE
We use two techniques to enhance generative model' ability to precisely identify long mentions with vague boundaries.
Explicit mention marker. As demonstrated by existing works of generative IE models (Huang, Tang, and Peng 2021), wrapping key concept with markers provides beneficial hints to the generative model that improve its understanding of the properties and syntactic associations of the marked text. In this work, we use markers to wrap potential trigger mentions for ED or argument mentions for EAE, which provides a pool of candidates to extract identifications from. To maximize the performance gain provided by the marker given an imperfect MI module, we must consider two conditions. First, the ED/EAE modules with markers should be robust enough to handle cases where the precision and coverage of the predicted mentions are compromised.
We thus perform data augmentation and train the ED/EAE module with two versions' data together: one with groundtruth mention markers and one with no markers. 2 This is done so that the model is trained to perform reasonably when markers are unavailable. Second, the granularity of the candidate set must not be too coarse or too fine (which is shown by our experiments in § 4.4), as too broad a candidate pool makes the markers less informative and too strict a candidate pool makes it difficult for the MI module to correctly identify mentions. Instead of using all nouns/verbs or entities as the targets of MI, we select trigger mentions for the ED task and argument mentions for the EAE task. The unique properties of triggers (mostly describing an entire process or a behaviour, which could be linked to a specific timestamp) and arguments (mostly concrete details or descriptive content) make them more useful as candidate sets.
MI as an implicit auxiliary task. We design all three generative tasks (ED, EAE and MI) as extraction tasks that extract a sub-sequence from the input passage given a taskspecific prompt. Compared to sequence tagging models with complex decoding designs, generative models are not as adept at extracting long and multi-token concepts despite the advantages we introduced in the beginning of § 3.1, especially in a domain it is not pre-trained for. Therefore, we add the trigger MI and argument MI tasks as auxiliary tasks that we jointly optimize with the ED and EAE tasks, respectively.

Training and Inference
Training. We provide ground-truth upstream data points (i.e. mention markers for ED and EAE, event triggers and types for EAE) when separately training the ED, EAE, and MI modules and use the standard teacher forcing crossentropy text generation loss to train the model. In the complete design, we first train standalone trigger and argument MI modules to provide mention markers for the ED and EAE modules. 3 We then train ED and EAE modules with an auxiliary trigger and argument MI modules with ground-truth mention markers respectively.
Inference. We use the trigger and argument mention markers produced by the standalone MI modules in the downstream ED and EAE modules. The predicted event triggers and their types output by the ED module are provided as input to the EAE module in a pipeline fashion.
Variations. We term two variations of our model: • Vanilla DICE: pipelined ED and EAE module without the mention enhancement techniques described in § 3.2; • DICE: pipelined ED and EAE module using mention markers and the mention identification auxiliary task.

Experiments
We evaluate DICE on MACCROBAT-EE and compare it with existing event extraction models.  Table 2: Overall experimental result (%). The argument extraction task takes the predicted event trigger and types of the corresponding ED model in the pipeline style. We use the codebase provided by the authors to produce results for all baselines.

Experimental Setup
Data splits. We first divide the 200 MACCROBAT-EE documents according to an 80/10/10 split for the training, validation, and testing sets, respectively. For each data split, we use the individual sentences and annotated mentions in corresponding documents as data instances for the ED and EAE tasks. For low-resource settings, we consider 10%, 25%, 50%, and 75% of the training data in terms of the number of documents to study the influence of resource availability on the model's performance while using the original validation and testing sets for evaluation.
Evaluation metrics. We follow previous EE works and report precision, recall and F1 scores for the following four tasks. 1) Trigger Identification: a trigger is correctly identified if it matches the ground truth span. 2) Trigger Classification: a trigger is classified if it is correctly identified and its predicted event type matches the ground truth event type. 3) Argument Identification: an argument of an event is correctly identified if it matches the ground truth span. 4) Argument Classification: an argument of an event is classified if it correctly identified and its predicted argument role matches the ground truth event type.
Baselines. We benchmark the performance of the recent EE models on the MACCROBAT-EE dataset, please refer to § B.3 for reproduction details. The baseline models are: •

Overall ED and EAE Results
High-resource results.  Figure 3: Performance on downsampled training data. We report F1 score (y-axis) for each proportion (x-axis). extraction tasks. DEGREE reports low performance on the argument extraction tasks due to the challenges of generating long sequences containing all arguments. Our model outperforms the benchmarks in the argument extraction task with a more than 2 point F1 score improvement on the classification task. DICE also achieves a marginal performance increase over OneIE on the trigger classification task.
Low-resource results. We show the results of training in lower-resource settings in Fig. 3 and § C.1. Despite OneIE achieving the best performance on ED tasks in higher resource settings, we observe that DICE significantly outperforms OneIE in boths trigger extraction tasks on the 10% data setting and marginally outperforms OneIE in trigger classification on the 25% data setting. We also observe that DICE significantly outperforms all baselines in the argument extraction tasks under all low-resource settings. The performance gap between DICE and the baselines increases in the lower training data percentage settings with more than 7 (10%) and 8 (25%) point F1 improvements over Text2Event.

Mention Identification Results
MI module design. We compare our MI module with several state-of-the-art mention identification methods on the entity identification task. The accuracy of the MI module is critical to the performance gain in the ED and EAE modules provided by the addition of the markers. In addition to the MI module used in DICE (Line 5) and its variant, MI with-  Table 3: Performances of the methods used to incorporate mention information. The argument extraction reported here use ground-truth event information, which enables us to remove the influence of the upstream ED result. † indicates the settings use mention markers to wrap ground-truth mentions and it is not comparable with other lines.
out sliding windows, we also report the performance of the following state-of-the-art entity recognition modules:   Table 4 shows that sliding window technique marginally improves the model's coverage for the approach outlined in Yan et al. (Line 2 vs 1) and significantly improves our MI module (Line 5 vs 4). Our MI module outperforms OneIE and achieves the best F1 score for entity identification.
Performance on different mention definitions. We use the MI module to identify trigger mentions and argument mentions that serve as the markers for the ED and EAE tasks and report the performance in Line 6 and 7 of Table  4. The results confirm that extracting trigger and argument mentions is more difficult than extracting all mentions and indicate that identifying argument mentions is the most difficult task among the three mention definitions.

Ablation on the Usage of Mention
Techniques. We analyze the effects of the proposed techniques incorporating mention information in Table 3. In our final design that uses trigger and argument as mentions shown in Line 6-8, we observe both auxiliary task and mention marker contribute to the performance improvement of 1.14 and 1.75 in the F1 score on argument classification. When adding both mention markers and the auxiliary task, we observe an improvement over vanilla DICE of 3.84 and 2.27 in the F1 score for trigger and argument classification, respectively. This result indicates that the ED and EAE modules benefit from both the implicit joint learning signal gained from the auxiliary task and the explicit mention hints provided by including mention markers in the input sequence. We include oracle settings on lines 5 & 9 that provide ground-truth mention markers during inference to show the oracle performance given a perfect MI module and to illustrate the influence of the accuracy of the MI module.
Mention definition. Trigger and argument mentions are subsets of the entity mentions in MACCROBAT-EE. If we equate mentions and entities, we essentially perform entity identification as the auxiliary task and add markers to wrap all entities. When defining mentions as triggers for ED and arguments for EAE, the auxiliary task is more similar to the ED or EAE task it is jointly trained with and the mention markers are more targeted. We observe a marginal performance gain for ED when using trigger mention as opposed to entity mention, while the performance gain of EAE with argument mentions over EAE with entity mentions is quite significant. These observations support the hypothesis that incorporating the information of all entities would bring noise into the training signal and that more targeted mention would improve performance for both ED and EAE tasks.

Other Ablation Studies
Extraction vs typing formulation. We formulate ED and EAE as conditional text generation tasks and consider two designs for our input and target format. The first is the DICE design in which we expect the model to extract content given queries with event type/argument role information. The second design formulates a typing task that provides a query to the generative model for each mention so that the expected output is the predicted event type or argument role for the querying mention (see § B.4 for details). This ap-proach is motivated by the notion that the output space of the typing formulation is much smaller than that of the extraction task. For the ED example in Fig. 2, we query with the input "... calcified <query>plaque</query> ... artery. " instead and the expected output is "Event type is Sign symptom".
The results in Table 5 show that the typing formulation improves ED performance over extraction (though still worse than mention-enhanced DICE), but leads to a much worse EAE performance. This is likely due to the typing task becoming more difficult as the number of candidate class increases and complicated typing spaces varied by event types.   Table 6: Ablation study of effects of sequence segments.
Input prompt segments. We analyze the importance of prompt segments in Table 6. For ED, we find that event type name is more important. For EAE, removing either the event type description (Line 5) or the argument role description (Line 9) leads to the most significant performance decreases. These results emphasize the benefits of incorporating the rich semantic information contained in the names and definitions for both event type and argument roles.

Error Analysis
We analyze the error introduced by the 4 steps in the pipeline for our best-performing EAE model using predicted triggers on the argument classification task. The results in Fig. 4 indicate that the identification sub-tasks, especially trigger identification, are the performance bottlenecks of our model.

General Domain Event Extraction
Many prior works formulate EE as token-level classification tasks and trained in a ED-EAE pipeline-style (Wadden et al. 2019; Yang et al. 2019;Ma et al. 2021). Other approaches jointly optimize ED and EAE tasks (Li, Ji, and Huang 2013;Yang and Mitchell 2016;Lin et al. 2020a) while incorporating constraints (Han, Zhou, and Peng 2020;Han, Ning, and Peng 2019), or including a named-entity recognition task to provide an additional supervision signal (Zhao et al. 2019;Zhang et al. 2019;Sun et al. 2020;Wadden et al. 2019). Recent work formulates the EE task as text generation with transformer-based pre-trained language models that prompt the generative model to fill in synthetic (Paolini et al. 2021;Huang, Tang, and Peng 2021;Lu et al. 2021b;Li, Ji, and Han 2021) or natural language templates Hsu et al. 2022;Ma et al. 2022). To our knowledge, there is no existing approach to clinical EE using a text generation framework, which we hypothesize is due to both data unavailability and to the aforementioned domain challenges.

Event Extraction in Biomedical Domain
Existing approaches to biomedical EE (Huang, Yang, and Peng 2020;Trieu et al. 2020;Wadden et al. 2019;Ramponi et al. 2020;Wang, Weber, and Leser 2020) typically focus on extracting interactions or relationships between biological components such as proteins, genes, drugs, diseases and outcomes related to these interactions (Ananiadou et al. 2010). The mentions in these biological component interactions are short, distinctive biomedical terms and do not have rich event type-argument role ontologies because of the lack of interaction types present in the datasets Kim et al. 2011;Kim, Wang, and Yasunori 2013;Pyysalo et al. 2011Pyysalo et al. , 2012. Our work addresses these concerns by introducing MACCROBAT-EE as well as providing a benchmark in a previously under-explored domain.

Conclusion and Future Work
In this work, we present DICE, a generative event extraction model designed for clinical domain. Our approach formulates the EE task as a conditional generation task that leverages domain knowledge in the form of a natural language prompt. DICE is adapted to tackle long and complicated mentions by jointly optimizing EE tasks with the auxiliary mention identification task as well as the addition of mention boundary markers. We also introduce MACCROBAT-EE, the first clinical EE dataset with argument annotation as a testbench for future clinical EE works. Lastly, our evaluation shows that DICE achieve the state-of-the-art performance on MACCROBAT-EE.
In future work, we aim to pre-train generative language models on a clinical corpus as well as to apply transfer learning from higher-resource domains to improve performance.

A.1 MACCROBAT Annotation
MACCROBAT is annotated according to the Annotation for Case Reports using Open Biomedical Annotation Terms (ACROBAT) defined in (Caufield et al. 2019). ACROBAT describes events and entities as meaningful text spans, but differentiates events as occurrences that may be ordered chronologically and entities as objects that may modify or describe events. Each event and entity is given a type such that certain events are associated with certain argument roles. According to ACROBAT, Entity text spans are limited to the shortest viable length. For example, the text span "mild asthma attack" would be annotated by labeling "asthma attack" as an event as that is the shortest span that conveys the occurrence of the event. "Mild" would be labeled an entity and the annotation would add a relation indicating that "mild" modifies "asthma attack". MACCROBAT contains 12 relation types, but for our purposes we only consider the MODIFY relation that occurs when an entity describes or characterizes an event.

A.2 Details of Inferring Event Arguments
According to ACE2005 English Events Guidelines (AEEG), 4 the arguments of events are defined as entities and values within the scope of an event and only the closest entities and values will be selected, where a value is defined to be "a string that further characterizes the properties of some Entity or Event". The MODIFY relation in the MACCROBAT dataset connects 2 arguments, and it is defined as the "generic relationship in which one entity or event modifies another entity or event, including instances where an entity is identified following an event" (Caufield et al. 2019). The MODIFY relation satisfies the argument definition described by the AEEG by incorporating within-sentence relationships between an entity that modifies or describes an event. Thus, given a certain event trigger, we consider non-event entities that hold a MODIFY relation with the trigger as argument of this event. We take the assigned type of the selected entity according to MACCROBAT as the role of the argument. To create an event ontology, which includes all possible event types and possible argument roles or each event type, we traverse all (event type, argument role) pairs to obtain the unique argument roles possible for each event type.

A.3 Event Ontology
We show the full event ontology, including all event types and their possible argument roles, in Table 8.

B Details of Implementation and Experiments
B.1 Implementation Details Mention Identification. The sliding window scan the passage from beginning to end with pre-defined window size and step size, which significantly boosts the coverage of the predicted mentions. During both training and inference, we retain the original full-length input passage in addition to the sliding window segments.
Training and evaluation. We select the best epoch based on the highest F1 score of the pipelined EAE classification task on the validation set. When evaluating correctness, we only accept an exact match between the generated trigger/argument and the ground-truth trigger/argument as a correct prediction. We use beam search with 2 beams to generate the output sequences for all three generative tasks. The generation stops either when the "end of sentence" token is generated or the output length reaches 30.
Frameworks. Our entire codebase is implemented in Py-Torch. 5 The implementations of the transformer-based models are extended from the Huggingface 6 codebase (Wolf et al. 2020).

B.2 Experiments Details
We report the averaged result for three runs with different random seeds for each experiment. For the low-resources result shown in Fig. 3, we sample three different selections of training data of corresponding proportion and report the average F1 score. All the models in this work are trained on a single Nvidia A6000 GPU on a Ubuntu 20.04.2 operating system.

B.3 Baseline Reproduction
Mention Identification. For results in Table 4, we use BART-large for Yan et al. because Yan et al. (2021) only supports a generative model with absolute position embedding. OneIE uses BERT-large as is its default and we use T5-large for our module in Line 4 and 5.
ED and EAE. OneIE jointly learns ED, EAE, and MI tasks and we provide entity information to its MI module with event types and role types stripped to equate its training information with the training information provided to our model DICE. For DEGREE, human-written templates that organize the argument roles of an event type in a sentence are required by the model. We construct these templates using phrases such as "<Argument role> is <argument text>" for all potential argument roles of an event type as the template.

B.4 Details of Ablation Studies
For the extraction vs typing formulation ablation shown in § 4.5, we formulate the ED and EAE tasks as typing tasks by querying each possible mention. For the ED task, we first use the standalone mention identification module introduced in § 3.1 to extract all possible triggers detected by the MI module, and then we query the generative model with the following example input and output format: Input: ... calcified <query>plaque</query> ... artery. Output: Event type is Sign symptom.
The output is constrained to belong to the candidate pool of event types or the placeholder event type "<Type>" following the prefix "Event type is ". For the EAE task, we first extract all possible argument candidates and then query each candidate with the input sentence containing event trigger, event trigger marker, event type name and event type description: Input: ... densely <query>calcified</query> <trigger>plaque</trigger> ... artery. n Event type is Sign symptom. n Any symptom or clinical finding. n Event trigger is plaque. Output: Argument role is Detailed description.
Similarly, the output is constrained to the candidate pool of argument roles possible for the given event type following the prefix"Argument role is ".

B.5 Hyperparameters
For the ED module, we define positive instances as (PASSAGE, EVENT TYPE) pairs where the passage contains one or more event triggers of this event type. Negative instance are pairs in which the passage contains no event triggers of the event type. We create 10 negative instances for each positive instance. For the EAE module, we define positive instance as the (PASSAGE, EVENT TRIGGER, EVENT TYPE, ARGUMENT ROLE) tuple that there exist a argument text contained in the passage that meet the query criteria. We create 10 negative instances for each positive instance. For the MI module, we use a window size of 10 words, with sliding step of 4 words. We retain the original full sequence in both training and evaluation. We use AdamW optimizer with 1e-5 learning rate without gradient accumulation.

C.1 Full Low-Resource Results
We show the full low-resource experimental results illustrated in Fig. 3 Table 7: Performance on downsampled training set. We report F1 score for each task using different downsampled training data. We create three random splits for each proportion and report the averaged performance.