PHEE: A Dataset for Pharmacovigilance Event Extraction from Text

The primary goal of drug safety researchers and regulators is to promptly identify adverse drug reactions. Doing so may in turn prevent or reduce the harm to patients and ultimately improve public health. Evaluating and monitoring drug safety (i.e., pharmacovigilance) involves analyzing an ever growing collection of spontaneous reports from health professionals, physicians, and pharmacists, and information voluntarily submitted by patients. In this scenario, facilitating analysis of such reports via automation has the potential to rapidly identify safety signals. Unfortunately, public resources for developing natural language models for this task are scant. We present PHEE, a novel dataset for pharmacovigilance comprising over 5000 annotated events from medical case reports and biomedical literature, making it the largest such public dataset to date. We describe the hierarchical event schema designed to provide coarse and fine-grained information about patients’ demographics, treatments and (side) effects. Along with the discussion of the dataset, we present a thorough experimental evaluation of current state-of-the-art approaches for biomedical event extraction, point out their limitations, and highlight open challenges to foster future research in this area.


Introduction
Pharmacovigilance is the pharmaceutical science that entails monitoring and evaluating the safety and efficiency of medicine use, which is vital for improving public health (World Health Organization, 2004).Unexpected adverse drug effects (ADEs) could lead to considerable morbidity and mortality (Lazarou et al., 1998).It has been reported that more than half of ADEs are preventable 1 Our data and code is available at https://github.com/ZhaoyueSun/PHEE(Gurwitz et al., 2000).Pharmacovigilance is therefore important for detecting and understanding ADE-related events, as it may inform clinical practice and ultimately mitigate preventable hazards.
Collecting and maintaining the clinical evidence for pharmacovigilance can be difficult because it requires time-consuming manual curation to capture emerging data about drugs (Thompson et al., 2018).Much of this information can be found in unstructured textual data including medical literature, notes in electronic health records (EHR), and social media posts.Using NLP methods to discover and extract adverse drug events from unstructured text may permit efficient monitoring of such sources (Nikfarjam et al., 2015;Huynh et al., 2016;Ju et al., 2020;Wei et al., 2020).
Past work has introduced pharmacovigilance corpora to support training and evaluation of NLP approaches for ADE extraction.However, most of these datasets (e.g., the ADE corpus; Gurulingappa et al. 2012b) contain annotations only on entities (such as drugs and side effects) and their binary relations as shown in Figure 1(a).This ignores contextual information relating to human subjects, treatments administered, and more complex situations such as multi-drug concomitant use.To address this problem, Thompson et al. (2018) developed the PHAEDRA corpus, which includes annotations of not only drugs and side effects, but also subjects (human, specific species, bacteria, and so on) and events encoding descriptions of drug effects, which involve multiple arguments, and event attributes -see Figure 1(b).
Despite these refinements, however, PHAEDRA does not provide detailed, nested annotations such as dosages, conditions, and patient demographic details.This granular information may provide critical context to clinical studies.Furthermore, PHAE-DRA consists of only 600 annotated abstracts of medical case reports, making it challenging to train NLP models for pharmacovigilance events extraction since its annotations are in the document level and the actual annotated events are sparse.
In this work we introduce a new annotated corpus, PHEE, for adverse and potential therapeutic effect event extraction for pharmacovigilance study.The dataset consists of nearly 5,000 sentences extracted from MEDLINE case reports, and each sentence features two levels of annotations.With respect to coarse-grained annotations, each sentence is annotated with the event trigger word/phrase, event type and text spans indicating the event's associated subject, treatment, and effect.In a fine-grained annotation pass, further details are marked, such as patient demographic information, the context information about the treatments including drug dosage levels, administration routes, frequency, and attributes relating to events.An example annotation is shown in Figure 1(c).
Using PHEE as the benchmark, we conduct thorough experiments to assess the state-of-theart NLP technologies for the pharmacovigilancerelated event extraction task.We use sequence labelling and (both extractive and generative) QAbased methods as baselines and evaluate event trigger extraction and argument extraction.The extractive QA method performs best for trigger extraction with the exact match F1 score of 70.09%, while the generative QA method achieves the best exact match F1 score of 68.60% and 76.16% for the main argument and sub-argument extraction, respectively.Further analysis shows that current models perform well on average cases but often fail on more complex examples.
Our contributions can be summarised as follows: 1) We introduce PHEE, a new pharmacovigilance dataset containing over 5,000 finely annotated events from public medical case reports.To the best of our knowledge, this is the largest and most comprehensively annotated dataset of this type to date.2) We collect hierarchical annotations to provide granular information about patients and conditions in addition to coarse-grained event information.3) We conduct thorough experiments to compare current state-of-the-art approaches for biomedical event extraction, demonstrating the strength and weaknesses of current technologies and use this to highlight challenges for future research in this area.

Related Work
Pharmacovigilance Related Corpora Prior pharmacovigilance-related corpora mainly has focused on annotation of entities (e.g., drugs, diseases, medications) and binary relations between them, namely, drug-ADE relations (Gurulingappa et al., 2012a;Patki et al., 2014;Ginn et al., 2014), disorder-treatment relations (Rosario and Hearst, 2004;Roberts et al., 2009;Uzuner et al., 2011;Van Mulligen et al., 2012), and drug-drug interactions (Segura-Bedmar et al., 2011;Boyce et al., 2012;Rubrichi and Quaglini, 2012;Herrero-Zazo et al., 2013).More recent open challenges, including the 2018 n2c2 shared task (Henry et al., 2020) and MADE1.0 challenge (Jagannatha et al., 2019), have considered annotating additional relation types, such as drug-attribute and drug-reason relations, but they are still binary relationships.Thompson et al. (2018) introduced the PHAE-DRA corpus, extending the drug-ADE annotations to pharmacovigilance events.Compared to corpora that only annotate simple drug-ADE relationsreferred to as AE events in PHAEDRA-they further annotate three additional relations, namely the Potential Therapeutic Effect (PTE) event which refers to the potential beneficial effects of drugs, the Combination and the Drug-Drug Interaction event which indicates multiple drug use and interactions between administered drugs, respectively.In addition, PHAEDRA includes the subject as a type of named entities (NEs) and annotates three types event attributes, i.e., negated, speculated and manner.However, some key informative details are still missing in PHAEDRA.As the NE annotation of PHAEDRA is usually a single noun or a short noun phrase, detailed information about the subject (such as age and gender), and of the medication (e.g., dosage and frequency) is not captured.
We set out to annotate a larger corpus with more detailed information to facilitate training of pharmacovigilance event extraction models.We build on existing corpora (PHAEDRA and ADE).The ADE corpus comprises ∼3,000 MEDLINE case reports and annotations on ∼4,000 sentences indicating adverse effects, but their annotations only involve drugs, dosages and adverse effects, and lack sufficient event details of interest.The PHAE-DRA corpus reuses 227 abstracts from ADE and integrates an additional 370 abstracts (from other corpora and some novel entries).However, the PHAEDRA corpus is annotated at the document   level, the actual annotated events are very sparse.We collected sentences in ADE and those in PHAE-DRA with AE or PTE event annotations and enriched these using our proposed annotation scheme.
Biomedical Event Extraction Most existing biomedical event extraction methods work as "pipelines", treating trigger extraction and argument extraction as two stages (Björne and Salakoski, 2018;Li et al., 2018Li et al., , 2020a;;Huang et al., 2020;Zhu and Zheng, 2020); this can lead to error propagation.Trieu et al. (2020) propose an end-to-end model that jointly extracts the trigger/entity and assigns argument roles to mitigate the problem of error propagation, but in contrast to our span-based annotation, this requires full annotation of all entities.Ramponi et al. (2020) consider biomedical event extraction as a sequence labelling task, allowing them jointly model event trigger and argument extraction via multi-task learning.
In other domains, recent work has formulated event extraction as a question answering task (Du and Cardie, 2020;Li et al., 2020b;Liu et al., 2020).This new paradigm transforms the extraction of event trigger and arguments into multiple rounds of questioning, obtaining an answer about a trigger or an argument in each round.Such methods can reduce the reliance on the entity information for argument extraction and have proved to be data efficient.The current QA-based event extraction methods are mainly built on extractive QA which obtains the answer to a question by predicting the position of the target span in the original text.As such, a separate question needs to be formulated for different event and argument type.We also experiment with a generative QA method, which generates the answers directly, for comparison.
3 The PHEE Dataset

Task Definition and Schema
The PHEE corpus comprises sentences from biomedical literature annotated with information relevant to pharmacovigilance.Annotations are hierarchically structured in terms of textual events.Following prior work (Thompson et al., 2018), we define two main clinical event types: Adverse Drug Effect (ADE) and Potential Therapeutic Effect (PTE), denoting potentially harmful and beneficial effects of medical therapies, respectively.Events consist of a trigger and several arguments, as defined by the ACE Semantic Structure (LDC, 2005).The trigger is a word or phrase that best indicates the occurrence of an event (e.g., 'induced', 'developed'), while the arguments specify the information characterizing an event, such as patient's demographic information, treatments, and (side-)effects (Figure 1(c)).We further organise arguments into two hierarchical levels, namely main and sub-arguments.Main arguments are longer text spans that contain the full description of an event aspect (e.g., treatment), while sub-arguments are usually words or short phrases included in main argument spans and highlighting specific details of the argument (e.g., drug, dosage, duration, etc).
More specifically, in PHEE, event arguments are defined as: Subject highlights the patients involved in the medical event, with sub-arguments including age, gender, race, number of patients (labeled as population) and preexisting conditions (labeled as subject.disorder) of the subject.
Treatment describes the therapy administered to the patients, with sub-arguments specifying drug (and their combinations), dosage, frequency, route, time elapsed, duration and the target disorder (labeled as treatment.disorder) of the treatment.
Effect indicates the outcome of the treatment.
We also collected annotations indicating three types of attributes characterizing whether an event is negated, speculated or its severity is indicated.See more details about the schema in Appendix A.

Data Collection and Validation
Data Collection To compose the PHEE corpus, we collect existing medical case report abstracts from the ADE (Gurulingappa et al., 2012b) and PHAEDRA (Thompson et al., 2018) datasets.We extract sentences from the abstracts and annotate them containing at least one adverse or therapeutic effect (ADE or PTE) event, for a total of over 4.8k sentences after deduplication.
Annotation Process We hired 15 annotators in total to participate in our annotation, who are PhD students in the computer science or medical domain.We consulted our annotation schema with pharmacovigilance researchers and biomedical NLP researchers before starting the annotation.
We conducted the corpus annotation through two stages to reduce the difficulty in dealing with medical text.In the first stage, we provided the annotators with sets of single sentences and asked them to highlight the event triggers and the text spans functioning as main arguments (i.e., subject, treatment and effect).Each annotator annotates about 330 sentences during this stage.In the next stage, we randomly assigned the annotated sentences to different annotators who were required to verify the correctness of the previous annotations.Once confirmed, the annotations were expanded specifying the possible sub-arguments (e.g., for subjects: age, gender, population, race, subject.disorder),and attributes (e.g., negation).To ease the cognitive demand required to highlight fine-grained sub-arguments during the second stage, the annotators were split into three groups, each specialising in just one of the three main argument types.Specifically, four annotators are allocated for subject sub-argument annotation and four for effect and attribute annotation, while seven annotators are allocated for treatment sub-argument annotation due to the task complexity.Each annotator is responsible for around 1.4k or 700 instances during this stage.Additional notes on the annotation process can be found in the Appendix B.

Data Validation
To ensure quality annotations, each stage of annotation was proceeded by several rounds of annotation trials, after which we discussed frequent inconsistencies.When questions about specific instances surfaced during the annotation process, annotators flagged these sentences for review.While the main annotations of stage one were double-checked by the annotators in stage two, we randomly duplicated 20% of the stage-two samples and assigned them to different groups to measure Inter-Annotator Agreement (IAA).
We compute F1-score2 as a measure of agreement between annotators.We calculate F1 scores between the sets of duplicated cases by (arbitrarily) selecting one annotation set as a "reference" to the other.Specifically, we adopted the EM_F1 (at span-level) and Token_F1 (at token-level) metric which are explained in details in Section 4.2.We report agreement scores in Table 1.
Consistency across trigger and argument types is over 80%, indicating the effectiveness of two-stage approaches.Agreement on sub-arguments is lower, which is expected due to the higher complexity of fine-grained medical annotations.In particular, we notice a difficulty in consistency over the annotation of duration and time_elapsed.One type of common inconsistent cases is "generalized expressions" (e.g., "chronic", "long-term", "shortly after"), which are annotated by some annotators but ignored by others.In addition, it is easy for annotators to confuse these two types of annotation.For example, the phrase "48 months" in "48 months postchemotherapy" is mistakenly annotated to be duration, which, however, is generally believed should be time_elapsed.Other less inconsistent sub-argument types including frequency and subject.disorder.For frequency, inconsistent cases including generalized expressions (e.g., "repeated", "continuous") and certain specific expressions such as "0.32mg/kg/day" that some annotators prefer to annotate "0.32mg/kg" as dosage and "/day" as frequency while others prefer to annotate the whole span as dosage.For subject.disorder, conflicts exist in "neutral" expressions that describe the subject's health condition but not necessarily to be a disorder, such as "pregnant" and "nondiabetic".Apart from the difficult cases, inconsistency also occurs in the choices of span boundaries, especially for long arguments, or sometimes due to accident mistakes.Attribute annotations are inconsistent, probably due to their rarity in the corpus.

Dataset Statistics and Analysis
PHEE includes a total of 4,827 sentences and 5,019 annotated events.This makes PHEE the largest annotated dataset on adverse drug events of which we are aware.We randomly divided train, dev, and test splits based on documents.Details about these splits are provided in Table 2. Table 3 reports statistics of the main event arguments.In general, each event contains at most one main argument of a particular type, but arguments might be discontinuous, leading to multiple spans representing a single argument.The average number of tokens per argument is about 3-4, which is generally longer than other datasets focusing only on biomedical entities (drugs, diseases or effects).
Statistics about sub-argument annotations are provided in Figure 2.For the sub-arguments of the subject, age is the most frequently mentioned feature.Gender, population and subject.disorderare also comparatively common; race is the rarest attribute.For treatment, drug names are the most frequently mentioned, even higher than the number of treatment arguments due to the administered combinations of drugs.The target disorder of the treatment is the second most mentioned, provid-# ann.# spans # ann./sentence avg.tokens/ann.ing context information in which the therapeutic or adverse events occurred.In contrast, the other treatment's sub-arguments occur less frequently, resulting in a rather imbalance argument distribution.
Statistics of attributes are in Appendix C.

Experiments
We conduct evaluation of sequence labelling and QA-based methods (both extractive and generative) on our PHEE dataset.We describe our experimental design, evaluation metrics, and main results in this section.Reproduction details are in Appendix D.
We use the "I-O" scheme, in which the label "I-X" indicates a token is within a span of argument type X, and "O" indicates it is outside of any argument span.As the main arguments and their associated sub-arguments usually overlap, we set the label to be "I-A.B" if the token is in a main argument span of type A and a sub-argument span of type B simultaneously.Correspondingly, the label will be "I-A" or "I-B" if the token only appears in a main argument or a sub-argument.For triggers, labels denote event types.An example of the flattened label sequence is shown in Figure 3(a).We use the ACE (Wang et al., 2021) model, which reaches state-of-the-art results for Named Entity Recognition (NER), as a representative sequence labelling method in our experiments.
Extractive QA We build our extractive QA model upon the EEQA method (Du and Cardie, 2020).Event triggers, main arguments, and subarguments are extracted in three sequential steps as shown in Figure 3(b).We fine-tune the pretrained BioBERT (Lee et al., 2020)   <question> and <sentence> are placeholders for a question template and an input sentence, respectively.The output is the text span extracted from the input sentence as answers.We experiment with different question templates for event triggers, main arguments and sub-arguments.
For event trigger extraction, the model predicts a probability distribution across all events types (including a non-event case) for each input token based on BioBERT representations.Argument extraction is done for each argument type, where the probabilities of being the start/end position of an argument span are predicted for each token by a classification layer added on top of the BioBERT encoder.All possible <start, end> pairs are then filtered by thresholds of scores of the [CLS] token (which indicates a non-event prediction) to retrieve the extracted arguments.We also filter out spans that overlap with other spans with better scores.We train the QA models for main-argument extraction and sub-argument extraction separately.
Generative QA Under the generative QA setting, we split the event extraction task into two stages.In the first, event triggers and main arguments are extracted simultaneously.In the second, sub-arguments are extracted.We fine-tune SciFive (PMC; Phan et al. 2021) model, a T5 model pretrained on Pubmed.An example of the input/output in the QA pipeline is shown in Figure 3(c).
For the first stage, the question is simply 'What are the events?'.Each sentence is paired with the question in the form of 'question: <question> context: <sentence>', where question: and context: are the fixed prompts, and <question> and <sentence> are placeholders for a pre-defined question and an input sentence, respectively.The gold-standard answer is constructed using a template '[<event type>] <trigger> [<main argument type>] <main argument content> ...', where <•> is a placeholder to be replaced by the relevant content.For each event, the trigger comes first, followed by the main arguments in the order of subject, treatment and effect.Multiple events are flattened into a sequence.The QA model then generates answers from which we obtain the event type, event trigger, and main argument spans via pattern matching.For the second stage, we use the questions defined for sub-arguments in extractive QA.The model input and gold-standard answers are formulated in a similar way as the first stage.

Evaluation Metrics
We evaluate model performance on event trigger extraction and argument extraction separately.Punc-tuation and articles are ignored during evaluation.
Event trigger extraction Following Lin et al. (2020), we use the F1 metric for the evaluation of event trigger identification and event trigger classification.Specifically, trigger identification (Trig-I) evaluates how well the trigger words match their corresponding references; trigger classification (Trig-C) evaluates not only the mentioning words but also event types.As the event trigger words could be ambiguous even for humans, and the detection of the presence of an ADE or PTE event is argubly more important, we further compute the event classification (Event-C) F1 score, which evaluates whether the event type of a trigger word matches its reference.
Argument extraction Argument evaluation is also conducted from both identification and classification perspectives.Specifically, an argument span is correctly identified if its event type and offsets match a gold-standard span, and it is correctly classified if the argument type also matches.Considering that argument spans could be long and the exact match (i.e., span-level) evaluation might be too strict, we additionally report token-level evaluation results.Specifically, EM_F1 measures the percentage of the predicted spans that match the ground truth spans exactly and Token_F1 measures the average token overlap between the predictions and references.As there might be multiple spans for each argument, we compute both metrics by micro-averaging.That is, we accumulate the number of matched spans (or tokens) across the corpus as the True Positive (TP) value, and compute the precision, recall and F1 accordingly.

Results and Analysis
We compare three families of baselines on the PHEE dataset.For the extractive QA and the generative QA approaches, we explored several question templates and report only the results of templates which perform best on the development set.A more extensive analysis on different template formats is discussed in Appendix E.  the trigger could be used for training and evaluation.Nevertheless, the comparison is still relevant as the trigger has the only linguistic function of representing an event occurrence but a limited semantic content.Instead, the generative QA model obtained the best comparative performance when classifying the event type(s) of the whole sentence, independently of the particular trigger extracted.

Evaluation: Argument Extraction
We present the main argument and sub-argument extraction results in Table 5. Generative QA achieves the best results in both main argument and sub-argument extraction.Extractive QA performs better than sequence labelling in main argument extraction, but worse in sub-argument extraction.We sampled and analysed the error cases of the three approaches, and present some of them in Table A5.
In particular, for main argument extraction, we observe that a common error of extractive QA is the failure of detecting an event trigger in an early stage, making it skip extracting main arguments in the subsequent stages.Generative QA performs better probably because it extracts the trigger and main arguments simultaneously thus avoiding the problem of error propagation.For sequence labelling, the most prominent problem is the incompleteness of the extracted main arguments, especially for the subject argument.One possible reason is that the main argument and sub-argument labels are flattened into one sequence, which results in the loss of the information about the relations between the main argument and its sub-arguments, therefore hurting the extraction performance.For the sub-argument extraction, the performance of the extractive QA drops to the worst, probably due to further error propagation from the previous two stages.For the other two methods, the sequence labelling method seems to be more severely affected by trigger extraction errors.In some cases, the argument spans are matched, but no trigger is detected in the sentence, thus leading to a failure.The generative QA method's performance at this stage is relatively less influenced by the main argument extraction results compared to the other two approaches, but we notice it could easily fail to extract less frequent sub-arguments.One possible downside of generative QA models when used for information extraction is that they may generate tokens not in the original input sentence, but in our sampled cases, such errors are very rare.

Evaluation for Each Argument Type
In Table 6, we present the results for each argument type.Firstly, among all main argument types, the effect seems to be the easiest one to be extracted.This is probably due to its abundant occurrences and relatively distinct features compared to other argument types.Although the treatment also occur frequently in the corpus, models perform much poorer on treatment extraction.The main reason is that the length of the treatment spans varies, and the information of the treatment could be more complex, which leads to the fragmented extraction results.The subject while appearing less frequently than treatment and effect, have relatively simpler linguistic patterns.As such, their exaction results are better than treatment when using QA models.
For the sub-arguments, highly frequent arguments with simpler linguistic patterns such as age, gender, and drug get promising results.Some arguments with relatively limited expressions, such as race and frequency, although very rare in our dataset, still have merit or moderate extraction result.Some sub-arguments such as subject.disordersand treatment.disorders,can be confusing even for human annotators, getting relatively low extraction performance.Models' performance on subject.disorder is even poorer due to its less occurrence in the dataset.Another pair of arguments that are easily confused is time elapsed and duration, both of which contain temporal expressions.Combined with the low occurrence frequencies, these two arguments also get quite low extraction results.

Challenges and Future Directions
Our analysis of experimental results, suggests the following open challenges for the extraction of pharmacovigilance events.Firstly, the models perform poorly on arguments with similar entity mentions but different argument roles.For example, a disease mentioned in text could be annotated as treatment.disorderif it is the target of the treatment or subject.disorderif it refers to someone's disease but not targeted for treatment.A similar problem can be observed for arguments of temporal expressions such as time elapsed and frequency.The poor performance on such arguments seems to indicate that existing models are not able to perform deep semantic analysis.Additional constraints encoding linguistic constructs between entity mentions and main argument types could be explored to guide the event extraction model through, for example, posterior regularisation.
Secondly, the models' performance on argument types with limited annotated training instances deteriorates drastically.One path forward is therefore to explore efficient few-shot learning strategies to improve models' generalisability on rare argument types.Also, there might exist corpora annotated with similar argument roles but for different purposes, for example, the corpora for medication extraction (Jagannatha et al., 2019) where drug dosage and frequency are annotated.It is possible to leverage external drug or disease knowledge through knowledge distillation.
Finally, none of the existing models cope well with the presence of multiple events in a sentence.This is mainly because existing annotations rely heavily on event triggers to differentiate events and require explicit linking between arguments and their respective event triggers.However, trigger identification itself is ambiguous and difficult even for human annotators.In some cases, multiple events could share the same trigger.For pipelinebased models, i.e., the QA models in our work, detection of multiple triggers is prone to error, thus making it hard for subsequent argument extraction due to error propagation.For the sequence labelling model, it is difficult to flatten the annotations of multiple events into a single label sequence.We thus duplicate the multi-event cases during training, and only provide a single event annotation for each case at one time.However, it becomes impossible to obtain full extract results for multiple events during the inference stage.In the future, rather than sequence labelling or QA-based extraction approaches, it is worth exploring graph-based approaches for multi-event extraction in which entity mentions are nodes in the graph while event extraction can be framed as soft clustering of entity mentions.

Conclusion
In this paper, we present the development of a novel corpus, PHEE, composed of sentences from the medical case reports annotated with pharmacovigilance-related events.Events in PHEE are hierarchically annotated with coarse and finegrained information about patient demographics, treatments, and (side) effects.We use it to evaluate state-of-the-art NLP models for pharmacovigilance event extraction.Experimental results show that current models could capture reasonable information in common cases but face challenges for complex situations such as distinguishing semanticallysimilar arguments, dealing with the low resource setting, and extracting multi-events from text.

Limitations
This study has several limitations.First, despite the implementation of a quality control process, the collected annotations inevitably have some quality issues.For example, to reduce cognitive load, we split the annotation process into two stages and required annotators working on the second-stage annotation to check and correct first-stage annotations.However, we noticed that many annotators are more willing to keep the previous annotations as they are unless the errors can be easily identified.This may lead to inflated IAA results.
Also, although trained for the task, the lack of medical background of annotators may have some impact on the quality of the dataset.Second, our dataset only contains two event types, Adverse Drug Event (ADE) and Potential Therapeutic Effect (PTE).It is worth considering adding sentences with the null event type, that is, not associated with ADE or PTE.Furthermore, only one base PLM model for each baseline was chosen in our experiments, and more encoding methods are worth exploring in the future.Finally, although we have provided the annotations of event attributes such as speculation, negation and severity, we have not implemented baseline models for event attribution detection, partly due to the few annotated cases.In the future, we will explore semi-supervised learning approaches for event attribute detection.

E Experimental Results using Different Question Templates
We present the experimental results on the development set with different question templates below.Table A2 and Table A3 show the results of the extractive QA model when using different templates for trigger extraction and main argument extraction, respectively.Table A4 shows the sub-argument extraction results of the extractive QA and the generative QA method.Overall, we observe that using different templates does not have a large impact on the results, which is probably due to the fact that our dataset involves few event types and relatively fixed argument types.Specifically, templates that achieve the best results on different metrics vary.The query template verb obtains the best trigger and event type detection performance, while a full-sentence question What is the trigger in the event?get modestly better trigger identification performance.
For main argument extraction, the template including a query of the argument and the event trigger achieves the best scores.For sub-argument extraction, the best-performing templates also slightly differ depending on the model.For the extractive QA model, a brief question including information on the event type, the main argument type and extracted span, and the queried sub-argument type gets the best exact match result, while also giving this information but changing the query argument type to a complete sentence would achieve the best token-level score.For the generative QA model, the argument type-specific query with all information about the event type and the main argument performs best on both span-level and token-level evaluation.

( a )
An example of the ADE dataset.(b) An example from the PHAEDRA dataset.(c) An example from our PHEE dataset.

Figure 1 :
Figure 1: Comparison of annotations from (a) the ADE corpus, (b) the PHEADRA corpus and (c) our developed PHEE corpus.

Figure 3 :
Figure 3: Illustrations of three baseline methods with the example: "A 52-year-old Black woman on phenytoin therapy for post-traumatic epilepsy developed transient hemiparesis contralateral to the injury."The diagram shows the extraction of the subject for main argument extraction and the age of the subject for sub-argument extraction.
Black woman on phenytoin therapy for post-traumatic epilepsy developed transient hemiparesis contralateral to the injury.

Table 2 :
Dataset statistics on train/dev/test sets.
base model on our dataset.In particular, the input is a sentence paired with a question, formatted as: '[CLS] <question> [SEP] <sentence>', where

Table 4 :
Table 4 reports the performance of the three baselines on trigger extraction.Extractive QA achieves the best result on both trigger identification and classification.However, it is worth mentioning that due to the EEQA design, only the first token of Results for trigger extraction.

Table 5 :
Results for arguments extraction.

Table 6 :
Classification results for each argument type.Best results for each argument type are highlighted in bold.

Table A1 :
(Du and Cardie, 2020) attributes.Experiments − Extractive QA For extractive QA experiments, we fine-tune the EEQA(Du and Cardie, 2020)model on our data 5 .We use the BioBERT (Base Cased, 1.1M parameters; Kenton and Toutanova 2019) as the base model.The SGD algorithm is used as the optimizer, and the learning rate is set to 1 × 10 −5 , 5 × 10 −5 , 5 × 10 −5 for trigger extraction, main argument extraction and 4 https://github.com/Alibaba-NLP/ACE 5https://github.com/xinyadu/eeqasub-argument extraction, respectively.We use a batch size of 32 for trigger extraction and 16 for argument extraction.We set the maximum training epochs to 10. Experiments are conducted on an NVIDIA TITIAN RTX GPU.Training time for trigger extraction, main argument extraction and sub-argument extraction are about 0.5, 1, 4 hours, respectively.The training time varies according to the number of argument(or trigger) types that need to be asked for each instance.Experiments − Generative QA For generative QA experiments, we run the experiments with the Huggingface example code for question answering 6 .We fine-tune the SciFive (PMC Base, 2.2M parameters; Phan et al. 2021) model, a T5 model pre-trained on a large-scale Pubmed corpus, on our dataset.The training batch size is 16.The learning rate is 5 × 10 −4 for main argument extraction and 5 × 10 −5 for sub-argument extraction.We train the model for no more than 20 epochs with early stopping patience as 2 epochs.We use beam search for decoding with the beam size of 3. We use an NVIDIA TITIAN RTX GPU for model training.The training of the generative QA model costs one hour (or less) and about four hours for (trigger and) main arguments extraction and sub-arguments extraction, respectively.
Table A5 lists some example error cases as complementary material for the discussion in Section 4.3.We present one example for each argument type in the table.