AniEE: A Dataset of Animal Experimental Literature for Event Extraction

,


Introduction
Scientific literature can provide a groundwork for further investigations and offer a valuable platform for turning academic achievements into tangible industrial outcomes.Especially in the biomedical domain, the volume of literature grows exponentially (Frisoni et al., 2021) 1 , posing challenges in locating relevant experimental key facts within unstructured text.To help researchers discover related literature that fulfills their needs, information extraction (IE) is actively utilized.Across diverse IE tasks, event extraction (EE) has been spotlighted to (IL-10 and IL-4) BNN27 (2, 10, and 50 mg/kg) i.p. for 7 days BNN27 increased the Anti-inflammatory BNN27 reversed the glial activation Figure 1: Each example for three event types from AniEE: SampleAdministration, PositiveRegulation, and NegativeRegulation extract the complex syntactic structures of biological events.Each event consists of an event trigger and event arguments, as well as the relations between them.Figure 1 (a) illustrates an example of three SampleAdministration events, which includes an event trigger "i.p." and Object, Subject, Schedule, and three Amount arguments.Additionally, Figure 1 (b) and (c) describe examples of PositiveRegulation and Neg-ativeRegulation, where"BNN27" is the Cause of the events.Among various biomedical literature, animal experiment-related articles are one of the most difficult texts to extract valuable information from.The experimental phases can be generally divided into three stages: within cells, animal experiments, and clinical trials.Com (Danielle et al., 2013), CADEC (Sarvnaz et al., 2015), GENIA 2011 (GE11) (Kim et al., 2011), and Multi-level Event Extraction (MLEE) (Pyysalo et al., 2012).To the best of our knowledge, AniEE is the first dataset on animal experiment literature containing both discontinuous entities (Disc.Entity) and nested events (Nest.Event).The number of documents and sentences is described in Appendix Table 12.
quires further due diligence because they must consider ethical guidelines and have extensive resource requirements.More importantly, before moving on to further clinical trials with human participants, animal research serves as a significant step to evaluate safety and efficacy.Therefore, thorough investigations of previous research are essential to design animal experiments, verifying information such as species, dosage, and duration, and the relations between them, as shown in Figure 1.
Despite the importance of experimental information, EE studies targeting animal experiment literature have rarely been conducted.One of the reasons is, as described in Table 1 2 , existing EE datasets contain literature that is either limited to the cell experiment stage (Kim et al., 2011;Pyysalo et al., 2013;Ohta et al., 2013) or does not specify a concrete experimental stage (Pyysalo et al., 2012).Therefore, an entity and event scheme that do not align with the dataset scope make it difficult to identify specific event triggers and their associated arguments, which are prevalent in animal experiments.
Therefore, we introduce AniEE, a named entity recognition and event extraction dataset focused on animal experiments.We establish a new entity and event scheme designed for the animal experiment stage in collaboration with domain experts.Our dataset is annotated by biomedical professionals and applied to the latest competitive NER and EE models.As described in Table 1, the novelty of our dataset lies in two aspects: addressing both 1) discon-2 In Table 1, note that the number of events, relations, and entities are the sum of train and valid datasets in the case of GE11, CG, and PC because the test datasets of them are not available publicly.Also, if the number of trigger types is not equal to the number of event types in the dataset, we count the trigger type as the event type.tinuous entities and 2) nested events.Existing benchmark datasets do not include both discontinuous entities and nested events.Therefore, we anticipate our dataset will contribute to the advance of all relevant NER and EE sub-tasks.
We sum up our contributions as follows: • We introduce a new high-quality dataset AniEE consisting of 350 animal experimental literature annotated by domain experts.• We define a novel entity and event scheme tailored to animal experiments.• We evaluate the recent NER and EE approaches on our dataset dealing with both discontinuous entities and nested events.
2 Related Work

Complex Information Extraction
Traditional NER and EE formulate their tasks as a sequence labeling problem (Lample et al., 2016;Liu et al., 2018;Lin et al., 2019;Cao et al., 2019), assigning a tag to each token from a pre-defined tagging scheme (e.g., BIO tagging scheme).When faced with challenging syntactic scenarios such as nested or discontinuous text structures, the tagging scheme lacks the flexibility to address such complexities adequately.Consequently, alternative approaches have been explored for each task.
Due to the syntactic complexity, research has focused on either nested or discontinuous entities.Recently, unified NER models (Li et al., 2020b(Li et al., , 2021(Li et al., , 2022) ) have been proposed to jointly extract these entity types.
On the other hand, to address advanced EE subtasks, pipeline-based methods (Li et al., 2020a;Sheng et al., 2021) have been introduced which sequentially extract event triggers and arguments.However, due to the inherent problem of error propagation in such sequential processes, OneEE (Cao et al., 2022) propose the one-step framework to simultaneously predict the relationship between triggers and arguments.

Clinical and Biomedical Datasets
Among the various domains where information extraction research is conducted, the clinical and biomedical domains are highly active fields based on numerous training datasets.In the clinical domain, ShARe13 (Danielle et al., 2013) and CADEC (Sarvnaz et al., 2015) are clinical report datasets including discontinuous drug event and disease entities.In the biomedical domain, several datasets (Pyysalo et al., 2009;Ohta et al., 2010) are derived from GE-NIA corpus (Ohta et al., 2002;Kim et al., 2003), including JNLPBA (Kim et al., 2004).et al., 2012) extend throughout all levels of biological organization from the molecular to the organism level, for both NER and EE tasks.In short, existing EE datasets consist of literature that is either restricted to cell experiments or generalized to cover all experimental stages.Hence, we introduce a novel dataset, AniEE, which aims to extract key animal experimental information.

Dataset Collection
The raw corpus was collected from PubMed, which is a widely-used online database containing a vast collection of biomedical literature.We collaborated with two senior domain experts to define the search terms, with the aim to crawl a diverse set of animal experimental literature.Medical subject headings (MesH), which serve as a hierarchical search index for the PubMed database, were used to determine the search term.Suppression or inhibition of a biological process or system in animals, resulting in reduced activity, expression, or response of a specific target.
3,371 32.0 Table 3: Definition of event types and their corresponding argument roles.Ratios are presented rounded to the second decimal place.
[MesHTerms] Animals) AND ([MesHSubheading] Physiopathology OR [MesHSubheading] Chemistry) AND (Not Review).We collected the titles and abstracts of the literature from the top search results.We then removed the articles without the direct involvement of animal experiments, resulting in a total of 350 articles.

Entity and Event Type Scheme
AniEE contains 12 entity types and 3 event types.Table 2 and Table 3 describe the entity types and event types in our dataset, respectively.The event arguments are detailed in Appendix Table 13.
Entity Types Existing benchmark datasets (Ohta et al., 2002;Pyysalo et al., 2012) have typically focused on anatomical information, encompassing various entity types, especially at the cellular level.Given our primary focus on animal experiments, we consolidated these various types into a single unified category named Anatomy.On the other hand, the animal types are more fine-grained: AnimalSubject as a general identifier (e.g., "mice") and AnimalStrain as a specific identifier (e.g., "C57BL/6J").Also, we introduce numerical entities, which are key attributes of the animal experiment design, such as Dosage, Duration, and DosageFrequency.

Sample Administration We annotate
SampleAdministration on the text spans representing a specific approach to administering a sample.It can be a verb, such as a typical event trigger, but it can also be a non-noun part of speech, including a specialized abbreviation (e.g., "i.c.v") or an adverb (e.g., "orally").
When literature explicitly describes a method of administering a sample, we prioritize annotating specific expressions over generic verbs (e.g., "administered" and "treated"), as illustrated in Figure 1 (a).The event has five ar-gument roles3 , including our two novel event argument roles to express the relations between the event trigger and arguments linked to experimental conditions (i.e., Dosage, Duration).

Positive and Negative Regulation
We annotate PositiveRegulation and NegativeRegulation on the text spans that induce or interfere with biological processes.PositiveRegulation, such as the increase in cancer cells, does not necessarily mean an ultimately positive effect, and the same concept applies to NegativeRegulation.A unique characteristic of our dataset is that it contains nested events.Figure 1 (c) describes an example of nested events where the event PositiveRegulation ("activation") is the Object argument of another event NegativeRegulation ("reversed").

Annotation Procedure
To maintain annotation consistency between annotators, our annotation process consisted of two stages: pilot annotation and expert annotation.All annotators were instructed with detailed annotation guidelines4 .
Pilot Annotation Prior to performing the actual annotation process, a pilot annotation was conducted to train the annotators due to their varying levels of domain knowledge.It was performed to apply our newly defined scheme on 10 pieces of literature, which are extracted if they are related to the animal experiment stage from the MLEE corpus (Pyysalo et al., 2012), a publicly available biomedical event extraction dataset.During the annotation, we developed the annotation guidelines, and the annotators became familiar with the task and tagtog 5 , a web-based annotation tool with a user-friendly interface.
Expert Annotation Six domain experts were recruited as annotators who are masters or Ph.D. candidates in the biomedical field.Two annotators independently annotated each piece of the literature.

Annotation for Complex Scenarios
Traditional annotation methods focus on continuous text spans, making it difficult to annotate certain entities and events due to the complex semantic nature of the animal experiment literature.To address this issue, we developed specialized annotation strategies to handle two specific scenarios: 1) the occurrences of discontinuous entities and 2) the instances where a solitary event trigger indicates several events.

Discontinous Entity
As shown in the Dosage case in Figure 1 (a), for numerical entity types, a substantial amount of literature list several numbers and then mention their unit only once at the end, resulting in discontinuous entities.To minimize the annotators' errors, these entities were subdivided into numeric (e.g., "8") and unit (e.g., "mg/kg") entities during the annotation process with a special relation type only for mapping the number and unit.We then post-process to merge them into a single entity (e.g., "8 mg/kg").For Dosage, because the daily dosage units can be described, we temporarily segmented the unit entity into two unit sub-entities: Dosage-Unit (e.g., "mg/kg") and DosageDayUnit (e.g., "/day"), which were later combined into one (e.g., "8 mg/kg /day").
Multiple Events on a Single Event Trigger Given an example of "ginsenoside Rb1 (35 mg/kg) and losartan (4.5 mg/kg) i.c.v", an event trigger ("i.c.v") has two samples and corresponding dosages for each sample.Since an event trigger corresponds to only a single instance of an Object argument, the example represents one event for each sample, with a total of two events.However, prior research has found that these scenarios are challenging to extract due to the inherent semantic intricacy, which has consequently been acknowledged as a limitation (Friedrich et al., 2020).In order to accurately extract these events, we introduce a supplementary relation type to link each dosage associated with each respective sample (e.g., between "losartan" and "4.5 mg/kg") and instruct the annotators.After the annotation process, post-processing is conducted to produce two distinct events for each sample.

Dataset Statistics and Characteristics
The AniEE corpus contains a total of 350 animal experimental articles.We split our dataset into 4:1:1 for the training, validation, and test sets.Table 4 6 presents their statistics.

Experiments 4.1 Settings
To examine the effectiveness and challenges of AniEE corpus, we conduct experiments on recent superior baseline models for the NER and EE tasks.

NER Models
We evaluate our dataset on unified NER models, which allow us to extract discontinuous and flat entities.W2NER (Li et al., 2022) is an unified NER framework by modeling 2D grid matrix of word pairs.Span-NER (Li et al., 2021) proposes a span-based model which extracts entity fragments and the relations between fragments to jointly recognize both discontinuous and overlapped entities.Evaluation Metric Following previous work in the NER and EE tasks (Sheng et al., 2021;Cao et al., 2022), we report Precision (P), Recall (R), and F measure.We measure four evaluation metrics for the EE task. 1) Trigger Identification: an event trigger is correctly identified if the predicted trigger matches with a ground truth; 2) Trigger Classification: an event trigger is correctly classified if the identified trigger is identical to the right event type; 3) Argument Identification: an event argument is correctly identified if the predicted argument aligns with a ground truth; 4) Argument Classification: an event argument is correctly identified if the identified argument is assigned to the right role.In short, we can organize identification as the task of finding the region of a trigger or entity and classification as predicting the class of each object.

Implementation Detail
We use BioBERT (Lee et al., 2020) as a pretrained language model and adopt AdamW optimizer (Ilya and Frank, 2019) with a learning rate of 2e-5.The batch size is 8, and the hidden size d h is 768.We train all baselines with 200 epochs.

Named Entity Recognition
Table 6 shows the precision, recall, and microaverage F1 scores of the NER baselines.W2NER slightly outperforms SpanNER based on F1 score.7 presents the precision, recall, and F1 scores of W2NER for each entity type.We assume that the long-tail frequency distribution and the number of unique expressions within each entity type affect the performance.Headfrequency entity types, such as SampleName and DiseaseName, where their frequencies are 4,566 and 1,620, show higher performance than tail entity types, such as SampleType and DosageFrequency, where their frequencies are 114 and 52, respectively.Also, the F1 scores of AnimalSubject and AnimalSex are higher than other entity types.This is because the concepts of animal subjects and sex are less specific to animal experiments than other entity types, making it easier to leverage the knowledge gained from the pretrained language model.

Event Extraction
Table 8 shows the EE results for the four evaluation metrics.CasEE consistently outperforms OneEE, except for the precision scores for the argument identification and classification metrics.Overall, CasEE has a higher recall than precision across all metrics, which suggests that the model produces more false positive predictions than false negatives.On the other hand, OneEE has a large gap between precision and recall across all metrics, which implies that the model generates a lot of false negatives and easily misses gold triggers and entities.Extraction of the Event Types Table 9 shows the precision, recall, and F1 score for each event type.This is the result when the model proceeds to trigger identification and classification jointly.The low accuracy of Sam-pleAdministration, where the frequency ratio is 12.9%, can be explained by the low frequency compared with other event types.However, although PositiveRegulation (55.1%) appears more frequently than NegativeRegulation (32.0%), the model predicts NegativeRegulation more accurately than PositiveRegulation.Hence, we further analyze this observation in Section 5.1.

CasEE
Extraction of the Event Arguments Table 10 describes the extraction performance of the event arguments for each event type.This is the result when the model predicts the argu-ment identification and classification together.CasEE (Sheng et al., 2021) shows the highest F1 score on the Amount in SampleAdministration.This result can be explained by the fact that the mentions of Amount usually consist of a numeric and a specific unit component, making it easier for the model to detect this consistent pattern in unstructured documents.
In addition, the F1 scores of Site in Posi-tiveRegulation and NegativeRegulation are significantly lower compared to SampleAdministration. Site in SampleAdministration typically refers to the location of administration and has a low variance.In contrast, Site in regulation refers to where the effect occurs and therefore includes a wider range of locations than the location of administration.Therefore, Sites in PositiveRegulation and NegativeRegulation can be described as more difficult to detect compared to SampleAdministration because of the variety of places where effects occur.

Recognition of the Event Trigger
On the EE task, the model needs to detect the span of the event trigger.We set this task to NER, which detects the span of the event trigger mention.As described in Table 11, W2NER (Li et al., 2022) shows progressively higher scores for NegativeRegulation, Posi-tiveRegulation, and SampleAdministration.It is similar to the event trigger classification performance in Table 9; the order of score is the same, Thus, we can analyze that the low NER performance of PositiveRegulation is due to the imbalanced distribution of trigger mentions.

Distribution of the Event Mentions
In the event trigger classification task, we expected that the model would predict Pos-itiveRegulation more accurately than Nega-tiveRegulation because the frequency of Pos-itiveRegulation is 1.7 times greater than one of NegativeRegulation.However, the results shown in Table 9 contradict our expectations.
To analyze this experiment, we collected the mentions of the event triggers for each event type and extracted the lemmas to group the mentions in a semantic manner7 .To calculate the frequency ratio of a lemma cluster per total mention frequency, we summed the frequencies of all the mentions belonging to the cluster.As shown in Figure 2, in PositiveRegulation, the top-1 lemma cluster with the lemma "induced" accounts for 20% of the overall frequency ratio, while the second lemma cluster ("increased") accounts for 5.8%.This is distinguishable with other event types, such as SampleAdministration and NegativeRegulation, where the frequency percentage gradually decreases with each subsequent lemma cluster.Therefore, the low performance of PositiveRegulation in Table 9 can be explained by the imbalanced distribution of the trigger mentions.

Conclusion
In order to enhance the efficiency and accuracy of a comprehensive review of existing animal experiment literature, we introduce AniEE, an event extraction dataset designed specifically for the animal experimentation stage.The distinctiveness of our dataset can be summarized through two key points.To the best of our knowledge, our dataset represents the first event extraction dataset focusing on animal experimentation.In addition, our dataset encompasses both discontinuous named entities and nested events within texts at the document level.We anticipate that introducing our novel dataset, AniEE, will contribute significantly to advancing document-level event extraction not only in the biomedical domain, but also in the natural language processing.

Limitations
We acknowledge that our dataset, which contains 350 abstracts, is not huge due to laborintensive manual annotation.However, considering the number of annotated events and entities, as well as our experimental results, the current dataset size is sufficient to develop NER and EE models.

Ethics Statement
As each annotator is a master's or Ph.D. student, we compensated each annotator reasonably comparable to a lab stipend.Additionally, we adjusted the workload weekly, accounting for each annotator's schedule.Table 12 shows the dataset comparison existing benchmarks with corpus information.Also, we describe the definition and frequency of event arguments in Table 13.

B Case Study
To further investigate 1) discontinuous entities and 2) nested events in our dataset, we visualize six samples of our dataset.

B.1 Discontinuous Entity
We extract data samples that contain discontinuous entities, color the named entities with each color of their entity type, and tag whether the prediction of this entity is a success or fail.W2NER (Li et al., 2022) is utilized to extract model prediction.As shown in Table 14, the model predicts the discontinuous entities for the first three examples accurately.However, the model fails to detect the duration entity of the fourth example (i.e., "five days") since it predicts "five consecutive days" as a flat entity.This is because we define Duration as a number and unit in the annotation strategy.

B.2 Nested Event
Similar to discontinuous entities, we color the event triggers in a given data sample and tag whether CasEE (Sheng et al., 2021) correctly predicts them.Also, we extract the relations between two event triggers when one of the triggers is an argument of the other.The relations between triggers are described by a triplet, where the first is the event trigger of the current example, the second is the argument of the first, and the third is the role of the second argument within the event of the first trigger.
Table 15 shows two examples of nested events.
The model shows incorrect prediction in the first example and correct prediction in the second example, but the argument roles are the same as for Object.

Figure 2 :
Figure2: Freqeucny ratio comparison for each event type.Each line represents an event type.We plot the percentage of the event mention distribution (Y-axis) accounted for by the top five lemma clusters (X-axis).

Table 1 :
pared to cell experiments, animal research re-Comparison of two clinical-domain named entity recognition (NER) datasets and two biomedical-domain event extraction (EE) datasets: CLEF eHealth Task 2013 (ShARe13)

Table 2 :
Definition and frequency of entity types.Ratios are presented rounded to the second decimal place.

Table 8 :
Event extraction performance comparison of two baseline models for the four evaluation metrics: trigger identification, trigger classification, argument identification, and argument classification (see Section 4.1).

Table 9 :
Event extraction performance of the triggers for each event type.

Table 10 :
Event extraction performance of the arguments for each event type.