OntoED: Low-resource Event Detection with Ontology Embedding

Event Detection (ED) aims to identify event trigger words from a given text and classify it into an event type. Most current methods to ED rely heavily on training instances, and almost ignore the correlation of event types. Hence, they tend to suffer from data scarcity and fail to handle new unseen event types. To address these problems, we formulate ED as a process of event ontology population: linking event instances to pre-defined event types in event ontology, and propose a novel ED framework entitled OntoED with ontology embedding. We enrich event ontology with linkages among event types, and further induce more event-event correlations. Based on the event ontology, OntoED can leverage and propagate correlation knowledge, particularly from data-rich to data-poor event types. Furthermore, OntoED can be applied to new unseen event types, by establishing linkages to existing ones. Experiments indicate that OntoED is more predominant and robust than previous approaches to ED, especially in data-scarce scenarios.


Introduction
Event Detection (ED) (Chen et al., 2015) is the task to extract structure information of events from unstructured texts. For example, in the event mention "Jack is married to the Iraqi microbiologist known as Dr. Germ.", an ED model should identify the event type as 'Marry' where the word 'married' triggers the event. The extracted events with canonical structure facilitate various social applications, such as biomedical science Wang et al., 2020c), financial analysis (Deng et al., 2019;Liang et al., 2020), fake news detection (Wang et al., 2018;Nikiforos et al., 2020) and so on.
As a non-trivial task, ED suffers from the lowresource issues. On the one hand, the maldistribu- * Equal Contribution. † Corresponding Author. tion of samples is quite serious in ED benchmark datasets, e.g., FewEvent (Deng et al., 2020) and MAVEN (Wang et al., 2020b), where a large portion of event types contain relatively few training instances. As shown in Figure 1, the sample size of two event types Attack and Riot differs greatly (4816 & 30). In low-resource scenarios, supervised ED models (Chen et al., 2015;Nguyen et al., 2016;Liu et al., 2018) are prone to overfitting since they require sufficient training instances for all event types. On the other hand, real-world applications tend to be open and evolve promptly, and accordingly there can be numerous new unseen event types. Handling new event types may even entail starting over, without being able to re-use annotations from previous ones .
Regarding low-resource ED,  take a fresh look at ED, by mapping each event mention to a specific type in a target event ontology, which can train from few seen event types and then transfer knowledge to new unseen ones. However, the event ontology here merely considers the intrastructure for each event mention and event type.
In this paper, we enrich the event ontology with more inter-structures of event types, such as temporal, causal and hierarchical event-event relations (Ning et al., 2018;Wang et al., 2020a). For example, as seen in Figure 1, Attack Our key intention is to fully utilize the event ontology and leverage correlation knowledge from datarich event types (i.e., Attack) to data-poor ones (i.e., Sentence, Acquit and Riot). Besides, new event types (i.e., Be-Born) can be learned with correlation (i.e., COSUPER) of existing ones (i.e., Injure).
As the first attempt to construct such event ontology, we propose a novel ED framework with ontology embedding called OntoED. First, we establish the initial event ontology with event instances and types. We capture semantic features and relations of event instances with BERT (Devlin et al., 2019) and utilize prototypes (Snell et al., 2017) to represent event types, where a prototype is the average of its instance embeddings. Second, we extend the event ontology with event-event relations based on extracted relations among event instances, and then learn ontology embedding by aggregating neighbor prototypes for each prototype w.r.t. correlations among event types. In this way, semantically similar event types in vector space will be closer, thus, improving the discrimination of dissimilar event types. Third, we design an event correlation inference mechanism to induce new event correlations based on symbolic rules, e.g., (Sentence, BEFORE, Acquit) ∧ (Acquit, BEFORE, Pardon) → (Sentence, BEFORE, Pardon). Thus, we can induce new eventevent relations to further enrich the event ontology. To the best of our knowledge, it is the first work to explicitly model correlations among event types with event ontology in low-resource ED.
Our contributions can be summarized as follows: • We study the low-resource event detection problem and propose a novel ontology-based model, OntoED, that encodes intra and inter structures of events.
• We provide a novel ED framework based on ontology embedding with event correlations, which interoperates symbolic rules with popular deep neural networks.
• We build a new dataset OntoEvent for ED. Extensive experimental results demonstrate that our model can achieve better performance on the overall, few-shot, and zero-shot setting.

Related Work
Traditional approaches to ED are mostly based on neural networks (Chen et al., 2015;Nguyen et al., 2016;Liu et al., 2018;Yan et al., 2019;Cui et al., 2020;Shen et al., 2020;Lou et al., 2021), and ignore correlation knowledge of event types, especially in low-resource scenarios. Most previous low-resource ED methods (Peng et al., 2016) have been based on supervised learning. However, supervised-based methods are too dependent on data, and fail to be applied to new types without additional annotation efforts. Another popular methods for low-resource ED are based on meta learning. Deng et al. (2020); Lai et al. (2020); Shen et al. (2021) reformulate ED as a few-shot learning problem to extend ED with limited labeled samples to new event types, and propose to resolve few-shot ED with meta learning. Besides, knowledge enhancement and transfer learning are applied to tackle low-resource ED problems. Tong et al. (2020) leverage open-domain trigger knowledge to address long-tail issues in ED. ; Du and Cardie (2020) propose to handle few-shot and zero-shot ED tasks by casting it as a machine reading comprehension problem.  propose to tackle zero-shot ED problem by mapping each event mention to a specific type in a target event ontology. Note that  establish the event ontology merely with intra-structure of events, while we extend it with inter-structure of event correlations. Though these methods are suitable for low-resource scenarios, they mostly ignore implicit correlation among event types and lack reasoning ability. In order to utilize correlation knowledge among event types,  propose a new event graph schema, where two event types are connected through multiple paths involving entities. However, it requires various annotations of entities and entity-entity relations, which is complicated and demanding. Different from , we propose to revisit the ED task as an ontology learning process, inspired by relation extraction (RE) tasks based on ontology and logic-based learning. Lima et al. (2018Lima et al. ( , 2019) present a logic-based relational learning approach to RE that uses inductive logic programming for generating information extraction (IE) models in the form of symbolic rules, demonstrating that ontology-based IE approaches are advantageous in capturing correlation among classes, and succeed in symbolic reasoning.

Problem Formulation
We revisit the event detection task as an iterative process of event ontology population. Given an event ontology O with an event type set E = {e i |i ∈ [1, N e ]}, and corpus T = {X i |i ∈ [1, K]} that contains K instances, the goal of event ontology population is to establish proper linkages between event types and instances. Specifically, each instance X i in T is denoted as a token sequence X i = {x j i |j ∈ [1, L]} with maximum L tokens, where the event trigger x t i are annotated. We expect to predict the index t (1 ≤ t ≤ L) and the event label e i for each instance respectively.
Besides, we utilize a multi-faceted event-event relation set R = R H R T R C for event ontology population and learning. Thereinto, R H = {SUBSUPER, SUPERSUB, COSUPER 1 } denotes a set of relation labels defined in the subevent relation extraction task (Wang et al., 2020a;Yao et al., 2020). R T = {BEFORE, AFTER, EQUAL 2 } denotes a set of temporal relations (Han et al., 2020). R C = {CAUSE, CAUSEDBY} denotes a set of causal relations (Ning et al., 2018).

Model Overview
In this paper, we propose a general framework called OntoED with three modules: (1) Event Detection (Ontology Population), (2) Event Ontology Learning, and (3) Event Correlation Inference.  Event Detection aims at identifying the event trigger x t i and type e i for each input tokens X i , and then identify relations among event instances. The average instance embedding of each type is calculated as the primitive event prototype.
Event Ontology Learning aims to obtain event ontology embedding with the correlation of event prototypes, based on the relations among event types derived from instances. Event Correlation Inference seeks to infer new event correlations based on existing event-event relations, so as to obtain a solid event ontology.
The detailed architecture of OntoED with running examples is illustrated in Figure 3.

Event Detection (Ontology Population)
The input of ED is an initial event ontology with event types E and coarse corpus T .
Instance Encoder. Given a token sequence we use a pretrained BERT (Devlin et al., 2019) to get a contextual representation X t i for x t i , and use the token embedding of [CLS] as the contextual representation X i for X i . Note that the instance encoder is pluggable, and can be replaced as other models followed by (Deng et al., 2020;Cui et al., 2020).
Class Encoder. We then represent event types as prototypes (Snell et al., 2017), as it is proven to be robust for low-resource ED (Deng et al., 2020).
Initially, event types have no correlation with others, thus we require to compute the prototype P k for e k ∈ E by averaging its instance embeddings: where N k is the instance number of e k . Afterward, event prototypes will be induced from the module of event correlation inference, as shown in Figure 3. Event Detector. Given embeddings of a token sequence, we treat each token as an event trigger candidate and then compute probability of the corresponding event type for event trigger candidate x t i , denoted by: where · denotes Euclidean distance, and N e = |E| denotes the number of event types. As general, we adopt cross entropy as the loss function for event detection, denoted by: where y is the ground-truth label for x t i . Instance Relation Extractor. For each event instance pair (X i , X j ), we adopt a comprehensive way to model embedding interactions , denoted by where [·, ·] denotes a vector concatenation, and is the element-wise Hadamard product. e 2 e 1 Figure 3: Detailed example for the process of OntoED. Note that we ignore instance nodes in No.2 and No.3 event ontology for space limit.
Step 1: Event Detection (Ontology Population) connect event types with instances, given the initial event ontology with coarse corpus.
Step 2: Event Ontology Learning establish correlations among event types, given the event ontology enriched with instances.
Step 3: Event Correlation Inference induce more event correlations based on existing event-event relations, e.g., (e 1 , CAUSE, e 2 ) → (e 1 , BEFORE, e 2 ), We then calculate the probability P (y = r k ) of relation r k ∈ R between (X i , X j ) by softmax. Generally, we adopt cross entropy as the loss function for instance relation extraction, denoted by: where y is the ground-truth for (X i , X j ), and N r = |R| denotes the number of event-event relations. Overall, the loss function for event detection (ontology population) is calculated by: where γ is a hyperparameter.

Event Ontology Learning
Ontology Completion. We complete event ontology O with both intra and inter structure of events. We normatively link event instances T to event types E, and establish correlations among event types based on linkages among event instances. Instance-to-class Linking. Given a sentence S i (formalized as a token sequence X i ) with a trigger x t i of an event instance, we link these information to its corresponding event type e i with normative triples: (S i , triggerIs, x t i ) and (S i , instanceOf, e i ). Class-to-class Linking. Given an event instance pair (X i , X j ) with a relation r, we upgrade the instance correlation to corresponding event types, denoted by (e i , r, e j ). Besides, we link each event subtype to its corresponding supertype 3 with a SUBSUPER relation (SUPERSUB in reverse), and we link each event subtype pair having the same supertype with a COSUPER relation.
Ontology Embedding. We represent the event ontology considering both instances and correlations for each event type. Specifically, given a triple = (e h , r, e t ) ∈ O, we propagate the prototype P h of head event type e h to prototype P t of tail event type e t with a relation transformation matrix M r ∈ R d×d . We select a matrix to embed r as it shows great robustness to model relations in lowresource senarios . We then aggregate propagation from all head event types by where O is all one-hop neighbor triples of e t in O.
The prototype P t of e t in after propagation is a weighted average of P t and P * t with weight λ ∈ [0, 1], denoted by: 3 The supertypes and its corresponding subtypes in this paper are pre-defined and will be introduced in appendix.  We calculate the possibility that r is the relation between e h and e t with a truth value for (e h , r, e t ): φ(e h , r, e t ) = sim(P h M r , P t ) = σ(P h M r P t ), where σ is sigmoid function, and the similarity between P h M r and P t is evaluated via dot product.
Overall, the loss fuction for event ontology learning is defined by: (8) and y denotes the ground-truth label for (e h , r, e t ).

Event Correlation Inference
Given the event ontology with correlations among event types, we infer new event correlations based on existing ones. To be specific, we utilize the grounding g to infer new event correlation triples, which can be generalized as the following form: (e I h , r I , e I t ) ← (e 1 h , r 1 , e 1 t ), · · · , (e n h , r n , e n t ) (9) where the right side event triples (e k h , r k , e k t ) ∈ O with k ∈ [1, n] have already existed in O and (e I h , r I , e I t ) / ∈ O is new inferred triples to be added. To compute the truth value of the grounding g, we select three object properties (OP) of relations defined in OWL2 4 Web Ontology Language: subOP, inverseOP, and transitiveOP, and then learn matrics of relations from linear map assumption , presented in Table 1  Assuming that M † r and M ‡ r denotes the relation set on left and right of Eq (9) respectively, they are 4 https://www.w3.org/TR/owl2-profiles/ matrices either from a single matrix or a product of two matrices. As relation constraints are derived from ideal linear map assumption (the 3rd column in Table 1), M † r and M ‡ r are usually unequal but similar during training. Thus, the normalized truth value F p of g can be calculated based on relation constraints (the 4th column in Table 1): where · F denotes Frobenius norm, and subscript p respectively denotes one of the three object properties. F max p and F min p is a the maximum and minimum Frobenius norm score. F p ∈ [0, 1] is the truth value for the grounding g and the higher F p means the more confident that g is valid.
The loss function for new event correlation inference is defined by: log F k p (10) G (·) denotes all groundings w.r.t. subOP (S), inverseOP (V ), and transitiveOP (T ). ψ S , ψ V , and ψ T are hyperparameters for the loss of three object properties respectively.
As a whole, the final loss function for OntoED is denoted by: where α and β are hyperparameters for the loss of event ontology population (Eq (5)) and event ontology learning (Eq (8)) respectively.

Experiments
The experiments seek to: (1) demonstrate that On-toED with ontology embedding can benefit both standard and low-resource ED, and (2) assess the effectiveness of different modules in OntoED and provide error analysis. To this end, we verify the effectiveness of OntoED in three types of evaluation: (1) Overall Evaluation, (2) Few-shot Evaluation, and (3) Zero-shot Evaluation.

Datasets
As none of present datasets for ED is annotated with relations among events, we propose a new ED dataset namely OntoEvent with event correlations. It contains 13 supertypes with 100 subtypes, derived from 4,115 documents with 60,546 event instances. The details of OntoEvent are introduced in appendix. We show the main statistics of OntoEvent and compare them with some existing widely-used ED datasets in Table 3.  OntoEvent is established based on two newly proposed datasets for ED: MAVEN (Wang et al., 2020b) and FewEvent (Deng et al., 2020). They are constructed from Wikipedia documents or based on existing event datasets, such as ACE-2005 5 and TAC-KBP-2017 6 . In terms of event-event relation annotation in OntoEvent, we jointly use two models: TCR (Ning et al., 2018) is applied to extract temporal and causal relations, and JCL (Wang et al., 2020a) is used for extract hierarchical relations. The code of OntoED and OntoEvent dataset can be obtained from Github 7 .

Baselines
For overall evaluation, we adopt CNN-based model DMCNN (Chen et al., 2015), RNN-based model JRNN (Nguyen et al., 2016), and GCNbased model JMEE (Liu et al., 2018). Besides, we adopt BERT-based model AD-DMBERT  with adversarial imitation learning. We also adopt graph-based models OneIE  and PathLM  which generate graphs from event instances for ED. For few-shot evaluation and zero-shot evaluation, we adopt some metric-based models for few-shot ED, such as MatchNet (Lai et al., 2020), ProtoNet (Snell et al., 2017) and DMBPN (Deng et al., 2020). We also adopt knowledge-enhanced model EKD (Tong et al., 2020) and BERT-based models QAEE (Du and Cardie, 2020) as well as RCEE ) based on machine reading comprehension. Besides, we adopt ZSEE  especially for zero-shot ED.

Experiment Settings
With regard to settings of the training process, SGD (Ketkar, 2014) optimizer is used, with 30,000 iterations of training and 2,000 iterations of testing. The dimension of token embedding is 50, and the maximum length of a token sequence is 128. In OntoED, a dropout rate of 0.2 is used to avoid over-fitting, and the learning rate is 1 × 10 −3 . The hyperparameters of γ, λ, α, and β are set to 0.5, 0.5, 1.5 and 1 respectively. ψ S , ψ V , and ψ T are set to 0.5, 0.5 and 1 respectively. As the dataset is unbalanced, we evaluate the performance of ED with macro precision (P), Recall (R) and adopt micro F1 Score (F) following (Chen et al., 2015). Detailed performance can be found in Github 7 .

Overall Evaluation
Setting. We follow the similar evaluation protocol of standard ED models, e.g., DMCNN (Chen et al., 2015). Event instances are split into training, validating, and testing subset with ratio of 0.8, 0.1 and 0.1 respectively. Note that there are no new event types in testing set which are not seen in training.
As seen from Table 4, OntoED achieves larger gains compared to conventional baselines, e.g., DMCNN, JRNN and JMEE. Moreover, OntoED still generally excel BERT-based AD-DMBERT. This implies the effectiveness of ED framework with ontology embedding, which can leverage and propagate correlations among event types, so that reduce the dependence on data to some extent. Especially, OntoED also outperform graph-based models, i.e., OneIE and PathLM. The possible reason is that although they both convert sentences into instance graphs, and PathLM even connects event types with multiple entities, the event correlations are still implicit and hard to capture. OntoED can explicitly utilize event correlations and directly propagate information among event types.

Few-shot Evaluation
Setting. We follow the similar evaluation protocol and metrics of data-scarce ED models, i.e., RCEE , which train models with partial data. We randomly sample nearly 80% event types for training, 10% for validating, and 10% for testing. Differently from overall evaluation, the event types in testing set are not exsiting in training set.  (Liu et al., 2018) 70.92 ± 0.90 57.58 ± 0.96 61.87 ± 0.94 52.02 ± 1.14 53.80 ± 1.15 68.07 ± 1.02 AD-DMBERT    Table 4: Evaluation of event detection with overall instances. P (%), R(%) and F (%) stand for precision, recall, and F1-score respectively.  (Deng et al., 2020) 11.25 ± 2.13 20.03 ± 1.99 27.69 ± 1.95 33.13 ± 1.91 38.06 ± 1.54 EKD (Tong et al., 2020) 35   As seen from Table 5, we demonstrate F1 score results in extremely low-resource scenarios (training with less than 20% data, with the similar setting to ). Obviously, OntoED behaves tremendous advantages in low-resource ED. For example, OntoED obtains 44.98% F1 with 1% data, in comparison to 7.09% in MatchNet and 8.18% in ProtoNet. We also illustrate accuracy results with different ratios of training data followed by , show in Figure 4. As seen, OntoED demonstrates superior performance with less data dependence than baselines. Especially comparing with DMBPN and EKD, which require 60% training data to closely achieve the best results, while OntoED only uses 20%. Besides, we find that the performance on DMBPN increases first and then slightly decreases as the ratio of training data in-creases, the possible reason may lie in data noise and redundancy. In low-resource scenarios, more data are not always better. Particularly for some merely data-driven ED models, such as DMBPN, may obtain a worse effect instead if added data are dirty or duplicated. But for OntoED, as it utilizes correlation knowledge in the event ontology and has less dependence on event instances, making it more robust to noisy and redundant data. Furthermore, OntoED also outperforms than BERT-based model with regarding each event instance as a question, i.e., QAEE and RCEE. This implies that event ontology learning with event type knowledge may resolve low-resource ED more advantageously than training merely with event instances.

Zero-shot Evaluation
Setting. We follow the similar evaluation protocol and metrics of zero-shot ED models, i.e., ZSEE , and comply with the same dataset segmentation policy as few-shot evaluation, thus there are also new unseen event types for testing. Differently, ED data are completely banned for training, meaning that we train models only with event types other than instances. Table 6 demonstrates the results regarding zeroshot ED. We can see that OntoED achieves best precision and F1 score as well as comparable recall results in comparison to baselines. This illustrates the effectiveness of OntoED handling new unseen event types without introducing outsourcing data.  Traditional models, such as EKD and RCEE, require to adopt other datasets, e.g., WordNet (Miller et al., 1990) (where words are grouped and interlinked with semantic relations) and FrameNet (Baker, 2014) (where frames are treated as meta event types) to increase the persuasiveness of results. In contrast, OntoED naturally models the structure of event types with an event ontology, thus even for a new unseen event type without instance data, we can also obtain its representation through the event-event correlation. Moreover, On-toED is also beneficial to resolve zero-shot ED than ZSEE. This may due to OntoED modeling with both intra and inter structures of events while ZSEE merely considering the intra-structure.

Ablation Study
To assess the effect of event ontology learning and correlation inference, we remove the two modules in OntoED, and evaluate F1 score shown in Figure 5. From the results, we observe that OntoED outperforms the two baselines in all evaluation settings, indicating that event ontology learning and correlation inference facilitate ED, as they utilize knowledge among event types and has less dependence on instance data. Furthermore, in terms of performance degradation compared to OntoED, F1 score of OntoED merely without event correlation inference (e.g., 10.9%↓) drops more seriously than that without event ontology learning (e.g., 6.6%↓), and the phenomenon is more obvious in few-shot and zero-shot evaluation (e.g., 10.9%↓ v.s. 15.9%↓ and 28.1%↓). This illustrates that event correlation inference is more necessary in OntoED, as it establishes more correlations among event types, thereby knowledge can be propagated more adequately, especially from data-rich to data-poor events.

Error Analysis
We further conduct error analysis and provide some representative examples.
(1) One typical error relates to similar event-event structures in the event ontology. As OntoED considers event correlations,  event types with similar neighbor triples can be indistinguishable. For example, Robbery and Kidnapping have the same supertype Crime, and they both have the neighbor triples of ( * , CAUSE, Arrest).
(2) The second error relates to wrong instance relations. As the instance relation extraction directly influence the establishment of event correlations, wrong instance relations will cause error propagation.
(3) The third error relates to the same event mention for different event types. For example, 'Of the 126 people aboard, 47 died and 74 sustained serious injuries.' both mentions Die and Injure.

Conclusion and Future Work
This paper proposes a novel event detection framework with ontology embedding called OntoED. We revisit the ED task by linking each event instance to a specific type in a target event ontology. To facilitate the linkage, we enrich the event ontology with event-event relations, such as temporal, causal and hierarchical correlation, and induce more event correlations based on existing ones. The key insight is that event ontology can help to reduce model dependence on instance data, especially in low-resource scenarios. As data-rich event types can propagate correlation knowledge to data-poor ones, and new event types can establish linkages to the event ontology. We demonstrate the effectiveness of OntoED in three settings: overall, few-shot as well as zeroshot, and experiments show that OntoED excels previous methods with great robustness.
In the future, we intend to extend our work in several aspects. First, we would improve the event ontology and consider more event correlations. Second, we would explore if low-resource ED can also boost to identify event correlation. Third, we would develop more neuro-symbolic methods for ED.

Broader Impact Statement
A broad goal of event detection is to extract structured knowledge from unstructured texts to facilitate knowledge acquisition. For example, it is valuable in the medical domain and provides social benefits to analyze dispensatory details as well as electronic health records. Furthermore, a solid ED system can also be applied to many society issues, such as anti-terrorist and public opinion analysis.
In this paper, we present a new dataset OntoEvent for ED with event-event correlations. The event data are all collected from existing datasets (i.e., ACE 2005) or open source databases (e.g., Wikipedia), and the annotation are generated from existing models with citations. In experiments, we detailedly describe how to evaluate the newly-proposed OntoEvent and provide specific analysis. The code and dataset are both available.
Our approach to ED can leverage only a few event corpus to establish the linkage between event types and event instances w.r.t. event correlations. In addition, this work is also a brand-new attempt to combine information extraction and symbolic reasoning, based on ontology embedding. Our intention is to develop an ontology-based ED system for the NLP community, and wish our innovation can become a small step in this direction.