Fine-Grained Event Trigger Detection

Most of the previous work on Event Detection (ED) has only considered the datasets with a small number of event types (i.e., up to 38 types). In this work, we present the first study on fine-grained ED (FED) where the evaluation dataset involves much more fine-grained event types (i.e., 449 types). We propose a novel method to transform the Semcor dataset for Word Sense Disambiguation into a large and high-quality dataset for FED. Extensive evaluation of the current ED methods is conducted to demonstrate the challenges of the generated datasets for FED, calling for more research effort in this area.


Introduction
Understanding events in text is an important aspect of Natural Language Processing (NLP). Toward this end, Event Detection (ED), a task of Information Extraction (IE), aims to identify event triggers in sentences and classify them into some predefined types of interest. Event triggers represent the most important words (usually single verbs or nominalizations) in the sentences that evoke the events. The current state-of-the-art methods for ED feature the deep learning models where many new network architectures are introduced in the last couple of years (Nguyen and Grishman, 2015;Chen et al., 2015;Liu et al., , 2019aLai et al., 2020b).
Among others, the rapid development of the deep learning models for ED can be partly attributed to the availability of the large datasets to evaluate the models (e.g., the ACE 2005 and TAC KBP 2015 datasets (Walker et al., 2006;Mitamura et al., 2015)). Unfortunately, a major issue in these existing datasets for ED is that they tend to only focus on a limited set of event types. For example, the popular ACE 2005 dataset is only annotated * Corresponding author.
for 33 event subtypes (e.g., Attack, Start-Position, Elect) while the number of events in the TAC KBP dataset (Mitamura et al., 2015) is 38. On the one hand, the limited numbers of types are unable to cover a wide range of possible events in practice (Araki and Mitamura, 2018). On the other hand, the small label sets often amount to the coarse-grained event types in the existing datasets that cannot capture the slightly different nuances (i.e., fine-grained distinction) of the events. For instance, both the words "quit" and "fired" in the two sentences "He decided to quit the job." and "He was fired due to a policy violation." (respectively) would be considered as the trigger words of the same event type of End-Position in the ACE 2005 dataset. However, the nuances in these two events are quite different (i.e., in term of the willingness of the job termination) and the ability to characterize such subtle distinction would be useful for the downstream applications (Choi et al., 2018).
In order to address these problems, we propose to explore the problem of Fine-grained Event Detection (FED) that seeks to solve ED with much larger and finer-grained sets of event types (motivated by the fine-grained entity typing task (Ling and Weld, 2012;Choi et al., 2018)). To our knowledge, this is the first work to explicitly study FED in the literature. A major challenge in this research direction is the creation of the evaluation datasets to enable effective model development and analysis. In particular, it is non-trivial to design a large set of fine-grained event types to be applied to annotate the datasets. In addition, with such a large number of fine-grained event types (i.e., 449 in this work), the traditional labeling procedure with human involvement might be too expensive and error-prone when it comes to the generation of large datasets for FED. To this end, we introduce a novel method to address these challenges and produce a large dataset for FED based on WordNet and Word Sense Disambiguation (WSD) datasets. Our method involves two major steps where we first leverage the synset typology in WordNet to formulate the finegrained event types and then convert the annotated datasets for WSD to establish the datasets for our FED problem. This novel data generation procedure minimizes the human effort and allows us to create a large and high-quality dataset with 449 fine-grained event types for FED. Finally, we extensively evaluate the state-of-the-art ED models on the proposed FED dataset. The experiments show that the performance of the current ED models is not yet satisfactory for FED and further research is needed to advance the performance in this area. We will publicly release the proposed dataset to promote the future research on FED.

Data Generation Procedure
The goal of this section is to generate a large dataset for ED with many fine-grained event types to evaluate the FED models. Our proposed procedure to achieve this goal involves two major steps. First, we identify the eventive synsets/senses in WordNet 3.0 (Miller, 1995) and group them into classes with similar eventive meanings. These classes would serve as the fine-grained event types in the resulting FED dataset. As the result, we obtain a mapping from the set of WordNet synsets to the set of the fine-grained event types for our problem (some WordNet synsets might not be mapped to any event type in our case). Afterward, we leverage the Semcor dataset for WSD (Miller et al., 1994) and map the synsets annotated for the words in this dataset into the event types in our setting. This conversion process produces a dataset whose words are assigned with the fine-grained event types in our FED problem. As the Semcor dataset is manually annotated, the resulting FED dataset would be large and have high quality if the synset-event type mapping is constructed well.
In particular, for eventive synset/sense identification, we first start with nouns. Following (Araki and Mitamura, 2018), we assume that any synset for a noun subsumed by one of the three following synsets via the WordNet hyponyms would be considered as eventive: state 2 n (i.e., the way something is with respect to its main attributes), process 6 n (i.e., a sustained phenomenon or one marked by gradual changes through a series of states), and event 1 n (i.e., something that happens at a given place and time). In this way, we find 13,166 even-tive synsets over 82,115 synsets for nouns. We call these three general synsets as the eventive root synsets in the following. Starting from these root synsets, we traverse the synset graph in WordNet by following the hyponym links. The graph traversal procedure will generate three different trees whose nodes are the eventive synsets and roots correspond to the three selected synsets. For convenience, we call the synsets in WordNet that can be reached by one of the three synsets above after n hyponym links as the synsets at the n-th level 1 (so the root synsets are at the zero level). In order to form the fine-grained event types for FED, we select the WordNet synsets at the 4 th level as the core meanings (called the core synsets) for the event types in our dataset (there are 2,637 core eventive synsets found in this way). We empirically choose the synsets at the 4 th level to balance two factors. On the one hand, the synsets at the shallower levels lead to too general event types that cannot achieve the expected fine-grained property. On the other hand, going deeper for the core event meanings reduces the numbers of examples per event type in the final FED dataset converted from Semcor. Given a core synset A, we identify the other synsets with similar meaning to A and combine them to represent a fine-grained event type (called E) in our dataset (i.e., the event type E will involve several semantically similar synsets in WordNet). In this work, we include two following classes of synsets in the event type E for A: • The synsets for nouns that can be reached from A with the hyponym links: Intuitively, the synsets subsumed by A would exhibit the general eventive 1 It is possible that some eventive synsets in WordNet might reside at more than one level as they can be reached from the three root nodes with multiple paths. We resolve this conflict by putting these synsets on the closet level to the roots. meanings of A with some certain distinctions.
• The synsets of the derivationally related forms of the lemmas/senses l in A and their synset descendants (via the hyponym links): In WordNet, the derivationally related forms of a lemma l in the synset A involve the lemmas from different syntactic categories (e.g., verbs and adjectives) that have the same root form as l and are semantically related to l and A (e.g., destruction → destroy) (Miller, 1995). Due to such semantic similarity, we expect that the synsets of the derivationally related forms of the lemmas in A and their descendants also express the same eventive meaning as A, thereby enriching the synsets for E with the syntactic categories beyond nouns.
Up to this point, we obtain a set of 2,637 event types, each represented by a core synset and a set of related synsets. We combine all the other synsets (i.e., the ones that do not appear in any of the 2637 types) to create a single event type called Other as in the traditional ED task. With these grouping information, we can now create a mapping from the synsets/senses in WordNet to the 2638 established event types for our dataset (called the M-WordNet-Event mapping). Based on this mapping, we transform each example in the WSD Semcor dataset, which involves a sentence and a word of interest, into an example in the new dataset for FED (called FedSemcor) where the synset/sense label for the word in the original example of Semcor is mapped into the corresponding event type in FedSemcor. As the final processing step, we remove from Fed-Semcor any event types that have less than 10 examples to ensure that the event types are adequately represented in our dataset. This step significantly reduces the number of event types in Implementation Details: In the actual implementation of the data generation procedure for Fed-Semcor, given a core synset A, we do not include all the descendants of A and the synsets of the derivationally related forms of A's lemmas into the  synset set for the correspdoning event type E for A. Instead, we only include the descendants that are at at most 2 hyponym links away from A and the synsets of the derivationally related forms of A's lemmas in E. This is based on our empirical investigation of the data where the descendants with more than 2 links away tend to have semantic drifts from A, potentially introducing noise into the event type E. For example with the core synset motion 6 n (i.e., the act of changing location from one place to another) at the 4 th level, the descendants at the 5 th , 6 th and 7 th levels include: level 5: approach 2 n (i.e., the act of drawing spatially closer to something), level 6: access 6 n (i.e., the act of approaching or entering), and level 7: back door 6 n (i.e., a secret or underhand means of access (to a place or a position)). As we can see, while the synsets at the 5 th and 6 th levels are related to the original core synset, the synset at the 7 th level already involves some semantic departure from the one at the 4 th level that should be avoided to improve the precision.
Dataset Statistics: Table 1 reports some statistics for FedSemcor and some prior popular datasets for ED (i.e., ACE 2005 (Walker et al., 2006) and TAC KBP 2015 (Mitamura et al., 2015)) to facilitate the comparison. As we can see from the table, FedSemcor has more positive examples, but less negative examples than ACE 2005 and TAC KBP 2015, making FedSemcor a more balanced dataset than the other two. In addition, we show the distribution of 50 event types with the highest numbers of examples in FedSemcor in Figure 2. Finally, Figure 3 illustrates the distribution of the sentence lengths for the examples in FedSemcor.
Evaluation of FedSemCor: As we rely on the manual annotation in Semcor for the synsets for the words, the main bottleneck in the data generation procedure is the mapping from the WordNet synsets to the 450 event types in FedSemCor (including Other). In order to evaluate the quality of this mapping, we sample 500 synsets from Word-Net that are different from the core synsets of the 449 positive event types. Two experienced NLP re-  searchers then independently examine each of these 500 sampled synsets to determine the appropriate event type for it (among the 450 types). In doing so, they examined the glossaries of the synsets as well as the examples provided by WordNet. The two annotators achieved 79.8% agreement for which the synsets with conflicts are resolved by a third NLP researcher. Afterward, we apply the synsetevent type mapping obtained in the data generation procedure to annotate the 500 sampled synsets for the event types. The event types provided by the mapping are then compared with those from the annotators, leading to 83.6%, 78.6% and 81.0% as the precision, recall, and F1 scores respectively.

Evaluation
Models and Data: In order to understand the complexity of the FedSemcor dataset for FED, this section evaluates the performance of the state-ofthe-art models for the traditional ED problem on this dataset. In particular, we first split FedSemcor into the training, development and test data  using the 6:2:2 ratio over the entire dataset. Table  2 presents the statistics about these data portions. Note that similar to some prior ED work (Nguyen and Grishman, 2015;Chen et al., 2015), our FED problem is formulated as a word classification problem where given a word in an input sentence, the models need to predict the event type for the word.
Afterward, we consider the following representative models for ED: CNN (Nguyen and Grish For the experiments in this work, we re-tune the hyper-parameters of the models on the development set of FedSemcor. In particular, depending on which components each model has, we use the following bounds to search for the hyperparameters: [100,200,300,400,500] for the dimensionality of the hidden vectors in the layers of all the feed-forward, BiLSTM, and GCN networks, [1, 2, 3] for the numbers of layers for BiLSTM and GCN,[16,32,64] for the mini-batch size, [1e-5, 1e-4, 1e-3, 1e-2, 1e-1] for the learning rate of the Adam optimizer,and [10,20,30,40,50] for the dimensions of the feature embeddings, i.e., position embeddings in CNN (Nguyen and Grishman, 2015;Chen et al., 2015).
Finally, in order to demonstrate the benefit of the conversion of Semcor (i.e., a WSD dataset) into FedSemcor for FED, we consider a WSDbased baseline for FED where the state-of-the-art WSD model in (Hadiwinoto et al., 2019) is trained on the training data of FedSemcor. As this is a WSD model, instead of using the mapped event types as the labels for the examples (i.e., the 450 types) in the training data, we employ the original word senses of the words as the labels to train this WSD model. Afterward, we apply the trained WSD model on the test data of FedSemcor, producing a word sense for each example. In the last step, the mapping M-WordNet-Event is utilized to convert the predicted word senses for the test set examples into the event types for FedSemcor that would be evaluated to obtain the FED performance for this baseline (called WSD-based). Note that this WSD model also uses the BERT embeddings. Results: Table 3 shows the performance of the models on the test set of FedSemcor. From the table, we see that GCN has the best performance among the models with word2vec while BERT-ED outperforms all the BERT-based models. However, the best performance on FedSemcor (i.e., 65.0% F1 score with BERT-ED) is still far behind the typical performance (i.e., up to 80.7% in (Yang et al., 2019)) of the models on the traditional ED datasets (i.e., ACE 2005). This suggests the more challenging nature of FedSemcor and FED over traditional ED, presenting a challenge for the future research in this area. Importantly, the performance of the ED models (i.e., with the BERT embeddings) is significantly better than the WSD-based baseline (i.e., up to 9% performance gap with BERT-ED), clearly testifying to the advantages of the conversion from Semcor into FedSemcor for FED.

Related Work
ED has been studied extensively in the last decade, featuring feature-based models (Ahn, 2006;Ji and Grishman, 2008;Li et al., 2013Li et al., , 2015, deep learning models (Chen et al., 2015;   2016b,a; Nguyen and Grishman, 2016; Liu et al., 2018;Yan et al., 2019;Ngo et al., 2020;Lai et al., 2020b), and few/zero-shot learning models (Huang et al., 2018;Lai and Nguyen, 2019;Lai et al., 2020a). The rapid development of such models has been facilitated by the availability of the ED datasets in different domains, including the general domain with the popular ACE and TAC KBP datasets (Walker et al., 2006;Mitamura et al., 2015Mitamura et al., , 2016, the biomedical domain (Kim et al., 2009(Kim et al., , 2011, literature (Sims et al., 2019), cybersecurity (Satyapanich et al., 2020;Man Duc Trong et al., 2020), and the open domain (Araki and Mitamura, 2018;Liu et al., 2019b). However, these datasets only involve a small number of event types and none of them has considered ED with many fine-grained event types as we do. Our FED task is also related to fine-grained entity typing that aims to classify entity mentions into a fine-grained set of types (Karn et al., 2017;Shimaoka et al., 2016;Lin and Ji, 2019). The techniques to generate datasets for fine-grained entity typing include distant supervision (Ling and Weld, 2012;Abhishek et al., 2017) and manual annotation (Murty et al., 2018;Choi et al., 2018). Notably, (Del Corro et al., 2015) also uses WordNet to establish the fine-grained entity types, applying different entity mention extractors over external corpus. Our work is different as we focus on fine-grained event types using the manually annotated corpus Semcor to generate data.

Conclusion
We study a new task of FED, featuring 449 finegrained event types in the dataset for ED. A novel method to generate the evaluation dataset for FED is introduced, leveraging manually annotated WSD datasets (i.e., Semcor) and the eventive synsets in WordNet. We evaluate the state-of-the-art ED models on the new dataset to show the opportunities for the future research on FED.