Treasures Outside Contexts: Improving Event Detection via Global Statistics

Event detection (ED) aims at identifying event instances of specified types in given texts, which has been formalized as a sequence labeling task. As far as we know, existing neural-based ED models make decisions relying entirely on the contextual semantic features of each word in the inputted text, which we find is easy to be confused by the varied contexts in the test stage. To this end, we come up with the idea of introducing a set of statistical features from word-event co-occurrence frequencies in the entire training set to cooperate with contextual features. Specifically, we propose a Semantic and Statistic-Joint Discriminative Network (SS-JDN) consisting of a semantic feature extractor, a statistical feature extractor, and a joint event discriminator. In experiments, SS-JDN effectively exceeds ten recent strong baselines on ACE2005 and KBP2015 datasets. Further, we perform extensive experiments to comprehensively probe SS-JDN.


Introduction
Event detection (ED) is an important information extraction task in the NLP field, which aims to identify event instances of specified types in given texts. Associated with each event mention is a phrase, i.e., the event trigger 1 evoking that event. More precisely, the task involves identifying event triggers and classifying them into the specific types. For instance, according to ACE2005 annotation guideline, in the sentence "A police officer was killed in New Jersey today", an ED model should be able to recognize the word "killed" as a trigger for the event type "Die".
With the development of deep learning, ED has been formalized as a sequence labeling task and * Corresponding author. 1 The event trigger is usually a single verb or nominalization, and some papers also refer to the multi-word trigger as event nugget. In this paper, we uniformly use "trigger". Other × Attack× Figure 1: A case study of SOTA model DMBERT (Wang et al., 2019) and our proposed model. We train DMBERT and our proposed model on ACE2005, and select two texts (T1 and T2) with "killed" and "wounded" as triggers in the training set. For testing, we change the contexts of the two triggers in T1 and T2 to obtain V1 to V4, and let the two ED models predict their event types in the new contexts. We can see that DMBERT is easy to get confused in the new contexts, while our model is obviously more stable. The detailed explanations are in Section 4.3.5.
implemented by a variety of neural network models (Chen et al., 2015;Zhao et al., 2018). As claimed by the famous distributional hypothesis (Harris, 1954;Firth, 1957), the words in a text are characterized by their contexts, i.e., themselves and their surrounding words. Thus, to fully consider the information of each word in the sequential prediction process, existing neural-based ED models make decisions based on elaborately extracted contextual features of the words (Hong et al., 2018;Tong et al., 2020b,a), which has become a "standard" decision-making pattern. Theoretically, the contextual information in a given text is enough to determine the event type of each word. But the fact is an event can be described by various contexts, and the contexts in the test stage are usually not covered by the training set. Therefore, a common shortcoming of existing neural-based ED models is that they are prone to get confused by the changeable contexts during testing. As shown in Figure 1, the variation of any surrounding words will lead to a different set of extracted contextual features and may misguide their final judgments.
To alleviate this shortcoming, a natural idea is to seek an additional "stable" decision-making basis unchanged across varied contexts to cooperate with the contextual features. Therefore, we envisage extracting a set of event features of words from the entire training set through global statistics-wordevent co-occurrence frequencies.
Global statistics were widely used by many early pattern-based and feature-based ED models (Liao and Grishman, 2010;Li et al., 2013). However, with the introduction of deep learning technologies (Chen et al., 2015), various powerful neural-based ED models directly extract contextual semantic features from each text via end-to-end architectures. The use of global statistics has gradually faded out of researchers' attention. In our work, we find that the word-event co-occurrence frequencies in ED datasets have potential to help neural-based ED models eliminate a large number of interference options in testing (Section 3.1). Thus, as the title suggests, the valuable features for neural-based ED models can be collected outside the contexts.
Specifically, we design a simple but novel Semantic and Statistic-Joint Discriminative Network (S 2 -JDN) to implement the new decisionmaking pattern, which consists of three modules. (i) Semantic feature extractor takes Dynamic Multi-pooling BERT (Wang et al., 2019) as the prototype and is used to obtain the token-level contextual features in each text. (ii) Statistical feature extractor mines event features in both direct and indirect ways from word-event co-occurrence frequencies, and then fuses and rescales them as the final statistical event features. (iii) Joint event discriminator adopts a layer normalization to unify the two types of event features and combines them for decision-making.
In experiments, we compare with ten strong baselines on ACE2005 and KBP2015 datasets, where S 2 -JDN exceeds the state-of-the-art (SOTA) models by 1.9% and 1.9% in trigger classification F1, respectively. Further, we perform extensive experiments and draw multiple useful conclusions about S 2 -JDN.
Our contributions can be summarized as follows: (1) As far as we know, all existing neural-based ED models make decisions entirely based on contextual features (i.e., the standard pattern) in the NLP field, and we are the first to explore to introduce the statistical features as an additional stable decision-making basis.
(2) We propose S 2 -JDN for ED task, which takes into account both the (contextual) semantic features and statistical features of each word to make a decision. Specifically, the statistical feature extractor of S 2 -JDN mines features in both direct and indirect ways from word-event co-occurrence frequencies, thereby acquiring abundant statistical event features.
(3) We demonstrate that S 2 -JDN effectively exceeds ten strong baselines on ACE2005 and KBP2015 datasets, and conduct extensive exploration experiments to comprehensively probe S 2 -JDN.

Related Work
As the key component of the event extraction system (Yang et al., 2019;Li et al., 2020;Ferguson et al., 2018), the research of ED has experienced the periods of traditional methods and deep learning methods.
During the period of traditional methods, global statistics collected from the training set were widely used as the knowledge sources or decisionmaking basis of different ED models (Saurí et al., 2005;Ahn, 2006;Ma and Cisar, 2009;Liao and Grishman, 2010;Li et al., 2013). Grishman et al. (2005) and Shinyama and Sekine (2006) used statistical information to train a MaxEnt event classifier. Wan et al. (2009) constructed a frequency patternbased framework for ED. Ji and Grishman (2008) obtained document-wide and cluster-wide statistics to correct the results of ED. Qin et al. (2013) proposed a classifier-based method to process statistical features for event filtering. Cao et al. (2015) recorded the frequency that each pattern is associated with an event type, and treated the frequencies as core features.
With the introduction of deep learning technologies, many neural-based ED models extracted contextual semantic features from each text via end-toend architectures (Orr et al., 2018;Lai et al., 2020). Chen et al. (2015) acquired the contextual features via convolution and dynamic multi-pooling techniques. Liu et al. (2018) and Zhao et al. (2018) employed the attention mechanisms during contextual feature extraction. Nguyen and , Ding et al. (2019), and Yan et al. (2019) introduced external knowledge to as-sist neural networks to better understand each given text. Recently, many studies applied powerful pretrained language models to better comprehend contexts (Du and Cardie, 2020;Liu et al., 2020a,b;Huang and Ji, 2020).
Neural-based ED models significantly outperform the traditional ones, and become the new research hotspot. Accordingly, the use of global statistics has faded out of researchers' attention. As far as we know, there is no neural-based ED model explicitly using global statistics in the training sets. Although neural networks possess powerful learning ability in theory, their training is based on individual sentences, and thus it's not easy for them to capture the global statistics. More importantly, we observe that the word-event co-occurrence frequencies for most words concentrate on only a few events, which are potential to help neural-based ED models eliminate a large number of interference options in testing (Section 3.1). To this end, we propose S 2 -JDN, which takes into account both the (contextual) semantic features and statistical features of each word to make a decision. Besides employing the collected global statistics as features directly like traditional ED models, S 2 -JDN also leverages the property of neural networks to indirectly extract more abundant features from wordevent co-occurrence frequencies (Section 3.2.2).

Methodology
In this section, we first elaborate the motivation for introducing word-event co-occurrence frequencies (Section 3.1), then propose a concrete instantiation called Semantic and Statistic-Joint Discriminative Network (S 2 -JDN) (Section 3.2), and finally describe the training details of S 2 -JDN (Section 3.3).

Motivation for Introducing Word-Event
Co-occurrence Frequencies The benefits of word-event co-occurrence frequencies for neural-based ED models are as follows: Stability and Accessibility. First of all, wordevent co-occurrence frequencies are collected from the entire training set and will not be disturbed by various contexts, which satisfy our requirement of stability. With their assistance, the neural-based ED models are easier to make correct predictions in the changeable contexts during testing. Also, the collection of these statistics does not rely on any external tools or data.
Clear Directivity. The event set of an ED dataset  Figure 2: Statistics about the numbers of events that can be evoked by each trigger word in ACE2005 and KBP2015 datasets. In our statistics, a word will be regarded as a trigger word if it evokes a specified event in a text in the dataset. When counting the number of events evoked by a trigger word, "Other" will also be considered. k-E-T (k = 1, 2, 3) denotes the proportion of the trigger words evoking k events over all trigger words, and O-T corresponds to the rest trigger words.
consists of many event types and a special "Other" type for non-triggers. As plotted in Figure 2, an ED dataset often contains dozens of events, but most words can only evoke at most three of them 2 . This phenomenon can be explained by the characteristic of ED task, i.e., although the total event number of a realistic ED dataset can be large, most triggers are single words 3 with limited meaning and can only evoke very few events. This characteristic can be reflected by the word-event co-occurrence frequencies in the ED dataset. We refer to the trait that word-event co-occurrence frequencies concentrate on only a few events as clear directivity. Therefore, such statistics can provide clear indications for neural-based ED models and have great potential to help them eliminate lots of interference options in testing.

Semantic and Statistical-Joint Discriminative Network
To take advantage of both contextual semantic features and the statistical features extracted from word-event co-occurrence frequencies, we propose a simple but novel Semantic and Statistic ure 3, which consists of a semantic feature extractor, a statistical feature extractor, and a joint event discriminator.

Semantic Feature Extractor
The semantic feature extractor collects contextual features from given texts and is acted by a Dynamic Multi-pooling BERT (Wang et al., 2019) in this work. It is also possible to explore other choices in future work.
After wordpiece tokenization (Wu et al., 2016), s w is further decomposed into ([CLS], t 1,1 , . . . , t 1,k 1 , . . . , t m,1 , . . . , t m,km ), where [CLS] is a specific token of BERT (Devlin et al., 2018), t i,j is the j th token of w i . We use e i,j to denote the BERT's input corresponding to t i,j , which is the sum of the token and position embeddings 4 . The BERT's output for e i,j is h i,j . Next, (h 1,1 , . . . , h m,km ) are further processed by a dynamic multi-pooling operation, and the output of h i,j is denoted as c i,j and calculated as: i,j and h z i,j is the z th features of c i,j and h i,j . Dynamic multi-pooling extracts important features on both sides of t i,j to form its contextual vector. Since the size of h i,j is large enough, we combine the pooling results by addition operation, rather than concatenation.

Statistical Feature Extractor
Parallel to the semantic feature extractor, the statistical feature extractor mines a set of features from word-event co-occurrence frequencies as additional decision-making basis. To take full advantage of these statistics, we extract features in both direct and indirect ways.
Extracting Direct Statistical Features: For word w i in text s w , we denote its wordevent co-occurrence frequency vector as where n z i is the number of times w i evokes the z th event type in the training set, and K is the total event number (including "Other"). f i is normalized to a vector with direct sta- is the total number of w i 's occurrences in the training set. In testing,f i = 0 for words unseen in the training set. Although mostf i have clear directivity and can reflect global event information of w i (Section 3.1), its feature dimensions are much smaller than the contextual vectors c i,j . Therefore, the information provided byf i might be insufficient to guide the final decision. In light of this, we propose a temporary training task named Frequency Supervised Multi-Label Classification (FSMLC), which finetunes the generic word embedding of each word w i to acquire more statistical features indirectly.
Extracting Indirect Statistical Features: As mentioned above, FSMLC needs to encode statistical information into the generic word embedding e w i of w i by finetuning. In our work, e w i is 300-dimension Glove embedding (Pennington et al., 2014). Concretely, FSMLC first maps e w i into a new vectorẽ w i by a linear transformation, then feedsẽ w i into a temporary event classifier and adopts the normalized frequencies inf i as the targets for training. The loss function of FSMLC for word sequence s w is: wherep z i , andũ z are the predicted probability and trainable projection vector of the z th event type, M 1 is the trainable linear transformation matrix. Now, we explain the settings of FSMLC. According to previous studies (Bespalov et al., 2012;Tang et al., 2014), word embeddings can gain the ability to represent some specific information via targeted training. Therefore, the temporary training task FSMLC finetunes the generic semantic features within e w i into a set of new ones inẽ w i that can better represent the events evoked by w i in the training set, thereby indirectly usingf i to obtain more features. Note that the trainedẽ w i possesses both semantic information and word-event co-occurrence information, but we refer to it as an indirect statistical feature vector. Finally, the loss of FSMLC on the entire training set is: where V is the vocabulary of the training set. Therefore, all words can get the same level of training. In testing, the temporary classifier will be discarded.
Feature Fusing and Rescaling: The statistical event vector of w i is the fusion off i andẽ w i : whereṽ i = f i ;ẽ w i , M 2 is a trainable fusion matrix. Although Eq.(6) is formally equivalent to a linear transformation ofṽ i , we find that the approach of separatingṽ i alone and adding it to M 2 ·ṽ i sometimes performs better in practice.
As is well-known, the credibility of statistics is closely related to the number of each word's occurrences. The fewer times a word appears in the training set, the less reliable its word-event occurrence frequencies are. To weaken the influence of low-frequency words' statistical features, we rescale v i as: where c is an integer hyperparameter denoting the threshold of occurrence number that a word's statistics can be trusted. For n i < c, v i is scaled down, and v i = 0 for words unseen in the training set.

Joint Event Discriminator
The joint event discriminator combines the tokenlevel semantic vectors c i,j and the word-level statistical vector v i for decision-making. Firstly, each c i,j (j ∈ 1, . . . , k i ) will be concatenated with v i to form the token-level event vector r i,j = [c i,j ; v i ]. Since features in c i,j and v i come from two different sources, we apply a simple layer normalization to unify them: where r z i,j andr z i,j are the z th features of r i,j and r i,j , and d r is the dimension of vectorr i,j .
In testing, if n i = 0, we have v i = 0, but Eq.(8-10) will change the statistical part in r i,j to non-zero values. So we multiply r i,j by a vector to change the statistical part to zeros again after normalizing, i.e., r i,j = r i,j [1; I sta i ], where I sta i corresponds to the statistical part in r i,j , and I sta i = 0 if n i = 0 else I sta i = 1; is element-wise multiplication.
Then, r i,j (j = 1, ..., k i ) are inputted into a Softmax layer to calculate the token-level prediction distributions p i,j : where p z i,j and u z are respectively the predicted probability and trainable projection vector of the z th event type. The prediction distribution of word w i is the average of p i,j , i.e., p i = 1

Training of S 2 -JDN
In the training of S 2 -JDN, besides the BERT, the trainable parameters also containũ z and M 1 in Eq.(3-4), M 2 in Eq.(6), and u z in Eq.(11). We jointly train the main task ED and the temporary task FSMLC, and the total loss for sequence s w is: where q i is the one-hot ground-truth event label of w i , β is the coefficient of FSMLC's loss. In experiments, the loss of Eq.(12) is optimized by Adam optimizer (Kingma and Ba, 2014).

Experimental Setups
Datasets: We take two benchmark datasets ACE2005 5 and KBP2015 6 for evaluation. ACE2005/KBP2015 contains 599/360 documents and 33/38 specified event types. For ACE2005, we follow the previous studies (Peng et al., 2016;Du and Cardie, 2020;Liu et al., 2020b;Lai et al., 2020) and use 40 documents from newswire domain for testing, 30 documents for validation, and the rest for training. For KBP2015, we also use the official test set, and split about 20% of sentences from the training set for validation. Their statistics are presented in Table 1.

Main Experiments
Baselines: We take SOTA models without external ED-related knowledge on ACE2005 and KBP2015 as baselines for fair comparisons. The baselines include: MSEP (Peng et al., 2016) develops an event detection and co-reference system with minimal supervision; FBRNN (Ghaeini et al., 2016)    Results and Analysis By convention, we use precision, recall, and micro-F1 as metrics, which are presented in Table 2.
As shown in Table 2, on ACE2005, S 2 -JDN exceeds the SOTA models EEGCN/G-GCN by 3.6%/1.5%, 0.2%/2.5%, and 1.9%/1.9% in terms of precision, recall, and F1. On KBP2015, DMBERT and S 2 -JDN are significantly better than MSEP, FBRNN, GCNED, and TACTop with the help of the pre-trained BERT, and S 2 -JDN further outperforms DMBERT by 2.3%, 1.6%, and 1.9% in precision, recall, and F1. All the neural-based baselines employ the standard decision-making pattern, and RCEE, ED-QA, M-FULL, G-GCN, PLMEE, and DMBERT are engined by BERT. These results show the effectiveness of S 2 -JDN introducing the statistical features from word-event co-occurrence frequencies. Therefore, we can infer that the new decision-making pattern used by S 2 -JDN is more suitable for ED task than the standard one.

Ablation Study
We evaluate the importance of each part in S 2 -JDN via ablation studies on five variants. V1 removes the indirect statistical features (ISF) by substituting v i withf i . V2 omits the direct statistical features (DSF) by replacing v i withẽ w i . V3 removes the temporary training task FSMLC by setting β = 0; V4 omits layer normalization (LN) in joint event discriminator by directly inputtingr i,j into the final Softmax layer. V5 eliminates the introduction of global statistics (GS) via replacing v i with the Glove embedding e w i . The results are reported in Table 3.
We analyze the results in Table 3 from five aspects. (i) V0 outperforms V1-V5 on all datasets, which shows that each part of S 2 -JDN has a positive effect on the overall performance. (ii) V1 is worse than V2. This is because the information in direct statistical features are relatively deficient and sparse, and thus is harder to change the final decisions. Therefore, the indirect statistical features are more powerful than the direct ones. (iii) V3 is better than V1, which means simply using the generic word embeddings as the indirect statistical features can also bring some improvement. We speculate that this is because the word embeddings can indicate the "identities" of the words and make it easier to recognize the words that fixedly evoke some events. (iv) V4 is worse than V2, so layer normalization is necessary, whose importance even exceeds the direct statistical features. (v) Theoretically, V5 should be the worst one among variants V1-V5, but it performs slightly better than V1. We can combine the above analyses (ii) and (iii) to explain. On the one hand, the deficient and sparse  To further ensure the credibility of the results, we train each variant for three times with different random seeds, which shows that the standard deviations of S 2 -JDN are about 0.3% on both ACE2005 and KBP2015. For the other five variants, the standard deviations are within 0.2%-0.5%. To conclude, our proposed model is stable, and the improvements over the five variants are significant.

Effects of Statistics with Different Degrees of Directivity
In Section 3.1, we pointed out that an advantage of word-event co-occurrence frequencies is clear directivity, which is caused by the characteristic of ED task. The fewer the events evoked by a word, the more clear the directivity is. In this experiment, we study the effect of statistics with different degrees of directivity. Concretely, we divide the words into four categories during testing: C1/C2/C3/C4 respectively corresponds to the words evoking 1/2/3/≥4 events in the training set. For comparisons, we construct a base model with only semantic event features (Section 3.2.1) as the decision-making basis and denote it as Base. The results of the base model and S 2 -JDN on the 4 categories of words are reported in Table 4.
From Table 4, we can find that S 2 -JDN has a certain improvement on each category, and its advantages decrease across C1, C2, C3, and C4, which means the more clear the directivity of word-event co-occurrence frequencies possess, the greater the effect of the statistical features have. Thus, the effectiveness of S 2 -JDN is closely related to the characteristic of ED task.

Performance on Predicting Global and Non-Global Events
In the training set, we refer to the most frequent event (including "Other") evoked by a word as its global event, which corresponds to the maximum word-event co-occurrence frequency of the word. Are the statistical features introduced by S 2 -JDN only helpful for predicting global events, but not for words with non-global event as the groundtruth labels in testing? To answer this question, we conduct the following experiments.
In each test set, we gather all word instances wrongly predicted by an ED model as collection A. From A, we further select the word instances with their global events as the ground-truth labels to form collection A G , and pick out the word instances with non-global events as the ground-truth labels to constitute collection A N G . Obviously, we have A = A G ∪ A N G . The instance proportions of A G and A N G in the entire test set respectively indicate the proportions of model mistakes caused by inconsistency with global events and non-global events. The results of the base model and S 2 -JDN are presented in Table 5.
As shown in Table 5, after introducing statistical features, the proportion of |A G | is decreased by 2.0% and 1.6% on ACE2005 and KBP2015 test sets, and the proportion of |A N G | is correspondingly reduced by 0.8% and 0.7%. Therefore, the statistical features are not just helpful for predicting the most frequent events evoked by each word. For words with non-global events as ground-truth labels, they also have certain benefits.  Table 6: Trigger classification F1 scores (%) of Base, S 2 -JDN − , S 2 -JDN on U (Unseen words), L (Lowfrequency words), and H (High-frequency words).

Effects on Words with Different Occurrence Frequencies
As discussed in Section 3.2.2, the credibility of statistics will decrease with the number of a word's occurrences. To this end, we rescale v i as Eq. (7) to weaken the impact of low-frequency words' statistical features. Now, we compare the model performance on words with different occurrence frequencies in the training set. Here we define lowfrequency words according to hyperparameter c in Eq. (7), which is 4 times on ACE2005 an KBP2015 (Section 4.1). Accordingly, we split the words in each test set into three parts: U={words unseen in the training set}, L={low-frequency words with occurrence numbers in {1, 2, 3} in the training set}, H={high-frequency words appearing at least 4 times in the training set}. Besides the base model and S 2 -JDN, we also test a variant of S 2 -JDN that removes the rescaling operations in Eq.(7) and the element-wise multiplication with [1, I sta i ] after layer normalization in joint event discriminator (Section 3.2.3), denoted as S 2 -JDN − . Their results are shown in Table 6.
As expected, all models perform best on H in Table 6, and perform worst on U. In predicting the events of words in U, the statistical features are useless and even become interference, so the results of S 2 -JDN − are significantly worse than the other two models, and S 2 -JDN avoids the interference by the two rescaling operations. On L and H, S 2 -JDN − and S 2 -JDN are better than Base, which is due to the effect of the statistical features. However, the training and testing of S 2 -JDN − are disturbed by some low-frequency words' false statistical information more seriously, so it can't achieve the same results as S 2 -JDN on L and H.

Case Study
In this subsection, we will analyze the case study results of DMBERT and our S 2 -JDN shown in Table  1. V1 is the simplest variant, which just cuts off the part after "killed" in T1. Although the contextual features of the two models are changed in V1, both models can predict the correct event "Die". V2 and V3 exchange the surrounding words of "killed" and "wounded" in T1 and T2, which successfully confuses DMBERT, while S 2 -JDN still makes the correct decisions with the help of statistical features. V4 is the most complex variant. Besides replacing the trigger word in T1 with "wounded", V4 also adds a long piece of content (green part). At this time, DMBERT can't even identify "wounded" as a trigger, and S 2 -JDN also wrongly predicts the event as "Attack".
This experiment shows that the statistical features are indeed helpful in varied contexts. But if the contexts change too significantly, S 2 -JDN may still be misguided.

Conclusion & Future Work
In this paper, we find that existing neural-based ED models are likely to get confused by changeable contexts during testing. To alleviate this problem, we propose S 2 -JDN model, which extracts a set of statistical event features from word-event cooccurrence frequencies as an additional decision basis besides contextual information. Experimental results on two benchmark datasets ACE2005 and KBP2015 against ten recent SOTA ED models demonstrate the effectiveness of S 2 -JDN and each proposed module.
For future work, there are three intriguing directions: (i) extending the decision-making pattern of S 2 -JDN to other neural-based ED models and thus absorbing their advantages; (ii) incorporating the word-event co-occurrence frequencies into neuralbased ED models via prior on their output distributions in a Bayesian framework; and (iii) combining S 2 -JDN with data augmentation methods, so as to collect more accurate global statistics and further promote the performance.