MLBiNet: A Cross-Sentence Collective Event Detection Network

We consider the problem of collectively detecting multiple events, particularly in cross-sentence settings. The key to dealing with the problem is to encode semantic information and model event inter-dependency at a document-level. In this paper, we reformulate it as a Seq2Seq task and propose a Multi-Layer Bidirectional Network (MLBiNet) to capture the document-level association of events and semantic information simultaneously. Specifically, a bidirectional decoder is firstly devised to model event inter-dependency within a sentence when decoding the event tag vector sequence. Secondly, an information aggregation module is employed to aggregate sentence-level semantic and event tag information. Finally, we stack multiple bidirectional decoders and feed cross-sentence information, forming a multi-layer bidirectional tagging architecture to iteratively propagate information across sentences. We show that our approach provides significant improvement in performance compared to the current state-of-the-art results.


Introduction
Event detection (ED) is a crucial sub-task of event extraction, which aims to identify and classify event triggers. For instance, the document shown in Table 1, which contains six sentences {s 1 , . . . , s 6 }, the ED system is required to identify four events: an Injure event triggered by "injuries", two Attack events triggered by "firing" and "fight", and a Die event triggered by "death".
Detecting event triggers from natural language text is a challenge task because of the following problems: a). Sentence-level contextual representation and document-level information aggregation (Chen et al., 2018;; * Equal contribution and shared co-first authorship.
† Corresponding author. 1 The code is available in https://github.com/ zjunlp/DocED. s1: what a brave young woman s2: did you hear about the injuries[Injure] she sustained s3: did you hear about the firing[Attack] she did s4: she was going to fight[Attack] to the death[Die] s5: she was captured but she was one tough cookie s6: god bless here Table 1: An example document in ACE 2005 corpus with cross-sentence semantic enhancement and event inter-dependency. Specifically, semantic information of s 2 provides latent information to enhance s 3 , and Attack event in s 4 also contributes to s 3 . Shen et al., 2020). In ACE 2005 corpus, the arguments of a single event instance may be scattered in multiple sentences (Zheng et al., 2019;Ebner et al., 2019), which indicates that document-level information aggregation is critical for ED task. What's more, a word in different contexts would express different meanings and trigger different events. For example, in Table 1, "firing" in s 3 means the action of firing guns (Attack event) or forcing somebody to leave their job (End Position event). To specify its event type, cross-sentence information should be considered. b). Intra-sentence and inter-sentence event inter-dependency modeling (Liao and Grishman, 2010;Chen et al., 2018;. For s 4 in Table 1, an Attack event is triggered by "fight", and a Die event is triggered by "death". This kind of event co-occurrence is common in ACE 2005 corpus, we investigated the dataset and found that about 44.4% of the triggers appeared in this way. The cross-sentence event co-occurrence shown in s 4 and s 3 is also very common. Therefore, modeling the sentence-level and document-level event inter-dependency is crucial for jointly detecting multiple events. To address those issues, previous approaches (Chen et al., 2015;Nguyen et al., 2016;Yan et al., 2019;Zhang et al., 2019) mainly focused on sentence-level event de-tection, neglecting the document-level event interdependency and semantic information. Some studies (Chen et al., 2018; tried to integrate semantic information across sentences via the attention mechanism. For the documentlevel event inter-dependency modeling, Liao and Grishman (2010) extended the features with event types to capture dependencies between different events in a document. Although great progress has been made in ED task due to recent advances in deep learning, there is still no unified framework to model the document-level semantic information and event inter-dependency.
We try to analyze the ACE 2005 data to reunderstand the challenges encountered in ED task. Firstly, we find that event detection is essentially a special Seq2Seq task, in which the source sequence is a given document or sentence, and the event tag sequence is target of task. Seq2Seq tasks can be effectively modeled via the RNN-based encoder-decoder framework, in which the encoder captures rich semantic information, while the decoder generates a sequence of target symbols with inter-dependency been captured. This separate encoder and decoder framework can correspondingly deal with the semantic aggregation and event interdependency modeling challenges in ED task. Secondly, for the propagation of cross-sentence information, we find that the relevant information is mainly stored in several neighboring sentences, while little is stored in distant sentences. For example, as shown in Table 1, it seems that s 2 and s 4 contribute more to s 3 than s 1 and s 5 .
In this paper, we propose a novel Multi-Layer Bidirectional Network (MLBiNet) for ED task. A bidirectional decoder layer is firstly devised to decode the event tag vector corresponding to each token with forward and backward event interdependency been captured. Then, the event-related information in the sentence is summarized through a sentence information aggregation module. Finally, the multiple bidirectional tagging layers stacking mechanism is proposed to propagate crosssentence information between adjacent sentences, and capture long-range information as the increasing of layers. We conducted experimental studies on ACE 2005 corpus to demonstrate its benefits in cross-sentence joint event detection. Our contributions are summarized as follows: • We propose a novel bidirectional decoder model to explicitly capture bidirectional event inter-dependency within a sentence, alleviating long-range forgetting problem of traditional tagging structure; • We propose a model called MLBiNet to propagate semantic and event inter-dependency information across sentences and detect multiple events collectively; • We achieve the best performance (F 1 value) on ACE 2005 corpus, surpassing the state-ofthe-art by 1.9 points.

Approach
Generally, event detection on ACE 2005 corpus is treated as a classification problem, which is to determine whether it forms a part of an event trigger. Specifically, for a given document d = {s 1 , . . . , s n }, where s i = {w i,1 , . . . , w i,n i } denotes the i-th sentence containing n i tokens. We are required to predict the triggered event type sequence y i = {y i,1 , . . . , y i,n i } based on contextual information of d. Without ambiguity, we omit the subscript i. For a given sentence, the event tags corresponding to tokens are associated, which is important for collectively detecting multiple events (Chen et al., 2018;. The way tokens are classified independently will miss the association. In order to capture the event inter-dependency, the sequential information of event tag should be retained. Intuitively, the ED task can be regarded as event tag sequence generation problem, which is essentially a Seq2Seq task. Specifically, the source sequence is a given document or sentence, and the event tag sequence to be generated is the target sequence. For instance, for sentence "did you hear about the injuries she sustained", the decoder model is required to generate a tag se- denotes that the corresponding token is not part of event trigger and "B Injure" indicates an Injure event is triggered. We introduce the RNN-based encoder-decoder framework for ED task, considering that it is an efficient solution for Seq2Seq tasks. And we propose a multi-layer bidirectional network called MLBiNet shown in Figure 1 to deal with the challenges in detecting multiple events collectively. The model framework consists of four components: the semantic encoder, the bidirectional decoder, the information aggregation module and stacking of mul- Figure 1: The architecture of our multi-layer bidirectional network (MLBiNet). The red arrow represents the input of semantic representation x t , the green arrow represents the input of adjacent sentences information [I k−1 i−1 ; I k−1 i+1 ] integrated in the previous layer, and the blue arrow represents the input of forward event tag vector. tiple bidirectional tagging layers. We firstly introduce the encoder-decoder framework and discuss its compatibility with the ED task.

Encoder-Decoder
The RNN-based encoder-decoder framework (Cho et al., 2014;Bahdanau et al., 2015;Luong et al., 2015;Gu et al., 2016) consists of two components: a) an encoder which converts the source sentence into a fixed length vector c and b) a decoder is to unfold the context vector c into the target sentence. As is formalized in (Gu et al., 2016), the source sentence s i is converted into a fixed length vector c by the encoder RNN, where f is the RNN function, {h t } are the RNN states, w t is the t-th token of source sentence, c is the so-called context vector, and φ summarizes the hidden states, e.g. choosing the last state h n i . And the decoder RNN translates c into the target sentence according to: where s t is the state at time t, y t is the predicted symbol at time t, g is a classifier over the vocabulary, and y <t denotes the history {y 1 , . . . , y t−1 }. Studies (Bahdanau et al., 2015;Luong et al., 2015) have shown that summarizing the entire source sentence into a fixed length vector will limit the performance of the decoder. They introduced the attention mechanism to dynamically changing context vector c t in the decoding process, where c t can be uniformly expressed as where α tτ is the contribution weight of τ -th source token's state to context vector at time t, h τ denotes the representation of τ -th token. We introduce the encoder-decoder framework to model ED task, mainly considering the following advantages: a) the separate encoder module is flexible in fusing sentence-level and documentlevel semantic information and b) the RNN decoder model (1) can capture sequential event tag dependency as the predicted tag vectors before t will be used as input for predicting t-th symbol.
The encoder-decoder framework for ED task is slightly different from the general Seq2Seq task as follows: a) For ED task, the length of event tag sequence (target sequence) is known because its elements correspond one-to-one with tokens in the source sequence. However, the length of target sequence in the general Seq2Seq task is unknown. b) The vocabulary of decoder for ED task is a collection of event types, instead of words.

Semantic Encoder
In this module, we encode the sentence-level contextual information for each token with Bidirectional LSTM (BiLSTM) and self-attention mechanism. Firstly, each token is transformed into comprehensive representation by concatenating its word embedding and NER type embedding. The word embedding matrix is pretrained by Skip-gram model (Mikolov et al., 2013), and the NER type embedding matrix is randomly initialized and updated in the training process. For a given token w t , its embedded vector is denoted as e t .
We apply the BiLSTM (Zaremba and Sutskever, 2014) model for sentence-level semantic encoding, which can effectively capture sequential and contextual information for each token. The BiLSTM architecture is composed of a forward LSTM and a backward LSTM, i.e., After encoding, the contextual representation of each Attention mechanism between tokens within a sentence has been proven to further integrate long-range contextual semantic information. For each token w t , its contextual representation is the weighted average of the semantic information of all tokens in the sentence. We apply the attention mechanism proposed by (Luong et al., 2015) with the weights derived by And the contextual representation of w t is h a t = n i j=1 α t,j h j . By concatenating its lexical embedding and contextual representation, we get the final comprehensive semantic representation of w t as

Bidirectional Decoder
The decoder layer for ED task is to generate a sequence of event tags corresponding to tokens. As is noted, the tag sequence (target sequence) elements and tokens (source sequence) are in one-toone correspondence. Therefore, the context vector c shown in (1) and (2) can be personalized directly by c t = x t , which is equivalent to attention with degenerate weights. That is, α tt = 1 and α tτ = 0, ∀τ = t.
In traditional Seq2Seq tasks, the target sequence length is unknown during the inference process, so only the forward decoder is feasible. However, for the ED task, the length of the target sequence is known when given source sequence. Thus, we devise a bidirectional decoder to model event interdependency within a sentence.
Forward Decoder In addition to the semantic context vector c t = x t , the event information previously involved can help determine the event type triggered by t-th token. This kind of association can be captured by the forward decoder model: y t } are the forward event tag vectors. Compared with general decoder (1), the classifier g(·) over vocabulary is replaced with a transformationf (·) (identity function, tanh, sigmoid, etc.) to obtain the event tag vector.
Backward Decoder Considering the associated events may also be mentioned later, we devise a backward decoder to capture this kind of dependency as follows: y t ] with bidirectional event inter-dependency been captured. The semantic and event-related entity information is also carried by y t as x t is an indirect input.
An alternative method modeling the sentencelevel event inter-dependency called hierarchical tagging layer is proposed by (Chen et al., 2018). The bidirectional decoder is quite different from the hierarchical tagging layer as follows: • The bidirectional decoder models event interdependency immediately by combining a forward and a backward decoder. The hierarchical tagging layer utilizes two forward decoders and the tag attention mechanism to capture bidirectional event inter-dependency.
• In the bidirectional decoder, the ED task is formalized as a special Seq2Seq task, which can simplify the event inter-dependency modeling problem and cross-sentence information propagation problem discussed below.
The bidirectional RNN decoder unfolds the event tag vector corresponding to each token, and captures the bidirectional event inter-dependency within the sentence. To propagate information across sentences, we need to firstly aggregate useful information of each sentence.

Information Aggregation
For current sentence s i , the information we are concerned about can be summarized as recording which entities and tokens trigger which events. Thus, to summarize the information, we devise another LSTM layer (information aggregation module shown in Figure 1) with the event tag vector y t as input. The information at t-th token is computed byĨ We choose the last stateĨ n i as the summary information, which is I i =Ĩ n i . The sentence-level information aggregation module bridges the information across sentences, as the well-formalized information can be easily integrated into the decoding process of other sentences, enhancing the event-related signal.

Multi-Layer Bidirectional Network
In this module, we introduce a multiple bidirectional tagging layers stacking mechanism to aggregate information of adjacent sentences into the bidirectional decoder, and propagate information across sentences. The information ({y t }, I i ) obtained by the bidirectional decoder layer and information aggregation module has captured the event relevant information within a sentence. However, the cross-sentence information has not yet interacted. For a given sentence, as we can see in Table  1, its relevant information is mainly stored in several neighboring sentences, while distant sentences are rarely relevant. Thus, we propose to transmit the summarized sentence information I i among adjacent sentences.
For the decoder framework shown in (4) and (5), the cross-sentence information can be integrated by extending the input with I i−1 and I i+1 . Further, we introduce a multiple bidirectional tagging layers stacking mechanism shown in Figure 1 to iteratively aggregate information of adjacent sentences. The overall framework is named Multi-Layer Bidirectional Network (MLBiNet). As shown in Figure 1, a bidirectional tagging layer is composed of a bidirectional decoder and an information aggregation module. For sentence s i , the outputs of k-th layer can be computed by where I k−1 i−1 is the sentence information of s i−1 aggregated in (k-1)-th layer, and {y k t } are event tag vectors obtained in k-th layer. The equation suggests that for each token of source sentence s i , the input of cross-sentence information is identi- It is reasonable as their crosssentence information available is the same for each token of current sentence.
The iteration process shown in equation (7) is actually an evolutionary diffusion of the crosssentence semantic and event information in the document. Specifically, in the first tagging layer, information of current sentence is effectively modeled by the bidirectional decoder and information aggregation module. In the second layer, information of adjacent sentences is propagated to current sentence by plugging in I 1 i−1 and I 1 i+1 to the decoder. In general, in the k-th (k ≥ 3) layer, since s i−1 has captured the information of sentence s i−k+1 in the (k-1)-th layer, then s i can obtain information in s i−k+1 by acquiring the information in s i−1 . Thus, as the number of decoder layers increases, the model will capture information from distant sentences. For K-layer bidirectional tagging model, the sentence information with the longest distance of K-1 can be captured.
We define the final event tag vector of w t as the weighted sum of {y k t } k in different layers, i.e., where α ∈ (0, 1] is a weight decay parameter. It means that cross-sentence information can supplement to the current sentence, and the contribution gradually decreases as the distance increases when α < 1.
We note that the parameters of bidirectional decoder and information aggregation module at different layers can be shared, because they encode and propagate the same structured information. In this paper, we set the parameters of different layers to be the same.

Loss Function
In order to train the networks, we minimize the negative log-likelihood loss function J(θ), where D denotes training documents set. The tag probability for token w t is computed by where M is the number of event classes, p(O j t |d; θ) is the probability that assigning event type j to token w t in document d when parameter is θ.

Dataset and Settings
We performed extensive experimental studies on the ACE 2005 corpus to demonstrate the effectiveness of our method on ED task. It defines 33 types of events and an extra NONE type for the nontrigger tokens. We formalize it as a task to generate a sequence of 67-class event tag (with BIO tagging schema). The data splitting for training, validation and testing follows (Ji and Grishman, 2008;Chen et al., 2015;Chen et al., 2018;, where the training set contains 529 documents, the validation set contains 30 documents and the remaining 40 documents are used as testing set. We evaluated the performance of three multilayer settings with 1-, 2-and 3-layer MLBiNet, respectively. We use the Adam (Kingma and Ba, 2017) for optimization. In all three settings, we cut every 8 consecutive sentences into a new document and padding when needed. Each sentence is truncated or padded to make it 50 in length. We set the dimension of word embedding as 100, the dimension of golden NER type and subtype embedding as 20. We set the dropout rate as 0.5 and penalty coefficient as 2 * 10 −5 to avoid overfitting. The hidden size of semantic encoder layer and decoder layer is set to 100 and 200, respectively. The size of forward and backward event tag vectors is set to 100. And we set the batch size as 64, the learning rate as 5 * 10 −4 with decay rate 0.99, the weight decay parameter α as 1.0. The results we report are the average of 10 trials.

Baselines
For comparison, we investigated the performance of the following state-of-the-art methods: 1) DM-CNN (Chen et al., 2015), which extracts multiple events from one sentence with dynamic multipooling CNN; 2) HBTNGMA (Chen et al., 2018), which models sentence event inter-dependency via a hierarchical tagging model; 3) JMEE , which models the sentence-level event interdependency via a graph model of the sentence syntactic parsing graph; 4) DMBERT-Boot (Wang et al., 2019), which augments the training data with external unlabeled data by adversarial mechanism; 5) MOGANED (Yan et al., 2019), which uses graph convolution network with aggregative attention to explicitly model and aggregate multiorder syntactic representations; 6) SS-VQ-VAE , which learns to induct new event type by a semi-supervised vector quantized variational autoencoder framework, and fine-tunes with the pre-trained BERT-large model. Table 2 presents the overall performance comparison between different methods with gold-standard entities. As shown, under 2-layer and 3-layer settings, our proposed model MLBiNet achieves better performance, surpassing the current state-ofthe-art by 1.9 points. More specifically, our models achieve higher recalls by at least 0.7, 5.9 and 5.2 points, respectively.

Overall Performance
The powerful encoder of BERT pre-trained model (Devlin et al., 2018) has been proven to improve the performance of downstream NLP tasks. The 2-layer MLBiNet outperforms BERT-Boot (BERT-base) and SS-VQ-VAE (BERT-large) by 3.5 and 1.9 points, respectively. It proves the im-  Table 3: System Performance on Single Event Sentences (1/1) and Multiple Event Sentences (1/n). 1/1 means one sentence that has one event; otherwise, 1/n is used. "all" means all test data are included.
portance of event inter-dependency modeling and cross-sentence information integration for ED task.
When only information of current sentence is available, the 1-layer MLBiNet outperforms HBT-NGMA by 2.9 points. It proves that the hierarchical tagging mechanism adopted by HBTNGMA is not as effective as the bidirectional decoding mechanism we proposed. Intuitively, the bidirectional decoder models event inter-dependency explicitly by a forward decoder and a backward decoder, which is more efficient than hierarchies.

Effect on Extracting Multiple Events
The existing event inter-dependency modeling methods (Chen et al., 2015(Chen et al., , 2018 aim to extract multiple events jointly within a sentence. To demonstrate that sentence-level event inter-dependency modeling benefits from crosssentence information propagation, we evaluated the performance of our model in single event extraction (1/1) and multiple events joint extraction (1/n). 1/1 means one sentence that has one event; otherwise, 1/n is used.
The experimental results are presented in Table 3. As shown, we can verify the importance of cross-sentence information propagation mechanism and bidirectional decoder in sentence-level multiple events joint extraction based on the following results: a) When only the current sentence information is available, the 1-layer MLBiNet outperforms existing methods at least by 2.4 points in 1/n case, which proves the effectiveness of bidirectional decoder we proposed; b) For ours 2-layer and 3-layer models, their performance in both 1/1 and 1/n cases surpasses the current methods by a large margin, which proves the importance of propagating information across sentences for single event and multiple events extraction. We conclude that it   is the propagating information across sentences and bidirectional decoder which make cross-sentence joint event detection successful. Table 4 presents the performance of the model in three decoder mechanisms: forward, backward and bidirectional decoder, as well as three multilayer settings. We can reach the following conclusions: a) Under three decoder mechanisms, the performance of the proposed model will be significantly improved as the number of decoder layers increases; b) The bidirectional decoder dominates both forward decoder and backward decoder, and forward decoder dominates backward decoder; c) The information propagation across sentences will enhance event relevant signal regardless of the decoder mechanism applied. Among the three decoder models, the bidirectional decoder performs best because of its ability in capturing bidirectional event inter-dependency, which proves both the forward and backward decoders are critical for event inter-dependency modeling.

Analysis of Aggregation Model
In information aggregation module, we introduce a LSTM shown in (6) to aggregate sentence information, and then propagate to other sentences via the bidirectional decoder. We compare other aggregation methods: a) concat means the sentence information is aggregated by simply concatenating the first and last event tag vector of the sentence, and b) average means the sentence information is aggregated by averaging the event tag vectors of tokens in the sentence. The experimental results are presented in Table 5.
Compared with the baseline 1-layer model, other three 2-layer settings equipped with information aggregation and cross-sentence propagation performs better. It proves that sentence information aggregation module can integrate some useful information and propagate it to other sentences through the decoder. On the other hand, the performance of LSTM and concat are comparable and stronger than average. Considering that the input of the information aggregation module is the event tag vector obtained by the bidirectional decoder, which has captured the sequential event information. Therefore, it is not surprising that LSTM does not have that great advantage over concat and average.

Related Work
Event detection is a well-studied task with research effort in the last decade. The existing methods (Chen et al., 2015;Nguyen and Grishman, 2015;Liu et al., 2017;Nguyen and Grishman, 2018;Deng et al., 2020;Tong et al., 2020;Lai et al., 2020;Cui et al., 2020;Deng et al., 2021;Shen et al., 2021) mainly focus on sentence-level event trigger extraction, neglecting the document information. Or the document-level semantic and event inter-dependency information are modeled separately.
For the problem of event inter-dependency modeling, some methods were proposed to jointly extract triggers within a sentence. Among them, Chen et al. (2015) used dynamic multi-pooling CNN to preserve information of multiple events; Nguyen et al. (2016) utilized the bidirectional recurrent neural networks to extract events;  introduced syntactic shortcut arcs to enhance information flow and used graph neural networks to model graph information; Chen et al. (2018) proposed a hierarchical tagging LSTM layer and tagging attention mechanism to model the event interdependency within a sentence. Considering that adjacent sentences also store some relevant event information, which would enhance the event signals of other sentences. These methods would miss the event inter-dependency information across sentences. For document-level event inter-dependency modeling, Lin et al. (2020) proposed to incorporate global features to capture the cross-subtask and cross-instance interactions.
The deep learning methods on document-level semantic information aggregation are primarily based on multi-level attention mechanism. Chen et al. (2018) integrated document information by introducing a multi-level attention.  used trigger and sentence supervised attention to aggregate information and enhance the sentence-level event detection. Zheng et al. (2019) utilized the memory network to store document level contextual information and entities. Some feature-based document level information aggregation methods were proposed by (Ji and Grishman, 2008;Liao and Grishman, 2010;Hong et al., 2011;Huang and Riloff, 2012;Reichart and Barzilay, 2012;Lu and Roth, 2012). And  proposed to aggregate the document-level information by latent topic modeling. The attention-based documentlevel information aggregation mechanisms treat all sentences in the document equally, which may introduce some noises from distant sentences. And the feature-based methods require extensive human engineering, which also greatly affects the portability of the model.

Conclusions
This paper presents a novel Multi-Layer Bidirectional Network (MLBiNet) to propagate documentlevel semantic and event inter-dependency information for event detection task. To the best of our knowledge, this is the first work to unify them in one model. Firstly, a bidirectional decoder is proposed to explicitly model the sentence-level event inter-dependency, and event relevant information within a sentence is aggregated by an information aggregation module. Then the multiple bidirectional tagging layers stacking mechanism is devised to iteratively propagate semantic and event-related information across sentence. We conducted extensive experiments on the widely-used ACE 2005 corpus, the results demonstrate the effectiveness of our model, as well as all modules we proposed.
In the future, we will extend the model to the event argument extraction task and other information extraction tasks, where the document-level semantic aggregation and object inter-dependency are critical. For example, the recently concerned document-level relation extraction (Quirk and Poon, 2017;Yao et al., 2019), which requires reading multiple sentences in a document to extract entities and infer their relations by synthesizing all information of the document. For other sequence labeling tasks, such as the named entity recognition, we can also utilize the proposed architecture 4837 to model the entity label dependency.