Document-level Event Extraction via Parallel Prediction Networks

Document-level event extraction (DEE) is indispensable when events are described throughout a document. We argue that sentence-level extractors are ill-suited to the DEE task where event arguments always scatter across sentences and multiple events may co-exist in a document. It is a challenging task because it requires a holistic understanding of the document and an aggregated ability to assemble arguments across multiple sentences. In this paper, we propose an end-to-end model, which can extract structured events from a document in a parallel manner. Specifically, we first introduce a document-level encoder to obtain the document-aware representations. Then, a multi-granularity non-autoregressive decoder is used to generate events in parallel. Finally, to train the entire model, a matching loss function is proposed, which can bootstrap a global optimization. The empirical results on the widely used DEE dataset show that our approach significantly outperforms current state-of-the-art methods in the challenging DEE task. Code will be available at https://github.com/HangYang-NLP/DE-PPN.


Introduction
The goal of event extraction (EE) is to identify events of a pre-specified type along with corresponding arguments from plain texts. A great number of previous studies (Ahn, 2006;Ji and Grishman, 2008;Liao and Grishman, 2010;Hong et al., 2011;Li et al., 2013;Chen et al., 2015;Nguyen et al., 2016;Yang and Mitchell, 2016;Chen et al., 2017;Huang et al., 2018;Yang et al., 2019;) focus on the sentence-level EE (SEE), while most of these works are based on the ACE evaluation (Doddington et al., 2004). 1 However, these SEE-based methods make predictions within Figure 1: An example of a document contains two Equity Freeze type events: Event-1 and Event-2. Words in bold-faced are arguments that scatter across multiple sentences. a sentence and fail to extract events across sentences. To this end, document-level EE (DEE) is needed when the event information scatters across the whole document.
In contrast to SEE, there are two specific challenges in DEE: arguments-scattering and multievents. Specifically, arguments-scattering indicates that arguments of an event may scatter across multiple sentences. For example, As shown in Figure 1, the arguments of Event-1 are distributed in different sentences ([S3] and [S7]) and extraction within an individual sentence will lead to incomplete results. So this challenge requires the DEE model to have a holistic understanding of the entire document and an ability to assemble all relevant arguments across sentences. Furthermore, it will be more difficult when coupled with the second challenge: multi-events, where multiple events are contained in a document. 2 As shown in Figure 1, there are two events Event-1 and Event-2 in a document with the same event type and there is no obvious textual boundary between the two events. The multi-events problem requires the DEE method to recognize how many events are contained in a document and achieve accurate arguments assembling (i.e., assign arguments to the corresponding event). As a result of these two complications, SEE methods are ill-suited for the DEE task, which calls for a model that can integrate document-level information, assemble relevant arguments across multiple sentences and capture multiple events simultaneously.
To handle these challenges in DEE, previous works (Yang et al., 2018;Zheng et al., 2019) formulate DEE as an event table filling task, i.e., filling candidate arguments into a predefined event table. Specifically, they model the DEE as a serial prediction paradigm, in which arguments are predicted in a predefined role order and multiple events are also extracted in predefined event order. Such a manner is restricted to the extraction of individual arguments, and the former extraction will not consider the latter extraction results. As a result, errors will be propagated and the extraction performance is under satisfaction.
In this paper, to avoid the shortage of serial prediction and tackle the aforementioned challenges in DEE, we propose an end-to-end model, named Document-to-Events via Parallel Prediction Networks (DE-PPN). DE-PPN is based on an encoder-decoder framework that can extract structured events from a whole document in a parallel manner. In detail, we first introduce a documentlevel encoder to obtain the document-aware representations. In such a way, a holistic understanding of the entire document is obtained. Then, we leverage a multi-granularity decoder to generate events, which consists of two key parts: a role decoder and an event decoder. The role decoder is designed for handling the argument-scattering challenge, which can assemble arguments for an event based on document-aware representations. For addressing the challenge of multi-events effectively, an event decoder is designed to support generating multiple events. Both of them are based on the non-autoregressive mechanism (Gu et al., 2018), which supports the extraction of multiple events in parallel. Finally, for comparing extracted events to ground truths, we propose a matching loss function inspired by the Hungarian algorithm (Kuhn, 1955;Munkres, 1957). The proposed loss function can perform a global optimization by computing a bipartite matching between predicted and groundtruth events.
In summary, our contributions are as follows: • We propose an encoder-decoder model, DE-PPN, that is based on a document-level encoder and a multi-granularity decoder to extract events in parallel with document-aware representations.
• We introduce a novel matching loss function to train the end-to-end model, which can bootstrap a global optimization.
• We conduct extensive experiments on the widely used DEE dataset and experimental results demonstrate that DE-PPN can significantly outperform state-of-the-art methods when facing the specific challenges in DEE.

Methodology
Before introducing our proposed approach for DEE in this section, we first describe the task formalization of DEE. Formally, we denote T and R as the set of pre-defined event types and role categories, respectively. Given an input document comprised of N s sentences D = {S i } Ns i=1 , the DEE task aims to extract one or more structured events where each event y t i with event type t contains a series of roles (r 1 i , r 2 i , . . . , r n i ) filled by arguments (a 1 i , a 2 i , . . . , a n i ). k is the number of events contained in the document, n is the number of pre-defined roles for the event type t, t ∈ T and r ∈ R.
The key idea of our proposed model, DE-PPN, is that aggregate the document-level context to predict events in parallel. Figure 2 illustrates the architecture of DE-PPN, which consists of five key components: (1) candidate argument recognition, (2) document-level encoder, (3) multi-granularity decoder, (4) events prediction, and (5) matching loss function.  Figure 2: The overall architecture of DE-PPN. Given a document, the DE-PPN first encodes each sentence separately and recognizes candidate arguments from it. Then a document-level encoder is designed to get the documentlevel representations. And a multi-granularity decoder is used to generate events in parallel based on documentaware representations. Finally, the matching loss function can produce an optimal bipartite matching between predicted and ground-truth events, which bootstrap a global optimization.

Candidate Argument Recognition
Given a document D = {S i } Ns i=1 with N s sentences, each sentence S i with a sequence of tokens is first embedded as [w i,1 , w i,2 , . . . , w i,l ], where l is the sentence length. Then, the word embeddings are fed into an encoder to obtain the contextualized representation. In this paper, we adopt the Transformer (Vaswani et al., 2017) as the primary context encoder. Through the encoder, we can get the context-aware embedding C i of sentence S i : where C i ∈ R l×d and d is the size of the hidden layer, and we represent each sentence in the given document as {C i } Ns i=1 . Finally, following Zheng et al. (2019), we model the sentence-level candidate argument recognition as a typical sequence tagging task. Through candidate argument recognition, we can obtain candidate arguments A = {a i } Na i=1 from the given sentence S i , where N a is the number of recognized candidate arguments.

Document-level Encoder
To enable the awareness of document-level contexts for sentences and candidate arguments, we employ a document-aware encoder to facilitate the interaction between all sentences and candidate arguments. Formally, given an argument a i with its span covering j-th to k-th in sentence S i , we conduct a max-pooling operation over the token-level embedding [c i,j , . . . , c i,k ] ∈ C i to get the local embedding c a i ∈ R d for it. Similarly, the sentence embedding c s i ∈ R d can be obtained by the maxpooling operation over the token sequence representation C i of sentence S i . Then, we employ the Transformer module, Transformer-2, as the encoder to model the interaction between all sentences and candidate arguments by a multi-head self-attention mechanism. Then we can get the document-aware representations for sentences and arguments. Note that we add the sentence representation with sentence position embeddings to inform the sentence order before feeding them into Transformer-2.
[H a ; H s ] = Transformer-2(c a 1 ...c a Na ; c s 1 ...c s Ns ) (2) since arguments may have many mentions in a document, we utilize the max-pooling operation to merge multiple argument embeddings with the same char-level tokens into a single embedding. After the document-level encoding stage, we can obtain the document-aware sentences representation H s ∈ R Ns×d and candidate arguments A = Before decoding, we stack a linear classifier over the document representation by operating the max-pooling over H s to conduct a binary classification for each event type. Then, for the predicted event type t with pre-defined role types, DE-PPN learns to generate events according to the document-aware candidate argument representations H a ∈ R N a ×d and sentence representations H s ∈ R Ns×d .

Multi-Granularity Decoder
To effectively address arguments-scattering and multi-events in DEE, we introduce a multigranularity decoder to generate all possible events in parallel based on document-aware representations (H a and H s ). The multi-granularity decoder is composed of three parts: event decoder, role decoder, and event-to-role decoder. All of these decoders are based on the non-autoregressive mechanism (Gu et al., 2018), which supports the extraction of all events in parallel.
Event Decoder. The event decoder is designed to support the extraction of all events in parallel and is used to model the interaction between events. Before the decoding stage, the decoder needs to know the size of events to be generated. We use m learnable embeddings as the input of the event decoder, which are denoted as event queries Q event ∈ R m×d . m is a hyperparameter that denotes the number of the generated events. In our work, m is set to be significantly large than the average number of events in a document. Then, the event query embeddings Q event are fed into a non-autoregressive decoder which is composed of a stack of N identical Transformer layers. In each layer, there are a multi-head self-attention mechanism to model the interaction among events and a multi-head cross-attention mechanism to integrate the document-aware representation H s into event queries Q event . Formally, the m event queries are decoded into m output embeddings H event by: where H event ∈ R m×d .
Role Decoder. The role decoder is designed to support the filling of all roles in an event in parallel and model the interaction between roles. As the predicted event type t with semantic role types (r 1 , r 2 , . . . , r n ), we use n learnable embeddings as the input of the role decoder, which are denoted as event queries Q role ∈ R n×d . Then, the role query embeddings Q role are fed into the decoder, which has the same architecture as the event decoder. Specifically, the self-attention mechanism can model the relationship among roles, and the cross-attention mechanism can fuse the information of the document-aware candidate argument representations H a . Formally, the n role queries are decoded into n output embeddings H role by: where H role ∈ R n×d .
Event-to-Role Decoder. To generate diversiform events with relevant arguments for different event queries, an event-to-role decoder is designed to model the interaction between the event queries H event and the role queries H role :

Events Prediction
After the multi-granularity decoding, the m event queries and n role queries are transformed into m predicted events and each of them contains n role embeddings. To filter the spurious event, the m event queries H event are fed into a feed-forward networks (FFN) to judge each event prediction is non-null or null. Concretely, the predicted event can be obtained by: where W e ∈ R d×2 is learnable parameters. Then, for each predicted event with pre-defined roles, the predicted arguments are decoded by filling the candidate indices or the null value with (N a + 1)-class classifiers: 3 After the prediction network, we can obtain the m eventsŶ = (Ŷ 1 ,Ŷ 2 , . . . ,Ŷ m ) where each event Y i = (P 1 i , P 2 i , . . . , P n i ) contains n predicted arguments with role types. Where P j i = P role [i, j, :] ∈ R (N a +1) .

Matching Loss
The main problem for training is that how to assign predicted m events with a series of arguments to the ground truth k events. Inspired by the assigning problem in the operation research (Kuhn, 1955;Munkres, 1957), we propose a matching loss function, which can produce an optimal bipartite matching between predicted and ground-truth events.
Formally, we denote predicted and ground truth events asŶ = (Ŷ 1 ,Ŷ 2 , . . . ,Ŷ m ) and Y = (Y 1 , Y 2 , . . . , Y k ), respectively. Where k is the real number of events in the document and m is fixed size for generated events. Note that m k. The i-th predicted event is denoted aŝ where P j i can be calculated by the Equation 7. And the i-th ground truth event is denoted as Y i = (r 1 i , r 2 i , . . . , r n i ) , where r j i is the candidate argument indix for j-the role type in i-th target event.
To find a bipartite matching between these two sets, we search for a permutation of m elements with the lowest cost: where (m) is the space of all m-length permutations and C match (Ŷ σ(i) , Y i ) is a pair-wise matching cost between ground truth y i and a predictionŶ σ(i) with index σ(i). By taking into account all of the prediction arguments for roles in an event, we define C match (Ŷ σ(i) , Y i ) as: (9) where the judge i is the judgement of event i to be non-null or null that is calculated by the Equation 6. The optimal assignment σ(i) can be computed effectively with the Hungarian algorithm. 4 Then for all pairs matched in the previous step, we define the loss function with negative log-likelihood as: (10) Whereσ is the optimal assignment computed in the Equation 8.

Optimization
During training, we sum the matching loss for events prediction with preconditioned steps before decoding as follows: where L ae and L ec are the cross-entropy loss function for sentence-level candidate argument recognition and event type classification, respectively. λ 1 , λ 2 and λ 3 are hyper-parameters.

Experiments and Analysis
In this section, we present empirical studies to answer the following questions:

Evaluation Metrics.
For a fair comparison, we adopt the evaluation standard used in Doc2EDAG (Zheng et al., 2019). Specifically, for each predicted event, the most similar groundtruth is selected without replacement to calculate the Precision (P), Recall (R), and F1-measure (F1score). As an event type often includes multiple roles, micro-averaged role-level scores are calculated as the final DEE metric.
Implementation Details. For a document as input, we set the maximum number of sentences and the maximum sentence length as 64 and 128, respectively. We adopt the basic Transformer, each layer has 768 hidden units, and 8 attention heads, as the encoder and decoder architecture. During training, we employ the AdamW optimizer (Kingma and Ba, 2014) with the learning rate 1e-5 with batch size 16. Testing set performance is chosen by the best development set performance step within 100 epochs. We leave detailed hyper-parameters and additional results in the Appendix.    where Num ments denotes the number of event mentions (i.e., sentences that contains arguments) and Num args denotes the number of arguments. The higher the ASR, the more scattering of the arguments in an event.  Figure 3: F1-score for performance differences of event decoder and role decoder layers.
maintains the best performance and the results indicate that the encoder-decoder framework can better assemble arguments to the corresponding event across sentences with the parallel prediction and the document-aware representations.
Single-Event vs. Multi-Event. To show the extreme difficulty when arguments-scattering meets multi-events for DEE, we conduct experiments on two scenarios: single-event (i.e., documents contain one event) and multi-event (i.e., documents contain multiple events). Table 2 shows the F1score on single-event and multi-event sets for each event type and the averaged (Avg.). We can observe that multi-events is extremely challenging as the extraction performance of all models drops significantly. But DE-PPN still improves the average F1score from 67.3% to 68.7% over the Doc2EDAG. The results demonstrate the effectiveness of our proposed method when handling the challenge of multi-events. This performance improvement benefits from the event decoder which can generate multiple events in parallel and the matching loss function which can perform a global optimization. Besides, the DE-PPN-1 model achieves an acceptable performance on the scenario of single event extraction which demonstrates the effectiveness of our end-to-end model. But DE-PPN-1 only generates one event and cannot deal with the multievents problem, resulting in low performance on the multi-event sets.

Ablation Studies
To verify the effectiveness of each component of DE-PPN, we conduct ablation tests on the next variants: 1) -DocEnc: removing the Transformerbased document-level encoder, which can support the document-aware information for decoding. 2) -MultiDec: replacing the multi-granularity decoder module with simple embedding initialization for event queries and role queries. 3) -MatchingLoss: -DcoEnc -2.1 -3.4 -1.7 -2.6 -3.2 -2.6 -MultiDec -5.1 -3.8 -4.3 -4.7 -3.6 -4.3 -MatchingLoss -9.2 -12. 8 -13.1 -17.5 -14.3 -13.4  replacing the matching loss function with normal cross-entropy loss. The results are shown in Table 4 and we can observe that: 1) the document-level encoder is of prime importance that enhances the document-aware representations for the generative decoder and contributes +2.6 F1-score on average; 2) the multi-granularity decoder alleviates the challenges of argument-scattering and multi-events by assembling arguments and generating events in parallel, improving by +4.3 F1-score on average. 3) the matching loss function is a very important component for events extraction with +13.4 F1-score improvement which indicates that the matching loss guide a global optimization between predicted and ground-truth events during training.

Effect of Different Decoder Layers
To investigate the importance of the multigranularity decoder, we explore the effect of different layers of the event decoder and the role decoder on the results. Specifically, the number of decoder layers is set to 0,1,2,3 and 4, where 0 means removing this decoder. 1) The effect of different event decoder layers are shown in the left of Figure 3, and our method can achieve the best average F1-score when the number of layers is set to be 2. We conjecture that more layers of the non-autoregressive decoder allow for better modeling the interaction between event queries and generating diversiform events. However, when the layer is set to be large, it is easy to generate redundant events. 2) The effect of different role decoder layers are shown in the right of Figure 3, and we can observe that the more decoder layers, the better performance on the results. We conjecture that more layers of the decoder with the more self-attention modules allow for better modeling the relationship between event roles and more inter-attention modules allow for integrating information of candidate arguments into roles.

Effect of Different Generated Sets
For the training and testing process of the DE-PPN, the number of generated events is an important hyperparameter. In this section, we explore the influence of setting different numbers of generated events on the results. We divide the development set into 5 sub-class where each class contains 1,2,3,4 and 5 events. Table 5 shows the statistics of the documents with different annotated events in the development set. To validate the impact of the number of generated events on the performance, we evaluate DE-PPN with various numbers of generated events: 1, 2, 5, 10, named DE-PPN-1, DE-PPN-2, DE-PPN-5, DE-PPN-10, respectively. The results of DE-PPN with different generated events are shown in Figure 4, which are also compared with the SOTA model Doc2EDAG. We can observe that as the number of events increases, it is more difficult for events prediction, which can be reflected in the decline of all performance. In general, DE-PPN almost achieves the best performance on the average F1-score when the number of generated sets is set to be 5. Besides, there is a performance gap between Doc2EDAG and our method DE-PPN when the number of annotated events is large than 2 in a document. It also demonstrates that our proposed parallel decoder can better handle the challenge of multi-events in DEE.   have been proposed to improve performance on this task. These studies are mainly based on handdesigned features (Li et al., 2013;Kai and Grishman, 2015) and neural-based to learn features automatically (Chen et al., 2015;Nguyen et al., 2016;Björne and Salakoski, 2018;Yang et al., 2019;Chan et al., 2019;Yang et al., 2019;. A few methods make extraction decisions beyond individual sentences. Ji and Grishman (2008) and Liao and Grishman (2010) used event type co-occurrence patterns for event detection. Yang and Mitchell (2016) introduced event structure to jointly extract events and entities within a document. Although these approaches make decisions beyond sentence boundary, their extractions are still done at the sentence level.

Document-level Event Extraction
Many real-world applications need DEE, in which the event information scatters across the whole document. MUC-4 (1992) proposed the MUC-4 template-filling task that aims to identify event role fillers with associated role types from a document. Recent works explore the local and additional context to extract the role fillers by manually designed linguistic features (Patwardhan and Riloff, 2009;Riloff, 2011, 2012) or neural-based contextual representation . Recently, Ebner et al. (2020) published the Roles Across Multiple Sentences (RAMS) dataset, which contains annotation for the task of multi-sentence argument linking. A two-step approach (Zhang et al., 2020) is proposed for argument linking by detecting implicit argument across sentences. Li et al. (2021) extend this task and compile a new benchmark dataset WIKIEVENTS for exploring document-level argument extraction task. Then, Li et al. (2021) propose an end-to-end neural event argument extraction model by conditional text generation. However, these works focused on the sub-task of DEE (i.e., role filler extraction or argument extraction) and ignored the challenge of multi-events. To simultaneously address both challenges for DEE (i.e., arguments-scattering and multi-events), previous works focus on the ChFinAnn (Zheng et al., 2019) dataset and model DEE as an event table filling task, i.e., filling candidate arguments into predefined event table. Yang et al. (2018) proposed a key-event detection to guide event table filled with the arguments from key-event mention and surrounding sentences. Zheng et al. (2019) transforms DEE into filling event tables following a predefined order of roles with an entity-based path expanding, which achieved the SOTA for DEE. However, these methods suffered from a serial prediction which will lead to error propagation and individual argument prediction.

Conclusion and Future Work
In this paper, we propose an encoder-decoder model, DE-PPN, to extract events in parallel from a document. For addressing the challenges (i.e., arguments-scattering and multi-events) in DEE, we introduce a document-level encoder and a multigranularity decoder to generate events in parallel with document-aware representations. For training the parallel networks, we propose a matching loss function to perform a global optimization. Experimental results show that DE-PPN can significantly outperform SOTA methods especially facing the specific challenges in DEE.

A Appendix
In the appendix, we incorporate the following details that are omitted in the main body due to the space limit.
• Section A.1 introduce the Hungarian Algorithm.
• Section A.2 complements additional evaluation results for event classification and candidate arguments extraction.
• Section A.3 show the hyper-parameter setting.

A.1 Hungarian Algorithm
The linear sum assignment problem is also known as minimum weight matching in bipartite graphs. A problem instance is described by a matrix C, where each C i,j is the cost of matching vertex i of the first partite set (a "worker") and vertex j of the second set (a "job"). The goal is to find a complete assignment of workers to jobs of minimal cost. Formally, let X be a boolean matrix where X i,j = 1 if row i is assigned to column j. C i,j is the cost matrix of the bipartite graph. Then the optimal assignment has cost: s.t. each row is assignment to at most one column, and each column to at most one row. This function can also solve a generalization of the classic assignment problem where the cost matrix is rectangular. If it has more rows than columns, then not every row needs to be assigned to a column, and vice versa. The method used is the Hungarian algorithm, also known as the Munkres or Kuhn-Munkres algorithm. Table 6 shows the results of event type classification and candidate argument extraction. They are the two preceding sub-tasks for decoder to predict events with corresponding arguments in parallel. We can observe that: 1) the document-level event type classification can achieve a good performance which proves that event classification is not a difficult problem in this task. 2) how to assemble candidate arguments to corresponding events is the key challenge for DEE.

A.3 Hyperparameter setting
The detail hyperparameter is shown in Table 7 Hyper-parameter Value