Exploring Sentence Community for Document-Level Event Extraction

Document-level event extraction is critical to various natural language processing tasks for providing structured information. Existing approaches by sequential modeling neglect the complex logic structures for long texts. In this paper, we leverage the entity interactions and sentence interactions within long documents, and transform each document into an undirected unweighted graph by exploiting the relationship between sentences. We introduce the Sentence Community to represent each event as a subgraph. Furthermore, our framework SCDEE maintains the ability to extract multiple events by sentence community detection using graph attention networks and alleviate the role overlapping issue by predicting arguments in terms of roles. Experiments demonstrate that our framework achieves competitive results over state-of-the-art methods on the large-scale document-level event extraction dataset.


Introduction
Document-level Event Extraction (DEE) aims to identify events in a long text with pre-specified types and corresponding event-specific argument roles. Figure 1 illustrates an DEE example for Covid-19 Tracking type with 5 arguments spreading across multiple sentences.
Generating document-level events is beneficial for a variety of natural language processing downstream tasks, such as knowledge base construction (Li et al., 2018), article summarization (Lee et al., 2003), and question answering (Srihari and Li, 2000), since it can produce valuable structured information. However, the complex logic structures in long documents have made it a more challenging task than Sentence-level Event Extraction (SEE) that extracts the event from the sentence.
Recently, a wide variety of deep neural network models (Nguyen et al., 2016;Sha Yang et al., 2019;Ahmad et al., 2020;Ma et al., 2020a) have been proposed for event extraction, which could capture the semantic dependencies (mainly sequential dependencies) through recurrent neural networks or Transformer-based networks. However, existing models are mainly designed for sentence-level event extraction, omitting the complex interactions among entities or sentences in a long document. Therefore, documentlevel event extraction remains under-explored in spite of its importance. Intuitively, for long texts, (1) Entity Interaction: Entities existing in the same sentence have a higher probability of being arguments of the same event. For example, in Figure 1,entities "Israel" and "120" in [S3] tend to portray the same event.
(2) Sentence Interaction: Sentences containing the same entity tend to narrate the same event. For example, in Figure 1, [S1]-[S4] containing the same entity "Israel" incline to depict the same event.
Considering the above properties, in this paper, we propose to build document graphs based on these interactions and bring the document-level event extraction from sequential modeling to graphical document representation, which could be exploited to handle multiple problems in DEE.
Specifically, we firstly propose a novel method that transforms each document into an undirected unweighted graph. Each sentence presents one node considering the entity interaction, and we assign each node with a comprehensively encoded attribute vector based on BERT (Devlin et al., 2019). Besides, the edges are constructed by entity cooccurrences between sentences in view of the sentence interaction. Compared with sequential modeling, graph structure maintains the capability to drain the information from long-distance sentences to their related sentences through much fewer transitions.
Second, we propose the so-called Sentence Community to represent each event as a subgraph of the constructed document graph. Specifically, we designate the sentence community by sentences that contain the arguments required for each event.
In this way, the selected sentences also contain information about the corresponding event type. Therefore, each sentence community contains all the information needed for the event. Each sentence community corresponds to the related sentence nodes and edges in the document graph.
Third, we are able to mitigate the following issues based on our graphical representation: (1) Multi-event issue. Extracting multiple events for DEE is challenging because of argument scattering and overlapping. 1 In the real world, long texts are prone to contain multiple events. To extract multiple events, we employ Graph Attention Networks (GAT) (Velickovic et al., 2018) with the multi-head graph attention to detect overlapping sentence communities (Shchur and Günnemann, 2019), then we classify event types and extract corresponding arguments with an entity-level attention mechanism for each sentence community. (2) Role overlapping issue. An interesting problem in DEE is role overlapping issue, which refers to the phenomenon that an argument can play multiple roles, and few attentions have been paid to the problem. For example, in sentence "On Mar 3 2021, FedEx pledges $2 billion toward sustainable energy initiatives", the "Mar 3 2021" plays both the role "StartDate" and the role "EndDate" at the same time. We mitigate this issue by predicting arguments in terms of roles.
In summary, our contributions include: • We propose a novel graph construction method for long documents with the comprehensively encoded attribute vector for each sentence node.
• We propose a novel framework SCDEE that explores Sentence Community for Documentlevel Event Extraction, which alleviates the multi-event issue and the role overlapping issue.
• We perform a thorough evaluation of our framework and show the effectiveness on a large-scale document-level event extraction dataset.

Methodology
In this section, we present our proposed framework. We first introduce the document graph construction method. Then we present the GNN-based sentence community detection approach. Finally, we explain the event type and argument classification module. An overview is shown in Figure 2.

Document Graph Construction
We denote one document D as a sequence of sentences D = [s 1 , ..., s i , ..., s N ]. For each document, we construct an undirected unweighted graph G = (V, E), where the number of nodes V = {v 1 , v 2 , ..., v N } equals the number of sentences and E = {(u, v) ∈ V × V : A uv = 1} is the set of edges where A ∈ {0, 1} N ×N is a binary adjacency matrix. Adjacency Matrix. The adjacency matrix is constructed based on the entity co-occurrences between sentences. For each sentence, entities are recognized by the well-performed BI-LSTM-CRF (Huang et al., 2015) model. Then we set A ij = A ji = 1 for any sentences s i and s j containing the same entity. Besides, we add self-loops for A, i.e. A ii = 1 for 1 ≤ i ≤ N . Node Attribute Vector. To comprehensively encode the sentence information for each node, the attribute vector is constructed based on two segments: (1) the entity-level feature vector α that presents the information of event argument candidates, and (2) the sentence-level feature vector β that reflects the information of the event type.  Figure 2: An overview of our SCDEE architecture. The input document contains 6 sentences with 2 events. Arguments of the first event (in orange and blue) are scattered in S1-S4, which form the first sentence community (in purple). Arguments of the second event (in green) are scattered in S4-S6, which form the second sentence community (in grey). The two sentence communities overlap on S4.
Specifically, for each sentence s i containing N i words, we employ BERT representation model on s i and obtain the embedding vector of the last layer B i ∈ R N i ×d B , where d B denotes the hidden layer dimensionality of BERT. For each recognized entity in s i covering the j th to k th tokens, we obtain the entity embedding e i ∈ R d B by conducting a max-pooling operation on corresponding index range of B i , i.e., Then we conduct another max-pooling operation on all the existing l entities in s i to obtain the fixsized entity-level feature vector α ∈ R d B , The sentence-level feature vector β is obtained by max-pooling on B i . Finally, we employ a Bi-LSTM layer on the concatenation of α and β to get the node attribute vector h i ∈ R D , where D is the dimensionality of Bi-LSTM hidden states, and denotes the concatenation operation.

Sentence Community Detection
Given the constructed document graph G = (V, E) with N vertices and node attribute vectors h = [h 1 , h 2 , ..., h N ], we first generate the target sentence community for each event within the document. Then we propose to utilize GAT networks to detect overlapping sentence communities as nodes might be shared by several sentence communities. Target Sentence Community. For a document containing C events and N sentences, we construct a binary affiliation matrix F ∈ {0, 1} N ×C with each column representing one sentence community, and we set F i,j = 1 if the ith sentence contains any argument of the jth event. Each sentence may be assigned to multiple sentence communities or no sentence community, depending on whether these sentence communities overlap with each other. Community Detection via GAT. We employ GAT to model the information flow between nodes and predict overlapping sentence communities. There are several advantages of utilizing GNN-based models for overlapping sentence community detection. First, GNN could capture long-range dependencies between sentences through edges. Second, GNN tends to produce similar community affiliation vectors for the densely connected subgraphs. In our implementation, we exploit GAT for sentence community detection. The local node attribute vectors can be further aggregated into more informative vectors by attention mechanism over its neighbor features. Besides, GAT does not depend on upfront access to the global graph structure as the attention mechanism is applied in a shared man-ner to all edges in a graph. Therefore, it is directly applicable to inductive learning, which means it could predict communities inductively on graphs that are completely unseen during training.
In general, the input to GAT layers is an undirected unweight graph G = (V, E) with the adjacency matrix A and node attribute vectors We use D to denote the cardinality of GAT outputs. We briefly describe the GAT layer used in our implementation. The attention score α ij that indicates the importance of the neighbor node j to the attended node i is where σ is LeakyReLU activation, a ∈ R 2D is a fully connected layer, · T represents transposition, W ∈ R D ×D denotes a weight matrix, and N i denotes the neighbors of node i.
We employ the multi-head attention mechanism with K heads to capture more information from different representation subspaces: Then we obtain the predicted feature matrix X ∈ R N ×(K·D ) by stacking the K-head GAT outputs h i ∈ R K·D , i = 1, 2, ..., N .
We employ a Multi-Layer Perceptron (MLP) on X with hidden dimensions of 2C, and we further reshape it as R N ×2×C with 2 being the cardinality of the target affiliation matrix F ∈ {0, 1} N ×C . Then we employ softamx on X, i.e., where values along the second dimension of P ∈ R N ×2×C represent the probabilities of nodes affiliating sentence communities. We assign node v i ∈ V to sentence community c ∈ C if the corresponding probability is more than half, and we can further obtain our predicted affiliation matrix F ∈ {0, 1} N ×C .
Besides, we calculate the high-dimension crossentropy loss L CD based on P and the target affiliation matrix F :

Event Type Classification
We predict the event type for the sentence community j based on the predicted affiliation matrix F ∈ {0, 1} N ×C . First, the embedding for the event E event is obtained by conducting a maxpooling operation on the selected node attribute vectors, where denotes element-wise product.
Then, for all pre-defined V target event types, the event type is predicted by applying a fully connected layer on the event embedding E event with softmax function to estimate the probability distribution, i.e., where W ∈ R V ×D and b ∈ R V are weights. The loss function for event type classification L ET is the cross-entropy loss, where y ET is the label of the event type.

Event Argument Classification
Given the sentences in each sentence community and the predicted event type, we extract the corresponding arguments. First, we take out the entities within these sentences and their embeddings as depicted in Equation 1. For entities preserving the same surface name, we merge their embeddings by max-pooling operation. Then, we obtain m entity embeddings with distinct surface names, which are denoted as E ∈ R m×d B . We employ a Bi-LSTM layer to make the embeddings more informative and obtain E ∈ R m×L with L being the hidden size of Bi-LSTM. Entity-Level Attention Layer. To capture the associations between entities, we further design an entity-level attention mechanism to aggregate information. The attention score α i ∈ R m (similarity or relatedness) is calculated as follows where W ∈ R L , b ∈ R are weights.
Then the final entity embedding F i ∈ R 2L is computed by: Role Overlapping Issue. We predict arguments for each argument role to mitigate this issue. First, we feed the final entity embedding F i to a sigmoid function to simulate the relative scores for argument classification instead of the ordinary softmax classifier:p where W ∈ R C×2L , b ∈ R C are weights, and C denotes the number of roles corresponding to the predicted event type. Then for each role, we select the entity with the highest score that exceeds the threshold p 0 as the argument. In this way, an entity can be the argument for multiple roles.
We assume the ground truth label for each role is y ∈ R C , where y i ∈ {0, 1} denotes whether the entity is the argument, and we utilize the binary cross-entropy loss L EA for argument classification as follows

Objective Function
We utilize the weighted summation of L CD , L ET , L EA as our final loss, i.e., where λ 1 , λ 2 and λ 3 are hyper-parameters. with 8, 6, 6, 6, and 9 pre-defined roles, respectively. The training set accounts for 80%, and both the development and test set account for 10%. The detailed statistics are shown in Table 1.
We can see from Table 1 that the number of EP type is much larger than other types. Therefore, in each epoch, we randomly sample 40% of the EP type, which is a similar size of EU and EO types. Besides, we randomly resample the documents so that the number of single-event, double-event, and triple-event documents are the same. Implementation Details. In our experiments, we set the hidden dimensions of all the LSTM layers used in our framework to be 250, and set the dropout rate to be 0.2 in order to avoid overfitting. We employ a one-layer GAT model with K = 3 attention heads computing D = 200 features per head (for a total of 400 features). In the event argument classification part, the probability threshold p 0 is set to be 0.5 to mitigate the role overlapping issue. During training, we set λ 1 = 3, λ 2 = λ 3 = 1 in the objective function. We employ the Adam (Kingma and Ba, 2015) to optimize the model parameters with the initial learning rate being 0.001, β 1 = 0.9, β 2 = 0.999 and = 10 −8 . We implement our model in PyTorch 1.7.1 with one NVIDIA Titan Xp GPU. For all experiments, we set the maximal number of training epochs to be 50.
Evaluation Metrics. The goal of DEE is to correctly predict the event type and extract the related arguments. Following (Zheng et al., 2019), for each document, we select the most similar predicted event record when the predicted event type is correct, and then we calculate the event-rolespecific true positive, false positive, and false negative statistics until no target event records left. Then we aggregate all the statistics for each event type and present the precision and F1 scores in the percentage format.

Experimental Results and Analysis
Baseline Models. In order to comprehensively evaluate our framework, we compare it with these following state-of-the-art baselines: • DCFEE  employs the argument-completion strategy to generate the document-level event record by utilizing the arguments from sentences-level event extraction results. In order to handle multi-event extraction, DCFEE-O and DCFEE-M (Zheng et al., 2019) are proposed by producing one event record and multiple possible argument combinations from one key-event sentence respectively.
• Doc2EDAG (Zheng et al., 2019) generates an entity-based directed acyclic graph to extract multiple events from documents. Besides, the Greedy-Dec fills one event table entry greedily by using recognized entity roles, which shares the same architecture with Doc2EDAG. Main Results. Table 2 presents the performance comparison of different models. Overall, our framework SCDEE outperforms all other methods on the test set and improves 5.1% and 2.6% on the averaged precision and F1-scores over the state-ofthe-art Doc2EDAG model. Specifically, compared with DCFEE-O and DCFEE-M, our framework achieves better results both in precision and F1scores on all the five event types. When compared with GreedyDec that holds relatively high precision, our framework still improves 9.8% on the averaged precision. Performance Analysis. Concretely, we transform the long document into a graph and provide shortcuts for closely related sentences in a sentence community. Compared with DCFEE-O and DCFEE-M that predicted missing arguments from surrounding sentences, we believe the improvements of DCFEE should give credit to the graph structure and the GAT layer, which alleviate the long-range dependency issue. When comparing with GreedyDec that   extracts events greedily using the recognized entity roles, we consider the reason may lie in the stronger association between entities within the same sentence, which means that these entities are more likely to portray the same event. The overall performance of the strongest baseline Doc2EDAG is slightly inferior to our model. Though Doc2EDAG generates multiple events by path-expanding subtasks, they ignore the role overlap problem in DEE. We further alleviate this problem by predicting arguments in terms of roles in our framework.

Ablation Study
As shown in Table 3, we conduct ablation experiments by evaluating three key designs to demonstrate the effectiveness of components in our framework.
• -GAT. We investigate the effectiveness of the GAT layers in our framework. To be fair, we replace the GAT layer with a fully connected layer. Experimental results show the effectiveness of the GAT networks on our framework.
• -ELA. We remove the entity-level attention layer that aims to capture the association between  entities. We show that the attention layer is helpful to incorporate the information from other entities and improve the overall performance.
• -ROI. We replace the sigmoid function and binary cross-entropy loss in the event argument classification with the general softmax classifier and cross-entropy loss respectively in order to explore how the role overlapping issue affects the experimental results. We find that F1 scores of the EF and EO types drop significantly, which might mean that they suffer the most from this issue.

Single & Multiple DEE Analysis
We conduct experiments to study the performance of our framework on single-event and multi-event documents, and the influence of the aforementioned three key components. As shown in Table  4, we find that (1) for single-event documents, our framework achieves superior performance in terms of both precision and F1-scores. In addition, -GAT leads to the most decrease in precision, and -ROI causes the most F1-score decrease, which means that the role overlapping issue might be the critical obstacle.
(2) For multiple-event documents, our framework achieves fairish performance. Besides, -ROI results in noteworthy performance degradation both in precision and F1-scores. It demonstrates that the role overlapping issue hinders the performance of multiple event extraction.

Effect of GAT Architecture
We conduct experiments to see how the model's performance is affected by the GAT network architecture. First, we perform a set of experiments on a single-layer GAT network with a different number of heads. Experimental results in Table 5 show that there is no notable difference between 1-head and 4-head GAT. However, more time is needed for convergence as parameters are increasing. But more heads lead to performance degradation.  The deeper, the better? We further investigate the framework performance using two-layer GAT networks with different numbers of heads. We employ the exponential linear unit (ELU) (Clevert et al., 2016) as the activation function between layers. As described in Table 6, the overall F1 scores significantly drop whether we increase the number of heads in the first or the second layer. The possible reason for the overall performance dropping may lie in the over-smoothing issue (Zhou et al., 2018) that the node attribute vectors tend to converge to similar values.

Time complexity
In news articles, entities are usually extracted in advance by highly efficient tools in real-world industry applications. For the document graph construction G = (V, E), let N s be the number of sentences, N e be the number of all extracted entities, and N u denotes the number of entities with distinct surface names. Then generating node attribute vectors requires O(N e ) complexity. For the sentence community detection, the GAT layer requires O(N s · D · D + |E| · D ) with D and D representing the input and output dimensionality, and the complexity of node assignment is O(N s ).
For the argument classification, the complexity of the entity attention layer is O(N u ). Notably, N s , N e , and N u are far less than the length of documents, which makes our model work efficiently.

Case Study
We visualize the graph structure of the document and analyse its property as shown in Figure 3. First, as shown in Figure 3(a), two thirds of the sentences contain no entity. Our framework could filter the noise sentences and focus on informative sentences, which is an advantage compared with the baseline DCFEE.
Second, in Figure 3(b), from the perspective of sentence community, the document graph is composed of two overlapping sentence commu-  Figure 3: (a) an example of the document containing 24 sentences with 2 EquityOerweight events. We exclusively present the 8 sentences with recognized entities (in red). (b) an example of the document graph with two sentence communities. The first sentence community corresponds to the complete graph K6. The second community corresponds to the complete graph K5. (c) an example of argument classification. An entity might be classified into multiple roles if these roles overlap.
nities. Notably, the first sentence community corresponds to the complete graph K6 since all the sentence nodes share the entity Wu Peifu. The second sentence community corresponds to the complete graph K5 with all the sentence nodes sharing the entity Wu Di. Sentences related to each event are densely connected under the definition of sentence community. Third, as depicted in Figure 3(c), our framework could reduce the irrelevant argument candidates for each event as compared with our baseline Doc2EDAG. Entities within each sentence community are more closely related.
The above results verify that graphical representation is advantageous for document-level event extraction.

Related Work
Event Extraction (EE), a challenging sub-task of information extraction, has been recently studied under two paradigms: the sentence-level EE and document-level EE. Sentence-level Event Extraction mainly follows the requirements of ACE event extraction task (Doddington et al., 2004) that aims to detect the event trigger and arguments from a sentence. This task can be further decomposed into two sub-tasks: Event Detection that aims to identify the event triggers (Feng et al., 2016;Liu et al., 2017;Yan et al., 2019;Cui et al., 2020;Lai et al., 2020a,b) and Event Argument Role Labeling that aims to predict whether words or phrases participate in the event argument roles (Wang et al., 2019;Yun et al., 2019;Pouran Ben Veyseh et al., 2020;Ma et al., 2020b;Ahmad et al., 2020;. Furthermore, various researches have been dedicated to extracting event triggers and argu-ments simultaneously (Sha et al., 2018;Yang et al., 2019;Tang et al., 2020;Du and Cardie, 2020b). Document-level Event Extraction aims to identify event types and corresponding event argument roles. Compared with sentence-level event extraction, the main difference is that it is no longer necessary to identify the event trigger words explicitly.
From the perspective of modeling,  employ a sequence tagging model to extract document-level events by utilizing sentence-level results. Zheng et al. (2019) propose an end-to-end model that transforms the DEE task into several sequential path-expanding sub-tasks with each final path being a predicted event record. Du and Cardie (2020a) show that longer text might hurt the model performance, and a multi-granularity reader is proposed to incorporate sentence-level and paragraph-level information. Huang and Peng (2020) propose to leverage Deep Value Networks (DVN) that captures cross-event dependencies to jointly resolving both the entity and event coreferences for DEE.  introduce an end-to-end generative transformer-based model to extract arguments across sentence boundaries.