Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker

Document-level event extraction aims to recognize event information from a whole piece of article. Existing methods are not effective due to two challenges of this task: a) the target event arguments are scattered across sentences; b) the correlation among events in a document is non-trivial to model. In this paper, we propose Heterogeneous Graph-based Interaction Model with a Tracker (GIT) to solve the aforementioned two challenges. For the first challenge, GIT constructs a heterogeneous graph interaction network to capture global interactions among different sentences and entity mentions. For the second, GIT introduces a Tracker module to track the extracted events and hence capture the interdependency among the events. Experiments on a large-scale dataset (Zheng et al, 2019) show GIT outperforms the previous methods by 2.8 F1. Further analysis reveals is effective in extracting multiple correlated events and event arguments that scatter across the document.


Introduction
Event Extraction (EE) is one of the key and challenging tasks in Information Extraction (IE), which aims to detect events and extract their arguments from the text. Most previous methods (Chen et al., 2015;Nguyen et al., 2016;Yang et al., 2019;Du and Cardie, 2020b) focus on sentence-level EE, extracting events from a single sentence. The sentence-level model, however, fails to extract events whose arguments spread in multiple sentences, which is much more common in real-world scenarios. Hence, extracting events at the document-level is critical. It has attracted much attention recently Du and Cardie, 2020a;Du et al., 2020).   in the financial domain, and we translate it into English for illustration. Entity mentions are colored. Due to space limitation, we only show four associated sentences and three argument roles of each event type. The complete original document can be found in Appendix C. EU: Equity Underweight, EO: Equity Overweight. Though promising, document-level EE still faces two critical challenges. Firstly, the arguments of an event record may scatter across sentences, which requires a comprehensive understanding of the cross-sentence context. Figure 1 illustrates an example that one Equity Underweight (EU) and one Equity Overweight (EO) event records are extracted from a financial document. It is less challenging to extract the EU event because all the related arguments appear in the same sentence (Sentence 2). However, for the arguments of EO record, Nov 6, 2014 appears in Sentence 1 and 2 while Xiaoting Wu in Sentence 3 and 4. It would be quite challenging to identify such events without considering global interactions among sentences and entity mentions. Secondly, a document may express several correlated events simultaneously, and recognizing the interdependency among them is fundamental to successful extraction. As shown in Figure 1, the two events are interdependent because they correspond to exactly the same transaction and therefore share the same StartDate. Effective modeling on such interdependency among the correlated events remains a key challenge in this task.  extracts events from a central sentence and query the neighboring sentences for missing arguments, which ignores the cross-sentence correspondence between augments. Though Zheng et al. (2019) takes a first step to fuse the sentences and entities information via Transformer, they neglect the interdependency among events. Focusing on single event extraction, Du and Cardie (2020a) and Du et al. (2020) concatenate multiple sentences and only consider a single event, which lacks the ability to model multiple events scattered in a long document.
To tackle the aforementioned two challenges, in this paper, we propose a Heterogeneous Graphbased Interaction Model with a Tracker (GIT) for document-level EE. To deal with scattered arguments across sentences, we focus on the Global Interactions among sentences and entity mentions. Specifically, we construct a heterogeneous graph interaction network with mention nodes and sentence nodes, and model the interactions among them by four types of edges (i.e., sentence-sentence edge, sentence-mention edge, intra-mention-mention edge, and inter-mentionmention edge) in the graph neural network. In this way, GIT jointly models the entities and sentences in the document from a global perspective.
To facilitate the multi-event extraction, we target on the Global Interdependency among correlated events. Concretely we propose a Tracker module to continually tracks the extracted event records with a global memory. In this way, the model is encouraged to incorporate the interdependency with other correlated event records while predicting.
We summarize our contributions as follows: • We construct a heterogeneous graph interaction network for document-level EE. With different heterogeneous edges, the model could capture the global context for the scattered event arguments across different sentences.
• We introduce a novel Tracker module to track the extracted event records. The Tracker eases the difficulty of extracting correlated events, as interdependency among events would be taken into consideration.
• Experiments show GIT outperforms the previous state-of-the-art model by 2.8 F1 on the large-scale public dataset  with 32, 040 documents, especially on crosssentence events and multiple events scenarios (with 3.7 and 4.9 absolute increase on F1).

Preliminaries
We first clarify some important notions. a) entity mention: a text span within document that refers to an entity object; b) event argument: an entity playing a specific event role. Event roles are predefined for each event type; c) event record: an entry of a specific event type containing arguments for different roles in the event. For simplicity, we use record for short in the following sections. Following , given a document composed of sentences D = {s i } |D| i=1 and a sentence containing a sequence of words s i = {w j } |s i | j=1 , the task aims to handle three sub-tasks : 1) entity extraction: from the document to serve as argument candidates. An entity may have multiple mentions across the document. 2) event types detection: detecting specific event types that are expressed by the document. 3) event records extraction: finding appropriate arguments for the expressed events from entities, which is the most challenging and also the focus of our paper. The task does not require to identify event triggers (Zeng et al., 2018;Liu et al., 2019b), which reduces manual effort of annotation and the application scenarios becomes more extensive.

Methodology
As shows in Figure 2, GIT first extracts candidate entities through sentence-level neural extractor (Sec 3.1). Then we construct a heterogeneous graph to model the interactions among sentences and entity mentions (Sec 3.2), and detect event types expressed by the document (Sec 3.3). Finally we introduce a Tracker module to continuously track all the records with global memory, in which we utilize the global interdependency among records for multi-event extraction (Sec 3.4).

Entity Extraction
Given a sentence s = {w j } |s| j=1 ∈ D, we encode s into a sequence of vectors {g j } |s i | j=1 using Trans-  former (Vaswani et al., 2017): The word representation of w j is a sum of the corresponding token and position embeddings.
We extract entities at the sentence level and formulate it as a sequence tagging task with BIO (Begin, Inside, Other) schema. We leverage a conditional random field (CRF) layer to identify entities. For training, we minimize the following loss: where y s is the golden label sequence of s. For inference, we use Viterbi algorithm to decode the label sequence with the maximum probability.

Heterogeneous Graph Interaction Network
An event may span multiple sentences in the document, which means its corresponding entity mentions may also scatter across different sentences. Identifying and modeling these entity mentions in the cross-sentence context is fundamental in document EE. Thus we build a heterogeneous graph G which contains entity mention nodes and sentence nodes in the document D. In the graph G, interactions among multiple entity mentions and sentences can be explicitly modeled. For each entity mention node e, we initialize node embedding h by averaging the representation of the contained words. For each sentence node s, we initialize node embedding h (0) s = Max({g j } j∈s ) + SentPos(s) by maxpooling all the representation of words within the sentence plus sentence position embedding.
To capture the interactions among sentences and mentions, we introduce four types of edges.
Sentence-Sentence Edge (S-S) Sentence nodes are fully connected to each other with S-S edges. In this way, we can easily capture the global properties in the document with sentence-level interactions, e.g., the long range dependency between any two separate sentences in the document would be modeled efficiently with S-S edges.

Sentence-Mention Edge (S-M)
We model the local context of an entity mention in a specific sentence with S-M edge, specifically the edge connecting the mention node and the sentence node it belongs to.

Intra-Mention-Mention Edge (M-M intra )
We connect distinct entity mentions in the same sentences with M-M intra edges. The co-occurrence of mentions in a sentence indicates those mentions are likely to be involved in the same event. We explicitly model this indication by M-M intra edges.

Inter-Mention-Mention Edge (M-M inter )
The entity mentions that corresponds to the same entity are fully connected with each other by M-M inter edges. As in document EE, an entity usually corresponds to multiple mentions across sentences, we thus use M-M inter edge to track all the appearances of a specific entity, which facilitates the long distance event extraction from a global perspective.
In Section. 4.5, experiments show that all of these four kinds of edges play an important role in event detection, and the performance would decrease without any of them.
After heterogeneous graph construction * , we apply multi-layer Graph Convolution Network (Kipf and Welling, 2017) to model the global interactions inspired by Zeng et al. (2020). Given node u at the l-th layer, the graph convolutional operation is defined as follows: k ∈ R dm×dm is trainable parameters. N k (u) denotes the neighbors for node u connected in k-th type edge and c u,k is a normalization constant. We then derive the final hidden state h u for node u, u is the initial node embedding of node u, and L is the number of GCN layers.
Finally, we obtain the sentence embedding matrix S = [h 1 h 2 . . . h |D| ] ∈ R dm×|D| and entity embedding matrix E ∈ R dm×|E| . The i-th entity may have many mentions, where we simply use string matching to detect entity coreference following  , and the entity embedding E i is computed by the average of its mention node embedding, E i = Mean({h j } j∈Mention(i) ). In this way, the sentences and entities are interactively represented in a context-aware way.

Event Types Detection
Since a document can express events of different types, we formulate the task as a multi-label classification and leverage sentences feature matrix S to * Traditional methods in sentence-level EE also utilize graph to extract events Yan et al., 2019), based on the dependency tree. However, our interaction graph is heterogeneous and have no demands for dependency tree. detect event types: where Q ∈ R dm×T and W t ∈ R dm are trainable parameters, and T denotes the number of possible event types. MultiHead refers to the standard multi-head attention mechanism with Query/Key/Value. Therefore, we derive the event types detection loss with golden label R ∈ R T :

Event Records Extraction
Since a document is likely to express multiple event records and the number of records cannot be known in advance, we decode records by expanding a tree orderly as previous methods did . However, they treat each record independently. Instead, to incorporate the interdependency among event records, we propose a Tracker module, which improves the model performance.
To be self-contained, we introduce the ordered tree expanding in this paragraph. In each step, we extract event records of a specific event type. The arguments extraction order is predefined so that the extraction is modeled as a constrained tree expanding task † . Taking Equity Freeze records as an example, as shown in Figure 3, we firstly extract EquityHolder, followed by FrozeShares and others. Starting from a virtual root node, the tree expands by predicting arguments in a sequential order. As there may exist multiple eligible entities for the event argument role, the current node will expand several branches during extraction, with different entities assigned to the current role. This branching operation is formulated as multi-label classification task. In this way, each path from the root node to the leaf node is identified as a unique event record.
Interdependency exists extensively among different event records. For example, as shown in Figure 1, an Equity Underweight event record is closely related to an Equity Overweight event record, and they may share some key arguments or provide useful reasoning information. To take advantage of such interdependency, we propose a novel Tracker module inspired by memory network (Weston et al., 2015). Intuitively, the Tracker continually tracks the extracted records on-the-fly and store the information into a global memory. When predicting arguments for current record, the model will query the global memory and therefore make use of useful interdependency information of other records.
In detail, for the i-th record path consisting of a sequence of entities, the Tracker encodes the corresponding entity representation sequence U i = [E i1 , E i2 , ...] into an vector G i with an LSTM (last hidden state) and add event type embedding. Then the compressed record information is stored in the global memory G, which is shared across different event types as shown in Figure 3. For extraction, given a record path U i ∈ R dm×(J−1) with the first J − 1 arguments roles, we predict the J-th role by injecting role-specific information into entity representations, E = E + Role J , where Role J is the role embedding for the J-th role. Then we concatenate E, sentences feature S, current entities path U i , and the global memory G, followed by a transformer to obtain new entity feature matrix E ∈ R dm×|E| , which contains global role-specific † We simply adopt the order used by . information for all entity candidates. ‡ We treat the path expansion as a multi-label classification problem with a binary classifier over E i , i.e., predicts whether the i-th entity is the next argument role for the current record and expand the path accordingly as shown in Figure 3.
During training, we minimize the following loss: where N D denotes the nodes set in the event records tree, and y n t is the golden label. If the t-th entity is validate for the next argument in node n, then y n t = 1, otherwise y n t = 0.

Training
We sum the losses coming from three sub-tasks with different weight respectively in Eq. (1), (2) and (3) as follows:

Experiments Setting
In our implementation of GIT, we use 8 and 4 layers Transformer (Vaswani et al., 2017) in encoding and decoding module respectively. The dimensions in hidden layers and feed-forward layers are the same as previous work , i.e., 768 and 1, 024. We also use L = 3 layers of GCN, and set dropout rate to 0.1, batch size to 64. GIT is trained using Adam (Kingma and Ba, 2015) as optimizer with 1e − 4 learning rate for 100 epochs. We set λ 1 = 0.05, λ 2 = λ 3 = 1 for the loss function. Three sub-tasks of the document-level EE are all evaluated by F1 score. Due to limited space, we leave the results of entity extraction and event types detection in Appendix B, which shows GIT only slightly outperform Doc2EDAG, because we mainly focus on event record extraction and the methods are similar to Doc2EDAG for these two sub-tasks. In the following, we mainly report and analyze the results of event record extraction.

Main Results
Overall performance. The results of the overall performance on the document-level EE dataset is illustrated in Table 1. As Table 1 shows, our GIT consistently outperforms other baselines, thanks to better modelling of global interactions and interdependency. Specifically, GIT improves 2.8 micro F1 compared with the previous state-of-the-art, Doc2EDAG, especially 4.5 improvement in Equity Underweight (EU) event type. Cross-sentence records scenario. There are more than 99.5% records of the test set are crosssentence event records, and the extraction becomes gradually more difficult as the number of their involved sentences grows. To verifies the effectiveness of GIT to capture cross-sentence information, we first calculate the average number of sentences that the records involve for each document, and sort them in ascending order. Then we divide them into four sets I/II/III/IV with equal size. Documents in Set. IV is considered to be the most challenging as it requires the most number of sentences to successfully extract records. As Table 2 shows, GIT consistently outperforms Doc2EDAG, especially on the most challenging Set. IV that involves the most sentences, by 3.7 F1 score. It suggests that GIT can well capture global context and mitigate the arguments-scattering challenge, with the help of the heterogeneous graph interaction network.
Multiple records scenario. GIT introduces the tracker to make use of global interdependency among event records, which is important in multiple records scenario. To illustrate its effectiveness, we divide the test set into single-record set (S.) containing documents with one record, and multi-record set (M.) containing those with multiple records. As shown in Table. 3, F1 score on M.      Table 5: Performance of GIT on ablation study for the T racker module. The removal of the Tracker (GIT-NT) brings about higher F1 decrease on M. than that on S.. S.: Single-record set, M.: Multi-record set.
is much lower than that on S., indicating it is challenging to extract multiple records. However, GIT still surpasses other strong baselines by 4.9 ∼ 35.3 on multi-record set (M.). This is because GIT is aware of other records through the T racker module, and leverage the interdependency information to improve the performance ¶ . ¶ Nguyen et al. (2016) maintain three binary matrices to memorize entities and events states. Although they aim at sentence-level EE that contains fewer entities and event records, it would be also interesting to compare with them and we leave it as future work.

Analysis
We conduct further experiments to analyze the key modules in GIT more deeply.
On the effect of heterogeneous graph interaction network. The heterogeneous graph we constructed contains four types of edges. To explore their functions, we remove one type of edges at a time, and remove the whole graph network finally. Results are shown in Table 4, including micro F1 and F1 on the four sets, which are divided by the number of involved sentences for records as we did before. The micro F1 would decreases 1.0 ∼ 1.4 without a certainty type of edge. Besides, removing the whole graph causes an significant drop by 2.0 F1, especially for Set IV by 2.5, which requires the most number of sentences to extract the event record. It demonstrates that the graph interaction network helps improve the performance, especially on records involving many sentences, and all kinds of edges play an important role for extraction.
On the effect of Tracker module. GIT can leverage interdependency among records based on the information of other event records tracked by Tracker. To explore its effect, firstly, we remove the global interdependency information between records of different event types, by clearing the global memory whenever we extract events for an- The shareholder of the company, Quanlie Chen, pledged 52.4 million to GDZQ Co., Ltd. in 2018, and supplemented the pledge recently because of the decline of the share price. … [7] Since the borrowings have been paid off, Quanlie Chen completed the pledge cancellation procedures of 35.5 million that were pledged to GTJA Co., Ltd. on Nov 7, 2018. other new event type (GIT-Own Type). Next, we remove all the tracking information except the own path for a record, to explore whether the tracking of other records makes effect indeed (GIT-Own Path). Finally, we remove the whole Tracker module (GIT-No Tracker). As Table 5 shows, the F1 in GIT-OT/GIT-OP decreases by 0.5/1.2, suggesting the interdependency among records of both the same and different event types do play an essential role. Besides, their F1 decrease in M. by 0.7/1.5 are more than those in S. by 0.8/1.0, verifying the effectiveness of the Tracker in multi-event scenarios. Moreover, the performances are similar between GIT-OP and GIT-NT, which also provides evidence that other records do help. We also reveal F1 on documents with different number of records in Figure 4. The gap between models with or without Tracker raises as the number of records increases, which validates the effectiveness of our Tracker.

Related Work
Sentence-level Event Extraction. Previous approaches mainly focus on sentence-level event extraction. Chen et al. (2015) propose a neural pipeline model that identifies triggers first and then extracts argument roles. Nguyen et al. (2016) use a joint model to extract triggers and argument roles simultaneously. Some studies also utilize dependency tree information Yan et al., 2019). To utilize more knowledge, some studies leverage document context , pre-trained language model (Yang et al., 2019), and explicit external knowledge (Liu et al., 2019a;Tong et al., 2020) such as WordNet (Miller, 1995). Du and Cardie (2020b) also try to extract events in a Question-Answer way. These studies usually conduct experiments on sentencelevel event extraction dataset, ACE05 (Walker et al., 2006). However, it is hard for the sentence-level models to extract multiple qualified events spanning across sentences, which is more common in real-world scenarios.
Document-level Event Extraction. Documentlevel EE has attracted more and more attention recently. Yang and Mitchell (2016) use well-defined features to handle the event-argument relations across sentences, which is, unfortunately, quite nontrivial.  extract events from a central sentence and find other arguments from neighboring sentences separately. Although  use Transformer to fuse sentences and entities, interdependency among events is neglected. Du and Cardie (2020a) try to encode the sentences in a multi-granularity way and Du et al.
(2020) leverage a seq2seq model. They conduct experiments on MUC-4 (Sundheim, 1992) dataset with 1, 700 documents and 5 kinds of entity-based arguments, and it is formulated as a table-filling task, coping with single event record of single event type. However, our work is different from these studies in that a) we utilize heterogeneous graph to model the global interactions among sentences and mentions to capture cross-sentence context, b) and we leverage the global interdependency through Tracker to extract multiple event records of multiple event types.

Conclusion
Although promising in practical application, document-level EE still faces some challenges such as arguments-scattering phenomenon and multiple correlated events expressed by a single document. To tackle the challenges, we introduce Heterogeneous Graph-based Interaction Model with a Tracker (GIT). GIT uses a heterogeneous graph interaction network to model global interactions among sentences and entity mentions. GIT also uses a Tracker to track the extracted records to consider global interdependency during extraction. Experiments on large-scale public dataset  show GIT outperforms previous stateof-the-art by 2.8 F1. Further analysis verifies the effectiveness of GIT especially in cross-sentence events extraction and multi-event scenarios.  Figure 5. Sentences in red color are presented in Figure 5.   Table 7: Results of entity extraction sub-task on the test set. The performance of different models are similar, for the reason that they all utilize the same structure and methods to extract entities. event extraction. In this section, we also illustate the results of entity extraction in Table. 7 and event types detection in Table. 8. Moreover, the comprehensive results of event record extraction is shown in Table. 10, including results reported in  with precison, recall and F1 score.

C Complete Document for the Examples
We show an example document in Figure 1 in the paper. To better illustrate, we translate it from Chinese into English and make some simplication.
Here we present the original complete document example in Figure 7. For the specific meanings of argument roles, we recommend readers to refer to .
We also demonstrate an case study in Figure 5 in the paper. Now we also show its original Chinese version in Figure 6.     Figure 7: The original complete document corresponding to the running example in Figure 1. Sentences in red color are presented in Figure 1.