GRIT: Generative Role-filler Transformers for Document-level Event Entity Extraction

We revisit the classic problem of document-level role-filler entity extraction (REE) for template filling. We argue that sentence-level approaches are ill-suited to the task and introduce a generative transformer-based encoder-decoder framework (GRIT) that is designed to model context at the document level: it can make extraction decisions across sentence boundaries; is implicitly aware of noun phrase coreference structure, and has the capacity to respect cross-role dependencies in the template structure. We evaluate our approach on the MUC-4 dataset, and show that our model performs substantially better than prior work. We also show that our modeling choices contribute to model performance, e.g., by implicitly capturing linguistic knowledge such as recognizing coreferent entity mentions.


Introduction
Document-level template filling (Sundheim, 1991(Sundheim, , 1993Grishman and Sundheim, 1996) is a classic problem in information extraction (IE) and NLP (Jurafsky and Martin, 2014). It is of great importance for automating many real-world tasks, such as event extraction from newswire (Sundheim, 1991). The complete task is generally tackled in two steps. The first step detects events in the article and assigns templates to each of them (template recognition); the second step performs role-filler entity extraction (REE) for filling in the templates. In this work we focus on the role-filler entity extraction (REE) sub-task of template filling (Figure 1). 1 The input text describes a bombing event; the goal is to identify the entities that fill any of the roles associated with the event (e.g., the perpetrator, their organization, the weapon) by extracting The explosion occurred at 2350 on 16 January, causing panic but no casualties.  a descriptive "mention" of it -a string from the document.
In contrast to sentence-level event extraction (see, e.g., the ACE evaluation (Linguistic Data Consortium, 2005)), document-level REE introduces several complications. First, role-filler entities must be extracted even if they never appear in the same sentence as an event trigger. In Figure 1, for example, the WEAPON and the first mention of the telephone company building (TARGET) appear in a sentence that does not explicitly mention the explosion of the bomb. In addition, REE is ultimately an entity-based task -exactly one descriptive mention for each role-filler should be extracted even when the entity is referenced multiple times in connection with the event. The final output for the bombing example should, therefore, include just one of the "water pipes" references, and one of the three alternative descriptions of the PERPIND and the second TARGET, the telephone company building. As a result of these complications, end-to-end sentence-level event extraction models (Chen et al., 2015;Lample et al., 2016), which dominate the literature, are ill-suited for the REE task, which calls for models that encode information and track entities across a longer context.
Fortunately, neural models for event extraction that have the ability to model longer contexts have been developed. Du and Cardie (2020), for example, extend standard contextualized representations (Devlin et al., 2019) to produce a documentlevel sequence tagging model for event argument extraction. Both approaches show improvements in performance over sentence-level models on event extraction. Regrettably, these approaches (as well as most sentence-level methods) handle each candidate role-filler prediction in isolation. Consequently, they cannot easily model the coreference structure required to limit spurious role-filler mention extractions. Nor can they easily exploit semantic dependencies between closely related roles like the PERPIND and the PERPORG, which can share a portion of the same entity span. "Shining Path members", for instance, describes the PERPIND in Figure 1, and its sub-phrase, "Shining Path", describes the associated PERPORG.
Contributions In this work we revisit the classic but recently under-studied problem of documentlevel role-filler entity extraction problem and introduce a novel end-to-end generative transformer model -the "Generative Role-filler Transformer" (GRIT) (Figure 2).
• Designed to model context at the document level, GRIT (1) has the ability to make extraction decisions across sentence boundaries; (2) is implicitly aware of noun phrase coreference structure; and (3) has the capacity to respect cross-role dependencies. More specifically, GRIT is built upon the pre-trained transformer model (BERT): we add a pointer selection module in the decoder to permit access to the entire input document, and a generative head to model document-level extraction decisions. In spite of the added extraction capability, GRIT requires no additional parameters beyond those in the pre-trained BERT.
• To measure the model's ability to both extract entities for each role, and implicitly recognize coreferent relations between entity mentions, we design a metric (CEAF-REE) based on a maximum bipartite matching algorithm, drawing insights from the CEAF (Luo, 2005) coreference resolution measure.
• We evaluate GRIT on the MUC-4 (1992) REE task (Section 3). Empirically, our model outperforms substantially strong baseline models. We also demonstrate that GRIT is better than existing document-level event extraction approaches at capturing linguistic properties critical for the task, including coreference between entity mentions and cross-role extraction dependencies. 2

Related Work
Sentence-level Event Extraction Most work in event extraction has focused on the ACE sentencelevel event task (Walker et al., 2006), which requires the detection of an event trigger and extraction of its arguments from within a single sentence. Previous state-of-the-art methods include Li et al. (2013) and Li et al. (2015), which explored a variety of hand-designed features. More recently, neural network based models such as recurrent neural networks (Nguyen et al., 2016;Feng et al., 2018), convolutional neural networks (Nguyen and Grishman, 2015;Chen et al., 2015) and attention mechanisms (Liu et al., 2017(Liu et al., , 2018 have also been shown to help improve performance. Beyond the taskspecific features learned by the deep neural models, Zhang et al. (2019) and  also utilize pre-trained contextualized representations. Only a few models have gone beyond individual sentences to make decisions. Ji and Grishman (2008) and Liao and Grishman (2010) utilize event type co-occurrence patterns to propagate event classification decisions. Yang and Mitchell (2016) propose to learn within-event (sentence) structures for jointly extracting events and entities within a document context. Similarly, from a methodological perspective, our GRIT model also learns structured information, but it learns the dependencies between role-filler entity mentions and between different roles. Duan et al. (2017) and Zhao et al. (2018) leverage document embeddings as additional features to aid event detection. Although the approaches above make decisions with cross-sentence information, their extractions are still done the sentence level.
Document-level IE Document-level event rolefiller mention extraction has been explored in recent work, using hand-designed features for both local and additional context (Patwardhan and Riloff, 2009;Huang andRiloff, 2011, 2012), and with end-to-end sequence tagging based models with contextualized pre-trained representations (Du and Cardie, 2020). These efforts are the most related to our work. The key difference is that our work focuses on a more challenging, and more realistic, setting: extracting role-filler entities rather than lists of role-filler mentions that are not grouped according to their associated entity. Also on a related note, Chambers and Jurafsky (2011) Recently, there has also been increasing interest in cross-sentence/document-level relation extraction (RE). In the scientific domain, Peng et al. (2017); Wang and Poon (2018);Jia et al. (2019) study N -ary cross-sentence RE using distant supervision annotations. Luan et al. (2018) introduce SciERC dataset and their model rely on multi-task learning to share representations between entity span extraction and relations. Yao et al. (2019) construct an RE dataset of cross-sentence relations on Wikipedia paragraphs. Ebner et al. (2020) introduce RAMS dataset for multi-sentence argument mention linking, while we focus on entity-level extraction in our work. Different from work on joint modeling (Miwa and Bansal, 2016) and multi-task learning  setting for extracting entities and relations, through the generative modeling setup, our GRIT model implicitly captures (non-)coreference relations between noun phrases, without relying on the cross-sentence coreference and relation annotations during training.
Neural Generative Models with a Shared Module for Encoder and Decoder Our GRIT model uses one shared transformer module for both the encoder and decoder, which is simple and effective. For the machine translation task, He et al. (2018) propose a model which shares the parameters of each layer between the encoder and decoder to regularize and coordinate the learning. Dong et al. (2019) presents a new unified pre-trained language model that can be fine-tuned for both NLU and NLG tasks. Similar to our work, they also introduce different masking strategies for different kinds of tasks (see Section5).

The Role-filler Entity Extraction Task and Evaluation Metric
We base the REE task on the original MUC 3 formulation (Sundheim, 1991), but simplify it as done in prior research (Huang and Riloff, 2012;Du and Cardie, 2020). In particular, we assume that one generic template should be produced for each document: for documents that recount more than one event, the extracted role-filler entities for each are merged into a single event template. Second, we focus on entity-based roles with string-based fillers 4 .
• Each event consists of the set of roles that describe it (shown in Figure 1). The MUC-4 dataset that we use consists of ∼1k terrorism events.
• Each role is filled with one or more entities.
• Each role-filler entity is denoted by a single descriptive mention, a span of text from the input document. Because multiple such mentions for each entity may appear in the input, the goldstandard template lists all alternatives (shown in Figure 1), but systems are required to produce just one.

Evaluation Metric
The metric for past work on document-level role-filler mentions extraction (Patwardhan and Riloff, 2009; Huang and Riloff, 2011; Du and Cardie, 2020) calculates mention-level precision across all alternative mentions for each rolefiller entity. Thus it is not suited for our problem setting, where entity-level precision is needed, where spurious entity extractions will get punished (e.g., recognizing "telephone company building" and "telephone company offices" as two entities will result in lower precision).
Drawing insights from the entity-based CEAF metric (Luo, 2005) from the coreference resolution literature, we design a metric (CEAF-REE) for measuring models' performance on this documentlevel role-filler entity extraction task. It is based on maximum bipartite matching algorithm (Kuhn, 1955;Munkres, 1957). The general idea is that, for each role, the metric is computed by aligning gold and predicted entities with the constraint that a predicted (gold) entity is aligned with at most one gold (predicted) entity. Thus, the system that does not recognize the coreferent mentions and use them for separate entities will be penalized in precision score. For the example in Figure 1, if the system extracts "Pilmai telephone company building" and "telephone company offices" as two distinct TARGETs, the precision will drop. We include more details for our CEAF-TF metric in the appendix.

REE as Sequence Generation
We treat document-level REE as a sequence-tosequence task (Sutskever et al., 2014) in order to better model the cross-role dependencies and crosssentence noun phrase coreference structure. We first transform the task definition into a source and target sequence.
As shown in Figure 2, the source sequence simply consists of the tokens of the original document prepended with a "classification" token (i.e., [CLS] in BERT), and appended with a separator token (i.e., [SEP] in BERT). The target sequence is the concatenation of target extractions for each role, separated by the separator token. For each role, the target extraction consists of the first mention's beginning (b) and end (e) tokens: Note that we list the roles in a fixed order for all examples. So for the example used in Figure 2, 1e would be "two" and "men" respectively; and e (3) 1e would be "water" and "pipes" respectively. Henceforth, we denote the resulting sequence of source tokens as x 0 , x 1 , ..., x m and the sequence of target tokens as y 0 , y 1 , ..., y n .

Model: Generative Role-filler Transformer (GRIT)
Our model is shown in Figure 2. It consists of two parts: the encoder (left) for the source tokens; and the decoder (right) for the target tokens. Instead of using a sequence-to-sequence learning architecture with separate modules (Sutskever et al., 2014;Bahdanau et al., 2015), we use a single pretrained transformer model (Devlin et al., 2019) for both parts, and introduce no additional fine-tuned parameters.
Pointer Embeddings The first change to the model is to ensure that the decoder is aware of where its previous predictions come from in the source document, an approach we call "pointer embeddings". Similar to BERT, the input to the model consists of the sum of token, position and segment embeddings. However, for the position we use the corresponding source token's position. For example, for the word "two", the target tokens would have the identical position embedding of the word "two" in the source document. Interestingly, we do not use any explicit target position embeddings, but instead separate each role with a [SEP] token. Empirically, we find that the model is able to use these separators to learn which role to fill and which mentions have filled previous roles. Our encoder's embedding layer uses standard BERT embedding layer, which applied to the source document tokens. To denote boundary between source and target tokens, we use sequence A (first sequence) segment embeddings for the source tokens, we use sequence B (second sequence) segment embeddings for the target tokens.  We pass the source document tokens through the encoder's embedding layer, to obtain their embeddings x 0 , x 1 , ..., x m . We pass the target tokens y 0 , y 1 , ..., y n through the decoder's embedding layer, to obtain y 0 , y 1 , ..., y n .

BERT as Encoder / Decoder
We utilize one BERT model as both the source and target embeddings. To distinguish the encoder / decoder representations, we provide a partial causal attention mask on the decoder side.
In Figure 3, we provide an illustration for the attention masks -2-dimensional matrix denoted as m. For the source tokens, the mask allows full source self-attention, but mask out all target tokens. For i ∈ {0, 1, ..., m}, For the target tokens, to guarantee that the decoder is autoregressive (the current token should not attend to future tokens), we use a causal masking strategy. Assuming we concatenate the target to the source tokens (the joint sequence mentioned below), for i ∈ {m + 1, ..., n}, The joint sequence of source tokens' embeddings (x 0 , x 1 , ..., x m ) and target tokens' embeddings (y 0 , y 1 , ..., y n ) are passed through BERT to obtain their contextualized representations, Pointer Decoding For the final layer, we replace word prediction with a simple pointer selection mechanism. For target time step t (0 ≤ t ≤ n), we first calculate the dot-product betweenŷ t and x 0 ,x 1 , ...,x m , z 0 , z 1 , ..., z m =ŷ t ·x 0 ,ŷ t ·x 1 , ...,ŷ t ·x m Then we apply softmax to z 0 , z 1 , ..., z m to obtain the probabilities of pointing to each source token, p 0 , p 1 , ..., p m = softmax(z 0 , z 1 , ..., z m ) Test prediction is done with greedy decoding. At each time step t, argmax is applied to find the source token which has the highest probability. The predicted token is added to the target sequence for the next time step t + 1 with its pointer embedding. We stop decoding when the fifth [SEP] token is predicted, which represents the end of extractions for the last role.
In addition, we add the following decoding constraints, • Tune probability of generating [SEP]. By doing this, we encourage the model to point to other source tokens and thus extract more entities for each role, which will help increase the recall.
(We set the hyperparameter of downweigh to 0.01, i.e., for the [SEP] token p = 0.01 * p.) • Ensure that the token position increase from start token to end token. When decoding tokens for each role, we know that mention spans should obey this property. Thus we eliminate those invalid choices during decoding.

Experimental Setup
We conduct evaluations on the MUC-4 dataset (1992), and compare to recent competitive end-to-end models Du and Cardie, 2020) in IE (Section 7). Besides the normal evaluation, we are also interested in how well our GRIT model captures coreference linguistic knowledge, and comparison with the prior models. In Section 8, we present relevant evaluations on the subset of test documents.  We use CEAF-REE which is covered in Section 3 as the evaluation metric. The results are reported as Precision (P), Recall (R) and F-measure (F1) score for the micro-average for all the event roles (Table 4). We also report the per-role results to have a fine-grained understanding of the numbers (Table 2).

Dataset and Evaluation Metric
Baselines We compare to recent strong models for (document-level) information/event extraction. CohesionExtract (Huang and Riloff, 2012) is a bottom-up approach for event extraction that first aggressively identifies candidate role-fillers, and prune the candidates located in event-irrelevant sentences. 5 Du and Cardie (2020) propose neural sequence tagging (NST) models with contextualized representations for document-level role filler mentions extraction. We train this model with BIO tagging scheme to identify the first mention for each role-filler entity and its type (i.e., B-PerpInd, I-PerpInd for perpetrator individual). DYGIE++ ) is a spanenumeration based extraction model for entity, relation, and event extraction. The model (1) enumerates all the possible spans in the document; (2) concatenates the representations of the span's beginning & end token and use it as its representation, and pass it through a classifier layer to predict whether the span represents certain role-filler entity and what the role is. Both the NST and DY-GIE++ are end-to-end and fine-tuned BERT (Devlin et al., 2019) contextualized representations with task-specific data. We train them to identify the first mention for each role-filler entity (to ensure fair comparison with our proposed model). Unsupervised event schema induction based approaches (Chambers and Jurafsky, 2011;Chambers, 2013;Cheung et al., 2013)

Results
In Table 4, we report the micro-average performance on the test set. We observe that our GRIT model substantially outperforms the baseline extraction models in precision and F1, with an over 5% improvement in precision over DYGIE++. Table 2 compares the models' performance scores on each role (PERPIND, PERPORG, TAR-GET, VICTIM, WEAPON). We see that, (1) our model achieves the best precision across the roles; (2) for the roles that come with entities containing more human names (e.g., PERPIND and VICTIM), our model substantially outperforms the baselines; (3) for the role PERPORG, our model scores better precision but lower recall than neural sequence tagging, which results in a slightly better F1 score; (4) for the roles TARGET and WEAPON, our model is more conservative (lower recall) and achieves lower F1. One possibility is that for role like TAR-  Table 4: Micro-average results (the highest number of each column is boldfaced). Significance is indicated with * * (p < 0.01), * (p < 0.1) -all tests are computed using the paired bootstrap procedure (Berg-Kirkpatrick et al., 2012).
GET, on average there are more entities (though with only one mention each), and it's harder for our model to decode as many TARGET entities correct in a generative way.

Discussion
How well do the models capture coreference relations between mentions? We also conduct targeted evaluations on subsets of test documents whose gold extractions come with coreferent mentions. From left to right in Table 3, we report results on the subsets of documents with increasing number (k) of possible (coreferent) mentions per role-filler entity. We find that: (1) On the subset of documents with only one mention for each role-filler entity (k = 1), our model has no significant advantage over DYGIE++ and the sequence tagging based model; (2) But as k increases, the advantage of our GRIT substantially increases -with an over 10% gap in precision when 1 < k ≤ 1.5, and a near 5% gap in precision when k > 1.5. From the qualitative example (document excerpt and the extractions in Figure 4), we also observe our model recognizes the coreference relation between candidate role-filler entity mentions, while the baselines do not, which shows that our model is better at capturing the (non-)coreference relations between role-filler entity mentions. It also proves the advantage of a generative model in this setting.

Discussion
How well do the models capture coreference relations between mentions we also see our model recognizes the coreference relation between candidate role-filler entities, while the baselines don't. This demonstrates that our model is better at capturing the (non)-coreference relation between role-filler entities. It also proves the advantage of generative modeling (over modeling one candidate role-filler entity's role in isolation). Our model correctly extracts the two role-filler entities for PERPORG: "FARC" and "popular liberation army", which are closely related to the PER-PIND entity "guerrilla". While the DYGIE++ and NST both miss the entities for PERPORG. How well do models capture dependencies between different roles? To study this phenomenon, we consider nested role-filler entity mentions in the documents. In the example of Figure 1, "shining path" is a role-filler entity mention for PERPORG nested in "two shining path members" (a role-filler entity mention for PERPIND). The nesting happens more often between more related roles (e.g., PERPIND and PERPORG) -we find that 33 out of the 200 test documents' gold extractions contain nested role-filler entity mentions between the two roles.
In Table 5, we present the CEAF-REE scores for role PERPORG on the subset of documents with nested roles. As we hypothesized beforehand, GRIT is able to learn the dependency between different roles and can learn to avoid missing relevant role-filler entities for later roles. The results provide empirical evidence: by learning the dependency between PERPIND and PERPORG, GRIT  Our model correctly extracts the two role-filler entities for PERPORG: "FARC" and "popular liberation army", which are closely related to the PER-PIND entity "guerrilla". While the DYGIE++ and NST both miss the entities for PERPORG. improves the relative recall score on the subset of documents as compared to DYGIE++. On all the 200 test documents, our model is ∼ 2% below DY-GIE++ in recall; while on the 33 docs, our model scores much higher than DYGIE++ in recall. For the document in the example of Figure 5, our model correctly extracts the two role-filler entities for PERPORG: "FARC" and "popular liberation army", which are closely related to the PERPIND entity "guerrilla". While DYGIE++ and NST both miss the entities for PERPORG.

Decoding Ablation Study
In the table below, we present ablation results based on the decoding constraints. These illustrate the influence of the decoding constraints on the our model's performance. The two constraints both significantly improve model predictions. Without downweighing the probability of pointing to [SEP], the precision increases but recall and F1 significantly drops.  Additional Parameters and Training Cost Finally we consider additional parameters and training time of the models: As we introduced previously, the baseline models DYGIE++ and NST both require an additional classifier layer on top of BERT's hidden state (of size H) for making the predictions. While our GRIT model does not require adding any new parameters. As for the training time, training the DYGIE++ model takes over 10 times longer time than NST and our model. This time comes from the DYGIE++ model requirement of enumerating all possible spans (to a certain length constraint) in the document and calculating the loss with their labels.

Conclusion
We revisit the classic and challenging problem of document-level role-filler entity extraction (REE), and find that there is still room for improvement. We introduce an effective end-to-end transformer based generative model, which learns the document representation and encodes the dependency between role-filler entities and between event roles. It outperforms the baselines on the task and better captures the coreference linguistic phenomena. In the future, it would be interesting to investigate how to enable the model to also do template recognition.  We list several cases ( Figure 6) and their CEAF-REE scores ( • Pytorch-Struct: Install from Github.