Document-level Entity-based Extraction as Template Generation

Document-level entity-based extraction (EE), aiming at extracting entity-centric information such as entity roles and entity relations, is key to automatic knowledge acquisition from text corpora for various domains. Most document-level EE systems build extractive models, which struggle to model long-term dependencies among entities at the document level. To address this issue, we propose a generative framework for two document-level EE tasks: role-filler entity extraction (REE) and relation extraction (RE). We first formulate them as a template generation problem, allowing models to efficiently capture cross-entity dependencies, exploit label semantics, and avoid the exponential computation complexity of identifying N-ary relations. A novel cross-attention guided copy mechanism, TopK Copy, is incorporated into a pre-trained sequence-to-sequence model to enhance the capabilities of identifying key information in the input document. Experiments done on the MUC-4 and SciREX dataset show new state-of-the-art results on REE (+3.26%), binary RE (+4.8%), and 4-ary RE (+2.7%) in F1 score.


Introduction
Document-level entity-based extraction (EE) are tasks that extract entity-centric information, such as entities and their relations, from unstructured text across multiple sentences. With the rise of big data in recent years, document-level EE is growing in importance with applications such as understanding clinical reports (Nye et al., 2020), extracting document-level events , and building knowledge graphs from journals . In this work, we focus on two classic tasks of document-level EE: role-filler entity extraction (REE) and relation extraction (RE). Recent works on document-level EE usually build 1 The source code is publicly available at https:// github.com/PlusLabNLP/TempGen  Figure 1: A comparison between our approach and a competitive extractive system, SCIREX-P (Jain et al., 2020), on a relation extraction example from SCIREX. The task is to extract entities and identify which entities are related from the given scientific article. Due to the long distances between entities, SCIREX-P struggles to extract the right entity pair that has a relation, while our approach correctly identifies them. This reflects our method's advantage in modeling long-term cross-entity dependencies.
task-specific classifiers on top of large pre-trained language models. For example, Du and Cardie (2020a) builds a sequence tagging framework with multi-granularity representations based on BERT (Devlin et al., 2019) for role-filler entity extraction. Jain et al. (2020) builds a relation extraction pipeline upon SCIBERT (Beltagy et al., 2019). However, there are a few drawbacks of this model architecture. First, as the size of the document increases, it becomes increasingly difficult for extractive methods to capture cross-entity dependencies among entitiy types due to long distances between entities, as shown in Figure 1. Additionally, discriminative models have no information regarding the semantics of the labels when classifying relations or entity types. Thus, it is unable to take advantage of the label semantics embedded in the pre-trained encoders.
Motivated by these challenges, we propose to formulate REE and RE tasks as template generation. Due to the autoregressive nature of generative setup, this formulation makes dependencies among the output entities easier to capture compared to sequence tagging methods. Moreover, label names are incorporated into the decoder targets for exploiting label semantics not present in the extractive counterparts. Furthermore, for tasks that involve the identification of N -ary relations, this formulation significantly alleviates the computational complexity of comparing exponential combinations of entities. A generative framework, Cross-attention Guided Template Generation (TEMPGEN), that incorporates a novel copy mechanism into a pretrained sequence-to-sequence model is proposed to solve the template generation problem effectively.
Our contributions can be summarized as follows: • We propose to formulate document-level EE tasks as a template generation problem, which allows our generative framework to effectively capture cross-entity dependencies, better identify entities with label semantics, and avoid the exponential computation complexity of identifying N -ary relations.
• We devise a novel copy mechanism based on cross-attention to enable our model to better learn how to copy key information from the input document.
• Our approach achieves state-of-the-art results on MUC role-filler entity extraction task and SCIREX relation extraction task, while being data efficient compared to previous systems.

Tasks
This section gives an overview of the two document-level EE tasks we tackled in this work: role-filler entity extraction (REE) and relation extraction (RE).

Role-filler Entity Extraction
The REE task aims to extract all entities involved in events from the input article (Du et al., 2021). It differs from the event template extraction task introduced by the MUC-4 dataset (muc, 1992) in that only one event template, as opposed to many, is outputted for each input document. For documents associating with multiple event templates (all events in MUC-4 are of ATTACK type), the event templates are collapsed as one -systems are required to identify all entities associate with different events for each role type. An event template consists of a set of pre-defined roles, and each role is filled with zero to many entities, as shown in Figure 2. An entity is characterized by a group of mentions, which are spans of text in the input document.

Relation Extraction
We focus on end-to-end document-level relation extraction where systems first extract entities from the input document and then identify the N -ary non-typed relations among the extracted entities. The SCIREX (Jain et al., 2020) dataset is the only dataset that supports such end-to-end configurations that we know of. Thus, we follow the definition of document-level RE in SCIREX, which contains binary and 4-ary relation annotation. A binary relation contains two typed entities, and a 4ary relation contains four typed entities. An entity is represented by a cluster of mentions, similar to the REE task. Systems should first extract salient entities of pre-defined types 2 . Then, binary and 4-ary relations among salient entities are identified. A binary relation example is shown in Figure 2.

Proposed Methods
In this section, we first illustrate how the REE and RE tasks can be framed as a template generation problem. This formulation then allows us to capture cross-entity dependencies easily with our proposed generative model, a pre-trained sequence-tosequence model integrated with a copy mechanism.

Template Generation Formulation
We frame the REE and RE tasks as template generation problem, as shown in Figure 2. A template is composed of slot names and slot values. For both tasks, slot names are entity types, and slot values are all entity mentions corresponding to such entity types. Similar to previous works on REE (Huang and Riloff, 2011;Du and Cardie, 2020a;Du et al., 2021) A slot sequence S i,j is represented by slot names and entities, where L is the slot name 3 , and D (e k ) 1 , ..., D (e k ) n is the token sequence that correspond to one mention randomly sampled from entity e k . Special tokens, such as <SOSN> and <EOSN> , are to indicate whether a tag-enclosed string is a slot name or an entity mention. In the first row of the REE example from Figure 2, L would be PERPIND and D (e k ) 1 , ..., D (e k ) n are "group of terrorists". Using this formulation, scalability challenges of modeling cross-entity dependencies is alleviated due to the significantly reduced distances between entities in template sequences.

Cross-attention Guided Template Generation
The template generation problem can be broken down into two sub-goals: (1) generating valid tem-3 Slot name corresponds to role in REE and entity type in RE. plate structures while capturing the dependencies between the input document and decoder targets, and (2) ensuring that salient mentions in the input document are correctly identified and outputted by the decoder. To achieve the first sub-goal, we leverage BART (Lewis et al., 2020), a pre-trained sequence-to-sequence model. The second sub-goal is achieved using a novel copy mechanism incorporated into BART.

Seq2Seq Model for Template Generation
BART (Lewis et al., 2020) is a pre-trained language model that combines bidirectional and auto-regressive transformers. Pre-training with multiple denoising objectives, BART has demonstrated significant advantages in various text generation tasks, especially on summarization (Lewis et al., 2020) 4 . The template generation problem much resembles summarization, except that generated template sequences contain implicit structures. With the various denoising pre-training objectives, we believe that BART can capture the implicit structure within template sequences, effectively model the dependencies among predicted entities, and produce rich semantics to reason over between slot names and entities.
Cross-attention guided copy mechanism To enhance BART's capabilities to identify salient mentions in the input documents, we incorporate a copy mechanism based on cross-attention. As cross-attentions often imply saliency of input to-kens, a naive approach of computing copy distributions P copy at time step t over the input tokens is taking the mean of the last decoder layer's crossattention across all heads, as mentioned in Xu et al. (2020), where α t,h is the attention scores over input tokens at decoding step t for head h. W s and W e are the projection matrices for the encoder and the decoder. s t is the decoder hidden states at step t, and e denotes the encoder hidden states.
However, recent studies have shown that attention heads are not equally important, and that some heads can be pruned out with a marginal decrease in overall performance (Voita et al., 2019;Michel et al., 2019). We hypothesize that the attention probabilities produced by insignificant attention heads may be noisy. Thus, computing copy distributions without these heads could improve the model's ability to infer the importance of each token in the input document. Motivated by this hypothesis, we propose TOPK COPY, a copy mechanism where only the Top-k important attention heads are used for computing copy distributions. Consider the formulation of multi-head attention, following the notation from Vaswani et al. (2017): are the projection matrices for computing attention. W O ∈ R hdv×d model is the matrix that allows interaction between different attention heads, where h is the number of heads. To determine the importance of each attention head, we first transform (5)), and then sum over the last two dimensions of W O (Equation (6)), where score i denotes the significance score for head i. We take the attention heads with Topk highest significance scores in the last crossattention layer, and use the mean of the attention probabilities outputted by these heads as the copy distribution as shown in equations 7 and 8, Objective function. The final probability P final of a word w t is a weighted sum of vocabulary distribution computed by BART P vocab and copy distribution P copy , where p gen ∈ [0, 1] is the generation probability computed by passing the dot product of the mean encoder hidden state e = n i=0 e i n and decoder hidden state s t at time step t through the sigmoid function σ, Using the final probability distribution P final , we can then compute the loss function as the average negative log likelihood of the target word y t over all timesteps, following See et al. (2017), 4 Experimental Setup

Dataset and Evaluation Metric
Experiments are conducted on two English datasets: MUC-4 (1992) for the role-filler entity extraction task and SciREX (Jain et al., 2020) for the binary and 4-ary end-to-end relation extraction tasks. MUC-4 contains 1700 documents, with on average about 400 tokens per document. Documents are annotated with zero to multiple event templates. As per Du et al. (2021)'s pre-processing, we have a 13:2:2 split on the documents for train, development, and test, respectively. We evaluate the REE task on this dataset using the entity-level metric, CEAF-REE (Du et al., 2021). The metric aligns predicted entities with gold entities using Kuhn -Munkres algorithm (Kuhn, 1955;Munkres, 1957), where a predicted entity is considered correct if and only if its corresponding mentions are a subset of the aligned gold entity's mentions. The SCIREX dataset 5 consists of scientific articles, with entity, coreference, and relation annotations.
With an average token count of about 5700, the articles are significantly longer than the documents in MUC-4. We use the pre-processed data from Jain et al. (2020), which contains 306 documents for training, 66 for validation, and 66 for testing. In contrast to conventional relation extraction datasets, such as ACE05, relations are not typed in SCIREX. Hence, the official SCIREX evaluator (Jain et al., 2020) only considers the correctness of predicted entities and entity types 6 in each relation. Predicted entities are aligned with gold entities based on mention overlap. When the entities are aligned, predicted relations are aligned with gold relations accordingly. A predicted relation is correct if and only if both the associated entities and the entity types match the aligned gold relation.

Baselines
We compare our method with the following competitive baseline systems.
NST (Du and Cardie, 2020a) builds multigranularity representations on documents, and utilizes gate mechanism to fuse representations of different granularity.
TANL (Paolini et al., 2021) augments sequential labels with input sentences, allowing it to be applied to various structured prediction tasks 7 .
GRIT (Du et al., 2021) shares transformer parameters between the encoder and the pointer network decoder, and is the SOTA system for the REE task 5 https://github.com/allenai/SciREX 6 There are 4 entity types: MATERIAL, METRIC, TASK, and METHOD. 7 Since the source code of TANL has not been released by the time we conducted experiments, we re-implemented it by closely following the method described in Paolini et al. (2021). on the MUC dataset.
DYGIE++ (Wadden et al., 2019) is a span-based multi-task IE framework jointly trained on relation extraction, named entity recognition, and coreference resolution.
SCIREX-P (Jain et al., 2020) is the SOTA framework for end-to-end binary and 4-ary relation extraction on SCIREX. The pipeline is composed of 4 components: mention identification, mention clustering, salient entity cluster identification, and relation classification.
In terms of the pre-trained language models used, BERT-BASE (Devlin et al., 2019) is used for NST, DYGIE++, and GRIT. SCIREX-P is fine-tuned on SCIBERT (Beltagy et al., 2019). We replace T5 (Raffel et al., 2020) with BART-BASE for TANL for a fair comparison with our method.

Implementation details
The proposed models are optimized using AdamW (Loshchilov and Hutter, 2019) with learning rate 5e-5 and weight decay 1e-5. We used grid search to find the best k for TOPK COPY and found that k = 10 yields the best overall performance across REE and RE. The maximum input sequence length for RE and REE are 1024 and 512, respectively. During inference time, all generative models used beam search with a beam width of 4.  Additionally, we set the maximum input sequence length to 512 for TEMPGEN for fairer comparisons with SCIREX-P. We obtain F1 scores of 11.94% and 2.18% on binary and 4-ary relation extraction, respectively. This confirms the advantage of our model on the relation extraction tasks.

Main Results
While TANL performs worse than our model on REE, it is still able to achieve a higher score than GRIT. This suggests that augmenting decoding targets with label names provides useful semantics, whereas adding input documents to decoding targets may not yield better results in the REE task. We also observe that TANL scores extremely low on both RE tasks, where 58% of the binary relations and 26 % of the 4-ary relations in the decoding targets are filtered out due to exceeding maximum sequence length of BART. Out of the remaining relations, 57% of the binary relations and 78% of the 4-ary relations have at least one entity removed in the decoding targets due to its long distance from the first-appearing entity 8 , suggesting that TANL's poor performance on RE tasks is due to scarcity of gold labels. This reflects that TANL is ill-suited for document-level EE tasks.
We observe extremely low performances across all systems on both tasks of SCIREX, even though TEMPGEN outperforms the baseline systems significantly. This is mainly caused by the characteristics of the SCIREX dataset. First, syntactic 8 Please refer to Appendix B for more details.
......The police also stopped 8,000 cars in the search for assassins, who are presumably members of the maoist terrorist organization shining path...... The dircote (counterterrorism divison) has identified one of the terrorists as Gerardo Olivos Silva through a composite made from witness' reports.......

Generated PerpInd entity: terrorists
(a) Copy distribution produced by NAIVE COPY.
......The police also stopped 8,000 cars in the search for assassins, who are presumably members of the maoist terrorist organization shining path...... The dircote (counterterrorism divison) has identified one of the terrorists as Gerardo Olivos Silva through a composite made from witness' reports.......
1e-2 5e-1 (c) The darker the color, the higher the probability. Figure 3: TOPK COPY produces a more reliable copy distribution P copy than that computed by NAIVE COPY in a MUC-4 example. Given an input document and decoded tokens "<s><SOT> <SOSN> PERPIND <EOSN> <SOE>", the gold PERPIND entity is "Gerado Olivos Silva". However, "terrorist" is assigned the highest copy probability computed by NAIVE COPY, leading to incorrect entity extracted. Conversely, TOPK COPY assigns the highest P copy to the head token of the gold entity, "Ger", resulting in successful extraction of the correct entity eventually.
characteristics specific to scientific journals, such as algorithm blocks, result in the unusually long sequences in the SCIREX dataset despite best parsing efforts. Additionally, another feature frequently seen in scientific journals is the use of table and figure captions. Since captions are not included as part of the input text, the number of accepted relations decreases drastically.

Performance Analysis
Ablation Study We conducted ablation studies by replacing the TOPK COPY module with other copy mechanisms. NAIVE COPY refers to computing copy distributions with the attentions from all cross-attention heads. SAGCopy (Xu et al., 2020) utilizes encoder self-attention to compute centrality scores for measuring the saliency of each input token. As shown in Table 2, we found that NAIVE COPY leads to performance drop on all three tasks, especially on binary and 4-ary relation extractions. NAIVE COPY achieving scores even lower than fine-tuning BART alone (i.e. w/o TOPK COPY ) reflects that copy mechanisms may mislead models to copy incorrect input tokens. A qualitative example of the difference between TOPK COPY and NAIVE COPY demonstrated in Figure 3   We also experimented with replacing the original slot names with numeric slot names (i.e. converting PERPIND to <ROLE_1>, PERPORG to <ROLE_2>, and etc). This conversion removes the semantics of slot names in the decoding targets. While little performance drop was observed on the REE task, using numeric slot names resulted in the worst performance on binary and 4-ary relation extraction tasks, which could be a result of strong slot dependencies in RE in comparison with REE. In RE, slots are directly semantically related to other slots in each template whereas slots in REE are relatively independent. This shows that slot name semantics are useful for template generation tasks with strong slot dependencies in each template. Finally, we conducted ablation studies on different variations of templates as decoding targets. Specifically, three variations are tested on the REE task: (1) We merge entities of the same role names into the same "slot". (e.g. transforming the decoding targets from " <SOSN> PerpInd <EOSN> <SOE> Alice <EOE> <SOSN> PerpInd <EOSN> <SOE> Bob <EOE> " to " <SOSN> PerpInd <EOSN> <SOE> Alice; Bob <EOE> ").
(2) Based on (1), all slot names, such as "PerpInd" and "PerpOrg", are replaced with the same special token "<ROLE>". (3) We use the same decoding targets as GRIT's. These three settings achieve test set F1 scores of 56.65, 54.16, and 52.55, respectively. The results suggest that differentiating entities with different entity types helps improve the performance. Furthermore, comparing with the results in Table 1, we found that GRIT performs better than our system, reflecting that a pointer network-based model, which has with smaller search space than ours, is more advantageous when using the same decoding targets.

Impact of the Amount of Training Data
To test the data efficiency of our approach, we compared TEMPGEN and TEMPGEN -TOPK COPY with GRIT on the REE task using different amount of MUC training data. As seen in Figure 5, both TEMPGEN and TEMPGEN -TOPK COPY outperform GRIT across all settings with a slightly larger performance margin in low resource settings. This indicates that our approach is more data-efficient compared to the previous SOTA system on REE. Figure 4 shows our model's change in performance conditioned on various values of K in the TOPK COPY mechanism. Consistent with our results in Section 5.1, we see that removing some of the cross-attention heads (12 → 10) can lead to performance gain due to the filtered noise brought by unimportant attention heads. However, we noticed a drop in performance across all three tasks for lower values of K, suggesting that beneficial cross-attention heads are removed. Interestingly, performance drops immediately as K decreases below 10, suggesting that only a small portion of the cross-attention heads are unimportant. The trend is consistent with Michel et al. (2019)'s results where pruning cross-attention heads to a certain extent can easily result in performance drop. Additionally, the model with no copy mechanism (K = 0) outperforms the model with few attention heads  Figure 6: An example showing how GRIT misidentifies the VICTIM entities and TARGET entities, likely due to the lack of role type semantics. Here, VICTIM entities are the people attacked, and TARGET entities are the objects compromised.

Impact of K Cross-attention Heads
(K ∈ {2, 4, 6}), suggesting that the copy distributions obtained from not sufficiently informative cross-attentions can mislead the model.

Qualitative Analysis
The following qualitative analysis provides intuition for our model's ability to capture dependencies across entities and utilize slot name semantics.
Cross-entity Dependencies To validate our approach's capability to capture cross-entity dependencies, we considered binary relations on SCIREX where at least one of the associated entities is involved in multiple relations. The dependencies among entities are better captured by the model that predicts fewer unlikely relations. Comparing the test set outputs of TEMPGEN and SCIREX-P, we see that 13131 errors made by SCIREX-P are corrected by our model, which only introduces 604 errors. This result demonstrates the strength of TEMPGEN in modeling cross-entity dependencies.

Importance of Label Semantics
Comparing the test set predictions between TEMPGEN and GRIT on the MUC-4 REE task, we see that our approach better distinguishes confusing entities such as VIC-TIM and TARGET entities. As shown in the example in Figure 6, GRIT incorrectly predicts the two victims, "Miguel Soler Rodrigues" and "Martha Luz Lopez", as TARGET entities. It also misidentifies "El Espectador", a newspaper company, as a victim of the attack. In contrast, TEMPGEN is able to recognize the roles of the two victims. Even though it's not an exact match, the predicted TARGET entity had a correctly identified role type with similar semantic meaning compared to the gold label.

Inference time comparison
As discussed earlier, TEMPGEN can significantly reduce the exponential computational complexity of document-level N-ary relation identification. To illustrate this, we compared the inference time between TEMPGEN and two other systems, TANL and SCIREX-P, on the SCIREX 4-ary RE task. As shown in Figure 7, TEMPGEN drastically shortens the inference time by around 39 times compared to SCIREX-P. TANL also runs much faster than SCIREX-P, but is still around 4 times slower than TEMPGEN. This is resulted from the fact that TANL generates the entire input document in addition to entity and relation labels, which is much longer than TEMPGEN's generated sequences.

Related Works
In the following sections, we will first discuss a few important works on the REE task and document-level RE task. Then, we will dive into a few works that uses a similar sequence generative approach to various document-level IE tasks.

Role-filler Entity Extraction
Document-level REE has been explored in recent works using a variety of model architectures. Du and Cardie (2020b) formulates the task as a sequence tagging problem, and trains layered classifiers as sequence readers on multiple granularities. In contrast, GRIT (Du et al., 2021) formulates the problem as sequence generation, and employs a single transformer layer whose parameters are shared between encoder and decoder to enrich semantics in the shared parameters. A pointer selection network is used for the final layer of decoding.

Document-level Relation Extraction
Due to long-term dependencies that often span over hundreds of tokens, capturing entity relations have proven to be a challenging task. One approach was constructing a document-level graph from sentence encoding, then extracting entity relations from edge representations in the graph (Christopoulou et al., 2019). Other works such as Jia et al. (2019) layer classifiers in a pipeline architecture to obtain hierarchical representation of N -ary relations.

IE as Sequence Generation
Recently, there has been an increasing number of works framing information extraction tasks as sequence generation problem. Zeng et al. (2018) formed triple extraction as a sequence generation task and adopted a RNN-based model with copy mechanisms. To encourage the faithfullness of the extracted triplets, Ye et al. (2021) designed a triplet contrastive training objective. These works focus on sentence-level triplet extraction, while our work extracts role-filler entities and entity relations at the document level. Li et al. (2021); Hsu et al. (2021) formulates the document-level event argument extraction task as a conditional generation problem by providing event ontology. However, their work cannot be applied to REE or RE due to the lack of ontology for role-filler entities and relations. Du et al. (2021) relied on a pointer-network-based decoder (Vinyals et al., 2015) to extract event rolefiller entities, and the parameters of BERT (Devlin et al., 2019) is shared between the encoder and the decoder. Nevertheless, their method cannot incorporate role labels, whereas our approach can take advantage of the label semantics. Paolini et al. (2021) uses a very similar generative approach, which constructs decoder targets by inserting text markers and labels around entity mentions in the input sentence. The key idea is that augmenting the decoder targets with original input sentence and labels provides stronger semantics to the model. Unfortunately, modeling cross-entity dependencies remains a challenge as entities are further apart in their decoding targets. We instead transform annotations into template sequences as decoding targets, where distances between entities are significantly shortened. Thus, our approach alleviates the scalability challenge of capturing crossentity dependencies at the scale of documents. Additionally, our approach differs in that the length of our decoder targets is significantly shorter, allowing the non-truncated decoder targets to fit in pre-trained language models. In contrast, for their method, the gold decoder targets are guaranteed to be longer than corresponding input document. Since the length of input tokens are often greater than the max sequence length of pre-trained language models for document-level EE, a great portion of the gold labels will be skipped using Paolini et al. (2021)'s method.

Conclusion
We have proposed TEMPGEN, a framework that frames document-level REE and RE tasks as a template generation task. A copy mechanism that takes the top-k important cross-attentions as copy distributions is incorporated into BART for capturing key information in the input document. Experimental results on MUC-4 and SCIREX showed that TEMPGEN outperforms prior approaches on rolefiller entity extraction and end-to-end documentlevel relation extraction tasks. Under different amount of training data, TEMPGEN demonstrates robustness across all settings, while being advantageous in lower-resource regime. Table 3 demonstrates the per-role performance comparison between TEMPGEN and other baselines. We observe that:

A REE Performance Breakdown
• TEMPGEN achieves the best precision across all roles.
• Except for PERPIND, TEMPGEN obtain substantial improvement in F1 over other baselines.
• While TEMPGEN has higher precision over GRIT in extracting PERPIND entities, it scores slightly lower in recall, leading to worse F1 performance.

B TANL Decoding Target Formulation
In this section, we illustrate how we formulate the TANL (Paolini et al., 2021) decoding targets for REE and RE. The formulation for REE is simple due to its similarity to the NER task. We produce REE decoding targets exactly the same way as how NER decoding targets are formed in Paolini et al. (2021). Given the REE example in Figure 2, the corresponding TANL decoding target is: As for RE, we follow how Paolini et al. (2021) handles nested entities and multiple relations, but we made a small modification on decoding targets. Since SCIREX does not contain relation type annotation, we use the related entities' types as the relation type in the decoding targets. With their formulation, the decoding target is created by inserting each relation annotation around the first-appearing entity in the input document. Take the RE instance in Figure 2 as an example. The corresponding TANL decoding target would be: Introduction: [Natural language inference | Task | Method = aESIM] (NLI) is an important andsignificant task in natural language processing (NLP)...

D Implementation Details
We conducted grid search to find the best learning rate over {1 × 10 −5 , 3 × 10 −5 , 5 × 10 −5 , 7 × 10 −5 , 9 × 10 −5 } using TEMPGEN w/o TOPK COPY on the MUC-4 REE task. The best learning rate, 5 × 10 −5 , is fixed for all other experiments. Models are trained for 150 epochs for REE and binary RE experiments, and 50 epochs for 4-ary RE experiments. To reproduce our results, please follow the README.md file in https://github. com/PlusLabNLP/TempGen. The weights of the trained models are also included for reproduction purposes.

E Validation Performance
For all reported test set results in Table 1, the corresponding development set performance are listed in Table 4.