Iterative Document-level Information Extraction via Imitation Learning

We present a novel iterative extraction model, IterX, for extracting complex relations, or templates, i.e., N-tuples representing a mapping from named slots to spans of text within a document. Documents may feature zero or more instances of a template of any given type, and the task of template extraction entails identifying the templates in a document and extracting each template’s slot values. Our imitation learning approach casts the problem as a Markov decision process (MDP), and relieves the need to use predefined template orders to train an extractor. It leads to state-of-the-art results on two established benchmarks – 4-ary relation extraction on SciREX and template extraction on MUC-4 – as well as a strong baseline on the new BETTER Granular task.


Introduction
A variety of tasks in information extraction (IE) require synthesizing information across multiple sentences, up to the length of an entire document.The centrality of document-level reasoning to IE has been underscored by an intense research focus in recent years on problems such as argument linking (Ebner et al., 2020;Li et al., 2021, i.a.), -ary relation extraction (Quirk and Poon, 2017;Yao et al., 2019;Jain et al., 2020, i.a.), and -our primary focus -template extraction (Du et al., 2021b;Huang et al., 2021, i.a.).
Construed broadly, template extraction is general enough to subsume certain other document-level extraction tasks, including -ary relation extraction.Motivated by this consideration, we propose to treat these problems under a unified framework of generalized template extraction ( §2). 2 Figure 1 shows 4-ary relations from the SCIREX dataset (Jain et al., 2020), presented as simple templates.Since documents typically describe multiple complex events and relations, template extraction systems must be capable of predicting multiple templates per document.Existing approaches such as Du et al. (2021b) and Huang et al. (2021) rely on a linearization strategy to force models to learn to predict templates in a pre-defined order.In general, however, such orderings are arbitrary.Others have instead focused on the simplified problem of role-filler entity extraction (REE), which entails extracting all slot-filling entities but does not involve mapping them to individual templates (Patwardhan and Riloff, 2009;Du et al., 2021a, i.a.).
We present a new model for generalized template extraction, ITERX, that iteratively extracts multiple templates from a document without requiring a pre-defined linearization scheme.We formulate the problem as a Markov decision process (MDP,§2), where an action corresponds to the generation of a single template ( §3.2), and states are sets of predicted templates ( §3.3).Our system is trained via imitation learning, where the agent imitates a dynamic oracle drawn from an expert policy ( §3.4).Our contributions can be summarized as follows: Col. Isaacs said that the guerrillas attacked the "La Eminencia" farm located near the "Santo Tomas" farm, where they burned the facilities and stole food He also reported that the guerrillas killed a peasant in the city of Flores, in the northern El Petén department, and burned a tank truck.and BETTER Granular (right) datasets.Event triggers (e.g.burned above) are not annotated in MUC-4 and are highlighted here only for clarity.
• We show that generalized template extraction can be treated as a Markov decision process, and that imitation learning can be effectively used to train a model to learn this process without making explicit assumptions about template orderings.We define a template ontology as a set of template types T , where each type  ∈ T is associated with a set of slot types   .A template instance is defined as a pair (, {(  :   ), • • • }) where  ∈ T is a template type,   ∈   is a slot type associated with , and   ⊆ X is a subset of all mention spans extracted from the document that fills slot type   (  = ∅ indicates that slot   has no filler).3Taking Figure 2 (left) as an example, Template 1 has type  = Arson and slots {PerpretratorIndiv : {"guerrillas"}, PhysicalTarget : {"facilities"}}.
We reduce the problem of extracting a single template to the problem of assigning a slot type to each extracted span   ∈ X , where some spans may be assigned a special null type (), indicating that they fill no slot in the current template.Given this formulation, we can equivalently specify a template instance as (, ) where  is an assignment of spans to slot types: {  :   }   ∈  .We denote the union of all slot types across all template types, along with the empty slot type , as S = {} ∪  ∈T   .With these definitions in hand, the problem of generalized template extraction can be stated succinctly: Given a template ontology T , a document , and a set of candidate mentions X extracted from , generate a set of template instances As an MDP We treat generalized template extraction as a Markov decision process (MDP), where each step of the process generates one whole template instance.For simplicity, we consider the problem of extracting templates of a specific type  ∈ T ; extracting all templates then simply requires iterating over T , where |T | is typically very small.This MDP (2 A , A, , ) comprises the following: 4 • 2 A : the set of states.In our case, this is the set of all template generation histories.Each state  ⊆ A is a set of generated templates; • A: the set of actions or assignments: an action is the generation of a single template (an assignment of slot types to spans); • : the environment that dictates state transitions.
Here, each transition simply adds a generated template to the set of all templates generated for the document:  ( , ) =  ∪ {}; • ( , ): the reward from action  under the current state .
These components are detailed in the following section.Figure 3 shows ITERX in action: the MDP 4 Our notation is consistent with prior NLP work that uses MDPs, e.g., Levenshtein Transformers (Gu et al., 2019).produces two templates sequentially, terminating on a null assignment to all input spans in X .

Model
Our ITERX model is a parameterized agent that makes decisions under the MDP above: conditioned on an input document , spans X extracted from , and a specific template type , ITERX generates a single template of type  at each step.The model consists of two parameterized components: • Policy : A policy ( | , X) that generates a distribution over potential assignments of spans to slots in the current template of type ; • State transition model : An autoregressive state encoder that maps a state (i.e., a set of predicted templates)  to a continuous representation X via a state transition model , where X (+1) = (X () ,   ).Here the state representation ITERX generates a sequence of templates: It starts with initial state X (0) (see Figure 3 for a running example) comprising only span representations derived from the encoder (as no templates have been predicted) and ends when no new template is generated ( §3.5).ITERX is trained via imitation learning, aiming to imitate an expert policy  * derived from reference templates.

Span Extraction and Representation
ITERX takes spans as input and thus relies on a span proposal component to obtain candidate spans. 5For all experiments, we use the neural CRFbased span finder employed for FrameNet parsing in Xia et al. (2021) and for the BETTER Abstract task in Yarmohammadi et al. (2021).6CRF-based span finders have been empirically shown to excel at IE tasks (Gu et al., 2022).The input document is first embedded using a pretrained Transformer encoder (Devlin et al., 2019;Raffel et al., 2020) that is fine-tuned during model training. 7Each span  =  [ : ] extracted by the span extractor is encoded as a vector x enc , which is obtained by first concatenating three vectors of dimension : the embeddings of the first and the last tokens in the span, and a weighted sum of the embeddings of all tokens within the span, using a learned global query (Lee et al., 2017).This 3-dimensional vector is then compressed to size  using a two-layer feedforward network.Lastly, to incorporate positional information, we add sinusoidal positional embeddings based on the tokenlevel offset of  within the document to yield x enc .

Policy: Generating a Single Template
A policy  generates a single template given span states X = (x 1 , • • • , x  ) and template type  ∈ T , conditioned on the document and all of its candidate mention spans.
Since an action  represents a set of slot type assignments for all candidate mentions, the policy ( | , X) can be factorized as (1) Thus we only need to model the slot type distribution for each candidate span.Here, we employ two models described below.
Independent Modeling We train a classifier that outputs a slot type (or ) given both the template type embedding t and the slot type embedding s, inspired by a standard practice in binary relation extraction (Ebner et al., 2020;Lin et al., 2020, i.a.).
Joint Modeling Following Chen et al. (2020), we create a model that jointly considers all candidate spans given the template type.We begin by prepending t to the sequence of span states X to yield the sequence (t, x 1 , • • • , x  ).This sequence is fed to a different Transformer encoder, which naturally models interactions both between spans and between a span and the template type via selfattention (Vaswani et al., 2017): We emphasize that the inputs to the Transformer are embeddings of spans (see §3.1) and not tokens, following Chen et al. (2019Chen et al. ( , 2020)). 8For each x  , 8 Positional embeddings are not needed in this Transformer since sinusoidal embeddings are already added to the span representations.
we pass the representation x output by the Transformer to a linear layer with output size |S |, the total number of slot types.A softmax activation is then applied over all slot types  that are valid for  (i.e.,  ∈   ∪ {}), with invalid types masked out, yielding the following distribution: (4)

State Transition Model
A state transition model models the environment  ( , ).Recall that a state transition just consists in the generation of a single template, where the current state  is the set of all templates that have been generated up to the current step.
Here, we propose a neural model that produces a representation of .Specifically, we model  as a sequence of vectors X mem ( ) ∈ R × -one -dimensional state vector for each of the  candidate spans  ∈ X .Each state vector x mem ∈ R  acts as a span memory, tracking the use of that span across generated templates.We model state transitions using a single gated recurrent unit (GRU; Cho et al., 2014).Given the current template assignment ( : ) ∈  of a slot type  to a span , the state transition for  ′ =  ∪ {} is given as follows: where t is a template embedding given by t = t when using the independent policy model given in Equation 2 and given by Equation 3 when using the joint model.Intuitively, t is a summarized vector of the current template, akin to the role of the [CLS] token employed in BERT (Devlin et al., 2019).Here, we use a concatenation of the slot type embedding s and the template vector t as the input to the state transition GRU to track the use of the span.
The input representation of a span  at each step is simply x = x enc + x mem -the sum of the original span embeddings x enc described in §3.1 and the current memory vector x mem .

Policy Learning
We use direct policy learning (DPL), a type of imitation learning, to train our model.DPL entails training an agent to imitate the behavior of an interactive demonstrator as given by optimal actions  * drawn from some expert policy  * , a proposal distribution over actions.This expert policy is computed dynamically based on the current state of the agent, as we describe below.For this reason, the interactive demonstrator is sometimes referred to as a dynamic oracle (Goldberg and Nivre, 2012).
The log-likelihood of the oracle action under the ITERX policy model is the reward.This ensures that the learning problem can be optimized directly using gradient descent, where the objective is given by the expected reward: Here,  is a discount factor, π is the mixed policy, and states are repeatedly sampled from their induced state distribution  π .The mixed policy π is a mixture of the expert policy and the agent's policy (Ross et al., 2011).Sampling from π can thus be described as first sampling some  ∈ {0, 1}, then sampling from the agent's parameterized policy  if  = 1, or sampling an action from the dynamic oracle  * if  = 0: Here , the agent roll-out rate, or the agent policy mixing rate, is a hyperparameter that controls the probability of the agent following its own policy vs. the dynamic oracle.This process resembles scheduled sampling (Bengio et al., 2015), a technique commonly employed in training models for sequence generation tasks like machine translation: when updating decoder hidden states, either the gold token  * or the predicted token ŷ may be used, and the decision is made via a random draw.Here, the difference is that we are generating templates at each step instead of tokens.

Expert Policy
We construct an expert policy based on the agent's policy.At training time, given the set of gold templates  * and the current state  (all templates predicted thus far), the set Ā =  * \  contains all gold templates not yet predicted.Our expert policy is formulated as where  is a temperature parameter.Intuitively, our expert policy seeks to "please" the agent: a (viable) action's probability under the expert policy is proportional to the probability under the agent's policy.Temperature  controls concentration:  → 0 + reduces it to a point distribution over a single action and  → ∞ results in equal probability assigned to all remaining gold templates.

Inference
Although many search algorithms for sequence prediction can be employed (e.g.beam search, A*), we find greedy decoding to be effective, and leave further exploration for future work.Setting the initial state  (0) = ∅, we take actions (i.e., generate templates) by greedy decoding â = arg max  ( | , X) for every step.Decoding stops when all spans are assigned the null slot type  in â.

Experiments
We evaluate ITERX on three datasets: SCIREX (Jain et al., 2020), MUC-4 (Grishman and Sundheim, 1996), and BETTER Phase II English Granular.9SCIREX is a challenge dataset for 4-ary relation extraction10 on full academic articles related to machine learning.MUC-4 and Granular are both traditional template extraction tasks, though they differ in important respects, which we discuss in Appendix C. For summary statistics, see Table 1.2021).12CEAF-REE is based on the CEAF metric (Luo, 2005) for coreference resolution, that computes an alignment between gold and predicted entities that maximizes a measure of similarity  between aligned entities (e.g.CEAF  4 in coreference resolution).This alignment is subject to the constraint that each reference entity is aligned to at most one predicted entity.
The CEAF-REE implementation (henceforth, CEAF-REE impl ) employed in Du et al. (2021a,b) and Huang et al. (2021) unfortunately departs from the stated metric definition (CEAF-REE def ) in two ways: (1) it eliminates the constraint on entity alignments and (2) it treats the template type as an additional slot when reporting cross-slot averages.For maximally transparent comparisons to prior work, we report scores under both CEAF-REE def and CEAF-REE impl , obtaining state-of-the-art results on MUC-4 with each.
However, we argue that neither CEAF-REE def nor CEAF-REE impl is consistent with historical evaluation of template extraction systems.CEAF-REE def errs in enforcing the entity alignment constraint: doing so effectively requires systems to perform coreference resolution, which is too strict and runs contrary to the original MUC-4 evaluation.
By contrast, CEAF-REE impl also errs in treating the template type as just another slot: this elides the important distinction between the kind of event being described and the participants in that event ( §6).
In the interest of clarity, we define a modified version of the CEAF-REE metric that avoids both pitfalls: it relaxes the entity alignment constraint and it does not include template type in cross-slot averages.We call this version CEAF-RME, where "M" stands for mention and emphasizes the focus on mention-level rather than entity-level ("E") scoring.Intuitively, relaxing this constraint amounts to placing the burden of coreference resolution on the metric: if the scorer aligns two predicted mentions to the same reference entity, the mentions are implicitly deemed coreferent.
Note that for a CEAF-family metric, the similarity function for entities (, ) between the reference entity  and the predicted  is arbitrary (Luo, 2005).In Du et al. (2021a), CEAF-REE impl uses  ⊆ (, ) = 1[ ⊆ ].We argue that  ⊆ overly penalizes models for predicting incorrect mentions, as even a single incorrect mention reduces the score to 0. Instead, a better choice is  3 (, ) = | ∩ | from Luo (2005): this computes a micro-average score of all mentions, and it adequately assigns partial credit to the overlap between the predicted mention set and the reference mention set.See Figure 4 for a succinct comparison among these variants. 13A more detailed discussion can be found in Appendix D.
For SCIREX, we report CEAF-RME (under both  3 and  ⊆ ).For MUC-4, we report all metrics so that fair comparison with prior work can be made.For BETTER Granular, we use its official metrics, described in Appendix D.
Figure 4: A comparison of the metrics discussed.Features in blue are "desired" for the evaluation of our task. 13See github.com/wanmok/iterxfor implementations.

Results
SCIREX For TEMPGEN, we report models trained with BART base and BART large , where only BART base was used in Huang et al. (2021).While BART is an encoder-decoder architecture, ITERX uses only the encoder part, and thus requires about half the pretrained parameters that TEMPGEN does. 14Even with far fewer parameters, ITERX outperforms the BART large baseline by a wide margin.Moreover, our best performing model under T5 enc large (Raffel et al., 2020) achieves roughly 2× the performance of TEMPGEN15 (see Table 2).

MUC-4
Under the most comparable setting, ITERX outperforms GTT under all metrics by 1-2%, both using BERT base (Table 3).With T5 enc large , ITERX obtains even better performance, with most gains coming from increased precision. 16Furthermore, we note a consistent gap of ≈ 5% F 1 be-tween CEAF-RME  ⊆ and CEAF-REE impl ,17 which we suspect is due to CEAF-REE impl 's inclusion of scores for template type into the aggregated slot F 1 : as template type scores are higher across models than slot type scores, they are liable to inflate the aggregate score.
BETTER Granular We report scores on the English-only Phase II BETTER Granular task using the official BETTER scoring metric in Table 4.Given the complexity of the Granular task, the accompanying difficulty of developing models to perform it, and the lack of existing work on Granular, we report scores only for ITERX under T5 enc large .We intend these to serve as a solid baseline against which future work may be measured.We next conduct ablations to examine how specific aspects of ITERX's design affect learning.Here, we focus on SCIREX as a case study, as it has the highest average templates per document of the three datasets, allowing us to best investigate the behavior of ITERX over long action sequences.
Recall that the dynamic oracle specifies an expert policy  * (Equation 8) from which expert actions  * are drawn.One design decision concerns the agent roll-out rate, , which controls how often we draw from the expert policy vs. the agent policy when making updates.Another decision concerns how entropic this policy distribution should be, controlled by the temperature .Both decisions reflect a trade-off between exploration and exploitation in the space of action sequences.
Agent Roll-out Rate  We show how model performance changes as we increase the agent rollout rate  ∈ [0, 1] in Figure 5, where  = 0 to always following the expert policy, and  = 1 corresponds to always updating based on the agent's own policy.The model performs poorly under low , but improves quickly as  increases, reaching a plateau past  ≥ 0.5.The results are intuitive, as relying more on the expert (lower ) for learning would result in a fixed and deterministic set of states that may hinder the agent from visiting new states, which are often encountered at test time.With higher , the agent's behavior is more consistent between train and test time.Temperature  We compare the following four settings for sampling from  * , keeping  = 0 to control for effects of policy mixing: • FIXED: Select the next template in the document based on the order as is given in the dataset.In this case,  does not come into play.This setting corresponds to the standard practice of using fixed template linearizations (Du et al., 2021b;Huang et al., 2021).
•  → 0 + (ARGMAX): Select the template that maximizes the likelihood with the systempredicted distributions over slots.
•  = 1 (XENT): Sample a template according to the distribution defined by the cross entropy between references and predictions.
•  → ∞ (UNIFORM): Sample a template uniformly from the correct template set.
Test set performance for each setting is shown in Interestingly, while CEAF-RME  ⊆ F 1 scores are consistent across settings, CEAF-RME  3 F 1 scores are higher under higher temperature settings.To the extent that the more entropic settings conduce to higher template and mention recall, we would expect these settings to yield more partial-credit template alignments (under the  3 similarity function) than non-random settings, which tend to focus on correct prediction of fewer templates -thus potentially missing templates entirely and receiving no partial credit.

Related Work
Template Extraction The term template extraction was originally proposed in the Message Understanding Conferences (MUC; Sundheim, 1991, i.a.) to describe the task of extracting templates from articles.Researchers later focused more heavily on sentence-level IE, especially after the release of the ACE 2005 dataset (Walker et al., 2006).But following renewed interest in document-level IE, researchers (Du et al., 2021b;Huang et al., 2021;Gantt et al., 2022, i.a.) have begun to revisit MUC and to develop new template extraction datasets (notably, BETTER Granular).
Traditionally, template extraction comprises two sub-tasks: template identification, in which a system identifies and types all templates in a document, and slot filling or role-filler entity extraction (REE), in which the slots associated with each template are filled with extracted entities.Much recent work in this domain has turned away from the full task, focusing only on REE, which is tantamount to assuming that there is just a single aggregate template per document (Patwardhan and Riloff, 2009;Huang andRiloff, 2011, 2012;Du et al., 2021a;Huang et al., 2021).

Document-Level Relation Extraction
Alongside template extraction, there has been considerable recent interest within IE in various challenging document-level relation extraction objectives, beyond the longstanding and dominant focus on coreference resolution.Argument linking -a generalization of semantic role labeling (SRL; Gildea and Jurafsky, 2002) in which a predicate's extrasentential arguments must also be labeled -is one notable example, and has attracted recent attention through the RAMS (Ebner et al., 2020) and WikiEvents (Li et al., 2021) benchmarks. 18Prior benchmarks on this task include SemEval 2010 Task 10 (Ruppenhofer et al., 2010), Beyond Nombank (Gerber and Chai, 2010), ONV5 (Moor et al., 2013), and Multi-sentence AMR (O'Gorman et al., 2018).A separate line of work has concentrated on general N-ary relation extraction challenge tasks, in which entities participating in the same relation may be scattered widely throughout a document.Beyond SCIREX, PubMed (Quirk and Poon, 2017;Peng et al., 2017) and DocRED (Yao et al., 2019) are two other prominent benchmarks in this area.
Imitation Learning Our approach casts the problem of generalized template extraction as a Markov decision process.SEARN (Daumé III et al., 2009) and other related work (Ross et al., 2011;Venkatraman et al., 2015;Chang et al., 2015, i.a.) have considered structured prediction under a reinforcement learning setting.Notably, in dependency parsing, Goldberg and Nivre (2012) proposed the use of a dynamic oracle to guide an agent toward the correct parse (see §3.4).
We also employ direct policy learning for optimization of the template extraction MDP, thus reducing the problem to one of supervised sequence learning that is amenable to gradient descent.Such treatment is reminiscent of other similar techniques in NLP.Scheduled sampling (Bengio et al., 2015), for instance, trains a sequence generator with an expert policy consisting of a mixture of the predicted token and the gold token.Relatedly, Levenshtein Transformers (Gu et al., 2019) learn to edit a sequence by imitating an expert policy based on the Levenshtein edit distance.

Conclusion
We have presented ITERX, a new model for generalized template extraction that iteratively generates templates via a Markov decision process.ITERX demonstrates state-of-the-art performance on two benchmarks in this domain -4-ary relation extraction on SCIREX and template extraction on MUC-4 -and establishes a strong baseline on a third benchmark, BETTER Granular.In our experiments, we have also shown that imitation learning is a viable paradigm for these problems.We hope that our findings encourage future work to confront the challenge of dealing with documents that describe multiple complex events and relations head-on, rather than veiling this difficulty behind simplified task formulations.

Limitations
Although we believe our iterative extraction paradigm to be promising, we acknowledge that this work is not without limitations.First, ITERX features a significant number of hyperparameters.We found that these generally required some effort to tune for specific datasets, and that there was no single configuration that was uniformly the best across domains.We showed the impact of manipulating some of these hyperparameters in §5.Second, our ITERX implementation iterates over all template types in the template ontology during training and inference, which means that runtime grows linearly in the number of template types.While our framework could in principle support template type prediction as well (which would reduce this to  (1)), it does not do so in practice, and hence runtime may be long for large ontologies.However, we again stress that actual template ontologies tend to be small.

A Terminology
Information Extraction is rife with vague and competing terms for similar concepts, and we recognize some hazard in introducing generalized template extraction (GTE) into this landscape.To head off possible confusion, we highlight two important differences between this problem and the well established problem of event extraction (EE).
First, EE requires identifying lexical event triggers, whereas GTE does not, as template instances do not necessarily have one specific lexical anchor.A document describing a terrorist attack may only explicitly describe a series of bombings, and a document describing an epidemic may only explicitly state that thousands of people have contracted a particular disease.This property holds of all three datasets we focus on, and can be seen in both Figure 1 and Figure 2. Template anchors are not annotated either for MUC-4 or for SCIREX.And while they are annotated for BETTER Granular, they do not factor into scoring.This contrasts with major EE datasets, such as ACE or PropBank, for which typed lexical triggers must be extracted.
Second, we take GTE to be a fundamentally document-level task: templates concern events described over an entire document.In practice, EE has historically referred to extraction of predicateargument structures within a single sentence.One could conceivably argue that this usage has begun to change with the recent interest in argument linking datasets like RAMS (Ebner et al., 2020) and WikiEvents (Li et al., 2021), in which arguments may appear in different sentences from the one containing their predicate.Even so, these crosssentence arguments are still arguments of a particular predicate, in a particular sentence.Moreover, the overwhelming majority of arguments in these datasets are sentence-local (Ebner et al., 2020).As emphasized above, templates are not necessarily anchored to particular lexical items.For this reason, they also do not necessarily exhibit the level of locality one finds in EE.
These differences are what motivate the use of CEAF-REE as an evaluation metric, in contrast to the precision, recall, and F1 scores for events and arguments that are typically reported for EE.In brief, it simply is not possible to compute these for GTE in the same way as they are computed for EE.We elaborate on this point in Appendix D.

B Model Training and Hyperparameters
We implemented our models in PyTorch (Paszke et al., 2019) and AllenNLP (Gardner et al., 2018).We trained all our models with a single NVIDIA RTX6000 GPU.For all experiments that reproduce prior works, we trained models until full convergence under the patience settings provided in the publicly released code.For all ITERX models, we trained and tuned hyperparameters under our grid's limit of 24 hours per run, with which we were able to obtain solid performance on all datasets.We performed hyperparameter search manually and report the best performing hyperparameters and the bounds we searched in Table 6, Table 7, and  Table 8.

C.1 MUC-4
The MUC-4 dataset features a total of 1,700 English documents (1,300 for train and 200 each for dev and test) concerning geopolitical conflict and terorrism in South America.Documents are annotated with templates of one of six kinds -Attack, Arson, Bombing, Murder, Robbery, and ForcedWorkStoppage -and may have multiple templates (often of the same type) or no templates at all.All templates contain the same Finally, the formal evaluation setting for Granularwhich we do not adopt in this paper -is zero-shot and cross-lingual: systems trained only on English documents are evaluated exclusively on documents in a different target language. 19The data used in our experiments is English-only and comprises the "train," "analysis," and "devtest" splits from Phase II of the BETTER program, for which the target language is Farsi.

C.3 SCIREX
The 4-ary SCIREX relation extraction task seeks to idenfity entity 4-tuples that describe a metric used to evaluate a method applied to an ML task as realized by a specific dataset -e.g.(span F1, BERT, SRL, ACE 2005).The challenge of SCIREX lies not only in these pieces of information tending to be widely dispersed throughout an article, but also in the fact that only tuples describing novel work presented in the paper (and not merely cited work) are labeled as gold examples.Following Huang et al. (2021), we frame this as a template extraction task, treating each 4-tuple as a template with four slots.

D Model Evaluation Details
A key consideration that arises in evaluating generalized template extraction is the need to align predicted and reference templates: a given predicted template may be reasonably similar to multiple different reference templates, and one must decide on a single template to use as the reference for each predicted one.Generalized template extraction is similar in this respect to coreference resolution, in which predicted entities may (partially) match multiple reference entities, and one must determine a ground truth alignment.Importantly, this consideration also renders metrics that are traditionally reported for event extraction -namely, event and argument precision, recall, and F 1 -inappropriate.This is because event extraction is fundamentally a span labeling problem, and the identity of the appropriate reference span is always clear for a given predicted span: either a reference span with the same boundary and type exists or it does not. 20By contrast, the mapping from prediction to reference for templates is only this transparent in cases of perfectly accurate predictions.
All the evaluation metrics presented in this appendix are, at base, minimal extensions of precision, recall, and F 1 to cases where template alignments are both necessary and nontrivial.For CEAF-REE in particular, the various versions of the metric that we discuss (CEAF-RME  3 , CEAF-RME  ⊆ , CEAF-REE def , and CEAF-REE impl ) merely reflect differences in how this alignment should be performed and whether the template type should be treated in the same way as slot types for reporting purposes.

D.1 MUC-4
MUC-4 evaluation presents a special challenge, owing to its long and complicated history, and to terminological confusion. 21Here, we discuss CEAF-REE (Du et al., 2021a), the current standard metric for MUC-4 evaluation.We begin with definitions, following with a discussion of some of its problems, and conclude with an extended presentation of our CEAF-RME variant, introduced in §4.

D.1.1 CEAF and CEAF-REE: Definitions
The CEAF-REE metric, introduced by Du et al. (2021a), has since been adopted as the standard evaluation metric for MUC-4 (Du et al., 2021b;Huang et al., 2021).To our knowledge, no official scoring script has ever been released for MUC-4, although the metrics used as part of the original evaluation are described in detail in Chinchor (1992).CEAF-REE does not attempt to implement these original metrics, but is rather a lightly adapted version of the widely used CEAF metric for corefer-ence resolution, proposed in Luo (2005). 22CEAF computes an alignment between reference (R) and system-predicted (S) entities, with each entity represented by a set of coreferent mentions, and with the constraint that each predicted entity is aligned to at most one reference entity.This is treated as a maximum bipartite matching problem, in which one seeks the alignment that maximizes the sum of an entity-level similarity function (, ) over all aligned entities  ∈ R and  ∈ S within a document.In principle, CEAF is agnostic to the choice of , though it is generally desirable that (, ) = 0 when  ∈  such that  ∈  and that (, ) = 1 when  = , for reasons described in Luo (2005).In practice, the  4 similarity function is most commonly used, defined as the Dice coefficient (or F 1 score) between  and : Another possible version is  3 : Given this  similarity function and the maximal match  * between entities, the final precision and recall are computed as follows: • A binary similarity function  ⊆ is used, defined as follows: ) 22 Luo's motivations for proposing CEAF actually derive in large part from observed shortcomings with the original MUC-4 F 1 score.See Luo (2005) for details.This function  ⊆ says that a model receives full credit (1.0) for a predicted entity if and only if its mentions form a subset of those in the reference entity.If even one incorrect mention is included, the model receives a score of zero for that entity.

D.1.2 CEAF-REE: Problems and Solutions
Our principal concerns with CEAF-REE lie with how it has so far been reported and implemented, and with challenges in extending it to the full template extraction task, in which multiple templates of the same type may be present in a document.We elaborate on two issues discussed briefly in §4 and also introduce a third.
First, previous work that reports CEAF-REE treats the template type merely as another slot, with template type labels treated as special kinds of "entities" that may fill this slot.This is not necessarily problematic in itself: template type-level metrics are valuable for evaluating system performance.However, it is problematic when reporting (micro or macro) average CEAF-REE figures across slots, as these works do.This is because incorporating the scores for template type into the average elides the distinction between roles (slots) and the kind of event being described (the template type).Moreover, the alignment between slot-filling entities is also already conditioned on a match between the template types.There are thus two distinct ways in which information about a system's predictive ability with respect to template type end up in a slot-level average CEAF-REE score.This results in reported values that are very difficult to interpret, and potentially misleading to the extent that these features of CEAF-REE implementations are not made apparent in writing.
Second, the constraint that at most one predicted entity be aligned to each reference entitystipulated in the metric definition (CEAF-REE def ) -is not enforced in the implementation (CEAF-REE impl ).Practically, this means that the alignment shown in Figure 6 would receive full credit, whereas it ought to receive a precision score of only 0.75, as Du et al. (2021a) describe.As we argue in §4, we believe this constraint to be overly strict.But this point aside, the discrepancy between definition and implementation is problematic in itself.
Third, full template extraction introduces a second maximum bipartite matching problem, which requires aligning predicted and reference templates of the same type, and which CEAF-REE (either CEAF-REE def or CEAF-REE impl ) is not natively Figure 6: An example alignment between predicted and reference entities from Du et al. (2021a).In past implementations of CEAF-REE, this alignment would receive full credit, rather than being penalized for precision ( = 0.75).
equipped to handle, given that it operates at the level of slots.Du et al. (2021b) reports CEAF-REE for GTT under an optimal template alignment, but this is obtained via brute-force, enumerating and evaluating every possible alignment, including those between templates of different types.The similarity function, (call it  TEMP (  ,   )) that they use for template alignment is itself the cross-slot average CEAF-REE impl score for predicted template   and reference template   .This brute-force template alignment, in conjunction with the two-level maximum bipartite matching problem, results in prohibitively long scorer execution times in cases where there are even a modest number of predicted or reference templates of the same type. 23  In addition to our implementation of CEAF-RME (see below), we also present the first correct implementation of CEAF-REE def that fully addresses the first two points above: template types are no longer treated as additional slots and the entity-level alignment constraint is enforced.On the third point, our implementation efficiently computes optimal template alignments using the Kuhn-Munkres algorithm (Kuhn, 1955;Munkres, 1957).However, even with this efficient implementation, solving the two-level maximum bipartite matching problems is still computationally intensive. 23Only CEAF-REE def requires solving a two-level maximum-bipartite matching problem.Since CEAF-REE impl does not enforce the entity alignment constraint, these alignments will not necessarily be bipartite.

D.1.3 Coreference and CEAF-RME
As CEAF was designed for coreference, it is unsurprising that coreference considerations introduce a further wrinkle for CEAF-REE.None of the three models described in this work (including ITERX) performs entity coreference resolution.This clearly presents a problem because CEAF-REE def is an entity-level metric.One way to score these models is simply to treat each extracted mention as a singleton entity and use CEAF-REE def exactly as defined, and we report these scores in the main text for MUC-4.However, reporting only CEAF-REE def would be undesirable for several reasons: • It would render our results incomparable to past work, which reports only CEAF-REE impl .
• It would put our work at odds with the overwhelming majority of the template extraction literature, where evaluation criteria focus on string matching between predicted and reference mentions.(The original MUC-4 evaluation only required systems to extract a single representative mention for each entity -not to identify all such mentions.) • The constraint that at most one predicted entity be aligned to a given reference entity would yield punishingly low scores for systems that are highly effective at extracting relevant spans, but that simply do not perform the additional step of coreference.
For these reasons, we disfavor a template extraction metric that requires template extraction systems to do coreference.These considerations motivate our introduction of CEAF-RME (rolefiller mention extraction) -that makes a minimal modification to CEAF-REE def to address (1) and (2) above.CEAF-RME treats system-predicted mentions as singleton entities, but deliberately relaxes the alignment constraint, potentially allowing multiple predicted singletons to map to the same reference entity, effectively pushing the burden of coreference into the metric.We believe CEAF-RME is consistent with what template extraction research has in fact historically cared about (identifying mentions that fill some slot) while correcting implementation problems with CEAF-REE that produce misleading results.
The micro-average CEAF-RME  3 results that we report on MUC-4 in the main body of the paper are micro-average CEAF-RME scores under an optimal template alignment (using CEAF-RME as the template similarity function), which is efficiently obtained using the Kuhn-Munkres algorithm.
We additionally include a version of CEAF-RME that uses  ⊆ (CEAF-RME  ⊆ ) for parallel comparison against CEAF-REE impl .Recall that CEAF-REE impl is essentially CEAF-RME  ⊆ with the template type included as an additional slot.We reiterate that CEAF-RME  3 is the more appropriate metric since it can award partial credit for predicted entities whose mentions overlap imperfectly with those in the reference, where CEAF-RME  ⊆ assigns zero credit in such cases.

D.2 SCIREX
We use the same CEAF-RME  3 implementations for scoring SCIREX as we use for MUC-4.Full evaluation using the original SCIREX scoring script requires systems to perform coreference resolution, which makes it similarly inappropriate to CEAF-REE def for evaluating the systems presented in this work, none of which feature a coreference module.The CEAF-RME  3 and CEAF-RME  ⊆ results presented in the main text together give a clearer picture of these models' ability to extract relevant mentions (short of clustering them) than would a coreference-based metric.We simply treat the SCIREX 4-tuples as 4-slot templates, following Huang et al. (2021).

D.3 BETTER Granular
Evaluation for the BETTER Granular task bears some core similarity to CEAF-REE def in that relies on obtaining the alignment between system and reference templates that maximizes some similarity function that decomposes over slot fillers.And just as with (our corrected implementation of) CEAF-REE def , this is achieved via the Kuhn-Munkres algorithm.However, Granular scoring differs from CEAF-REE def in four key respects.First, the overall system score -referred to as the combined score -incorporates both a slot-level F1 score and a template-level F1 score: Only exact matches between system and reference templates types are awarded credit.It is worth noting that because this score does not decompose over template pairs, it cannot be optimized directly using Kuhn-Munkres.In practice, what is optimized is response gain -the number of correct slot fillers minus the number of incorrect oneswhich provably yields alignments that optimize the combined score within a probabilistic error bound.
The remaining three key differences relate to the calculation of the slot-level F1.For one, Granular slots are not exclusively entity-valued, but may also be event-, (mixed) event-and-entity-, boolean-, and (categorical) string-valued, and different similarity functions must be employed in these different cases.For another, where CEAF-REE defines mentions by their string representation, the Granular score defines mentions based on document offsets.Finally, Granular also requires extraction of temporal and irrealis information for slots, and this in turn impacts the SlotF1 score.
Borrowing terminology from the discussion of MUC-4 above, we describe below how (, ) is calculated for some generic reference slot filler  and system-predicted slot filler  for slots of different types.
Boolean and Categorical Values For booleanand categorical-string valued slots (i.e., slots taking on one of a predefined set of values).(, ) = 1 if there is an exact match between the system and reference fillers and is 0 otherwise.Entities Unique among the three tasks discussed in this paper, Granular features an explicit preference for informative arguments in its scoring structure.In particular, (proper) name mentions of an entity are worth more than nominal mentions, which in turn are worth more than pronominal ones.24Thus, if Barack Obama were represented by the reference entity {Obama, the former President, he}, full credit would be awarded for returning only the mention Obama, less credit for the former President, and still less for he.Exact point values depend on the mentions present in the reference entity: • Correct name mentions always receive full credit ((, ) = 1) • Correct nominal mentions receive half-credit ((, ) = 0.5) if the reference entity additionally contains a name mention, and receive full credit otherwise.
• Correct pronominal mentions receive quartercredit ((, ) = 0.25) if the reference entity additionally contains both a name and a nominal mention, and half-credit if only a nominal mention is featured.(Note that entities will never feature only pronominal mentions.) Events Some Granular slots require events as fillers.Like entities, events are represented as sets of mentions (event anchors or triggers).Unlike entities, there is no informativity hierarchy for events.Furthermore, while event coreference is not a part of the Granular task, annotations for event coreference are nonetheless provided for scoring purposes: (, ) = 1.0 iff  contains only mentions belonging to events in the set of gold coreferent events , and is 0 otherwise, akin to  REE .
Mixed Entities and Events Some slots may take a mix of events and entities as fillers.Since systems must indicate whether predicted mention clusters are entity-or event-denoting, the same similarity criteria for events and entities as described above are used to compute  for events and entities that fill these slots.
Temporal and Irrealis Information One of the features of Granular that makes it decidedly more difficult than either MUC-4 or SCIREX is the requirement to extract information relating to the time and irrealis status of an event when such information is available in the document.This information is encapsulated in special time-attachments and irrealis fields associated with each slot-filling entity or event.The former is given as a set of temporal expressions that describe the time at or during which the filler satisfied the role denoted by the slot (e.g. when individuals filling the tested-count slot in the Epidemic template were tested for the disease).The latter is given as one of a set of strings that describe whether or how the filler satisfied the role denoted by the slot: counterfactual, hypothetical, future, unconfirmed, unspecified, and non-occurrence.
time-attachments and irrealis are each worth 0.25 points, where exact matches are required for full credit on either and where zero points are awarded otherwise.For slots for which time-attachments and irrealis are required, the value of  appropriate to its filler type is scaled by 0.5 such that the maximum overall score (, ) for a given filler -factoring in time-attachments, irrealis, and event or entity similarity -is 1.

Figure 1 :
Figure 1: An example of multi-template extraction on a document (an NLP paper; Lei et al. (2018)) from the SCIREX dataset.An agent reads the entire paper and iteratively generates templates, each consisting of slots for Task, Method, Dataset, and Metric.

AlphaFigure 5 :
Figure5: Performance changes on CEAF-RME  3 with respect to , for which higher means higher probability of rolling out agent's policy for state update.
In early September, illegal border crossings by two people infected with COVID-19, triggered a week-long lockdown of another Yunnan border city, Ruili, and prompted at least eight prefectures and 25 counties to enter "wartime status."Following the incident, Yunnan vowed to strengthen border patrols.

Table 1 :
Summary statistics of the datasets.† indicates slot types that take non-span values as fillers.

Table 3 :
Results on MUC-4.† Note that scores under CEAF-REE def are compared as if every mention forms a singleton entity.We made this assumption since neither prior work nor our model perform coreference resolution.Thus, CEAF-REE def is a somewhat inappropriate metric for these systems, but is included for completeness.

Table 4 :
Results on the BETTER Granular dataset.Combined score is Template F1 × Slot F1.

Table 5 .
The results for CEAF-RME  ⊆ show a trade-off in precision and recall corresponding to the exploitation-exploration trade-off induced by , with the higher  (more exploration) of XENT and UNIFORM, yielding higher recall.The trend for CEAF-RME  3 is similar, though less pronounced.

Table 5 :
Results with different choices of temperature.

Table 8 :
Hyperparameters and other reproducibility information for Granular.
The CEAF that uses  4 is sensibly denoted CEAF  4 in coreference resolution.CEAF-REE impl differs from CEAF  4 in the following ways: • All entities are aligned within role, conditioned on matching template type.E.g.only predicted entities for the Victim slot in Bombing templates would be considered valid candidates for alignment with entities filling the Victim slot in reference Bombing templates.