A Unified View of Evaluation Metrics for Structured Prediction

We present a conceptual framework that unifies a variety of evaluation metrics for different structured prediction tasks (e.g. event and relation extraction, syntactic and semantic parsing). Our framework requires representing the outputs of these tasks as objects of certain data types, and derives metrics through matching of common substructures, possibly followed by normalization. We demonstrate how commonly used metrics for a number of tasks can be succinctly expressed by this framework, and show that new metrics can be naturally derived in a bottom-up way based on an output structure. We release a library that enables this derivation to create new metrics. Finally, we consider how specific characteristics of tasks motivate metric design decisions, and suggest possible modifications to existing metrics in line with those motivations.


Introduction
A wide range of tasks in NLP can be considered as forms of structured prediction.Syntactic and semantic parsing produces a tree or graph 2 based on text.Information extraction (IE) aims to produce structured representations of data extracted from unstructured sources, often in the form of relations that may be used to populate a database (Grishman, 2019).Such relations may be typed or untyped, may have different numbers of arguments, and may relate objects of different kinds (e.g.mentions, entities, events, or even images).
The structural complexity of these representations varies considerably between tasks.On the simpler end, problems like binary relation extraction require identifying relationships between pairs of entity mentions.On the more complex end are

President Biden He
Matching similarity for sets Similarity of a matched pair φ X (x, y) Figure 1: Our generic framework, with the CEAF  4 metric (Luo, 2005) as for coreference resolution as an example.Here the task output is a set of entities, where each entity is a set of coreferent mentions identfied in the document.Computing CEAF  4 thus amounts to calculating the matching similarity between the predicted () and reference () sets of entities.
tasks like template extraction, which requires populating various types of slots with sets of mentions, categorical values, or even whole event structures, and AMR parsing (Langkilde and Knight, 1998;Banarescu et al., 2013), which requires generating a DAG of entities and values representing their semantic relations.
A wide array of evaluation metrics have been proposed across this spectrum of tasks.For simpler ones, researchers have generally converged to a standardized set of metrics (e.g.trigger and argument F 1 for event extraction).However, for more complex tasks like template extraction, researchers have often proposed bespoke metrics tailored to the problem at hand, complicating comparison with prior work on similar problems (Chinchor, 1991(Chinchor, , 1992;;Du et al., 2021b;Chen et al., 2023).
Given the common goal of predicting structured objects, our aim is to present a similarly unified, high-level picture of evaluation.We observe that a variety of metrics can be viewed as computing scores over a matching between substructures of predicted and reference objects, where this score decomposes as a normalized sum over matched pairs.The process of computing metrics can thus be abstracted to a framework as shown in Figure 1.
On the one hand, this observation drives a contribution to structured prediction theory, clarifying the relationships among numerous metrics proposed over the years by identifying their core components.On the other, it drives a contribution to NLP practice, offering a bottom-up process for designing new metrics based on an output structure.Our contributions can be summarized as follows: • We present a unified framework for expressing structured prediction metrics; • We demonstrate how to derive various classic metrics using this framework, given a specification of a task's output structure; • We consider how different problem features may recommend particular design decisions within the framework -often different decisions from those realized by existing metrics; • We release a library that enables bottom-up creation of new metrics based on predefined output data structure of a given task.
Throughout, we emphasize both how evaluation of substructures (e.g.mentions) composes in the evaluation of superstructures (e.g.relations, templates), as well as the different notions of similarity employed for different structures.Our discussion starts with simpler tasks and proceeds to more com-plex ones, interleaving with examples throughout our exposition.

Records and Sets
We begin by focusing on records3 with non-nested, fixed-named fields or slots.Specifically, for ,  ∈  of Predicted and Reference objects of record type , we induce a similarity function over .
an object is at least as similar to itself as to any other.A relaxed version is an unnormalized similarity, where  :  ×  → [0, +∞).
Discrete Similarity4 Equality is a trivial but important notion of similarity, which can be expressed by the Kronecker delta or the Iverson bracket5 as Product Similarity for Records Given two similarities  and  over sets  and  , we can define a product similarity  ×  for tuples of  ×  : Clearly, the product similarity of two similarities is also a similarity. 6This generalizes to -tuples, or record/class types 7 if a similarity function is defined for each field in the record.
Set Intersection and Normalization Sets are commonly compared with Jaccard similarity, or F 1 score.Note that the core of such comparison is the overlap between two sets ,  ⊆ , namely if we consider the elements of  as discrete (using   as their similarity).This overlap score Σ  is an unnormalized similarity under our definition.
There are multiple ways to normalize this Σ score so that the result is a (proper) similarity.We consider a few common choices: precision (Eq.4), recall (Eq.5), and F 1 (or Dice score; Eq. 6): And the Jaccard similarity: Note that all these normalizers can be expressed solely with the overlap scoring function Σ.Let N ∈ {P, R, F, J} be a normalizer over objects of type .Hence we arrive at a normalized similarity over sets of : N[] (, ) = N(Σ  (, )).
We have created the basic tools needed to derive metrics for many simple tasks.Next, we illustrate how to do so for two common NLP tasks.

Binary Relation Extraction
Binary relation extraction (RE) focuses on typed relations (e.g.IS-CAPITAL-OF) with two arguments, a subject and an object.Traditionally, both the subject and the object are text spans (i.e.mentions).
Given a text passage, the objective is to output a set of binary relations.
To ground our discussion of concrete structured prediction tasks, we specify relevant output data structure(s) in a Python dataclass-like syntax.For binary RE, these are as follows: 7 Product types in programming languages literature.We will now derive a metric bottom-up.A standard similarity for mentions is exact offset match8 , where two mentions are considered the same if and only if both the left and right boundaries match.This is an instance of product similarity:9 On the outer level, relation instances are considered correct only when all of its components are correct: Finally, precision, recall, and F 1 score are the most common metrics to evaluate predicted relations.Practically, this requires finding the intersection between predicted and reference relations:10

Dependency Parsing
Our next example is dependency parsing, where dependencies are relations between a governor and its dependent.The output structure is as follows: Dependency parsing is evaluated using unlabeled (UAS) and labeled (LAS) attachment scores (Buchholz and Marsi, 2006), which are simply F 1 scores over dependency edges: 3 Matching of Sets In the previous section, we derived Σ  , a similarity for sets whose elements are discrete.However, elements of sets can be equipped with their own similarity.For example, in coreference resolution, the output of a system is a set of entities, where each entity is in turn a set of mentions that may partially overlap.We develop the idea of a matching of sets to express these cases.We derive a similarity  P () over sets of elements of  (i.e.elements of the power set P ()) using bipartite graphs.Assuming that elements in  are compared with a custom similarity   , given two sets ,  ⊆ , we can construct a bipartite similarity graph  = (, , ) between  and , where  ⊆  ×  is the set of edges, and the weight on each edge   (, ) corresponds to the value of the similarity (  ) between nodes  and .
We then determine a matching  ⋄ ⊆  on this bipartite graph .An unnormalized matching score between  and  is defined to be the maximum sum of weights of all edges in a matching, subject to some constraint: where ⋄ ∈ {↔, →, ←, ∼} is the matching constraint.Specifically we have the following: • 1:1 (↔): Each element of  can be matched to at most one element of , and vice versa.This is corresponds to the unbalanced assignment problem, and can be solved efficiently with the Hungarian algorithm (Kuhn, 1955;Munkres, 1957).We denote this  ↔ since the matching is a (partial) bijection between  and .
• N:1 (→) / 1:N (←): Each element of  can be matched to at most one element of , but each element of  can be matched to multiple elements of .We denote this  → since the matching is a (partial) function from  to .A flipped version  ← obviously follows.
• N:N (∼): Every element of  may be matched with multiple elements of , and vice versa, without constraints.We denote this  ∼ = , as the matching may be any relation between  and .
Note that the overlap score developed in §2 is a special case of the 1:1 matching score here, since Thus we arrived at a generalization of our original overlap score.We denote the N-normalized . Such (normalized) matching scores are sometimes kernels, which have additional nice properties.For discussion, see Appendix B.
With the notion of matching of sets, we next consider metrics for several more complex tasks.

Event Extraction
Our first such task is event extraction. 11We imagine that events and arguments are represented using the following data structures: The canonical metrics for event extraction are labeled precision, recall, and F 1 score for both event triggers and arguments (Li et al., 2013, i.a.).An event trigger is considered correct iff both the event type and the trigger mention offsets exactly match those of the reference (i.e. Trigger =  mention ×  type ).An event argument is considered correct iff the argument mention offsets and role exactly match the reference (i.e. Argument =  mention × role ) and the associated trigger is correct.12Given these, we can express trigger and argument F 1 scores as Note that the definition of ArgF 1 suggests that the metric can be viewed as a nested matching, in which we first compute an unnormalized optimal argument matching score (Σ ↔ args , i.e., a raw count of matched arguments) based only on role type and argument boundaries, and then use this score to identify the optimal matching and score conditioned on the trigger.As with F ↔ relations in §2.1,  trig renders F ↔ events trivial to compute, as an aligned event pair receives no credit if the triggers do not match.However, this nested matching view articulates a key aspect of our framework, evidenced by other metrics discussed in this section -namely, that evaluation of complex structures depends on an optimal matching of their substructures.

Coreference Resolution
Event extraction deals only with trigger and argument mentions, but IE also deals with coreference resolution, where systems predict a set of entities, which in turn are sets of coreferent mentions: A variety of metrics have been proposed for coreference resolution.Commonly used are CEAF (Luo, 2005), MUC (Vilain et al., 1995) and B 3 (Bagga and Baldwin, 1998).
CEAF We start with CEAF since it explicitly evaluates coreferences as sets of mentions.CEAF computes entity precision, recall, and F 1 by finding a partial bijection between predicted and reference entities that maximizes an entity similarity.Luo (2005) considers several functions -denoted  {1,2,3,4} -ultimately preferring  3 and  4 : Both correspond to intuitive notions of entity similarity, with  3 simply counting the number of mentions a pair of entities have in common, while  4 F-normalizes this value. 14In contrast to the identity similarities ('s) typically used for mentions, the similarity used in coreference resolution is gradient: entities can be more or less correct based on their constituent mentions.Coreference resolution researchers have often used  4 (Moosavi and Strube, 2016; Joshi et al., 2020, i.a.), where CEAF  4 is just the F-normalized total score under a  4 -optimal entity matching: CEAF offers a nice illustration of the expressiveness of our framework, computing a matching score between sets (of entities), where the internal metric over elements (entities) is also a matching score over sets (of mentions).
MUC The main step of MUC scoring is to create (separate) partitions of the predicted and reference entities (Pradhan et al., 2014).Assume that the predicted and reference entity sets are P and R, and the partition of each reference entity  ∈ R created by intersecting it with predicted entities P is Part P (): i.e.  ∈Part P () = .MUC recall is computed as We can define an unnormalized similarity (number of shared links that link mentions to form a coreference chain) between entities: Using this, we see that Precision can be defined similarly.The recall of B 3 assigns to each reference mention a score equal to the ratio of the number of correct mentions in the predicted entity containing the reference mention to the size of the reference entity to which that mention belongs (Pradhan et al., 2014).Under our new data structure this ratio is just R ↔ entity [𝛿].Precision is computed similarly by switching the role of predicted and reference entities.Thus, B 3 can be succinctly expressed as Our framework thus captures all three of the standard coreference resolution metrics.

Role-Filler Entity Extraction & N-ary
Relation Extraction Relation and event extraction generally take mentions as arguments, but some tasks take entities as arguments (fillers) of roles (slots) in some relation: Tasks of this form have been instantiated in various ways in prior work, which we discuss below.
Role-filler Entity Extraction One such instance is role-filler entity extraction (REE), a subtask of template extraction in which one must populate the subset of slots (roles) of a single identified template that takes entities as fillers (Du et al., 2021a;Huang et al., 2021, i.a.).Since the task deals with a single template, the output is a single NAryRelation. 15 Du et al. (2021a) introduced the CEAF-REE metric for REE which differs from CEAF only in requiring matching entities to share a role type and in using a different  for entities: where  and  are predicted and reference entities (sets of mentions).CEAF-REE is then defined as: Whereas  3 and  4 award partial credit to predicted entities that contain at least one correct mention,  ⊆ is much stricter, awarding no credit in cases where even one mention is incorrect, while simultaneously awarding full credit to any non-empty subset of the reference entity.This may make sense in some settings, but in most, it is unduly harsh (see §6).Responding to this observation, Chen et al. two-sided matching constraints to one-sided: N-ary Relation Extraction -ary RE generalizes binary RE to relations among  entity or mention arguments.Here, we will assume we are dealing with entities; the case of mention arguments is comparatively straightforward.
Often, work on -ary RE assumes gold entities are given to the model as input, along with a set of candidate relations, and results are reported as relation type classification accuracy or F 1 .This is true of much work on a number of recent, popular -ary RE benchmarks, including SCIERC (Luan et al., 2018), DOCRED (Yao et al., 2019), and the dataset released by Peng et al. (2017).
In a more comprehensive task setting, entities or mentions must also be predicted, along with the relations.We highlight the SCIREX benchmark (Jain et al., 2020), an extension of SCI-ERC, as an example of evaluation in this setting.SCIREX requires extraction of quaternary (dataset, method, task, metric) relations over entities extracted from ML papers.We formulate the SCIREX metric in our framework below.For this task, mentions are represented as index ranges: A predicted mention is considered to match a reference mention iff their Jaccard similarity (considered as bag-of-integer offsets) exceeds 0.5: Jain et al. propose computing a role-filling entity matching based on mention and role matching: In other words, a pair of entities   and   will be matched iff more than half of   's mentions appear in   , and their role matches.Given this matching, predicted 4-ary relations are then evaluated against reference ones using F ↔ [ NAryRelation ], where F ↔ args [ RFE ] = 1 means that all four role-filler entities must match under  RFE to receive credit.F ↔ relations [ NAryRelation ] further illustrates how matching superstructures depends on matching substructures, with optimal matching of relations depending on optimal matching of entities, which in turn depends on optimal matching of mentions.

Template Extraction
We now turn to template extraction, which arguably features the most complex outputs of any IE task.where a distinct similarity function  T may be needed for each T. Constraints on template matchings are traditionally two-sided.Below, we consider the metrics employed for the classic MUC-4 task.In Appendix D, we also consider the more recent BETTER Granular benchmark.
The MUC-4 dataset (MUC, 1992; Sundheim, 1992) features 6 template types, which concern varieties of terrorist act (e.g.bombing, kidnapping) and which all contain the same slots.Some are "string-fill" slots, which take entity mentions as fillers, and others are "set-fill" slots, which take a categorical value.Although the official MUC-4 evaluation reported several metrics, 16 the overall score was F 1 over slot fillers: where  T ∈ { set ,  str } is the type-appropriate filler similarity function.Both  set and  str are somewhat complex and, similar to  3 and  4 , allow for partial credit.For some of the set-fill slots, the possible values are hierarchical; i.e., some values are more specific, and thus considered more accurate, than others.Suppose a set-fill slot  takes values from a set V, and we write  <:  to denote  is a subtype of .Then  <:  iff  is a descendant of  according to some hierarchy for 16 See Chinchor (1992) for details.
,  ∈ V. MUC-4 defines  set as: This choice of  set is notable for suggesting a means of handling hierarchical sub-ontologies of the template ontology itself; such ontologies have seen considerable interest in many recent IE benchmarks, including RAMS (Ebner et al., 2020), WikiEvents (Li et al., 2021), andBETTER (Mckinnon andRubino, 2022).We return to this in §6.
String-fill slots were evaluated based on maximum lexical overlap between a predicted mention and all mentions in a reference entity.We provide more detailed discussion in Appendix C.

Sets with Latent Variables
Next, we consider Abstract Meaning Representation (AMR) parsing (Langkilde and Knight, 1998;Banarescu et al., 2013), which involves outputs with latent variables.AMR describes the semantics of a sentence as a rooted, directed graph represented by a set of neo-Davidsonian triples, each with a subject, an object, and a relation.Subjects are variables and objects can be variables or concepts (e.g. from PropBank (Palmer et al., 2005) Following the metrics for relation extraction, a prima facie appealing metric for AMR graphs would be just like Eq. 10 for binary RE: However, this poses a problem, as we cannot know whether two variables  and  refer to the same object: instance(, boy) and instance(, boy) could match if there is no constraint enforcing that  ≠ .Thus, it is not immediately clear what the similarity function for variables ( Var ) should be.
The commonly used SMATCH metric solves this problem.SMATCH is defined to be the maximum Fscore obtainable via a one-to-one matching of variables between two AMRs (Cai and Knight, 2013).That is, it looks for an optimal partial bijection  ↔  ⊆   ×   between the variables of the predicted and reference AMRs (  and   , respectively).Given  ↔  , we can define where φ denotes a similarity conditioned on the variables in its arguments being matched.Hence SMATCH is given by We generalize the spirit of SMATCH to any set of  with latent variables yet to be matched.The matching score of ,  with latent variables   ,   is defined to be where  ↔  is an one-to-one matching between the variable set   and   ; and  ⋄ is an matching between objects in  and  under constraint ⋄.
Computing this constrained optimization problem requires solving  ↔  , which can be done via an integer linear programming (ILP) solver (Cai and Knight, 2013).See Appendix A for more details.

Matching of Other Structures
In the past few sections we developed tools to obtain matching of sets.We can extend this to match more complex structures such as sequences, DAGs, and arbitrary directed graphs.
Recall the matching score in Eq. 13: we computed a sum of similarities based on matched pairs.In the matching of structures, the matching should preserve the structure of the object being matched.
Elements of a sequence form a total order where earlier elements precede later elements.Given two sequences ,  whose elements are of type , each is equipped with a total order: (, ⪯  ), (, ⪯  ).
To compute the matching score of two sequences, we define That is, we seek a maximum monotonic matching between  and  that preserves the total order.For example, the matching score between (1, 2, 3, 4, 5) and (1, 3, 5, 7, 9) is 3 since 1, 3, 5 are monotonically matched.The sequence matching problem given by Eq. ( 37) is a weighted longest common subsequence (LCS) problem, and thus can be solved with dynamic programming.We can further generalize this matching score to DAGs and graphs by noting that the total order ⪯ of sequence elements is relaxed to a partial order in DAGs and a preorder in arbitrary directed graphs.The constrained optimization problem in Eq. 37 can be solved via ILP, see Appendix A.

Discussion
We have seen that a diverse set of structured prediction metrics can be framed as computing a normalized total matching score for an optimal matching, given some similarity, which may itself reflect a score over an optimal matching of the relevant substructures.We now consider how different problem settings may motivate particular design decisions within this framework.We also highlight a couple of cases in which the actual metrics used for a task might be modified to better fit the problem setting.
Partial Credit For many tasks, we want to award some credit for partially correct responses.In applications where precision is paramount, it may be appropriate to insist on exact matches, but less so when some modest tradeoff with recall is desired.Moreover, many IE objects intuitively admit gradient notions of correctness.Perhaps the most obvious example of this is mentions.Exact match on mentions, whether stringor offset-based, remains surprisingly common despite the possibility for variation in how they are annotated (e.g.disagreements about the extent of NPs).More relaxed mention similarities -such as head word matching or Jaccard score -are typically more appropriate.Recently, there has also been greater interest in the informativity of entity mentions (Li et al., 2021;Chen et al., 2023), where, e.g., names > nominal expressions > pronouns, and where scores may need to vary according to a mention's informativity.All of these can be captured by different choices of  Mention .
REE offers another example.Earlier ( §3.3), we saw that CEAF-REE uses the  ⊆ entity similarity, which awards no credit at all to entities containing even one incorrect mention, but full credit to entities containing just one correct mention.A more natural extension of the CEAF metric to the REE setting, and one that permits partial credit, would be to replace  ⊆ with  3 or  4 .
Hierarchical Ontologies Type hierarchies are another common feature of IE problems: both events and entities may have types, subtypes, and even sub-subtypes.This is true of the event ontologies for FrameNet (Baker et al., 1998), RAMS (Ebner et al., 2020), WikiEvents (Li et al., 2021), and even MUC-4.17Yet, the standard evaluation metrics for these datasets do not take the hierarchy into account, instead treating the ontology as flat.
Following the discussion above, it may thus often be appropriate to replace exact type matches ( type ) with similarities that award partial credit for correct ancestor type prediction.One possibility is a level-based partial scoring: Given a -level type ontology with types specified as a -tuple  = (  1 , . . .,   ), we could, for instance, award credit based on the depth  ∈ {0, . . ., } of the most specific correctly predicted type, e.g.: where  = 0 iff even the most general type is incorrectly predicted.Or, one could adopt practices from related work in fine-grained entity typing (Ling and Weld, 2012;Chen et al., 2020, i.a.), which use the F 1 score of the set of all possible supertypes of predicted / reference types () = {|  <: ,  ∈ }: There is some precedent for schemes like this, but proper analysis of performance on tasks with hierarchical ontologies requires metrics that account for that hierarchy, and the field of IE as a whole would benefit from adopting them more widely.
One-Sided vs. Two-Sided Constraints In general, metrics impose constraints on the matching between the predictions and the reference.Overwhelmingly, these tend to be two-sided (bijective) constraints, as systems usually try to generate just one predicted object for each one in the reference.But this is not always the case.The CEAF-RME metrics (Eqs.27 and 28) proposed by Chen et al. (2023), which use one-sided constraints, are motivated in part by a need to evaluate a model that predicts mention fillers against references that contain entity fillers.This suggests a more general motivation for one-sided constraints -namely, for cases where the reference outputs are sets, but where predictions take the form of members of those sets.

Conclusion
We have presented a framework that unifies a variety of structured prediction metrics as normalized scores over (possibly hierarchical) constrained optimal matchings of structured objects.On the side of theory, our framework elucidates the relationships among tasks by defining the core components of their metrics.On the side of practice, it offers a compositional toolkit for the design of new metrics (aided by our library) and for critically evaluating existing ones, showing where they may inadequately capture important task features ( §6).
We intend this work to help the NLP community converge both on a common language for metric design and on more standardized metric implementations.

Ethics Statement
As this work principally describes a conceptual framework and presents a survey of evaluation metrics, we do not believe it raises ethical concerns.

Limitations
While this work aims to give a unified treatment of a variety of different metrics, our coverage of existing metrics is not exhaustive, and is intended rather to convey the expressiveness of the framework.
Our framework for evaluation is based on matching of substructures -thus metrics based on structure editing (e.g.string or tree edit distances; word error rate (WER) in speech recognition) cannot be expressed naturally in our formulation.One can of course define a  based on edit distances over sequences, but that has to be an atomic definition and cannot be derived naturally under our bottom-up approach.

References
Amit Bagga and Breck Baldwin. 1998 . This is the kernel version of the product similarity we discussed in the main text.
Note that F score can be written as Therefore, given the lemmata above, we have the nice property that any kernels composed with F ↔ , J ↔ are kernels.The deductions presented here follows Shen (2019).

C MUC-4 Evaluation: Additional Details
String-valued similarities and Interactive Scoring For string-valued slots, although full entities are annotated in the reference, systems are required to predict just one mention per entity.Two different versions of  str were used: one for determining the template alignment and one for computing the score given that alignment.The first version awarded full credit when there was at least a oneword overlap between the predicted string and at least one of the strings in the reference entity, so long as that word was not a designated premodifier; zero credit was awarded otherwise.Suppose valid(, ) is true iff neither word  nor word  is a premodifier.Then we can write:  word (, ) =  word • ⟦valid(, )⟧ (45) The second version of  str , used for final reporting, merely enhanced Eq. 46 by interactively querying the user in cases where a mismatch could not be automatically resolved, whereupon the user could determine the appropriate amount of credit to award, including partial (=half) credit.Full guidelines on interactive scoring can be found in the gzip archive containing the MUC-3 and MUC-4 data here: https://www-nlpir.nist.gov/related_ projects/muc/muc _ data/muc _ data _ index.html(see TEST/SCORER/scoring-guidelines.v7).
Template Alignment Constraints Template alignments for the original evaluation featured a couple of quirks.For one, it was possible to obtain partial (=half) credit for the template ("incident") type by predicting the generic attack label in place of any of the other, more specific labels (bombing, kidnapping, etc.). 18For another, a partial match on at least one of the following slots was required: physical target identifier, physical target type, human target name, human target description, human target type, perpetrator individual identifier, and perpetrator organization identifier.Chinchor (1992) notes that this constraint was put in place to prevent "fortuitous" but spurious template alignments that were observed in the MUC-3 evaluation.To our knowledge, researchers have not applied these rules in evaluating their own systems on MUC-4 since the original evaluation.

MUC-4: Recent Work
In recent years, it has become standard to evaluate only on the string-fill slots, plus the template type (Chambers and Jurafsky, 2011;Du et al., 2021b;Das et al., 2022;Chen et al., 2023).Du et al. (2021b) thus proposed a version of Eq. 32 that sets  T =  ⊆ , which amounts to using CEAF-REE (Eq.26) to determine the optimal template alignment.Following on this work, Chen et al. (2023) additionally present MUC-4 results using their relaxed (one-sided) metrics (Eqs.27, 28) for  T .

D BETTER
BETTER Granular is a recent template extraction dataset released as part of the IARPA BETTER program that is more complex than MUC-4 both in having different slots for each template type and in having a greater diversity of filler types.Here, we focus just on the key difference in overall score calculation compared to MUC-4.The Granular score is the product of the slot filler F 1 score (Eq.32) and the template type F 1 score: where  type applies to template types and  T applies to filler types, as in Eq. 32.Because this product cannot be expressed as a sum of scores over aligned template pairs (Eq.13), it does not, on its face, fit within our framework.However, this score could still be optimized indirectly by instead optimizing the template alignment against the second term only, as this will be non-zero only in cases where there is a match on template type.For more on BETTER, see Soboroff (2023), Mckinnon and Rubino (2022), and the following URL: https://www.iarpa.gov/index.php/research-programs/better.All BETTER program datasets are available here (note that account registration is required, but is free): https: //ir.nist.gov/better/.Appendices C and D of Chen et al. (2023) also provide a good overview of the Granular task and its evaluation, including definitions of  T for all slot filler types. 19

Figure 2 :
Figure 2: Output structure of common IE tasks discussed in this paper, with examples of their outputs.

B 3
Different from MUC and CEAF, B 3 assigns a score to each mention instead of each entity.Here, we need a slightly different data structure, where we pair each mention with the entity it belongs to: class Membership: # An instance of a relation mention: Mention entity: Entity class CorefOutputForB3: rels: Set[Membership] # membership relations (2023)  suggest a pair of alternatives to CEAF-REE, CEAF-RME  ⊆ and CEAF-RME  3 , that treat predicted mentions as singleton entities and relax the 15 Some work has evaluated at the mention level(Patwardhan and Riloff, 2009;Huang and Riloff, 2011;Du and Cardie, 2020), essentially doing named entity recognition (NER).