Evaluation for Partial Event Coreference

This paper proposes an evaluation scheme to measure the performance of a system that detects hierarchical event structure for event coreference resolution. We show that each system output is represented as a forest of unordered trees, and introduce the notion of conceptual event hierarchy to simplify the evaluation process. We enumerate the desiderata for a similarity metric to measure the system performance. We examine three metrics along with the desiderata, and show that metrics extended from MUC and BLANC are more adequate than a metric based on Simple Tree Matching.


Introduction
Event coreference resolution is the task to determine whether two event mentions refer to the same event. This task is important since resolved event coreference is useful in various tasks such as topic detection and tracking, information extraction, question answering, textual entailment, and contradiction detection.
A key challenge for event coreference resolution is that one can define several relations between two events, where some of them exhibit subtle deviation from perfect event identity. For clarification, we refer to perfect event identity as full (event) coreference in this paper. To address the subtlety in event identity,  focused on two types of partial event identity: subevent and membership. Subevent relations form a stereotypical sequence of events, or a script (Schank and Abelson, 1977;Chambers and Jurafsky, 2008). Membership relations represent instances of an event collection. We refer to both as partial (event) coreference in this paper. Figure 1 shows some examples of the subevent and membership relations in the illustrative text below, taken from the Intelligence Community domain of violent events. Unlike full coreference, partial coreference is a directed relation, and forms hierarchical event structure, as shown in Figure 1. Detecting partial coreference itself is an important task because the resulting event structures are beneficial to text comprehension. In addition, such structures are also useful as background knowledge information to resolve event coreference.
A car bomb that police said was set by Shining Path guerrillas ripped off(E4) the front of a Lima police station before dawn Thursday, wounding(E5) 25 people. The attack(E6) marked the return to the spotlight of the feared Maoist group, recently overshadowed by a smaller rival band of rebels. The pre-dawn bombing(E7) destroyed(E8) part of the police station and a municipal office in Lima's industrial suburb of Ate-Vitarte, wounding(E9) 8 police officers, one seriously, Interior Minister Cesar Saucedo told reporters. The bomb collapsed(E11) the roof of a neighboring hospital, injuring(E12) 15, and blew out(E13) windows and doors in a public market, wounding(E14) two guards. Figure 1: Examples of subevent and membership relations. Solid and dashed arrows represent subevent and membership relations respectively, with the direction from a parent to its subevent or member. For example, we say that E4 is a subevent of E6. Solid lines without any arrow heads represent full coreference.
In this paper, we address the problem of evalu-ating the performance of a system that detects partial coreference in the context of event coreference resolution. This problem is important because, as with other tasks, a good evaluation method for partial coreference will facilitate future research on the task in a consistent and comparable manner. When one introduces a certain evaluation metric to such a new complex task as partial event coreference, it is often unclear what metric is suitable to what evaluation scheme for the task under what assumptions. It is also obscure how effectively and readily existing algorithms or tools, if any, can be used in a practical setting of the evaluation. In order to resolve these sub-problems for partial coreference evaluation, we need to formulate an evaluation scheme that defines assumptions to be made regarding the evaluation, specifies some desiderata that an ideal metric should satisfy for the task, and examines how adequately particular metrics can satisfy them. For this purpose, we specifically investigate three existing algorithms MUC, BLANC, and Simple Tree Matching (STM).
The contributions of this work are as follows: • We introduce a conceptual tree hierarchy that simplifies the evaluation process for partial event coreference.
• We present a way to extend MUC, BLANC, and STM for the case of unordered trees. Those metrics are generic and flexible enough to be used in evaluations involving data structures based on unordered trees.
• Our experimental results indicate that the extended MUC and BLANC are better than Simple Tree Matching for evaluating partial coreference.

Related Work
Recent studies on both entity and event coreference resolution use several metrics to evaluate system performance (Bejan and Harabagiu, 2010;Lee et al., 2012;Durrett et al., 2013;Lassalle and Denis, 2013) since there is no agreement on a single metric. Currently, five metrics are widely used: MUC (Vilain et al., 1995), B-CUBED (Bagga and Baldwin, 1998), two CEAF metrics CEAF-φ 3 and CEAF-φ 4 (Luo, 2005), and BLANC (Recasens and Hovy, 2011). We can divide these metrics into two groups: cluster-based metrics, e.g., B-CUBED and CEAF, and link-based metrics, e.g., MUC and BLANC. The former group is not applicable to evaluate partial coreference because it is unclear how to define a cluster. The latter is not readily applicable to the evaluation because it is unclear how to penalize incorrect directions of links. We discuss these aspects in Section 4.1 and Section 4.2. Tree Edit Distance (TED) is one of the traditional algorithms for measuring tree similarity. It has a long history of theoretical studies (Tai, 1979;Zhang and Shasha, 1989;Klein, 1998;Bille, 2005;Demaine et al., 2009;Pawlik and Augsten, 2011). It is also widely studied in many applications, including Natural Language Processing (NLP) tasks (Mehdad, 2009;Wang and Manning, 2010;Heilman and Smith, 2010;Yao et al., 2013). However, TED has a disadvantage: we need to predefine appropriate costs for basic tree-edit operations. In addition, an implementation of TED for unordered trees is fairly complex.
Another tree similarity metric is Simple Tree Matching (STM) (Yang, 1991). STM measures the similarity of two trees by counting the maximum match with dynamic programming. Although this algorithm was also originally developed for ordered trees, the underlying idea of the algorithm is simple, making it relatively easy to extend the algorithm for unordered trees.
Tree kernels have been also widely studied and applied to NLP tasks, more specifically, to capture the similarity between parse trees (Collins and Duffy, 2001;Moschitti et al., 2008) or between dependency trees (Croce et al., 2011;Srivastava et al., 2013). This method is based on a supervised learning model with training data; hence we need a number of pairs of trees and associated numeric similarity values between these trees as input. Thus, it is not appropriate for an evaluation setting.

Evaluation Scheme
When one formulates an evaluation scheme for a new task, it is important to define assumptions for the evaluation and desiderata that an ideal metric should satisfy. In this section, we first describe assumptions for partial coreference evaluation, and introduce the notion of conceptual event hierarchy to address the challenge posed by one of the assumptions. We then enumerate the desiderata for a metric.

Assumptions on Partial Coreference
We make the following three assumptions to evaluate partial coreference. Twinless mentions: Twinless mentions (Stoyanov et al., 2009) are the mentions that exist in the gold standard but do not in a system response, or vice versa. In reality, twinless mentions often happen since an end-to-end system might produce them in the process of detecting mentions. The assumption regarding twinless mentions has been investigated in research on entity coreference resolution. Cluster-based metrics such as B-CUBED and CEAF assume that a system is given true mentions without any twinless mentions in the gold standard, and then resolves full coreference on them. Researchers have made different assumptions about this issue. Early work such as (Ji et al., 2005) and (Bengtson and Roth, 2008) simply ignored such mentions. Rahman and Ng (2009) removed twinless mentions that are singletons in a system response. Cai and Strube (2010) proposed two variants of B-CUBED and CEAF that can deal with twinless mentions in order to make the evaluation of end-to-end coreference resolution system consistent.
In evaluation of partial coreference where twinless mentions can also exist, we believe that the value of making evaluation consistent and comparable is the most important, and hypothesize that it is possible to effectively create a metric to measure the performance of partial coreference while dealing with twinless mentions. A potential problem of making a single metric handle twinless mentions is that the metric would not be informative enough to show whether a system is good at identifying coreference links but poor at identifying mentions, or vice versa (Recasens and Hovy, 2011). However, our intuition is that the problem is avoidable by showing the performance of mention identification with metrics such as precision, recall, and the F-measure simultaneously with the performance of link identification. In this work, therefore, we assume that a metric for partial coreference should be able to handle twinless mentions. Intransitivity: As described earlier, partial coreference is a directed relation. We assume that partial coreference is not transitive. To illustrate the intransitivity, let e i s − → e j denote a subevent relation that e j is a subevent of e i . In Figure 1, we have E7 s − → E8 and E8 s − → E9. In this case, E9 is not a subevent of E7 due to the intransitivity of subevent relations. One could argue that the event 'wounding(E9)' is one of stereotypical events triggered by the event 'bombing(E7)', and thus E7 s − → E9. However, if we allow transitivity of partial coreference, then we have to measure all implicit partial coreference links (e.g., the one between E7 and E9) from hierarchical event structures. Consequently, this evaluation policy could result in an unfair scoring scheme biased toward large event hierarchy. Link propagation: We assume that partial coreference links can be propagated due to a combination of full coreference links with them. To illustrate the phenomenon, let e i ⇔ e j denote full coreference between e i and e j . In Figure 1 , then there is no reason to argue that the identified subevent relation is incorrect given that E6 ⇔ E7 and E7 s − → E8. The discussion here also applies to membership relations.

Conceptual Event Hierarchy
The assumption of link propagation poses a challenge in measuring the performance of partial coreference. We illustrate the challenge with the example in the discussion on link propagation above. We focus only on subevent relations to describe our idea, but one can apply the same discussion to membership relations. Suppose that a system detects a subevent link E7 s − → E8, but not E6 s − → E8. Then, is it reasonable to give the system a double reward for two links E7 s − → E8 and E6 s − → E8 due to link propagation, or should one require a system to perform such link propagation and detect E7 s − → E8 as well for the system to achieve the double reward? In the evaluation scheme based on event trees whose nodes represent event mentions, we need to predefine how to deal with link propagation of full and partial coreference in evaluation. In particular, we must pay attention to the potential risk of overcounting partial corefrence links due to link propagation.
To address the complexity of link propagation, we introduce a conceptual event tree where each node represents a conceptual event rather than an event mention. Figure 2 shows an example of a conceptual subevent tree constructed from full coreference and subevent relations in Figure 1. Using set notation, each node of the tree represents an abstract event. For instance, node {E6, E7} represents an "attacking" event which both event mentions E6 and E7 refer to.
Figure 2: A conceptual subevent tree constructed from the full coreference and subevent relations in Figure 1.
The notion of a conceptual event tree obviates the need to cope with link propagation, thereby simplifying the evaluation for partial coreference. Given a conceptual event tree, an evaluation metric is basically just required to measure how many links in the tree a system successfully detects. When comparing two conceptual event trees, a link in a tree is identical to one in the other tree if there is at least one event mention shared in parent nodes of those links and at least one shared in child nodes of those links. For example, suppose that system A identifies E6 s − → E8, system B E7 s − → E8, system C both, and all the systems identify E6 ⇔ E7 in Figure 1. In this case, they gain the same score since the subevent links that they identify correspond to one correct subevent link {E6, E7} s − → {E8} in Figure 2. It is possible to construct the conceptual event hierarchy for membership relations in the same way as described above. This means that the conceptual event hierarchy allows us to show the performance of a system on each type of partial coreference separately, which leads to more informative evaluation output.
One additional note is that the conceptual event tree representing partial coreference is an unordered tree, as illustrated in Figure 2. Although we could represent a subevent tree with an ordered tree because of the stereotypical sequence of subevents given by definition, partial coreference is in general represented with a forest of unordered trees 1 .
1 For example, it is impossible to intuitively define a se-

Desiderata for Metrics
In general, a system output of partial event coreference in a document is represented not by a single tree but by a forest, i.e., a set of disjoint trees whose nodes are event mentions that appear in the document. Let T be a tree, and let F be a forest F = {T i }. Let sim(F g , F r ) ∈ [0, 1] denote a similarity score between the gold standard forest F g and a system response forest F r . We define the following properties that an ideal evaluation metric for partial event coreference should satisfy. P1. Identity: sim(F 1 , F 1 ) = 1. P2. Symmetricity: sim(F 1 , F 2 ) = sim(F 2 , F 1 ). P3. Zero: sim(F 1 , F 2 ) = 0 if F 1 and F 2 are totally different forests. P4. Monotonicity: The metric score should increase from 0 to 1 monotonically as two totally different forests approach the identical one. P5. Linearity: The metric score should increase linearly as each single individual correct piece of information is added to a system response.
The first three properties are relatively intuitive. P4 is important because otherwise a higher score by the metric does not necessarily mean higher quality of partial event coreference output. In P5, a correct piece of information is the addition of one correct link or the deletion of one incorrect link. This property is useful for tracking performance progress over a certain period of time. If the metric score increases nonlinearly, then it is difficult to compare performance progress such as a 0.1 gain last year and a 0.1 gain this year, for example. In addition, one can think of another property with respect to structural consistency. The motivation for the property is that one might want to give more reward to partial coreference links that form hierarchical structures, since they implicitly form sibling relations among child nodes. For instance, suppose that system A detects two links {E6, E7}  Figure 2. We can think that system A performs better since the system successfully detects an implicit subevent sibling relation between {E8} and {E11} as well. Due to space limitations, however, we do not explore the property in this work, and leave it for future work. quence of child nodes in a membership event tree in Figure 1.

Evaluation Metrics
In this section, we examine three evaluation metrics based on MUC, BLANC, and STM respectively under the evaluation scheme described in Section 3.

B-CUBED and CEAF
B-CUBED regards a coreference chain as a set of mentions, and examines the presence and absence of mentions in a system response that are relative to each of their corresponding mentions in the gold standard (Bagga and Baldwin, 1998). Let us call such set a mention cluster. A problem in applying B-CUBED to partial coreference is that it is difficult to properly form a mention cluster for partial coreference. In Figure 2, for example, we could form a gold standard cluster containing all nodes in the tree. We could then form a system response cluster, given a certain system output. The problem is that B-CUBED's way of counting mentions overlapped in those clusters cannot capture parentchild relations between the mentions in a cluster. It is also difficult to extend the counting algorithm to incorporate such relations in an intuitive manner. Therefore, we observe that B-CUBED is not appropriate for evaluating partial coreference.
We see the basically same reason for the inadequacy of CEAF. It also regards a coreference chain as a set of mentions, and measures how many mentions two clusters share using two similarity metrics φ 3 (R, S) = |R ∩ S| and φ 4 (R, S) = 2|R∩S| |R|+|S| , given two clusters R and S. One can extend CEAF for partial coreference by selecting the most appropriate tree similarity algorithm for φ that works well with the algorithm to compute maximum bipartite matching in CEAF. However, that is another line of work, and due to space limitations we leave it for future work.

Extension to MUC and BLANC
MUC relies on the minimum number of links needed when mapping a system response to the gold standard (Vilain et al., 1995). Given a set of key entities K and a set of response entities R, precision of MUC is defined as the number of common links between entities in K and R divided by the number of links in R, whereas recall of MUC is defined as the number of common links between entities in K and R divided by the number of links in K. After finding a set of mention clusters by resolving full coreference, we can compute the num-ber of correct links by counting all links spanning in those mention clusters that matched the gold standard. It is possible to apply the idea of MUC to the case of partial coreference simply by changing the definition of a correct link. In the partial coreference case, we define a correct link as a link matched with the gold standard including its direction. Let MUC p denote such extension to MUC for partial coreference.
Similarly, it is also possible to define an extension to BLANC. Let BLANC p denote the extension. BLANC computes precision, recall, and F1 scores for both coreference and noncoreference links, and average them for the final score (Recasens and Hovy, 2011). As with MUC p , BLANC p defines a correct link as a link matched with the gold standard including its direction. Another difference between BLANC and BLANC p is the total number of mention pairs, denoted as L. In the original BLANC, L = N (N − 1)/2 where N is the total number of mentions in a document. We use L p = N (N − 1) instead for BLANC p since we consider two directed links in partial coreference with respect to each undirected link in full coreference.

Extension to Simple Tree Matching
The underlying idea of STM is that if two trees have more node-matching, then they are more similar. The original STM uses a dynamic programming approach to perform recursive node-level matching in a top-down fashion. In the case of partial coreference, we cannot readily use the approach because partial coreference is represented with unordered trees, and thus time complexity of node-matching is the exponential order with respect to the number of child nodes. However, partial event coreference is normally given in a small hierarchy with three levels or less. Taking advantage of this fact and assuming that each event mention is uniquely identified in a tree, we extend STM for the case of unordered trees by using greedy search. Algorithm 1 shows an extension to the STM algorithm for unordered trees.
We can also naturally extend STM to take forests as input. Figure 3 shows how one can convert a forest into a single tree whose subtrees are the trees in the forest by introducing an additional dummy root node on top of those tree. The resulting tree is also an unordered tree, and thus we can apply Algorithm 1 to that tree to measure the sim-Algorithm 1 Extended simple tree matching for unordered trees.
Input: two unordered trees A and B Output: score 1: procedure SimpleTreeMatching(A, B) 2: if the roots of A and B have different elements then 3: return 0 4: else 5: s := 1 The initial score for the root match. 6: m := the number of first-level sub-trees of A 7: n := the number of first-level sub-trees of B 8: for i = 1 → m do 9: for j = 1 → n do 10: if Ai and Bj have the same element then 11: s = s + SimpleTreeMatching(Ai, Bj) Figure 3: Conversion from a forest to a single tree with an additional dummy root.
ilarity of two forests comprising unordered trees. Let STM p denote the extended STM. Finally, we normalize STM p . Let NSTM p be a normalized version of STM p as follows: N ST M p (F 1 , F 2 ) = ST M p (F 1 , F 2 )/max(|F 1 |, |F 2 |) where |F | denotes the number of nodes in F .

Flexibility of Metrics
Making assumptions on evaluation for a particular task and defining desiderata for a metric determine what evaluation scheme we are going to formulate. However, this kind of effort tends to make resulting evaluation metrics too restrictive to be reusable in other tasks. Such metrics might be adequate for that task, but we also value the flexibility of a metric that can be directly used or be easily extended to other tasks. To investigate the flexibility of MUC p , BLANC p and STM p , we will examine these metrics without making the assumptions of twinless mentions and intransitivity of partial coreference against each metric. We consider that the assumption of link propagation is more fundamental and regard it as a basic premise, and thus we will continue to make that assumption. MUC was originally designed to deal with response links spanning mentions that even key links do not reach. Thus, it is able to handle twinless mentions. If we do not assume intransitivity of partial coreference, we do not see any difficulty in changing the definition of correct links in MUC p and making it capture transitive relations. Therefore, MUC p does not require both assumptions of twinless mentions and intransitivity.
In contrast, BLANC was originally designed to handle true mentions in the gold standard. Since BLANC p does not make any modifications on this aspect, it cannot deal with twinless mentions either. As for intransitivity, it is possible to easily change the definition of correct and incorrect links in BLANC p to detect transitive relations. Thus, BLANC p does not require intransitivity but does require the assumption of no twinless mentions.
Since STM p simply matches elements in nodes as shown in Algorithm 1, it does not require the assumption of twinless mentions. With respect to intransitivity, we can extend STM p by adding extra edges from a parent to grandchild nodes or others and applying Algorithm 1 to the modified trees. Hence, it does not require the assumption of intransitivity.
with the perfect output, and then added one incorrect link 49 s − → 50 shown in System 1. In a manner similar to case (a), we added incorrect links up to the merged tree one by one in a bottom-up fashion.
The results indicate that MUC p and BLANC p meet the desiderata defined in Section 3.3 more adequately than NSTM p . The curve of MUC p and BLANC p in Figure 4 are close to the linearity, which is practically useful as a metric. In contrast, NSTM p fails to meet P4 and P5 in case (a), and fails to meet P5 in case (b). This is because STM first checks whether root nodes of two trees have the same element, and if the root nodes have different elements, STM stops searching the rest of nodes in the trees.

Discussion
In Section 4.4, we observed that MUC p and STM p are more flexible than BLANC p because they can measure the performance coreference in the case of twinless mentions as well. The experimental results in Section 5 show that MUC p and BLANC p more adequate in terms of the five properties defined in Section 3.3. Putting these together, MUC p seems the best metric for partial event coreference. However, MUC has two disadvantages that (1) it prefers systems that have more mentions per entity (event), and (2) it ignores recall for singletons (Pradhan et al., 2011). MUC p also has these disadvantages. Thus, BLANC p might be the best choice for partial coreference if we could assume that a system is given true mentions in the gold standard.
Although STM p fails to satisfy P4 and P5, it has potential power to capture structural proper-   ties of partial coreference described in Section 3.3. This is because STM's recursive fashion of nodecounting can be easily extend to counting the number of correct sibling relations.

Conclusion
We proposed an evaluation scheme for partial event coreference with conceptual event hierarchy constructed from mention-based event trees. We discussed possible assumptions that one can make, and examined extensions to three existing metrics. Our experimental results indicate that the extensions to MUC and BLANC are more adequate than the extension to STM. To our knowledge, this is the first work to argue an evaluation scheme for partial event coreference. Nevertheless, we believe that our scheme is generic and flexible enough to be applicable to other directed relations of events (e.g., causality and entailment) or other related tasks to compare hierarchical data based on unordered trees (e.g., ontology comparison). One future work is to improve the metrics by incorporating structural consistency of event trees as an additional property and implementing the metrics from the perspective of broad contexts beyond local evaluation by link-based counting.