Realistic Evaluation Principles for Cross-document Coreference Resolution

We point out that common evaluation practices for cross-document coreference resolution have been unrealistically permissive in their assumed settings, yielding inflated results. We propose addressing this issue via two evaluation methodology principles. First, as in other tasks, models should be evaluated on predicted mentions rather than on gold mentions. Doing this raises a subtle issue regarding singleton coreference clusters, which we address by decoupling the evaluation of mention detection from that of coreference linking. Second, we argue that models should not exploit the synthetic topic structure of the standard ECB+ dataset, forcing models to confront the lexical ambiguity challenge, as intended by the dataset creators. We demonstrate empirically the drastic impact of our more realistic evaluation principles on a competitive model, yielding a score which is 33 F1 lower compared to evaluating by prior lenient practices.


Introduction
Cross-document (CD) coreference resolution identifies and links textual mentions that refer to the same entity or event across multiple documents. For example, Table 1 depicts different news stories involving former U.S. president Barack Obama.
While subsuming the challenges of withindocument (WD) coreference, CD coreference introduces additional unique challenges. Most notably, lexical similarity is often not a good indicator when identifying cross-document links, as documents are authored independently. As shown in Table 1, the same event can be referenced using different expressions ("nominated", "approached"), while two different events can be referenced using the same expression ("name"). Despite these challenges, reported state-of-the-art results on the Table 1: Example of sentences of from the ECB+. The underlined words represent events, same color represents a coreference cluster. Different documents describe the same event using different words (e.g name, approached), while the two predicates "name" in the two subtopics are not coreferring. popular CD coreference ECB+ benchmark (Cybulska and Vossen, 2014) are relatively high, reaching up to 80 F1 (Barhom et al., 2019;Meged et al., 2020).
In this paper, we show that CD coreference models achieve these numbers using overly-permissive evaluation protocols, namely assuming gold entity and event mentions are given, rewarding singletons and bypassing the lexical ambiguity challenge. Accordingly, we present more realistic evaluation principles which better reflect model performance in real-world scenarios.
First, following well established standards in WD coreference resolution (Pradhan et al., 2012), we propose that CD coreference models should be also evaluated on predicted mentions. While recent models unrealistically assume that event mentions are given as part of the input, practical application on new texts and domains requires performing coreference on raw text, including automatic mention detection. Using predicted mentions raises a subtle point with regards to singletons (entities which are only referenced once). In particular, we observe that ECB+'s inclusion of singletons inaccurately rewards models for predicting them, by conflating the evaluation of mention identification with that of coreference detection. To address this, we propose reporting of singleton identification performance in a separate metric, while reporting coreference results without singletons.
Second, we find that ECB+ does not accurately reflect real-world scenarios where prominent events can be referenced in documents spanning different subjects and domains. To facilitate its annotation, ECB+ mimics this phenomenon by artificially grouping documents dealing with the same event (e.g., the nomination of Sanjay Gupta in Table 1) into a subtopic, and further groups two similar subtopics into a larger topic document group (e.g., different nominations of government officials in Table 1). We observe that recent works exploit ECB+'s artificially simplistic structure by practically running the coreference model at the subtopic level, thus sidestepping a major lexical ambiguity challenge (e.g., mentions of "nomination" across subtopics do not co-refer). In contrast, in realworld scenarios such clustering is much harder to perform and is often not as easily delineated. For example, Barack Obama and events from his presidency can be referenced in news, literature, sport reports, and more. To address this, we propose that models report performance also at the topic level.
Finally, we show empirically that both of these evaluation practices artificially inflate results. An end-to end model that outperforms state-of-the-art results on previous evaluation settings drops by 33 F1 points when using our proposed evaluation scheme, pointing at weaknesses that future modelling work could explore.

Background
In this work, we will examine the evaluation of CD coreference on the popular ECB+ corpus (Cybulska and Vossen, 2014), constructed as an augmentation of the EECB and ECB datasets (Lee et al., 2012;Bejan and Harabagiu, 2010). As exemplified in Table 1, ECB+ groups its annotated documents into subtopics, consisting of different reports of the same real-world event (e.g., the nomination of Sanjay Gupta), and topics, which in turn consist of two lexically similar subtopics. Full ECB+ details are presented in Appendix A.
The ECB+ evaluation protocol largely follows that of CoNLL-2012, perhaps the most popular WD benchmark (Pradhan et al., 2012), with two major distinctions. First, barring a few notable exceptions (Yang et al., 2015;Choubey and Huang, 2017), 2 most recent CD models have unrealistically assumed that gold entity and event mentions are given as part of the input, reducing the task to finding coreference links between gold mentions (Bejan and Harabagiu, 2014;Cybulska and Vossen, 2015;Kenyon-Dean et al., 2018;Barhom et al., 2019;Meged et al., 2020). Second, while singletons are omitted on CoNLL-2012, they are exhaustively annotated in ECB+.
In the following section, we present a more realistic evaluation framework for CD coreference, taking into account the interacting distinctions of ECB+.

Realistic Evaluation Principles
In this paper, we suggest that CD coreference models should perform and be evaluated on predicted mentions. To achieve this, in Section 3.1, we will introduce the singleton effect on coreference evaluation and propose to decouple the evaluation of mention prediction from coreference resolution. In Section 3.2, we will establish guidelines allowing to better assess how models handle the ubiquitous lexical ambiguity challenge in real-world scenarios.

Decoupling Coreference Evaluation
Our goal is to propose a more reliable evaluation methodology of a coreference system over predicted mentions when singletons are included.
We use an example to show that evaluating singleton prediction with standard coreference metrics (B3, CEAF, LEA) could lead to counterproductive results which are hard to interpret (henceforth, we refer to this phenomenon as the singleton effect). Assume G denotes the gold clusters for Table 1 (for brevity, we omit some mentions), and S1 and S2 denote the output of two systems, which differ in their mention detection and coreference link performance: 3   (2) when including singletons, where S1 does better. S2 predicts the coreference links better than S1 but S1 achieves higher results in (2) because S1 performs better the mention detection task. S1 identified the mentions of the singleton clusters while S2 missed them and predicted incorrect span boundaries for the two first mentions ("News that" and "Emory"). Both S1 and S2 erroneously merged the singleton mention "announcement" with the cluster {name, approached}; however, S1 further included these mentions with the lexically-similar cluster {names, nominates, deci-sion}, whereas S2 successfully separated them. In other words, S1 performs well on the mention detection task, but worse on the coreference linking, and S2 did the opposite. Table 2 shows the results of S1 and S2 according to (1) the common CoNLL-2012 evaluation, where only non-singleton clusters are evaluated, and (2) using coreference metrics also on singleton prediction. With respect to (1), S2 achieves higher results according to all evaluation metrics. In (2), we see the opposite, the results of S1 are significantly higher than S2 w.r.t B 3 (+18.4), CEAF-e (+45.1), and LEA (+19), but not w.r.t MUC, a link-based metric. Indeed, these evaluation metrics reward S1 in both recall and precision for all predicted singletons, while penalizing S2 for the wrong and missing singleton spans. Since singletons are abundant in natural text, they contribute greatly to the overall score. However, as observed by Rahman and Ng (2009), a model's ability to identify that these singletons do not belong to any coreference cluster is already captured in the evaluation metrics, and additional penalty is not desired. In Appendix B, we introduce the aforementioned evaluation metrics for coreference resolution (MUC, B 3 , CEAF and LEA) and explain how singletons affect them.
To address the singleton effect, we suggest decoupling the evaluation of the two coreference substasks, mention detection and coreference linking, allowing to better analyze coreference results and to compare systems more appropriately. 4 Mention detection is typically a span detection task and should be evaluated using standard span metrics on all detected mentions, including singletons. In particular, we use the span F1 metric and consider a predicted mention as correct if it has an exact match with a gold mention, as common in named entity recognition (Tjong Kim Sang and De Meulder, 2003). Using such evaluation in our above example, S1 achieves 100 F1 and S2 achieves 66.7 F1 (recall: 60, precision: 75).
For the coreference evaluation, we propose to follow CoNLL-2012 and apply coreference metrics only on non-singleton (gold and predicted) clusters, as singletons are already evaluated under the mention detection evaluation. We note also that even when omitting singletons, coreference metrics still penalize models for making coreference errors involving singletons (as S2 is penalized for linking "announcement" to a cluster).
We further show empirically ( §4.2) that when evaluating using gold mentions, the singleton effect is amplified and harms the validity of the current CD evaluation protocol. Evidently, a dummy baseline that predicts no coreference links and puts each input gold mention in a singleton cluster achieves non-negligible performance (Luo, 2005), while state-of-the-art results are artificially inflated.

Confronting Lexical Ambiguity
As mentioned previously, the same event can be described in documents from different topics, while documents in the same topic may describe different events (e.g. different nominations as surgeon general, as shown in Table 1). Such settings pose a lexical ambiguity problem, where models encounter identical or lexically-similar words that should be assigned to different coreference clusters. Accordingly, while topical document clustering is useful for CD coreference resolution in general, it does not solve the ambiguity problem and models still need to make subtle disambiguation distinctions (e.g nomination of Sanjay Gupta vs. nomination of Regina Benjamin). Aiming at simulating this chal-  lenge on a manageable annotation task, the ECB+ authors (Cybulska and Vossen, 2014) augmented each topic in the original ECB with an additional subtopic of the same event type, allowing to challenge models with lexical ambiguity (as mentioned in Section 2). However, recent works (Barhom et al., 2019; Meged et al., 2020) predict coreference clusters separately on each subtopic, using a simple unsupervised document clustering during preprocessing. Such clustering performs near perfectly on ECB+ because of its synthetic structure, where each topic includes exactly two subtopics with only a few coreference links across different subtopics. Yet, document clustering is not expected to perform as well in realistic settings where coreferring events can spread multiple topics. More importantly, this bypasses intentions behind the inclusion of subtopics in the ECB+'s and avoids challenging the coreference models on lexical ambiguity. Indeed, the ECB+ authors, in a subsequent work, did not apply a topic clustering (Cybulska and Vossen, 2015).
We therefore recommend that models report results also at the topic level (when document clustering is not applied). This will conform to ECB+'s purpose and follows the original evaluation setup of the ECB+ corpus (Bejan and Harabagiu, 2014).

Experiments
We show empirically that each of the previous evaluation practices (using gold mentions, singleton in-clusion, and subtopic clustering) artificially inflates the results ( §4.2). As recent CD coreference models are designed to perform on gold mentions ( §2), we cannot use them to set baseline results on predicted mentions. We therefore develop a simple and efficient end-to-end model for CD coreference resolution by combining the successful single document e2e-coref (Lee et al., 2017) with common CD modeling approaches.

Model
We briefly describe the general architecture of our model, further details are explained in (Cattan et al., 2021) and Appendix C. Given a set of documents, our model operates in four sequential steps: (1) following Lee et al. (2017), we encode all possible spans up to a length n with the concatenation of four vectors: the output representations of the span boundary (first and last) tokens, an attentionweighted sum of token representations in the span, and a feature vector denoting the span length (2) we train a mention detector on the ECB+ mentions, and keep further spans with a positive score, 5 (3) we generate positive and negative coreference pairs on the predicted mentions and train a pairwise scorer, and (4) apply an agglomerative clustering on the pairwise similarity scores to form the coreference clusters at inference.

Results
We first evaluate our model under the current evaluation setup (gold mentions, singletons, subtopic) and compare it with two recent neural state-ofthe-art models (Barhom et al., 2019;Meged et al., 2020). In addition, we test a dummy singleton baseline which puts each gold mention in a singleton cluster and re-evaluate all baselines while omitting singletons. The results in Table 3 show that our model surpasses current state-of-the-art results in previous settings, supporting its relevance for setting baseline results over predicted mentions. The mention detection performance of our model is 80.1 F1 (Recall 76 and Precision 84.7).
The results corroborate the importance of our proposed evaluation enhancements. First, the performance drops dramatically when using predicted mentions (e.g. from 71.1 to 54.4 F1 at the subtopic level). Second, for all models, the results are significantly higher when including singletons in coreference metrics, because, as explained in Section 3.1, models are rewarded for singleton prediction. Indeed, the model performs better in mention detection than in coreference linking, confirming the importance of decoupling the evaluation of the two subtasks. Finally, performance is lower at the topic level than at the subtopic level (62.0 vs. 71.1 F1 using gold mentions and 48.6 vs. 54.4 F1 using predicted mentions), indicating that models struggle with lexical ambiguity ( §3.2). Taken together, evaluating over raw text without singletons while not clustering into fine-grained subtopics, leads to a performance drop of 33 F1 points, indicating the vast room for improvement under realistic settings.

Conclusion
We established two realistic evaluation principles for CD coreference resolution: (1) predicting mentions and (2) facing the lexical ambiguity challenge. We also set baseline results for future work on our evaluation methodology using a SOTA model.

Acknowledgment
We thank Shany Barhom for fruitful discussion and sharing code, and Yehudit Meged for providing her coreference predictions. The work described herein was supported in part by grants from Intel Labs, Facebook, the Israel Science Foundation grant 1951/17, the Israeli Ministry of Science and Technology, the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1), and from the Allen Institute for AI.

Ethical Considerations
Model As described in the supplementary material ( §C), our cross-document coreference model does not contain any intentional biasing or ethical issues, and our experiments were conducted on a single 12GB GPU, with relatively low compute time.

A The ECB+ Dataset
Documents in ECB+ were selected from various topics in the Google News archive in English, while annotation was performed separately for each topic. ECB+ statistics are shown in Table 4. As opposed to Ontonotes, only a few sentences are exhaustively annotated in each document, and the annotations include singletons. In addition, it is worth noting that the ECB+ authors kept the entities from EECB (Lee et al., 2012) only if they participate in events in the annotated sentences, while leaving all other entities. Accordingly, "Los Angeles" and "Los Angeles hospital" are marked as coreferent in the sentences "Yesterday in Los Angeles, pin-up icon Bettie Page succumbed to complications.. and "Pinup icon Bettie Page died Thursday evening at a hospital in Los Angeles.." because they refer to the location of the same event. This differs from the standard entity coreference resolution since detecting those entities involves an additional challenge of extracting event participants, for example, using a Semantic Role Labeling system.

B Singleton Effect on Coreference Metrics
Here, we briefly introduce the different evaluation metrics for coreference resolution (MUC, B 3 , CEAF and LEA) and explain how singletons affect them. As mentioned in the paper, all evaluation metrics penalize models for wrongly linking a singleton to a cluster or singletons together. However, B3, CEAF and LEA further reward models for predicting singleton clusters, as explained below.
MUC Introduced by (Vilain et al., 1995), MUC is an early link-based evaluation metric for corefer-ence resolution. Recall and precision are measured based on the minimal number of coreference links needed to align gold and predicted clusters, as follows: where p(k i ) is the set of different predicted clusters that contain one or more mention of the gold cluster k i . The precision is obtained by switching the role of the predicted and the gold clusters. Since MUC scores are calculated over the coreference links, singletons do not affect this metric, as observed in our illustrative example in the paper (Section 3.1).
B 3 B 3 (Bagga and Baldwin, 1998) is a mentionbased evaluation metric, the recall and precision correspond to the average of individual mention scores. The recall is defined as the proportion of its true coreferering mentions that the system links, over all the gold coreferering mentions that are linked to it, as follows: where Rm i and Km i are respectively the system and the gold cluster containing the mention m i . The precision is obtained by switching the role of the predicted and gold clusters.
Here, all mentions m i (including singleton mentions) are scored in Eq. 2 and participate in the overall recall and precision score. Therefore, a singleton that was successfully predicted will be rewarded 100% in both precision and recall, missing singletons will affect the recall and extra-singletons will affect the precision.
CEAF Introduced by Luo (2005), CEAF assumes that each predicted cluster should be mapped to only one gold cluster and vice versa. Using the Kuhn-Munkres algorithm, CEAF first finds the best one-to-one mapping g( * ) of the predicted clusters to the gold clusters, according to a similarity function φ. Given this mapping, predicted clusters are compared to their corresponding gold clusters, as follows: where R is the set of predicted clusters, K the set of gold clusters, g * (r i ) the gold cluster aligned to the predicted cluster r i , and φ() the similarity function. The precision is obtained by switching the role of the predicted and gold clusters in the denominator. There are two variants of CEAF based on φ, (1) a mention-based CEAFm defined as the number of shared mentions between the two clusters φ(r i , k i ) = |r i ∩ k i | and (2) an entity-based metric CEAFe: φ(r i , k i ) = 2 |r i ∩k i | |r i |+|k i | . Here again, a predicted singleton cluster that appears also in the gold will be obviously mapped to it and will be rewarded 100% in both recall and precision.
LEA Recently proposed by Moosavi and Strube (2016), LEA is the most recent evaluation metric, designed to overcome shortcomings in previous evaluation metrics, notably the mention identification effect in B 3 and CEAF. LEA is a Link-Based Entity-Aware metric, which assigns a score to each coreference cluster, based on all coreference links (n × (n − 1)/2) in the cluster, as follows: where link(k i ) is the total number of links in the gold cluster k i , link(k i , r j ) is the total number of links in the predicted cluster r j that appears in the gold cluster k i , and |k i | is the number of mentions in the gold cluster k i in order to give higher importance to large clusters. The precision is calculated by switching the role of the gold clusters K and the predicted clusters R. Singleton clusters are also rewarded because they have self-links (links to themselves). However, since each cluster score is weighted by the size of the cluster, the singleton effect is less important in LEA, as we can see in the paper (Table 3).

C Our Coreference Model
As mentioned in the paper ( §4.1), our model is inspired by the single document coreference resolver e2e-coref (Lee et al., 2017). The e2e-coref model forms the coreference clusters by linking each mention to an antecedent span appearing before it in the text. However, in the CD setting, there is no linear ordering between the documents. We therefore implement a new model while modifying the clustering method and the optimization function of the original e2e-coref model, as elaborated below. 6 Span Representation Given a set of documents, the first step consists of encoding each document separately using RoBERTa LARGE (Liu et al., 2019). Long documents are split into non overlapping segments of up to 512 word-piece tokens and are encoded independently (Joshi et al., 2019). We then, following Lee et al. (2017), represent each possible span up to a length n with the concatenation of four vectors: the output representations of the span boundary (first and last) tokens, an attentionweighted sum of token representations in the span, and a feature vector denoting the span length. We use g i to refer to the vector representation of the span i.

Mention Scorer
We train a mention detector s m (i) using a simple MLP on top of these span representations, indicating whether i is a mention in ECB+. This is possible because singleton mentions are annotated in ECB+ ( §A). Unlike the e2ecoref, we keep further only detected mentions in both training and inference. We also tried the joint approach but the performance drops by 0.4 CoNLL F1 and the run-time was longer.
Pairwise Scorer Given the predicted mentions, we first generate positive and negative training pairs as follows. The positive instances consist of all the pairs of mentions that belong to the same coreference cluster, while the negative examples are sampled (20x the number of positive pairs) from all other pairs. This sampling reduces the computation time, and limits the unbalanced negative ratio between training pairs. Then, for each pair of mentions i and j, we concatenate 3 vectors: g i , g j , and the element-wise multiplication g i • g j , and feed it to a simple MLP, which outputs a score s(i, j) indicating the likelihood that mentions i and j belong to the same cluster, which we optimize using the binary cross-entropy loss on the pair label. Due to memory constraints, we freeze output representations from RoBERTa instead of fine-tuning all parameters.
Agglomerative Clustering As common in recent CD coreference models (Yang et al., 2015;Choubey and Huang, 2017;Kenyon-Dean et al., 2018;Barhom et al., 2019;Meged et al., 2020), we use an agglomerative clustering on the pairwise scores s(i, j) to form the coreference clusters at inference time. The agglomerative clustering step merges the most similar cluster pairs until their pairwise similarity score falls below a tuned threshold τ .
Technical Details We conduct our experience on a single GeForce GTX 1080 Ti 12GB GPU. Our model has 14M parameters. On average, the training takes 30 minutes and inference over all the test set takes 3 minutes.