Conundrums in Entity Coreference Resolution: Making Sense of the State of the Art

Despite the signiﬁcant progress on entity coreference resolution observed in recent years, there is a general lack of understanding of what has been improved. We present an empirical analysis of state-of-the-art resolvers with the goal of providing the general NLP audience with a better understanding of the state of the art and coreference researchers with directions for future research


Introduction
The advent of the neural NLP era has revolutionized virtually all areas of NLP research. For entity coreference, many issues that were once thought to be important no longer appear to be particularly relevant to the current research agenda. Specifically, while a decade ago coreference researchers have focused on developing computational models that are complex (e.g., structured models (Fernandes et al., 2012;Björkelund and Kuhn, 2014;Martschat and Strube, 2015)) and knowledge-rich (e.g., those that encode world knowledge (Ponzetto and Strube, 2007;Rahman and Ng, 2011a;Hajishirzi et al., 2013)), nowadays virtually all state-of-the-art resolvers employ a simple model (i.e., the mentionranking model, which was developed more than a decade ago (Denis and Baldridge, 2008)) 1 and a fairly simple input representation (i.e., contextualized word embeddings) in conjunction with a mechanism for learning representations of entity mention spans such that coreferent mentions have similar representations (Lee et al., 2017(Lee et al., , 2018Kantor and Globerson, 2019;Joshi et al., 2019).
Despite significant progress in the past few years in terms of performance numbers, what seems to be missing is an understanding of what has been improved. The lack of understanding has long been a concern shared by coreference researchers, even before the neural revolution in NLP. This has led to several attempts to analyze coreference resolvers over the years (Stoyanov et al., 2009;Kummerfeld and Klein, 2013). With the development of neural resolvers, however, this concern has become more serious than ever: the fact that significant progress can be made via learning mention representations with a simple neural mention-ranking model that employs a fairly simple input representation for a task as challenging as coreference resolution (CR) is somewhat contrary to common wisdom.
In light of this apparent conundrum, we present an empirical analysis of state-of-the-art entity coreference resolvers through four major sets of experiments in this paper, with the goal of gaining insights into their behaviors. We believe that our analysis will not only provide the general NLP audience with a better understanding of the state of the art, but also provide coreference researchers with directions for future research.

Evaluation Setup
In this section, we describe the datasets, the evaluation metrics, the state-of-the-art resolvers and the hyperparameters used in our experiments.
Datasets. We report results on three coreference datasets. The NIST-sponsored ACE evaluations resulted in several datasets. We use ACE 2005 (Walker et al., 2006), the last one in the series. The ACE 2005 organizers have only made the official training set (but not the official test set) publicly available, so previous work defined different traintest splits over the official training set. We employ the same train-test split as Bansal and Klein (2012).
information extraction tasks (e.g., event extraction, event CR), the organizers have made available several corpora that include entity coreference annotations. For training, we use three such corpora (LDC2015E29, LDC2015E68, and LDC2016E64). For evaluation, the KBP 2017 organizers have made available the official test set for the event CR task (LDC2017E51), which also include entity coreference annotations. We use it as our test set.
OntoNotes (Hovy et al., 2006), which was developed circa 2006, is the most widely-used dataset for entity coreference evaluations. It has a standard train-dev-test split. Unlike in ACE and KBP, singleton clusters are not annotated in OntoNotes.
The key difference between these three corpora is that OntoNotes supports "unrestricted" CR, meaning that coreference links are annotated between entity mentions without regard to their entity types. In contrast, coreference links are only annotated between mentions belonging to one of the seven entity types in ACE and one of the five entity types in KBP. Statistics on these corpora are shown in Table 1. Evaluation metrics. Following the convention established in the CoNLL 2011 and 2012 shared tasks (Pradhan et al., 2011(Pradhan et al., , 2012, we use as our primary coreference evaluation measure the CoNLL score, which is the unweighted average of the F-scores provided by three popular metrics, the link-based MUC metric (Vilain et al., 1995), the mention-based B 3 metric (Bagga and Baldwin, 1998), and the entity-based CEAF e metric (Luo, 2005). We obtain these scores using the official CoNLL scorer (Pradhan et al., 2014). 2 Mention detection (MD) is the task of extracting the mentions in a text needed for entity CR. A key observation made in the CoNLL shared tasks was that the performance of resolvers was limited by MD, so it is important to examine the extent to which MD performance has improved over the 2 LEA (Moosavi and Strube, 2016) is a coreference evaluation metric recently designed to address the shortcomings associated with B 3 and CEAFe, but we found no difference in the performance trends in our experiments according to CoNLL and LEA. See the Appendix for the LEA results. years. We report performance in terms of recall, precision, and F-score, considering that a system mention is correctly detected if and only if it has an exact match in boundary with a gold mention.
Systems. We evaluate five variants of three stateof-the-art neural resolvers, all of which employ a ranking model where all candidate antecedents are ranked against each other for a given anaphor.
The first resolver, the Stanford neural resolver (Clark and Manning, 2016) 3 , takes as input a set of entity mentions identified for a given document by a rule-based MD system and trains using reinforcement learning a simple mention ranker consisting of three hidden layers of ReLU units and a final layer that is fully-connected.
The other two resolvers are developed by Lee et al. (2018) (Joshi et al., 2020), which is designed to better represent text spans than BERT. The two variants of Joshi et al. differ in the transformer. Specifically, SpanBERT-base (henceforth SpanBERT-b) employs a simple transformer while SpanBERT-large (henceforth SpanBERT-l) employs a more complex transformer. We use the publicly-available implementation of each of these resolvers. There is one caveat, however. Recall that the span-based resolvers were all evaluated on OntoNotes. Since singleton clusters are not annotated in OntoNotes, all singleton clusters predicted by a resolver are removed from its output before it is sent to the scoring program. In contrast, singleton clusters that contain mentions belonging to one of the ACE/KBP entity types are annotated in ACE/KBP, so these mentions should not be removed from a resolver's output. However, span-based resolvers cannot distinguish between spans that correspond to entity mentions and those that do not. To address this problem, we extend the span-based models so that they are jointly trained to predict entity mention spans and coreference links. Specifically, the feedforward neural network that is responsible for scoring a span in these models currently do not receive direct feedback on whether a span corresponds to an entity mention. We first turn it into a mention detector by training it in a supervised manner using the negative cross entropy loss, so that it predicts a positive mention score for a span if and only if the span corresponds to an entity mention. Then, to jointly learn MD and CR in the span-based resolvers, we employ a loss function that is the unweighted sum of the coreference loss and the MD loss. Hyperparameter tuning. To ensure a fair comparison of the resolvers, we tune their hyperparameters to maximize the CoNLL score on development data. Note, however, that the authors of Stanford, ELMo, SpanBERT-b, and SpanBERT-l reported the best hyperparameter settings on OntoNotes in the original papers (Lee et al., 2018;Joshi et al., 2019), so we simply use them in our experiments and focus on tuning the hyperparameters for the remaining cases. We adopt the set of hyperparameters to be tuned from the original papers.
For Stanford, there are three hyperparameters to tune: α W L , α F A , and α F N . These are the weights associated with three different types of mistakes made by the coreference model. Following Clark and Manning (2016), we fix α W L = 1.0 and search for α F A and α F N out of {0.1, 0.2, . . ., 1.5} using a variant of grid search. For ACE, (α W L , α F A , α F N ) = (1.0, 0.5, 1.0) is the best configuration, and for KBP, the best configuration is (1.0, 0.5, 0.8). For OntoNotes, we use the configuration found by Clark and Manning, which is (1.0, 0.5, 0.8).

Performance across Datasets
We first provide the reader with a high-level understanding of the state of the art by analyzing the five  resolvers' performance on the three datasets.
Performance across datasets. Results on the three datasets, which are reported in terms of the CoNLL score, are shown in the CoNLL column in Table 3. 7 Although the five resolvers have been evaluated solely on OntoNotes, their relative performances are consistent across the datasets. In particular, the use of ELMo embeddings enables ELMo to outperform GloVe by 3.8-4.9% points. SpanBERT-b outperforms ELMo by 1.6-4.4% points, and SpanBERT-l further outperforms SpanBERT-b by 2.2-4.5% points.
Source of performance improvements. Do the above improvements stem from improved recognition of coreference links, or improved recognition of singleton clusters, or both? To understand whether these resolvers have improved in terms of link prediction, we examine the MUC F-scores (see the MUC column), which are computed solely on coreference links. As we can see, the MUC scores are consistently increasing down the table across all datasets, meaning that later systems are indeed doing better at identifying coreference links. To understand whether later resolvers are also better at identifying singleton clusters, we show in the Singleton column the percentage of singleton clusters that are correctly recalled. Again, the scores are increasing down the table, and the degree of improvement is particularly large from GloVe to ELMo and from SpanBERT-b to SpanBERT-l. Mention detection performance. First, MD performance has improved significantly over the years. SpanBERT-l achieves an F-score of 88.2 in MD on OntoNotes, which is significantly higher than the best MD F-score achieved in the CoNLL-2012 shared task (77.7). Note that Stanford's mention detector performs substantially worse than those of the other resolvers, especially on ACE and KBP. The reason is that Stanford employs a rule-based MD system that was initially developed when the 7 Owing to space limitations, we show only the most important scores in Table 3. The detailed results (e.g., B 3 and CEAFe results) can be found in the Appendix.
Stanford NLP Group participated in the CoNLL-2011 shared task, whereas in the other resolvers MD is jointly trained with CR. Overall, MD performance appears to have a significant impact on CR performance. In particular, joint MD and CR in the span-based resolvers seems to be a driving force behind the rapid coreference performance improvements we have seen in recent years.

Using Oracles
Can the performance of coreference resolvers be further improved if we improve MD? Being able to answer this kind of questions is important: if further improvements in MD can result in significant gains in coreference performance, then future research efforts should perhaps be focused on MD.
To answer this kind of questions, we perform oracle experiments. Specifically, we provide a resolver with a particular type of perfect information (e.g., using gold mentions as input) and see how much performance improvement can be obtained.

Gold Mention Boundaries
Our first oracle experiment concerns training and testing our resolvers on gold mention boundaries. While this experiment has been conducted over the years by numerous researchers (e.g., Peng et al. (2015), Zhang et al. (2018)), we are primarily interested in understanding whether further improving an MD component that already has an F-score of more than 85% can improve coreference performance. For the four span-based models, we disable the component in the span representation layer that is responsible for proposing spans (i.e, mention boundaries) and instruct them to use gold mention spans instead. Note, however, that the representation of a span will be learned during training. In other words, although all resolvers are given gold mention spans, the span representations that will be used during resolution will still be different for different span-based resolvers.
Results, expressed in terms of the CoNLL score, are shown in the Gold Mention Boundaries col-  Table 4: CoNLL scores of the resolvers using gold mention boundaries, perfect anaphoricity, and gold entity types.
umn in Table 4. First, despite recent significant improvement in MD, these results suggest that coreference performance can still be significantly improved just by improving MD: for the best resolver (SpanBERT-l), the CoNLL score can be improved by 8.4-12.3% points. Second, the relative performances of the resolvers are consistent across the three datasets: the CoNLL scores increase as we go down the table. Since the four span-based resolvers use essentially the same (mention-ranking) model for resolution and the same algorithm for weight updates, their performance differences can be attributed largely to differences in the pretrained embeddings and the encoder. In addition, these results suggest that the coreference performance improvements we observed in recent years can be attributed to not only improved mention (boundary) detection but also improved resolution accuracy presumably as a result of better span representations.

Perfect Anaphoricity
Anaphoricity determination, a.k.a. discourse-new detection (Poesio et al., 2004), is the task of determining whether a mention is coreferent with another mention that appears earlier in the text. Being able to identify non-anaphoric mentions could improve the precision of coreference resolvers, as any antecedent chosen for them is erroneous.
In this oracle experiment, we provide a resolver with perfect anaphoricity information, meaning that we know for every entity mention whether it is anaphoric or not. We use this perfect anaphoricity information during resolution: we will resolve all and only those mentions that are anaphoric.
Results are shown in the Perfect Anaphoricity column of Table 4. A few points deserve mention. First, all resolvers improved on all datasets when provided with perfect anaphoricity information. These results imply that anaphoricity determination remains an important issue in CR research, and further improvements in anaphoricity can improve CR. However, the gains that state-of-the-art resolvers can achieve by improving anaphoricity determination are generally smaller than those by improving MD: the CoNLL scores of the spanbased resolvers increase by 3.5-4.3% points on ACE, 4.4-9% points on OntoNotes, and 3.9-5% points on KBP. This is understandable, as MD is likely to improve both coreference precision and recall, whereas anaphoricity determination can only improve precision. Note that Stanford's poor performance on ACE and KBP is due to poor MD.

Gold Entity Types
In this experiment, we assume that a resolver is given gold entity types (i.e., semantic classes) such as PERSON, ORGANIZATION, and LOCATION. The set of entity types to be provided is corpusdependent. As mentioned before, ACE and KBP only have seven and five entity types respectively. In OntoNotes, however, only named entities are annotated with (one of 18) entity types. Consequently, we automatically derive entity types for pronouns and nominals using gold coreference chains: if a pronoun or a nominal appears in a coreference cluster that contains a name, we derive its entity type from that of the name. This method allows us to derive the entity type of 36.4% of the nominals and 70% of the pronouns. Any pronoun or nominal whose entity type cannot be derived using this method will be assigned the entity type UN-KNOWN. While this method does not provide full coverage, we will still be able to examine whether having access to perfect entity types on a subset of the mentions will enable us to improve the performance of a resolver on OntoNotes.
We use entity types during resolution. We disallow a candidate antecedent to be selected as the antecedent for a given anaphor if they have different entity types. Results are shown in the Gold Entity Types column in Table 4. As we can see, all resolvers improved on all datasets when provided with gold entity types. Compared with the gains achieved using gold mention spans or perfect anaphoricity, the gains that come with the use of gold entity types are smaller: for the span-based resolvers, the CoNLL scores increase by 1.5-1.7% points on ACE, 1.8-3.7% points on OntoNotes, and 1.9-2.5% points on KBP. In other words, state-ofthe-art resolvers can be improved by improving the determination of entity types.
These results are particularly interesting in light of a conundrum in entity CR: while some researchers have reported successes with improving entity CR using automatically computed semantic information (Ng, 2007), there have also been numerous failed attempts (Kehler et al., 2004;Durrett and Klein, 2013;Sapena et al., 2013). Although the semantic information we use in this paper is restricted to gold entity types, our results suggest that hand-annotated semantic information is indeed useful, and the (non-)utility of semantics for CR reported in earlier work could be attributed to the noise inherent in computing semantic information.

Results on Resolution Classes
To gain additional insights into the state-of-theart resolvers, we analyze their performance on different types of entity mentions. More specifically, motivated by Stoyanov et al. (2009), we partition the gold mentions into different resolution classes. While previous work has focused mainly on three coarse-grained resolution classes (namely, pronouns, names, and nominal mentions), we employ the 13 fine-grained resolution classes defined by Rahman and Ng (2011b), as discussed below. Names. Four classes are defined for gold names.
(1) e: a name is assigned to this exact string match class if there is a preceding mention such that the two are coreferent and are the same string; (2) p: a name is assigned to this partial string match class if there is a preceding mention such that the two are coreferent and have some content words in common; (3) n: a name is assigned to this no string match class if there is no preceding mention such that the two are coreferent and have some content words in common; and (4) na: a name is assigned to this non-anaphor class if it is not coreferent with any preceding mention. Nominal mentions. Four analogous resolution classes are defined for gold mentions whose head is a nominal: (5) e; (6) p; (7) n; and (8) na. Pronouns. We have three pronoun classes. (9) 1/2: 1st and 2nd person pronouns; (10) G3: gendered 3rd person pronouns (e.g., she); (11) U3: ungendered 3rd person pronouns; (12) oa: any anaphoric pronouns that do not belong to (9), (10), and (11) (e.g., relative pronouns); and (13) na: nonanaphoric pronouns (e.g., pleonastic pronouns). Table 5 shows the performance of each resolver on each resolution class. To avoid overwhelming the reader, we only show the results of ELMo and SpanBERT-l, which will allow us to gain insights into what made SpanBERT-l better. Specifically, for each resolution class C, we show each resolver's MD recall (percentage of gold mentions in C that are correctly recalled) under MD and its resolution accuracy (percentage of correctly identified anaphors in C that are correctly resolved) 8 under RA. Under Size we show the percentage of gold mentions belonging to each resolution class.
First, if we consider only the three coarsegrained resolution classes, the results are perhaps not surprising: name resolution is the easiest and nominal resolution is the hardest.
Second, consider the 13 fine-grained resolution classes. By design, the names and the nominals in the 'e' class should be easier to resolve than those in 'p', which in turn should be easier to resolve than those in 'n'. The results are consistent with this intuition. Results on the anaphoric pronoun classes are also consistent with our intuition: 3rd person gendered pronouns are the easiest to resolve, followed by 1st/2nd person gendered pronouns and then ungendered 3rd person pronouns.
Third, these results reveal that the difficulty of anaphoricity determination stems primarily from pronouns: while resolution accuracies on nonanaphoric names and nominal mentions are above 89%, those on non-anaphoric pronouns are only between 65.9% and 77.6%. Note that we consider a non-anaphoric mention correctly "resolved" if it is resolved to the dummy antecedent.
Finally, SpanBERT-l has better resolution accuracies than ELMo for all resolution classes on all datasets. Encouragingly, the harder a resolution class is, the bigger the improvement is. These results clearly show that we are making progress on resolving anaphors that are traditionally considered difficult to resolve. Note that part of this improvement can be attributed to improved MD, which increases the likelihood that the correct antecedent of an anaphor is present in its list of candidate antecedents. Additional experiments are needed to determine the impact of improved MD on improvement in resolution accuracies, however.  We divide the different kinds of perturbations into two broad categories, mention-internal perturbations and mention-external perturbations.

Mention-internal Perturbations
Mention-internal perturbations involve making changes to the words within an entity mention.

Perturbations to Names
We consider two kinds of perturbations to names.
Unseen names. We replace each name in a training document with a name that will highly unlikely appear in any test set. With this replacement, all the names in the test set will be unseen w.r.t. the training set. When trained on this perturbed training set, we can determine the extent to which the algorithms for learning coreference resolvers rely on memorizing seen names (as opposed to generalizing from their contexts) when performing MD and CR. Specifically, if a learner memorizes a lot,  it will likely perform poorly on MD (i.e., its recall will suffer) and subsequently CR. We perform name replacement in a deterministic manner: we replace each word in a name with another word in which the order of its characters is reversed. Note that person prefixes (e.g., "Mr."), organization words and suffixes ("Airlines", "Inc."), and location nouns (e,g, "River") will not be replaced, as the goal is to introduce unseen names rather than change the type of a name. 9 In addition, any word in a name that appears in a nominal mention in the training set will not be replaced. For instance, the word "Church" in "Baptist Church" appears in a nominal in the training set and therefore will not be replaced. This is done to ensure that only the "name" part of a mention will be changed to something that is not previously seen. 10 Names of a different type. In this experiment, we replace each name, ne 1 , in a training document with another name, ne 2 , that satisfies two conditions. First, ne 2 , like ne 1 , should appear in the training set. This ensures that the number of names in the test set that will be unseen w.r.t. the train-ing set will not change. Second, the two names should have different entity types. Importantly, the replacement is deterministic, meaning that (1) all occurrences of ne 1 will be replaced with the same name (i.e., ne 2 ), and (2) any name coreferent with ne 1 (but are not lexically identical to ne 1 , such as "Trump" and "President Trump") will be replaced with a name coreferent with ne 2 . These conditions together ensure that only the names and their types will change, but their coreference relationships will not. Note that the choice of ne 2 is random subject to these conditions. Due to the randomness involved in the selection of ne 2 , we repeat the experiment three times and report the average result.
With this replacement, the resulting training documents may no longer make sense to a human reader, as a PERSON name may appear in a context for an ORGANIZATION name. In particular, the contexts in which a certain type of names (e.g., PERSON) appear in the training set will be different from those in which these names appear in the test set. This experiment will allow us to determine the extent to which a resolver makes use of contextual information when identifying coreference links involving names: if it makes heavy use of contextual information, we should see a considerable drop in resolver performance.

Perturbations to Nominal Mentions
We consider three kinds of perturbations to nominal mentions to determine the roles they play.
Unseen nominals. This experiment has the same setup as the "Unseen names" experiment above, except that we replace each nominal mention in the training set with another mention in which we reverse the order of the characters of each of its words. Note that this is a mention-internal perturbation, meaning that we replace all and only those nominals that are annotated as entity mentions, not all nominals in the training set.
Nominals of a different type. This experiment has the same setup as the "Names of a different type" experiment above, except that we replace nominal mentions rather than names. As in the previous experiment, we replace each nominal that is annotated as an entity mention in the training set.
Nominals of the same type. This experiment has the same setup as the "Nominals of a different type" experiment above, except that the nominal mention being replaced must have the same entity type as its replacement. This kind of perturbation is "milder" than the previous kind of perturbation, as a PERSON mention will continue to appear in a PERSON context after the replacement. In other words, if a machine learner does not pay attention to the semantic compatibility between a nominal mention and its context, then we should see little performance difference when a resolver is trained on this training set vs. the previous training set (i.e., the one from "Nominals of a different type").

Mention-external Perturbations
Mention-external perturbations involve making changes to the words outside a mention.

Perturbations to Verbs
We consider two kinds of perturbations to verbs to determine the role they play in resolution. Unseen verbs. This experiment has the same setup as the two "Unseen" experiments above, except that we replace each verb in the training set that is not part of an entity mention. Seen verbs. This experiment has the same setup as the "Names of a different type" experiment above, except that we replace verbs outside of entity mentions rather than names. In particular, the new verb is not constrained to have the same type as the verb being replaced: it can be any verb taken from the training set. Nevertheless, the replacement is deterministic: all occurrences of a given verb will be replaced with the same verb.

Perturbations to Adjectives & Adverbs
We consider two kinds of perturbations to adjectives and adverbs to determine the roles they play. Unseen adjectives and adverbs. This experiment has the same setup as the three "Unseen" experiments above, except that we replace each adjective and adverb in the training set that is not part of an entity mention. Seen adjectives and adverbs. This experiment has the same setup as the "Seen verbs" experiment above, except that we replace adjectives and adverbs outside of entity mentions rather than verbs.

Perturbation Results
Results of these experiments on the three datasets are shown in Table 7. As in Table 5, we only show the results of ELMo and SpanBERT-l. For each resolver, we show its CR CoNLL score and its MD F-score. To facilitate comparison, we show in row 1 the performance of the resolvers when the input is not perturbed.  A few points deserve mention. First, mentioninternal perturbations (rows 2-6) triggered larger deterioration in CR performance than mentionexternal perturbations. These results suggest that the resolvers rely more on the mentions themselves than their contexts for resolution, which should not be surprising. Among the mention-internal perturbations, the biggest CR performance drops occur with the Unseen perturbations (rows 2 and 4), particularly those involving unseen names, followed by perturbations involving the replacement of a seen name or nominal with a different type. A closer inspection of the results reveals that there is a strong correlation between CR performance and MD performance: larger drops in CR performance are always accompanied by larger drops in MD performance. This sheds light on why the Unseen perturbations triggered the largest drop in CR performance: when all the names or nominal mentions in the test set are not seen in the training set, the mention detector is likely to perform poorly on the test set. In contrast, when they are replaced by mentions of a different entity type, the percentage of unseen mentions in the test set doesn't change, thus posing fewer problems for the mention detector.
As for the mention-external perturbations, no clear patterns emerged: while verb replacement (rows 7-8) has a greater impact than adjective and adverb replacement (rows 9-10) for OntoNotes, the same observation cannot be made for the other datasets. Moreover, while replacing a word with another seen word is generally expected to cause less harm to MD (and thus CR) performance than replacing a word with an unseen word, these experiments show that this is not necessarily the case. These results seem to suggest that the mention span learner is not particularly sensitive to the verbs, adjectives and adverbs that appear in the context. Third, it is not easy to conclude which resolver is more robust to perturbations. While the drops in CR performance on ACE are fairly mild for both resolvers, we see bigger CR performance drops on the other two datasets. In particular, ELMo suffers from a bigger drop in performance than SpanBERTl on OntoNotes, whereas the reverse is true on KBP. Finally, while the two resolvers' MD performances are similar, SpanBERT-l's CR performance is always superior to ELMo's. These results reveal once again that the mention representations learned by SpanBERT-l are indeed better than those by ELMo as far as resolution is concerned.

Conclusions
While space limitations preclude a reiteration of all the observations we have made, we believe the key conclusions are: (1) the relative performances of the resolvers are consistent across datasets; (2) for each resolver, higher mention detection performance always yields better coreference performance; (3) the newest resolvers perform better because of not only improved mention detection, but also improved mention span representations, and they improved the resolution of both easyand difficult-to-resolve anaphors; (4) all resolvers can be improved by improving mention detection, anaphoricity determination, and entity type detection; and (5) our perturbation results suggest that coreference performance is most sensitive to those words/phrases in the input that have the greatest impact on mention detection performance.  Table 8 shows the list of person prefixes, organization words and suffixes, and location nouns used in the perturbation experiments.

B Results from Different Evaluation Metrics
Recall that owing to space limitations, results of the different resolvers are only expressed in terms of the CoNLL score, the MUC F-score, the percentage of singleton clusters being recalled, and the mention detection F-score. Table 9 provides the detailed results on coreference expressed in terms of recall (R), precision (P) and F-score (F) that are via different evaluation metrics (i.e., MUC, B 3 , CEAF e , and LEA). In addition, mention detection performance is expressed in terms of R, P, and F. As can be seen, regardless of which coreference evaluation metric is used, F-score consistently increases down the table for each dataset. These results provide suggestive evidence that the improvements achieved by each resolver over the previous ones are robust. As for mention detection, improvements in F-score are largely accompanied by improvements in both recall and precision.