Understanding Mention Detector-Linker Interaction in Neural Coreference Resolution

Despite significant recent progress in coreference resolution, the quality of current state-of-the-art systems still considerably trails behind human-level performance. Using the CoNLL-2012 and PreCo datasets, we dissect the best instantiation of the mainstream end-to-end coreference resolution model that underlies most current best-performing coreference systems, and empirically analyze the behavior of its two components: mention detector and mention linker. While the detector traditionally focuses heavily on recall as a design decision, we demonstrate the importance of precision, calling for their balance. However, we point out the difficulty in building a precise detector due to its inability to make important anaphoricity decisions. We also highlight the enormous room for improving the linker and show that the rest of its errors mainly involve pronoun resolution. We propose promising next steps and hope our findings will help future research in coreference resolution.


Introduction
Coreference resolution identifies mentions in a document that co-refer to the same entity.It is an important task facilitating many applications such as reading comprehension (Dasigi et al., 2019) and text summarization (Azzam et al., 1999).Lee et al. (2017) proposed the first neural end-toend architecture for coreference resolution.Most recent systems use it as a backbone and employ better scoring functions (Zhang et al., 2018), pruning procedures (Lee et al., 2018), or token representations (Joshi et al., 2019(Joshi et al., , 2020)). 1 Despite this usage, little in-depth analysis has been done to better understand the inner workings of such an influential system.Xu and Choi (2020) analyzed the effect of the high-order inference, while Subramanian and Roth (2019) and Zhao et al. (2018) respectively examined its generalizability and gender bias.Few work has inspected the interaction between its components.Lu and Ng (2020) conducted oracle experiments that are related to ours, but without fine-grained control over confounding factors affecting oracle mentions.Such an understanding is important: for example, Kummerfeld and Klein (2013)'s dissection of the then-best classical coreference systems inspired many important follow-up works (Peng et al., 2015;Martschat and Strube, 2015;Wiseman et al., 2016, inter alia).However, it is unknown if observations on such classical feature-based and often pipelined systems extend to current neural end-to-end models.
We consider the best instantiation of this model family, SpanBERT (Joshi et al., 2020) + c2fcoref (Lee et al., 2018), and investigate the interaction between its two components: mention detector and mention linker.We study how their errors independently or jointly affect the final clustering.
Using the CoNLL-2012(Pradhan et al., 2012) and PreCo (Chen et al., 2018) datasets, we highlight the low-precision, high-recall nature of the detector.While traditionally only recall is emphasized for the detector as a design decision (Lee et al., 2011;Lee et al., 2017), we show huge degradation from noisy mentions and, perhaps surprisingly, increasing the number of candidates considered by the baseline linker only deteriorates the performance.While some classical coreference pipelines focused on detector precision (Uryupina, 2009), it is rarely emphasized for current end-toend systems.We hence stress the importance of a precision-recall balance for the detector and demonstrate how pruning hyperparameters, in addition to reducing computational complexity, control this trade-off.However, we show the difficulty of obtaining a precise detector by demonstrating the importance of anaphoricity decisions and the inability of the detector to make them.Finally, we highlight the high potential of the linker and that the remaining errors mainly involve pronoun resolution.We hope this work sheds light on the internals of the mainstream coreference system and, with our proposed next steps, catalyze future research.We believe some of our findings may also transfer to other tasks with a similar joint span detection and span (pair) classification architecture, such as SRL (He et al., 2018), IE (Luan et al., 2019), and entity linking (Kolitsas et al., 2018).See Jiang et al. (2020) which subsumes many other tasks under such a span-based framework.

Background
Model We study the coarse-to-fine coreference system (c2f-coref; Lee et al. 2018).It assigns an antecedent for every span in a document of length T , including a dummy that indicates non-mentions or non-anaphoric mentions.The final clustering is the transitive closure of connected spans.The system consists of a mention detector and a mention linker.The detector scores all O(T2 ) spans up to length L and outputs the λT highest-scoring spans as possibly anaphoric mentions.The linker links each mention candidate with the highest-scoring antecedent among K ones.Hyperparameters L, λ, and K control the number of considered spans and antecedents, reducing computational complexity.
Data CoNLL-2012 is the most common dataset to test coreference models.However, it lacks singleton mention annotation (Pradhan et al., 2012).
Singleton, or non-anaphoric, mentions do not co-refer with other spans, e.g."The dog" in "[The dog] barks."However, they may become anaphoric in another context, e.g."[The dog] barks at [itself]."Being a mention is a span's inherent property, while anaphoricity, whether or not a mention co-refers, is context-dependent.We use "all mentions" to refer to the union of singleton and anaphoric mentions.
To understand the effect of singleton mentions, we heuristically generate all mentions for CoNLL-12 ( §B) for relevant experiments.We also experiment with PreCo, a coreference dataset with annotated singleton mentions.We do all analyses on development sets and report dataset statistics in §A.

Experiments
Settings We embed tokens with SpanBERTlarge, a pre-trained transformer (Vaswani et al., 2017) with state-of-the-art performance in coreference resolution.We choose L = 30, λ = 0. Oracles We build oracle detectors where, starting from the original system's mention candidates (its detector output), we either remove all non-gold mentions (prefect precision), add all missing gold mentions (perfect recall), or both (perfect precision & recall).We give the altered, rather than the original, mention candidates to the linker.We consider both anaphoric mentions and all mentions as gold mentions and modify either in a post-hoc manner or re-train the system with the altered candidates.To control for a non-trainable detector, we train only a linker reusing the original system's mention candidates, dubbed Fixed Detector.We consider this baseline as the comparison target for the oracles.
Besides oracle detectors, we also build an oracle linker that assigns the correct antecedent (including dummy) to each of the λT mention candidates.

Precision-Recall Trade-Off for the Mention Detector
Traditionally, coreference systems heavily favor recall over precision for the detector (Lee et al., 2011) as the linker cannot recover missed mentions.Similarly, our c2f-coref system gets >96% anaphoric mention recall yet only <40% precision (Table 1).We therefore explore if detector recall is always more important than its precision.If more spans are considered by increasing the max span width L or the number of spans considered per word λ, will the system performance necessarily improve?In the extreme case, if we hypothetically  had enough compute that allows the linker to consider all O(T 4 ) span-antecedent pairs, should we simply remove the pruning in the detector?
The Aggregated Importance of Precision For all oracles in Table 3, fixing precision yields a larger improvement than recall, especially with anaphoric mentions.This highlights the importance of detector precision and the extent to which the linker suffers from noisy mention candidates.In Table 2, we present the F 1 improvement after independently fixing categorized errors following Kummerfeld and Klein (2013).3Noisy candidates result in extra mention and extra entity errors, fixing which accounts for more than half of the ≈8 F 1 gap between the post-hoc perfect precision oracle and the baseline for CoNLL-12 (Table 3).Furthermore, re-training the system to leverage the distributional shift of the absence of noise leads to another ≈4 and 5 F 1 increase (CoNLL-12/PreCo).
To analyze how higher detector precision helps the linker, we examine the coreference score the linker assigns to every span-antecedent pair.The anaphoric mention re-trained perfect precision oracle has an average score of -13.0 on CoNLL-12, higher than -15.1 with perfect recall.Among only correct span-antecedent pairs, these scores are 11.7 and 7.1, with the same pattern.This indicates that the noise with perfect recall prevents the linker from reliably assigning high coreference scores, even for correct links.The effect of higher coreference scores also shows in that, compared with perfect recall, the perfect precision oracle produces on average larger (4.44 vs. 4.26 entities) and longerdistance (154 vs. 152 tokens spanned) clusters.
We also see this effect by examining the amount of improvement with reduced noise in Table 2.In the anaphoric mention post-hoc oracles, as expected, fixing precision results in fewer extra mention/entity errors and more missing errors, while the perfect recall oracle behaves conversely.However, when re-trained, the perfect precision oracle has much fewer missing entities, even fewer than with perfect recall.This is surprising as the latter considers more candidates.The reason is likely that the linker learns to leverage the absence of noise and reliably assigns high coreference scores.Despite some incorrect links leading to more conflated entities, the many correct ones drastically reduce missing mention/entity errors.On the other hand, the noise in the perfect recall or the original system prevents consistent high scores, resulting in more missing mentions and entities.Hence, the improvement with perfect precision partly stems from the linker's increased confidence in assigning coreference scores when not tasked with ignoring non-mentions (and singletons) in noisy candidates.
The Average Importance of Recall The large improvement from fixing precision may be due to its larger original headroom than recall (Table 1).We compute the number of operations (span addition/removal) needed for each oracle and the average F 1 improvement per operation in Table 4.For anaphoric mentions, recall has 5-8× the average effect of precision. 4If we control the number of operations by re-training an anaphoric mention (semi-)perfect precision oracle removing only as many top-scoring extra spans as the number of missing correct spans (rather than removing all extra spans), it gets 79.08 and 85.01 F 1 on CoNLL-12 and PreCo, lower than the perfect recall oracle with 79.65 and 85.22.It is therefore only due to the lowprecision high-recall nature of the original detector that precision is more important in aggregate.

Precision-Recall Trade-Off
We return to the original question: if we had more compute, is it always beneficial to consider more spans in the detector?From our results, while recall is important, an imprecise detector has substantial adverse effects by increasing the linker's learning burden.Indeed, Table 5 shows that increasing the max span width by up to 33% or the spans considered per 4 CoNLL-12 with all mentions has a different pattern as we noisily generated singletons in a recall-oriented way.
word by up to 38% only degrades the performance.As the extra low-scoring spans are mostly noise, we slightly increase recall but more heavily decrease precision, causing more harm than benefit.Hence, besides saving computation, these hyperparameters also balance the precision-recall trade-off.Future work should hence put more emphasis on precision which is often overlooked in end-to-end systems.

The Detector's Difficulty With Anaphoricity Decisions
Despite its large aggregated improvement, i.e. ≈11.7 and 10.5 F 1 for CoNLL-12 and PreCo, perfect anaphoric mention precision requires perfectly distinguishing anaphoric from singleton mentions.These anaphoricity decisions in fact account for most of the improvement, ≈10.5 and 6.7 F 1 (Table 3, anaphoric v.s.all mentions perfect precision).5However, the detector, as a span classifier, does not explicitly model inter-span anaphoric relationships.To test this architecture's ability to distinguish anaphoric from singleton mentions, we build two span classifiers with the same structure as the detector, supervised with sigmoid loss, that recognize all mentions and anaphoric mentions in PreCo.The former achieves 79.89 classification F 1 while the latter only 54.32, showing the inability of a span classifier to make anaphoricity decisions.
To better understand this difficulty, we define a confusion index as singleton recall divided by anaphoric mention recall.It correlates with the classifier's inability to identify anaphoricity.Ideally, this value should be close to 0, recalling more anaphoric mentions and fewer singletons.A random classifier incapable of distinguishing between the two has an expected confusion index of 1.
The anaphoric mention classifier above has a confusion index of 0.81, showing its inability to make anaphoricity decisions even when explicitly trained with the signal.If we only consider text appearing as both singleton and anaphoric mentions in the same document, demanding contextual reasoning by disregarding obvious anaphoric mentions such as pronouns, the confusion index degrades to 0.997.Hence, the classifier is poor at leveraging self-attentive contextual cues to make anaphoric-  (Kummerfeld and Klein, 2013;Joshi et al., 2019), we consider all deictic terms as pronouns.Each example contains two incorrectly linked entities in bold.Square brackets are added to separate mentions.
ity decisions without explicit inter-span relational modeling.In §C we also show the degradation of the confusion index with shorter spans.Given the importance of anaphoric mention precision ( §4), more research in improving anaphoricity decisions in the detector would be fruitful, for example, by more explicitly attending to neighboring spans.Alternatively, as Zhong and Chen (2021) showed the benefit of disentangling the span representations for entity detection and relation extraction in information extraction based on the intuition that they are disparate tasks, one may split the task of anaphoricity decision from mention linking and introduce a separately parameterized anaphoricity module, similarly considering the discrepancy between the two tasks.Recasens et al. (2013); Moosavi and Strube (2016); inter alia have pursued similar ideas in the pre-neural era, but it has still not yet been explored with deep models.

The Linker's Errors
While the detector struggles with anaphoricity decisions, the linker explicitly models anaphoricity by assigning the dummy to extra mentions.It is hence also viable to determine anaphoricity in the linker.Indeed, the current detector would suffice with a stronger linker: in Table 3, the oracle linker gets near-perfect scores with the original mentions (not perfect since the candidates are not gold). 6o analyze the remaining non-anaphoricity linker errors, we assume a perfect anaphoric mention detector.Here, conflated entities is the single major error source (last row of Table 2).Table 6 shows 150 manually categorized conflated entities in the CoNLL-12 development set.7 Suboptimal pronoun resolution is the biggest issue, and the linker also tends to link spans with various degrees of text match or semantic proximity.Within pronoun errors, the most common case is a pronoun linked to an incorrect nominal (in Table 6), occurring 43 times.Sometimes two pronouns, often identical, are incorrectly linked, a case that necessitates better higher-order inference.Third person pronouns with different referents are conflated 29 times.Errors with first or second person pronouns occur 37 times, usually due to speaker switching.
Similar to §5.1, separately parameterizing the linker's encoder may help reduce conflation: intuitively, the span representation for mention detection may promote homogeneity.Meanwhile, the lack of discerning span-internal content for certain error types including pronoun resolution and exact match, combined with current systems' trend to rely on such cues (Lu and Ng, 2020), calls for more focus on improving their contextual reasoning.

Conclusion
We analyzed the complex interaction between the mention detector and linker in the mainstream coarse-to-fine coreference system.Using oracle experiments, we showed that, while detector recall is important, higher anaphoric mention precision would lead to dramatically better linker performance, though achieving this is difficult.We also demonstrated that the oracle linker performance is near perfect and that the vast majority of remaining linker errors besides anaphoricity decisions are about pronoun resolution.We hope these discoveries will help future coreference research.

B Heuristically Generated CoNLL-12 All Mentions
We heuristically generate the set of all mentions for CoNLL-12 in a recall-oriented manner.We use the gold syntactic information as a proxy and consider the union of all phrases tagged with NP or NML and all words tagged with PRP, PRP$, WP, WDT, WRB, NNP, VB, VBD, VBN, VBG, VBZ, or VBP.This set includes 99.63% anaphoric mentions which constitute 20.89% of this set.We obtain the set of all mentions by merging this set with the non-singleton mentions to ensure all mentions are a superset of anaphoric mentions.

C The Confusion Index's Variation With Span Width
In Figure 1, we plot how the confusion index of the PreCo anaphoric mention classifier ( §5.1) changes with span widths.The classifier's inability to make anaphoricity decisions is the most pronounced for short phrases, possibly because these phrases are also more likely to appear as both singleton and anaphoric mentions whose anaphoricity status is especially hard to determine, discussed in §5.1.
Joshi et al. (2020)rst 110 sentences per document during training.To reduce confounding factors, we do not use speaker and genre metadata."Original"Systemreferstoastandard Span-BERT + c2f-coref trained baseline.Its F 1 score 2 is reported in Table1, similar to the results inJoshi et al. (2020)considering we disregard metadata.

Table 2 :
Kummerfeld and Klein (2013)fter fixing different types of errors on the CoNLL-12 development set.The errors are independently fixed after span errors are fixed.The categorization is fromKummerfeld and Klein (2013).

Table 3 :
Baseline and oracle coreference F 1 for anaphoric mentions (ANA.) and all mentions (ALL) on CoNLL-12 and PreCo development sets."Fixed Detector" is the baseline with a non-trainable detector.The middle three sections are oracle detectors with perfect candidate precision/recall.The last row is an oracle linker that always makes correct antecedent decisions.

Table 5 :
CoNLL-12 development F 1 with increased max span width L or the number of spans considered per word λ.The first column is the original setting.Boldface indicates the best performance.

Table 6 :
Dr. Mann says they 've narrowed it down ... Examples of categorized conflated entity errors in the CoNLL-12 development set with a perfect detector.Following past studies