Benchmarking Scalable Methods for Streaming Cross Document Entity Coreference

Streaming cross document entity coreference (CDC) systems disambiguate mentions of named entities in a scalable manner via incremental clustering. Unlike other approaches for named entity disambiguation (e.g., entity linking), streaming CDC allows for the disambiguation of entities that are unknown at inference time. Thus, it is well-suited for processing streams of data where new entities are frequently introduced. Despite these benefits, this task is currently difficult to study, as existing approaches are either evaluated on datasets that are no longer available, or omit other crucial details needed to ensure fair comparison. In this work, we address this issue by compiling a large benchmark adapted from existing free datasets, and performing a comprehensive evaluation of a number of novel and existing baseline models. We investigate: how to best encode mentions, which clustering algorithms are most effective for grouping mentions, how models transfer to different domains, and how bounding the number of mentions tracked during inference impacts performance. Our results show that the relative performance of neural and feature-based mention encoders varies across different domains, and in most cases the best performance is achieved using a combination of both approaches. We also find that performance is minimally impacted by limiting the number of tracked mentions.


Introduction
The ability to disambiguate mentions of named entities in text is a central task in the field of information extraction, and is crucial to topic tracking, knowledge base induction and question answering. Recent work on this problem has focused almost solely on entity linking-based ap- * Work done during an internship at Google Research. 1 Code and data available at: https://github.com/ rloganiv/streaming-cdc proaches, i.e., models that link mentions to a fixed set of known entities. While significant strides have been made on this front-with systems that can be trained end-to-end (Kolitsas et al., 2018), on millions of entities (Ling et al., 2020), and link to entities using only their textual descriptions (Logeswaran et al., 2019)-all entity linking systems suffer from the significant limitation that they are restricted to linking to a curated list of entities that is fixed at inference time. Thus they are of limited use when processing data streams where new entities regularly appear, such as research publications, social media feeds, and news articles. In contrast, the alternative approach of crossdocument entity coreference (CDC) (Bagga and Baldwin, 1998;Gooi and Allan, 2004;Singh et al., 2011;Dutta and Weikum, 2015), which disambiguates mentions via clustering, does not suffer from this shortcoming. Instead most CDC algorithms suffer from a different failure mode: lack of scalability. Since they run expensive clustering routines over the entire set of mentions, they are not well suited to applications where mentions arrive one at a time. There are, however, a subset of streaming CDC methods that avoid this issue by clustering mentions incrementally (Figure 1). Unfortunately, despite such methods' apparent fitness for streaming data scenarios, this area of research has received little attention from the NLP community. To our knowledge there are only two existing works on the task (Rao et al., 2010;Shrimpton et al., 2015), and only the latter evaluates truly streaming systems, i.e., systems that process new mentions in constant time with constant memory.
One crucial factor limiting research on this topic is a lack of free, publicly accessible benchmark datasets; datasets used in existing works are either small and impossible to reproduce (e.g., the dataset collected by Shrimpton et al. (2015) only contains a few hundred unique entities, and many of the (b) Mentions are encoded as points in a vector space and incrementally clustered. As the space grows some points are removed to ensure that the amount of memory used does not exceed a given threshold. annotated tweets are no longer available for download) or lack the necessary canonical ordering and are expensive to procure (e.g., the ACE 2008 and TAC-KBP 2009 corpora used by Rao et al. (2010)).
To remedy this, we compile a benchmark of three datasets for evaluating English streaming CDC systems along with a canonical ordering in which evaluation data should be processed. These datasets are derived from existing datasets that cover diverse subject matter: biomedical texts (Mohan and Li, 2019), news articles (Hoffart et al., 2011), and Wikia fandoms (Logeswaran et al., 2019).
We evaluate a number of novel and existing streaming CDC systems on this benchmark. Our systems utilize a two step approach where: 1) each mention is encoded using a neural or feature-based model, and 2) the mention is then clustered with existing mentions using an incremental clustering algorithm. We investigate the performance of different mention encoders (existing feature-based methods, pretrained LMs, and encoders from entity linkers such as RELIC (Ling et al., 2020) and BLINK (Wu et al., 2020)), and incremental clustering algorithms (greedy nearest-neighbors clustering, and a recently introduced online agglomerative clustering algorithm, GRINCH (Monath et al., 2019)). Since GRINCH does not use bounded memory, which is required for scalability in the streaming setting, we introduce a novel bounded memory variant that prunes nodes from the cluster tree when the number of leaves exceeds a given size, and compare its performance to existing bounded memory approaches.
Our results show that the relative performance of different mention encoders and clustering algorithms varies across different domains. We find that existing approaches for streaming CDC (e.g., feature-based mention encoding with greedy nearest-neighbors clustering) outperform neural approaches on two of three datasets (+1-3% abs. improvement in CoNLL F 1 ), while a RELIC-based encoder with GRINCH performs better on the last dataset (+9% abs. improvement in CoNLL F 1 ). In cases where existing approaches perform well, we also find that better performance can be obtained by using a combination of neural and feature-based mention encoders. Lastly, we observe that by using relatively simple memory management policies, e.g. removing old and redundant mentions from the mention cache, bounded memory models can achieve performance near on-par with unbounded models while storing only a fraction of the mentions (in one case we observe a 2% abs. drop in CoNLL F 1 caching only 10% of the mentions).

Task Overview
The key goal of cross-document entity coreference (CDC) is to identify mentions that refer to the same entity. Formally, let M = m 1 , . . . , m |M| denote a corpus of mentions, where each mention consists of a surface text m.surface (e.g., the colored text in Figure 1a), as well as its surrounding context m.context (e.g., the text in black). Provided M as an input, a CDC system produces a disjoint clustering over the mentions C = C 1 , . . . , C |C| , |C| ≤ |M|, as the output, where each cluster C e = {m ∈ M | m.entity = e} is the set of mentions that refer to the same entity. In streaming CDC, there are two additional requirements: 1) mentions arrive in a fixed order (M is a list) and are clustered incrementally, and 2) memory is constrained so that only a fixed number of mentions can be stored. This can be formulated in terms of the above notation by adding a time index t, so that M T = {m t ∈ M | t ≤ T } is the set of all mentions observed at or before time T , M T ⊆ M T is a subset of "active" mentions whose size does not exceed a fixed memory bound k, e.g., | M T | ≤ k, and C T is comprised of clusters that only contain mentions in M T . Due to the streaming nature, M T − {m T } ⊂ M T −1 , i.e., a mention cannot be added back to M T if it was previously removed. When the memory bound is reached, mention are removed from M according to a memory management policy Φ.
An illustrative example is provided in Figure 1. Mentions arrive in left-to-right order (Figure 1a), with the clustering process depicted in Figure 1b (memory bound k = 3). At time T = 4, the mention m 1 is removed from M 4 . Note that, even though m 1 is removed, it is still possible to disambiguate mentions of all previously observed entities, whereas this would not be possible had m 3 or m 4 been removed. This illustrates the effect the memory management policy can have on performance.

Background and Motivation
Cross Document Entity Coreference As we show later, we employ a two-stage CDC pipeline where mentions are first encoded as vectors, and subsequently clustered. This approach is used in most existing work on CDC (Bagga and Baldwin, 1998;Mann and Yarowsky, 2003;Gooi and Allan, 2004). In the past decade, research on CDC has mainly focused in improving scalability (Singh et al., 2011), and jointly learning to perform CDC with other tasks such as entity linking (Dutta and Weikum, 2015) and event coreference (discussed in the next paragraph). This work similarly investigates whether entity linking is beneficial for CDC, however we use entity linkers that are pretrained separately and kept fixed during inference.
Recently, there has been a renewed interest in performing CDC jointly with cross-document event coreference (Barhom et al., 2019;Meged et al., 2020;Cattan et al., 2020;Caciularu et al., 2021) on the ECB+ dataset (Cybulska and Vossen, 2014). Although we do not evaluate methods from this line of research in this work, we hope that the benchmark we compile will be useful for future evaluation of these systems.
Streaming Cross Document Coreference The methods mentioned in the previous paragraphs disambiguate mentions all at once, and are thus unsuitable for applications where a large number of mentions appear over time. Rao et al. (2010) propose to address this issue using an incremental clustering approach where each new mention is either placed into one of a number of candidate clusters, or a new cluster if similarity does not exceed a given threshold (Allaway et al. (2021) use a similar approach for joint entity and event coreference). Shrimpton et al. (2015) note that the this incremental clustering does not process mentions in constant time/memory, and thus is not "truly streaming". They present the only truly streaming approach for CDC by introducing a number of memory management policies that limit the size of M, which we describe in more detail in Section 3.3.
One of the key problems inhibiting further research on streaming CDC is a lack of suitable evaluation datasets for measuring system performance. The datasets used in Rao et al. (2010) are either small in size (few hundreds of mentions), contain few annotated entities, or are expensive to procure. Additionally, they do not include any canonical ordering of the mentions, which precludes consistent evaluation of streaming systems. Meanwhile, Tweets annotated by Shrimpton et al. (2015) only cover two surface texts (Roger and Jessica) and are no longer accessible via the Twitter API. 2 To address this we collect a new evaluation benchmark, comprised of 3 existing publicly available datasets, covering a diverse collection of topics (News, Biomedical Articles, Wikias) with natural orderings (e.g., chronological, categorical). This benchmark is described in detail in Section 4.1.
Entity Linking CDC is similar to the task of entity linking (EL, Mihalcea and Csomai (2007)), which also addresses the problem of named entity disambiguation, with the key distinction that EL is formulated as a supervised classification problem (list of entities is known at training and test time), while CDC is an unsupervised clustering problem. In particular, CDC is similar to time-aware EL (Agarwal et al., 2018)-where temporal context is used to help disambiguate mentions-and zero-shot EL (Zeshel, Logeswaran et al. (2019))where the set of entities linked to during evaluation does not overlap with the set of entities observed during training. Streaming CDC can also be con-sidered a method for time/order-aware zero-shot named entity disambiguation, however, it is strictly more challenging as it does not assume access to a curated list of entities at prediction time, or any supervised training data.
Although CDC is formulated as a strictly unsupervised clustering task, this does not preclude the usage of labeled data for transfer learning. One of the primary goals in this work is to investigate whether the mention encoders learned by entity linking systems provide useful representations in the first step of the CDC pipeline. Specifically, we consider mention encoders for two state-of-the-art entity linking architectures: RELIC (Ling et al., 2020) and the BLINK bi-encoder (Wu et al., 2020).
Emerging Entity Detection Streaming CDC is also related to the task of emerging entity detection (EED, Färber et al. (2016)), which, given a mention that cannot be linked, seeks to predict whether it should produce a new KB entry. Although both tasks share similar motivations, they adopt different approaches (EED is formulated as a binary classification task), and CDC does not require deciding which entities should and should not be added to a knowledge base. However, in many practical applications, it may make sense to apply streaming CDC only to emerging entities.

Building Streaming CDC Systems
Following previous work, we adopt a two-step approach to performing streaming cross-document coreference. In the first step, an encoder is used to produce a vector representation of the incoming mention m t = Enc(m t ). In the second step, these vectors are input into an incremental clustering algorithm to update the predicted clustering C t = Clust(C t−1 , m t ). In the following sections we describe in detail the mention encoders and clustering algorithms used in this work.

Mention Encoders
The primary goal of mention encoders Enc(m t ) is to produce a compact representation of the mention, including both the surface and the context text.
Feature-Based Encoders Existing models for streaming cross-document coreference exclusively make use of feature-based mention encoders. While there are many feature engineering options explored in the literature, in this work we consider the mention encoding approach proposed by Shrimpton et al. (2015), which uses character skip bigram indicator vectors to encode the surface text, and tf-idf vectors to represent contexts. When using this encoding scheme, similarity scores are computed independently for the surface and context embeddings, and a weighted average is taken to produce the final similarity score. We use the same setup and parameters as Shrimpton et al. (2015).

Masked Language Model Encoders
We also consider mention encodings produced by masked language models, particularly BERT (Devlin et al., 2019). We encode the mention by feeding the contiguous text of the mention (containing both the surrounding and surface text) into BERT and concatenating the contextualized vectors associated with the first and last word-piece of the surface text. That is, let s, e ∈ N denote the start and end of the mention surface text within the complete mention, and let M = BERT(m) denote the contextualized word vectors output by BERT. Then the mention encoding is given by: Entity Linker-Based Encoders We consider producing mention encodings using the bi-encoderbased neural entity linkers: RELIC (Ling et al., 2020) and BLINK (Wu et al., 2020). The bi-encoder architecture is comprised of two components-a mention encoder Enc m , and an entity encoder Enc e -and is trained to maximize a similarity score (e.g., dot-product) between the mention encoding and the encoding of its underlying entity, while simultaneously minimizing the score for other entities. We use Enc m from pretrained entity linkers to encode mentions for CDC.

Hybrid Encoder
We also consider a hybrid encoder which combines feature-based and neural mention encoders. We retain the feature-based character skip bigram surface text encoder, but use one of the neural encoders from entity linkers in place of tf-idf context representation. Similarity scores are computed by averaging the two without any weights, unlike by Shrimpton et al. (2015).

Clustering Algorithms
Here we describe incremental clustering approaches, Clust(C t−1 , m t ), that compute a new clustering when m t is added to the mentions under consideration ( M). CDC using a single linkage incremental clustering approach that clusters each new mention m to its nearest neighbor m = arg min m ∈ M sim(m, m ), if the similarity exceeds some threshold τ . We use a similar approach here, however we cluster m with all m ∈ M such that sim(m, m ) > τ thus allowing previously separate clusters to be merged if m is similar to both of them.

Greedy Nearest Neighbors Clustering
GRINCH Gooi and Allan (2004) find that average link hierarchical agglomerative clustering can outperform greedy single link approaches. However, agglomerative approaches are typically not used for streaming CDC because running the algorithm at each time step is too expensive, and incremental variants of the approach are not able to recover from incorrect choices made early on (Figure 2a). The recently introduced GRINCH clustering algorithm (Monath et al., 2019) uses rotate and graft operations that reconfigure the tree, thereby avoiding these issues ( Figure 2b). We defer to the original paper for details, however note that, for our application, each interior node of the cluster tree is computed as a weighted average of its children's representations (where the weights are proportional to the number of leaves). Thus at each interior node, it is possible to compute the similarity score between that node's children. This allows us to produce a flat clustering from the cluster tree by thresholding the similarity score, just as in the greedy clustering case.

Memory Management Policies
As described in Section 2.1, memory management policies decide which mentions to remove from M to prevent its size from exceeding the memory bound, providing scalable, memory-bound variants of the clustering algorithms.
Bounded Memory Greedy NN Clustering For bounded memory greedy nearest neighbors clustering, we consider the following memory management policies of Shrimpton et al. (2015): • Window: Remove the oldest mention in M.
• Cache: Remove the oldest mention in the least recently updated cluster C LRU . • Diversity: Remove the most similar mention to mention just added, i.e. arg max m sim(m, m t ) • Diversity-Cache: A combination of the diversity and cache strategies, where the diversity strategy is used if the similarity score exceeds a given threshold sim(m, m t ) > α, and the cache strategy is used otherwise.
Bounded Memory GRINCH Memory management for GRINCH is more complicated than for greedy clustering, as instead of maintaining a flat clustering of mentions, GRINCH instead maintains a cluster hierarchy in the form of a binary cluster tree. Every time a mention is inserted into the tree, two new nodes are created: one node for the mention itself, and a new parent node linking the mention to its sibling ( Figure 2a). Accordingly, when the memory bound is reached, the memory management policy for GRINCH must remove two nodes from the tree. Furthermore, in order to preserve the tree's binary structure, the removed nodes must be leaf nodes as well as siblings. Because the original GRINCH algorithm only includes routines for inserting nodes into the tree, and reconfiguring the tree's structure, we modify GRINCH to include a new remove operation that prunes two nodes satisfying the these criteria. The parent of these nodes then becomes a leaf node, whose vector representation is produced by combining the vector representations of its former children using a weighted average (this is conceptually similar to the collapse operation described in Kobren et al. (2017)). We consider the following policies here: • Window: Remove the nodes whose parent was least recently added to the tree. • Diversity: Remove the pair of nodes that are most similar to each other.

Benchmarking Streaming CDC
In this section, we describe our proposed benchmark for evaluating streaming CDC systems. Analysis Statistics for the benchmark data are provided in Table 1, which list the number of mentions and unique entities for each dataset. We also list the percentage overlap between entities in the training set, and entities in the dev and test sets (% Seen), as well as the maximum active entities (MAE). MAE is a quantity introduced by Toshniwal et al. (2020), which measures the maximum number of "active entities" (e.g., entities that have been previously mentioned, and will be mentioned in the future) for a given dataset, which can alternatively be interpreted as the smallest possible memory bound that can be used in order to ensure that a CDC system can cluster each mention with at least one other mention of the same entity. Importantly, this number is a small fraction of the total number of mentions in each dataset, indicating that these datasets are appropriate for the streaming setting and to compare memory management policies.

Evaluation Metrics
We evaluate CDC performance using the standard evaluation metrics: MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), CEAFe (Luo, 2005), and CoNLL F 1 which is an average of the previous three. In order to perform evaluation when memory is bounded, we perform the following bookkeeping to track nodes which have been removed by the memory management policy. For bounded memory greedy NN clustering, we keep track of the removed node's predicted cluster (e.g., if the node was removed from cluster C, then it is considered an element of C during evaluation).
This is similar to the evaluation used by Toshniwal et al. (2020). For bounded memory GRINCH, we keep track of the removed node's place within the tree structure, and produce a flat clustering using the thresholding approach described in Section 3.2 as if the node were never removed. Because leaf nodes (and accordingly removed nodes) are never updated by insertion or removal operations, nodes belonging to the same cluster before they are pruned they will always remain in the same cluster during evaluation, which is the same assumption used for the greedy NN evaluation.

Hyperparameters
Vocabulary and inverse document frequency (idf) weights are estimated using each dataset's train set. For masked language model encoders, we use an unmodified BERT-base architecture, with model weights provided by the HuggingFace transformers library (Wolf et al., 2020). For BLINK, we use the released BERT-large bi-encoder weights. 4 Our bounded memory variant of GRINCH is based on the official implementation. 5 Note that GRINCH does not currently support sparse inputs, so we do not include results for feature-based mention encoders. RELIC model weights are initialized from BERT-base, and then finetuned to perform entity linking in the following settings: • RELIC (wiki): Trained on the same Wikipedia data used to train the BLINK bi-encoder. • RELIC (in-domain): Trained on respective benchmark's training dataset; a separate model is trained for each benchmark. Training is performed using hyperparameters suggested by Ling et al. (2020). 6 For each benchmark, the hybrid mention encoder uses the best performing RELIC variant on that benchmark. Cluster thresholds τ are chosen so that the number of predicted clusters on the dev dataset approximately matches the number of unique entities.

Results
In this section, we provide a comprehensive evaluation of the design choices that define the existing and proposed approaches for streaming CDC. Choice of Encoder We include the results for CDC systems with unbounded memory on the benchmark datasets in Table 2, as well as results for two baselines: 1) a system that clusters together all mentions with the same surface forms (exact match), and 2) a system that only considers gold within-document clusters and does not merge clusters across documents (oracle within-doc). We observe that, in general, neural mention encoders are not sufficient to obtain good CDC performance. With the exception of the RELIC (In-Domain) on MedMentions, no neural mention encoders are able to outperform the feature-based greedy NN approach, and furthermore, the MLM and BLINK mention encoders do not even surpass the exact match baseline. However, note that for AIDA and Zeshel, best results are obtained using a hybrid mention encoder. Thus, in these domains, we can conclude that while neural mention encoders are useful for encoding contexts, CDC systems require an additional system to model surface texts to achieve good performance. The results on Med-Mentions provide an interesting contrast to this conclusion. Here the RELIC (In-Domain) mention encoder outperforms both the feature-based and hybrid mention encoders. In the error analysis below, we find that this is due mainly to improved performance clustering mentions of entities seen when training the mention encoder.
Choice of Clustering Algorithm Comparing greedy nearest neighbors clustering to GRINCH, we do not observe a consistent trend across mention encoders or datasets. While the best performance on AIDA and Zeshel is achieved using greedy nearest neighbor clustering, the best performance on MedMentions is achieved using GRINCH. These results highlight the importance of benchmarking CDC systems on a number of different datasets; patterns observed on a single dataset do not extrapolate well to other settings. It is also interesting to observe that a much simpler approach often works better than the more complex one.
Error Analysis We characterize the errors of these models by investigating: a) the entities whose mentions are conflated (e.g., are wrongly clustered together) and split (e.g., wrongly grouped into separate clusters) using the approach of Kummerfeld and Klein (2013), and b) differences in performance on entities that are seen vs. unseen during training for models that use in-domain data. A sub-   set of our results is provided in Table 3, with full results available in Tables 4-11 in the Appendix.
In aggregate, these error metrics closely track the results in Table 2, where better models make fewer errors of all types. We do, however, observe that in-domain training improves RELIC's performance considerably on MedMentions (+15 CoNLL F 1 on seen entities, and +18 on unseen entities), and is the primary reason underlying the improved performance over feature-based encoders (72.6 vs. 60.7 CoNLL F 1 on seen entities, while performance on unseen entities is comparable).
Comparing mentions of the most conflated entities provides a qualitative sense of the failure modes of each method. We note that the featurebased method tends to fail at distinguishing entities with the same surface form, e.g., world cups of different sports, while neural entity linkers tend to conflate entities with similar contexts, particularly when surface forms are split into multiple word pieces in the model's vocabulary (each surface form in the bottom of Table 3 gets broken into 3+ word pieces).

Effect of Bounded Memory
Results for the bounded memory setting are illustrated in Figure 3. In these experiments we take the best neural mention encoder for each benchmark dataset (RELIC (Wiki) for AIDA and Zeshel, and RELIC (In-Domain) for MedMentions), and plot the CoNLL F 1 score for each of the memory management policies described in Section 3.3. We measure performance for memory bounds at the maximum number of active entities (MAE) and total unique entities (|E|) for each dataset (as well as 1/2x, and 2x multiples of these numbers). In sum, these results provide strong evidence that CDC systems can reliably cluster mentions in a truly streaming setting, even when memory is bounded to a small fraction of the number of entities encountered by the system. Most impressively, using the diversity-cache memory management policy, a greedy nearest neighbors bounded memory model achieves a CoNLL F 1 score within 2% of the best performing unbounded memory model, while only storing approximately 10% (i.e., E/2) of the mentions.
We notice a few fairly consistent trends across datasets. The first is that increasing the memory bound has diminishing returns; while there is a large benefit incurred by increasing the bound from MAE/2 to MAE, the difference in performance attained from increasing the bound from E to 2E is often negligible. We also find that naïve memory management policies that store recent mentions (i.e., window, W, and cache, C) tend to perform better than the policies that attempt to remove redundant mentions (i.e., diversity, D). This effect is particularly pronounced for small memory bounds. While this is somewhat surprising-storing mentions of the same entity is particularly harmful when memory is limited, so encouraging diversity should be a good thing-one possible explanation is that the diversity policy is actually removing mentions of entities that appear within the same context, as we saw earlier that neural mention encoders appear to focus more on mention context than surface text. Lastly, regarding the comparison of greedy nearest neighbors clustering to GRINCH we again see that inconsistency in performance across datasets; GRINCH appears to perform better at larger cache sizes for AIDA and MedMentions, while greedy nearest neighbors clustering has much better performance than GRINCH on Zeshel.

Conclusion and Future Work
Streaming cross document coreference has a number of compelling applications, especially concerning processing streams of data such as research publications, social media feeds, and news articles where new entities are frequently introduced. Despite being well-motivated, this task has received little attention from the NLP community. In order to foster a more welcoming environment for research on this task, we compile a diverse benchmark dataset for evaluating CDC, comprised of existing datasets that are free and publicly available. We additionally evaluate the performance of a collection of existing approaches for CDC, as well as introduce new approaches that leverage modern neural architectures. Our results highlight a number of challenges for future CDC research, such as how to better incorporate surface level fea- tures into neural mention encoders, as well as alternative policies for memory management that improve upon the naïve baselines studied in this work. Benchmark data and materials needed to reproduce our results are provided at: https://github. com/rloganiv/streaming-cdc.

Broader Impact Statement
This paper focuses on systems that perform entity disambiguation without reliance on an external knowledge base. The potential benefit of such systems is an improved ability to track mentions of rare and emergent entities (e.g., natural disasters, novel disease variants, etc.); however, this is also relevant in digital surveillance settings, and may result in reduced privacy.

A Error Analysis
A.1 Seen vs. Unseen Performance We evaluate CoNLL F 1 scores for mentions of entities that are seen vs. unseen in the AIDA and MedMentions training datasets (Zeshel is excluded since no test entities are seen in the training data). Results are provided in Table 4, with performance of models that are trained using the in-domain training datasets reported in bold.

A.2 Clustering Mistakes
Kummerfeld and Klein (2013) define a system for categorizing coreference errors into a number of underlying error types. Because gold mention boundaries are provided in our task setup, the main error types of relevance are divided entities, i.e., mentions of the same entity that occur in different clusters, and conflated entities, i.e., mentions of different entities that are grouped into the same clusters. We can quantify these error types by counting the number of times clusters need to be merged together vs. split, respectively. The overall error counts are provided in Table 5.
In addition to providing the overall error counts, we also render a sample of mentions from predicted clusters containing the most conflated entities in Tables 6-11.

RELIC (In-Domain)
Protein Expression . . . by vascular endothelial growth factor (VEGF) signaling. We describe spatiotemporal expression of vegf and vegfr and experimental manipulations targeting VEGF . . . Genes, Homeobox . . . cell adhesion, and newly identified processes, including transcription and homeobox genes . We identified mutations in protein binding sites correlating with . . . Gene Expression . . . identified mutations in protein binding sites correlating with differential expression of proximal genes and experimentally validated effects of mutations . . .

BLINK (Wiki)
Robin ( Friends ) . . . , Sarah and Maya . Emma has a horse called Robin , dog called Lady and a cat called Jewel . . . . 41003 Olivia ' s Newborn Foal . . . a pet bird , Goldie . Olivia also has a new pet foal , which she takes care of frequently . She seems . . . 41007 Heartlake Pet Salon . . . its neck . Background . Joanna brings her poodle to the pet salon , where Emma pampers her up . ¡ br ¿ . . .

RELIC (Wiki)
Ro Gale . . . , as was Maquis leader Macias . Ro recalled that her father made the strongest " hasperat " she ' d ever . . . Unnamed shuttlepods ( 22nd century ) . . . . " ( ) The Federation starship carried at least one shuttlepod until the time of its disappearance in the mid -. . . Founders ' homeworld ( 2372 ) . . . As she is reluctant to reveal the location of the Founders ' new homeworld , but respects Sisko ' s loyalty to Odo when . . .

BLINK (Wiki)
Astral projection . . . also possible to escape with " teleportation " spells or astral travel , though the force blocked ethereal travel . A captive . . . Krakentua ( Shinkintin ) . . . and force newly hatched krakentua spawn to fight . A krakentua related these events via dreams to adventurers in the Fochu . . . Generic temple guard . . . two to attempt a crossing were Father Sambar and a temple guard . Sambar died horrifically , but the guard survived as . . .