Entity Linking via Explicit Mention-Mention Coreference Modeling

Learning representations of entity mentions is a core component of modern entity linking systems for both candidate generation and making linking predictions. In this paper, we present and empirically analyze a novel training approach for learning mention and entity representations that is based on building minimum spanning arborescences (i.e., directed spanning trees) over mentions and entities across documents to explicitly model mention coreference relationships. We demonstrate the efficacy of our approach by showing significant improvements in both candidate generation recall and linking accuracy on the Zero-Shot Entity Linking dataset and MedMentions, the largest publicly available biomedical dataset. In addition, we show that our improvements in candidate generation yield higher quality re-ranking models downstream, setting a new SOTA result in linking accuracy on MedMentions. Finally, we demonstrate that our improved mention representations are also effective for the discovery of new entities via cross-document coreference.


Introduction
Natural language corpora, such as biomedical research papers (Leaman and Lu, 2016), news articles (Milne and Witten, 2008;Hoffart et al., 2011), and, more generally, web page text (Gabrilovich et al., 2013;Lazic et al., 2015a), often contain ambiguous mentions of entities. Resolving this ambiguity requires mentions to either be linked to a knowledge base (KB) of entities or discovered as a new KB concept if no suitable entry exists. Grounded entity mentions are beneficial for tasks such as question-answering (Das et al., 2019), semantic search (Leaman and Lu, 2016), recommendation ranking (Noia et al., 2016), and KB construction (Ling et al., 2015). The task is made particularly challenging in zero-shot settings, where not every entity has labeled training data (Lin et al., 2017;Logeswaran et al., 2019). In such settings, a common approach is to make use of entity descriptions, types, and aliases to form entity representations, which can then be used for making predictions.
Learned vector representations of entity mentions are an integral part of modern linking systems (Gillick et al., 2019;Wu et al., 2020, inter alia). These representations are used for (a) retrieving a short-list of entity candidates for a mention to use with a re-ranker , (b) making linking predictions directly Liu et al., 2020;Sung et al., 2020), and (c) performing coreference by clustering mentions to form entities (Logan IV et al., 2020).
In this work, we present a new objective and training procedure for learning mention and entity representations that explicitly model mention coreference relationships. Our proposed method uses a supervised clustering training objective based on forming a directed minimum spanning tree, or arborescence, over mentions and entities. We hypothesize that such coreference links provide a useful inductive bias because the two tasks are inherently related (Angell et al., 2021;FitzGerald et al., 2021). We thoroughly analyze the performance of the proposed procedure in each of the aforementioned use cases on MedMentions (Mohan and Li, 2019) and ZeShEL (Logeswaran et al., 2019), two challenging datasets that require zero-shot generalization at inference.
Retrieving Candidates We illustrate that our approach yields mention and entity representations useful for candidate retrieval. We show improvements over baselines that use similarly parameterized models, achieving gains of at least 7.94 and 0.93 points in recall@64 over two standard dualencoder training procedures on MedMentions and ZeShEL, respectively. We also consider the linking capacity of our learned embeddings without re-ranking and find that their performance (i.e re-call@1) indeed improves upon our baselines. Our best performing models show gains of 13.61 & 15.46 points in linking accuracy on MedMentions and 12.06 & 1.52 points on ZeShEL.
Linking Predictions We further consider the improvement in downstream training of full crossattention re-ranker models using higher quality candidates generated by our approach. We show consistent gains in linking accuracy on MedMentions, setting a new state-of-the-art with a 1.63 point gain over the previous best model. We also note that our proposed approach shows mixed results on ZeShEL, with one variant outperforming all compared models by at least 1.19 points, while the other two underperform the baselines. We analyze this behavior in a later section and discuss the characteristics of the data distribution sufficient to make our approach effective.
Cross-Document Coreference Finally, we illustrate that the learned representations can be used to perform coreference of mentions across documents. This indicates that our approach could be used to discover entities in settings where there is limited or no existing knowledge base of entities.

Arborescence-based Training for Mention & Entity Representations
In this section, we describe our approach for constructing training objectives for dual-encoders that model mention coreference relationships.

Problem Definition
Each document d of a corpus D contains a set of entity mention spans M d = {m d 1 , m d 2 , . . . , m d N }. All mentions in the corpus are given by M = d∈D M d . Following (Logeswaran et al., 2019;Angell et al., 2021), we assume that these mentions are pre-identified spans of text.
Entity Linking Formally, we define the task of entity linking as follows: given a knowledge base of entities E and a set of mentions M, predict an entity e d i ∈ E for each mention m d i . We use e ⋆d i to refer to the ground truth entity label for m d i . Zero-Shot Linking The zero-shot task refers to the setting where there are entities in the knowledge base that do not have any labeled mentions in the training data. Linking decisions must instead rely on the provided information for entities, such as descriptions, aliases, and/or entity types.
Coreference We also consider a setting in which the KB of entities is not known in advance and entities must be discovered. For this task, we map every entity mention m d i to a cluster and assign a coreference label c d i ∈ C that is independent of the entity labels in the KB.

Coreference-based Similarity
In order to jointly train both the mention and entity encoders, we define a similarity measure and an analogous procedure for sampling positive training examples that intersperses the selection of coreferent mentions and gold entities based on a singlelinkage structure formed by the representations generated by the model snapshot. We construct k-nearest neighbor graphs over coreferent mention and entity clusters, followed by the application of a pruning algorithm to generate arborescence (directed MST) structures rooted at entity nodes. The resultant edges after pruning the graphs represent the pairs of positive examples used for training.
Graph-based Dissimilarity Let G be a graph with nodes V = M ∪ E and directed edges E ⊂ V × V . Each edge (x, y) of the graph has an associated weight w x,y . We define a dissimilarity function f between two nodes u, v ∈ V to be the weight of the minimax path between the nodes, i.e.
where connected(u, v) is true if there exists a directed path from node u to v in G, and u ⇝ v is the set of all paths between u and v. In words, the dissimilarity between u and v is the minimum of the largest edge weights in all paths between the two nodes, and this is often referred to as the "bottleneck edge". This measure has the property of emitting low dissimilarities between nodes even when their direct edge weight w u,v is high by connecting them through a chain of low-weight edges, providing an inductive bias well-suited for coreference, i.e. not all pairs of points in a cluster are nearby ( Figure 1). This inductive bias is not achieved if we sum edge weights and simply find the minimum path.
Edge Weights With this definition of dissimilarity, we now define how edge weights are calculated. We use two models: a mention-pair affinity model, ϕ : M × M → R, and a mention-entity affinity model, ψ : E × M → R. An edge between two …increasingly recognized as an important functional activity mode and is tightly linked with various cognitive functions.
…the instantaneous changes in metabolites as a function of the levels of enzymatic catalytic activities.
Here we investigate a novel strategy to normalize medial frontal brain activity by stimulating cerebellar projections. In addition, deletion of the N-terminal 24-or 37-amino acids led to significant reduction in thermostability but not the enzymatic activity.
Enc M < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 Y W N j / D K W A G S c 2 y g 4 H b o q D X P a Z o = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b g a L 4 K o k V d B l U Q Q 3 Q g X 7 g D a E y X T S D p 0 8 m L k R S s j G X 3 H j Q h G 3 f o Y 7 / 8 Z J z E J b D w y c O e d e 7 r 3 H i w V X Y F l f R m V p e W V 1 r b p e 2 9 j c 2 t 4 x d / e 6 K k o k Z R 0 a i U j 2 P a K Y 4 C H r A A f B + r F k J P A E 6 3 n T q 9 z v P T C p e B T e w y x m T k D G I f c 5 J a A l 1 z w Y B g Q m A O l 1 S D O 3 + M g g v c 1 c s 2 4 1 r A J 4 k d g l q a M S b d f 8 H I 4 i m g Q s B C q I U g P The activity of stearoyl -CoA desaturase-1 , the central enzyme in the synthesis of monounsaturated fatty acids.… Enc M < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 Y W N j / D K W A G S c 2 y g 4 H b o q D X P a Z o = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b g a L 4 K o k V d B l U Q Q 3 Q g X 7 g D a E y X T S D p 0 8 m L k R S s j G X 3 H j Q h G 3 f o Y 7 / 8 Z J z E J b D w y c O e d e 7 r 3 H i w V X Y F l f R m V p e W V 1 r b p e 2 9 j c 2 t 4 x d / e 6 K k o k Z R 0 a i U j 2 P a K Y 4 C H r A A f B + r F k J P A E 6 3 n T q 9 z v P T C p e B T e w y x m T k D G I f c 5 J a A l 1 z w Y B g Q m A O l 1 S D O 3 + M g g v c 1 c s 2 4 1 r A J 4 k d g l q a M S b d f 8 H I 4 i m g Q s B C q I U g P b i s F J i Q R O B c t q w 0 S x m N A p G b O B p i E J m H L S 4 o A M H 2 t l h P 1 I 6 h c C L t T f H S k J l J o F n q 7 M N 1 T z X i 7 + 5 w 0 S 8 C + c l I d x A k w f W A z y E 4 E h w n k a e M Q l o y B m m h A q u d 4 V 0 w m R h I L O r K Z D s O d P X i T d Z s M + b T T v z u q t y z K O K j p E R + g E 2 e g c t d A N a q M O o i h D T + g F v R q P x r P x Z r z / l F a M s m c f / Y H x 8 Q 2 u U J c a < / l a t e x i t >

[########]
Enc M < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 Y W N j / D K W A G S c 2 y g 4 H b o q D X P a Z o = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b g a L 4 K o k V d B l U Q Q 3 Q g X 7 g D a E y X T S D p 0 8 m L k R S s j G X 3 H j Q h G 3 f o Y 7 / 8 Z J z E J b D w y c O e d e 7 r 3 H i w V X Y F l f R m V p e W V 1 r b p e 2 9 j c 2 t 4 x d / e 6 K k o k Z R 0 a i U j 2 P a K Y 4 C H r A A f B + r F k J P A E 6 3 n T q 9 z v P T C p e B T e w y x m T k D G I f c 5 J a A l 1 z w Y B g Q m A O l 1 S D O 3 + M g g v c 1 c s 2 4 1 r A J 4 k d g l q a M S b d f 8 H I 4 i m g Q s B C q I U g P b i s F J i Q R O B c t q w 0 S x m N A p G b O B p i E J m H L S 4 o A M H 2 t l h P 1 I 6 h c C L t T f H S k J l J o F n q 7 M N 1 T z X i 7 + 5 w 0 S 8 C + c l I d x A k w f W A z y E 4 E h w n k a e M Q l o y B m m h A q u d 4 V 0 w m R h I L O r K Z D s O d P X i T d Z s M + b T T v z u q t y z K O K j p E R + g E 2 e g c t d A N a q M O o i h D T + g F v R q P x r P x Z r z / l F a M s m c f / Y H x 8 Q 2 u U J c a < / l a t e x i t >

[########]
Enc M < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 Y W N j / D K W A G S c 2 y g 4 H b o q D X P a Z o = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b g a L 4 K o k V d B l U Q Q 3 Q g X 7 g D a E y X T S D p 0 8 m L k R S s j G X 3 H j Q h G 3 f o Y 7 / 8 Z J z E J b D w y c O e d e 7 r 3 H i w V X Y F l f R m V p e W V 1 r b p e 2 9 j c 2 t 4 x d / e 6 K k o k Z R 0 a i U j 2 P a K Y 4 C H r A A f B + r F k J P A E 6 3 n T q 9 z v P T C p e B T e w y x m T k D G I f c 5 J a A l 1 z w Y B g Q m A O l 1 S D O 3 + M g g v c 1 c s 2 4 1 r A J 4 k d g l q a M S b d f 8 H I 4 i m g Q s B C q I U g P b i s F J i Q R O B c t q w 0 S x m N A p G b O B p i E J m H L S 4 o A M H 2 t l h P 1 I 6 h c C L t T f H S k J l J o F n q 7 M N 1 T z X i 7 + 5 w 0 S 8 C + c l I d x A k w f W A z y E 4 E h w n k a e M Q l o y B m m h A q u d 4 V 0 w m R h I L O r K Z D s O d P X i T d Z s M + b T T v z u q t y z K O K j p E R + g E 2 e g c t d A N a q M O o i h D T + g F v R q P x r P x Z r z / l F a M s m c f / Y H x 8 Q 2 u U J c a < / l a t e x i t >

[########]
Enc M < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 Y W N j / D K W A G S c 2 y g 4 H b o q D X P a Z o = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b g a L 4 K o k V d B l U Q Q 3 Q g X 7 g D a E y X T S D p 0 8 m L k R S s j G X 3 H j Q h G 3 f o Y 7 / 8 Z J z E J b D w y c O e d e 7 r 3 H i w V X Y F l f R m V p e W V 1 r b p e 2 9 j c 2 t 4 x d / e 6 K k o k Z R 0 a i U j 2 P a K Y 4 C H r A A f B + r F k J P A E 6 3 n T q 9 z v P T C p e B T e w y x m T k D G I f c 5 J a A l 1 z w Y B g Q m A O l 1 S D O 3 + M g g v c 1 c s 2 4 1 r A J 4 k d g l q a M S b d f 8 H I 4 i m g Q s B C q I U g P b i s F J i Q R O B c t q w 0 S x m N A p G b O B p i E J m H L S 4 o A M H 2 t l h P 1 I 6 h c C L t T f H S k J l J o F n q 7 M N 1 T z X i 7 + 5 w 0 S 8 C + c l I d x A k w f W A z y E 4 E h w n k a e M Q l o y B m m h A q u d 4 V 0 w m R h I L O r K Z D s O d P X i T d Z s M + b T T v z u q t y z K O K j p E R + g E 2 e g c t d A N a q M O o i h D T + g F v R q P x r P x Z r z / l F a M s m c f / Y H x 8 Q 2 u U J c a < / l a t e x i t >

[########] [########]
Enc E < l a t e x i t s h a 1 _ b a s e 6 4 = " e r S O k l k 2 X v T W / L I 9 j n R M M y i C j d w = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b o J F c F W S K u i y K A W X F e w D 2 h A m 0 0 k 7 d C Y J M z d C C d n 4 K 2 5 c K O L W z 3 D n 3 z i J W W j r g Y E z 5 9 z L v f f 4 M W c K b P v L q K y s r q 1 v V D d r W 9 s 7 u 3 v m / k F P R Y k k t E s i H s m B j x X l L K R d Y M D p I J Y U C 5 / T v j + 7 y f 3 + A 5 W K R e E 9 z G P q C j w J W c A I B i 1 h g k E V 2 7 K w j g B q g 8 s B g U J t y C y 8 j S s M Z O U A J 9 r g o l k e l e L T L H E B H R m N R 2 C s 3 j y M u k 1 G 8 5 5 o 3 l 3 U W 9 d l 3 F U 0 T E 6 Q W f I Q Z e o h W 5 R B 3 U R Q R l 6 Q i / o 1 X g 0 n o 0 3 4 / 2 n t G K U P Y f o D 4 y P b 6 I o l x I = < / l a t e x i t >

C0443158
Brain Activity

[########]
Enc E < l a t e x i t s h a 1 _ b a s e 6 4 = " e r S O k l k 2 X v T W / L I 9 j n R M M y Enzyme Activity

[########]
Enc E < l a t e x i t s h a 1 _ b a s e 6 4 = " e r S O k l k 2 X v T W / L I 9 j n R M M y mentions m i and m j has weight: and the weight of the edge from entity e to m i is: Each of ϕ(·, ·) and ψ(·, ·) are independently parameterized by dual-encoder transformer models (Gillick et al., 2019;Humeau et al., 2019), one for mentions (Enc M ), and one for entities (Enc E ). The affinity models are simply the inner products of the associated encoded representations: For the mention encoder, Enc M , the transformer input is the surrounding mention context with the mention span marked by special tokens [START] and [END]: where c left and c right are the left and right contexts of the mention m i in the document. For the entity encoder, Enc E , the transformer takes as input the title and description of the entity: [CLS]e title [TITLE]e desc [SEP] In this input, e desc is the token sequence corresponding to the description of the entity, which could include natural language text related to the entity, such as a "wiki" entry, a list of entity aliases, or other available features useful in forming an entity representation. We use a special token [TITLE] to separate the title text from the description.

Training Procedure
We now define our approach for training the affinity models, ϕ(·, ·) and ψ(·, ·), and their associated encoders, Enc M and Enc E . Our objective is to optimize the dissimilarity function f (·, ·) such that the procedure infers a set of clusters that each contain exactly one entity, and every mention is assigned to the cluster containing its ground truth entity. We optimize f (·, ·) using mini-batch gradient descent by sequentially building batches of mentions B ⊂ M over the training data, where each m i ∈ B has its gold entity defined by e ⋆ i . We then build a graph G B with nodes consisting of (a) each m i ∈ B, (b) each mention coreferent to m i ∈ B, and (c) the set of gold entities for each m i ∈ B. For every m i , we build a set of directed edges defined by where M e ⋆ i is the set of coreferent mentions with e ⋆ i as the ground-truth entity. The complete set of edges in graph G B for a mini-batch B is then given by E(G B ) = m i ∈B E m i . Observe that the resultant edges ensure that each connected component contains exactly one entity (namely, the gold entity for the mentions in that connected component).

Forming Clusters for Positive Sampling
The graph G B is input to a constrained clustering procedure that partitions a graph G into disjoint clusters C = {C 1 , . . . , C M } such that each cluster contains at most one entity. There are three constraints that every C ∈ C must satisfy: where λ is a hyperparameter representing the dissimilarity threshold over which edges between nodes are dropped. We set λ = ∞ during training. These constraints ensure that (i) there is at most one entity in each cluster, (ii) if u is reachable from v then every edge in the path from v to u has a weight ≤ λ, and (iii) each node in the cluster has a path connecting itself with every other node in the cluster.
We solve this constrained clustering problem, i.e., partition graph G, using a process similar to Angell et al. (2021). Specifically, we first remove all edges in graph G with weight greater than threshold λ. We then evaluate each edge (u, v) ∈ E in descending order of dissimilarity and check if its presence violates any of the three constraints defined above, removing the edge from E if it does. If not, we evaluate whether there is an entity in the connected component of node u, i.e. |C u ∩ E| = 1. If |C u ∩ E| = 1, we temporarily drop edge (u, v) and check whether v can still be reached by an entity node. If reachable, we permanently drop (u, v), maintaining the validity of constraint (i) as well as our minimax dissimilarity function f (·, ·). If an entity cannot reach v, we retain edge (u, v), preserving the connectivity of the cluster, and iterate further. Our predicted clusters are the resultant connected components in the partitioned graph G.
Using this procedure on each E m i to generate a pruned set of edges E ⋆ m i , we construct a partitioned target graph which is used to optimize the parametric encoder models. Note that each mention node in a target edge set E ⋆ m i has only one incoming edge originating from either an entity or a mention, and the selection of E ⋆ m i was done in a way to minimize the dissimilarity function f (·, ·) between mentions and entities with coreferent labels on the subgraph of the mini-batch.
For every cluster with an entity node, the edge structure is a directed analogue of the minimum spanning tree, where there exists a directed path from the entity node to every other node in the cluster. This structure is often referred to as the minimum spanning arborescence, thus lending its name to our method, i.e. ARBORESCENCE-based linking.
Negative Sampling Akin to the graph embedding objectives used by Nickel and Kiela (2018) and others, we construct our objective by sampling hard negative edges. For each mention m i ∈ B, the set of negative edges N(m i ) is the k/2 lowestweight incoming edges from E \ {e ⋆ i } and the k/2 lowest-weight incoming edges from M \ M e ⋆ i , where k is a tuned hyperparameter. In other words, we sample negative mention and entity edges that are most similar to the gold edge.  We also experimented with the standard cross-entropy formulation, but found its performance subpar.

Experiments
We are interested in investigating the following empirical research questions:

Datasets
We run experiments on two entity linking datasets that both require generalization to unseen entities at test time. Each document in the datasets contain a set of entity mention spans, which are pre-defined using common mention-detection heuristics. KB entities are composed of two metadata attributes -an entity title and description, which are natural language sequences of text. ZeShEL, additionally, contains a fine-grained type specification, which is needed due to the diverse disjoint domains contained in the dataset. The statistics for both datasets are reported in Table 2.
MedMentions (Mohan and Li, 2019) is a collection of titles and abstractions of bio-medical research papers. The KB that is used for this dataset is the 2017AA full-version of UMLS. The validation and test sets contain both entities that are present in the training set as well as entities that are zero-shot (never seen at training time). We use the author-recommended ST21pv subset.
ZeShEL (Logeswaran et al., 2019) is a collection of crowd-sourced wikis, which are divided into train, validation, and test splits such that no Fandom topic overlaps across the sets. In this way, all entities that appear at validation and test time are not seen during training.

Dual-Encoder Retrieval
In order to robustly evaluate the benefit of modeling coreference relationships for learning representations, we construct three variants of our proposed dual-encoder training objective, which jointly train both the mention-mention similarity function ϕ(·, ·) and the mention-entity similarity function ψ(·, ·). We compare to baselines that only explicitly train ψ(·, ·) and rely on the structure of ϕ(·, ·) sharing representations with ψ(·, ·) to provide meaningful mention-mention similarities. Our proposed objectives differ in how the positive training pairs are constructed, thus providing a way to analyze the general idea of using coreference rather than any one specific target structure for training. Our baselines are identical to each other except in how negatives are sampled.  Arborescence In the first training variant, for each mention query, we begin by constructing a fully-connected graph of the ground truth coreferent mention cluster along with the gold entity. We then apply the pruning procedure described in the previous section to compute an arborescence rooted at the entity node. From the resultant graph, each pair of a mention and its incoming-edge node (which can either be a coreferent mention or the gold entity) is treated as a positive example for training. Following previous work (Gillick et al., 2019), we use hard negative mining with k = 10 negatives composed of equal number of mention and entities.

1-NN Arborescence
Instead of constructing a fully-connected k-NN graph over the entire gold cluster, in this variant we approximate the arborescence structure by pruning a restricted graph of only the gold entity, the query mention, and the most similar within-cluster mention neighbor of the query. We keep all other details of the training procedure identical to the first variant.
1-Rand Arborescence A third training objective we explore modifies the initial k-NN graph construction by restricting the nodes to the gold entity, the query mention, and a random withincluster mention neighbor of the query, instead of the nearest-neighbor.
Baselines We compare to two baselines following previous work: (a) training ψ(·, ·) with random negatives (IN-BATCH NEGATIVES) where each gold entity for a mention in a training batch is treated as a negative example for all other mentions in the batch, and (b) training ψ(·, ·) with hard negatives (K-NN NEGATIVES) similar to the negative mining in our proposed methods albeit with only mention-entity positive selection.

Results
In Table 1, we report the test set re-call@64 for each dual-encoder model, where the prediction is evaluated as a hit if the gold entity is retrieved in the top-64 candidates for a mention.
On each dataset, we additionally include the performance of candidate generators used by previous works that we compare to. We find that models trained with explicit coreference relationships outperform those that incorporate this relationship only indirectly. For recall@64, our proposed methods improve over the baselines by at least 7.94 percentage points on MedMentions and 0.93 points on ZeShEL. Even at linking, or recall@1, our proposed methods show similar improvements with gains of 13.61 and 1.52 points over the next best baseline models. We perform a more comprehensive analysis of the dual-encoder linking performance and describe our inference approach and results in Appendix §A.2 and §A.3.
We posit that much of the observed gains in recall using our methods result from higher quality mention embeddings generated due to a wide array of surface forms available to mention queries at training. Since each training example evaluates not only the gold entity but also its coreferent mentions, this leads to better generalization of representations. We evaluate this improvement in representations in the clustering / coreference setting in §3.5.

Cross-Encoder Re-ranking
To answer our second research question, we compare five cross-attention models, which are trained using entity candidates generated by the dualencoder variants discussed in the previous experiment. Training and inference batches are constructed by concatenating each mention with an entity candidate separated by a [SEP] token. Similar to , we use the top-64 retrieved entities as hard negatives during training and as linking candidates during inference.
Results We report the cross-encoder linking accuracy for MedMentions in Table 3. We additionally  report the breakdown of accuracy on subsets of test mentions for which the ground truth entities were not evaluated ("unseen") during training, illustrating the zero-shot capability of the models. We also include the current state-of-the-art results by Angell et al. (2021), which uses an n-gram based model for candidate generation and two crossencoder models, one each for mention-mention and mention-entity scoring, for re-ranking. We observe that each cross-encoder trained with candidates generated by an arborescence-based model outperforms the baselines, including the current SOTA by at least 0.63 points, and the best performing model -ARBORESCENCE -achieves 1.63 point gains. We note, however, that Angell et al. (2021) does better on unseen entities by 1.91 points compared to ARBORESCENCE, which might be a result of benefiting from a reduced search scope owing to the within-document nature of their TF-IDF retriever. Table 4 contains linking results for ZeShEL, where each reported model varies only in the method used for retrieving the entity candidates, while the cross-encoder re-ranker training method is held constant (K-NN NEGATIVES with k = 64). Since ZeShEL is completely zero-shot, we do not include a seen-unseen analysis. We follow  and report the unnormalized accuracy, which is calculated as the percentage of successes out of the total number of query mentions in the test set, and the macro-averaged unnormalized accuracy, which is a simple average of the unnormalized accuracies over the different "worlds", or domains, in the test set. We find that the best performing model is 1-RAND ARBORESCENCE, with a 1.19 point difference in macro-averaged accuracy over the next best model .
We also note that, unlike on MedMentions, not all of our proposed models have higher accuracy than the mention-entity baselines. Since a key mo- tivation for the arborescence-based methods is to explicitly model coreference relationships during training, we expect performance gains to be strongly correlated with the number of coreference links present within the dataset. We analyze the two datasets in terms of the number of mentions for each KB entity, which can be thought of as how large each cluster of coreferent mentions is. We report a histogram distribution in Figure 2 and find that the clusters in ZeShEL are typically very small (at most 3), whereas in MedMentions, each cluster has many more mentions with maximum sizes of 1256, 434, and 447 across the train, validation, and test sets.
Finally, we also provide representative examples of predictions comparing the link predictions by our best-performing ARBORESCENCE-based method to the baseline of Angell et al. (2021) on MedMentions and  on ZeShEL in Appendix Table 7 and Table 8, respectively.

Oracle Inference
In this setting, we isolate the re-ranking capability of the cross-encoder from the quality of the candidates retrieved at inference. This setting also removes the upper-bound on re-ranking accuracy by artificially injecting the ground-truth entity in the top-64 candidates retrieved at inference for each mention where retrieval failed. An additional setting we explore holds this oracle candidate set constant across each variant of the cross-encoder by taking a union over all dual-encoder candidate sets and then proceeding to inject the ground-truth. This construction provides a way to disentangle the factor of candidate retrieval quality at inference, which otherwise conflates the comparison of re-ranking performance. We refer to these oracle settings as SELF and UNION, respectively.
Results As seen in column Oracle of Table 3, the baseline models show higher linking accuracy than our proposed methods when the gold entity is guaranteed to be present in the original candidate set. However, the performance of the baseline models drops significantly (≥ 32 points) when evaluated with the UNION candidate set, while the arborescence-based models show a ± 0.9 point variation. We believe this discrepancy clearly highlights the poor quality of candidates retrieved by the baseline models compared to our proposed methods. This also explains the inflation in accuracy of the baselines on the SELF set due to the trivial discrimination task presented to the cross-encoders. We further point to linking performance on the UNION set, which provides the more challenging task of differentiating between higher quality candidates that are similar. We argue that the large performance difference (≥ 26.75 points) is strongly indicative of the greater linking capacity of our proposed methods.
In Table 4, we report both the micro accuracy and macro-averaged accuracy for the two oracle sets. We observe that 1-RAND ARBO performs the best on the UNION set, but is marginally outperformed by IN-BATCH on micro accuracy on the SELF set by 0.02 points. In contrast to the fluctuation on MedMentions, the relative uniformity in results on the oracle candidate sets indicates that the candidates generated by each model have similar quality.

Mention Coreference
Next, we evaluate the quality of the learned mention representations for cross-document coreference using the entity labels of each mention as its ground truth cluster assignment. To form clusters, we build mention-only arborescences using the clustering procedure described in §2.3, tuning   We compare the mention coreference recall@64 with the entity linking recall@64 for each dual-encoder training procedure on MedMentions and ZeShEL. There is a positive correlation when comparing coreference-based procedures with entity-only methods, which is stronger on the highly-coreferent MedMentions dataset than on ZeShEL.
the threshold value, λ, based on the validation data. We report the Adjusted Rand Index (ARI) clustering scores in Table 5 using each of our dualencoder representation learning objectives. For both ZeShEL and MedMentions, we report ARI on all the test mentions (denoted ALL). For MedMentions, we report two additional settings: (a) ARI when clustering mentions with ground truth entity not seen at training (denoted UNSEEN ONLY), and (b) clustering on all mentions but evaluating on the UNSEEN ONLY set (denoted ALL/UNSEEN). The results show that representations learned with the ARBORESCENCE objective perform best on each setting, aligning with the inductive bias of its training procedure and indicating its utility in a setting where new entities must be discovered.
We further probe the inductive bias of the arborescence-based training procedures by inspecting whether improvements in entity linking recall are accompanied by similar gains in mention coreference performance. In Figure 3, we plot entity and mention recall@64 for each training method on the test set of the two datasets. Mention recall is calculated by retrieving 64 nearest-neighbors for each mention and counting the number of neighbors that are coreferent as a proportion of the total number of coreferent mentions limited to 64. Entity recall is calculated as defined in §3.2. We find that entity recall indeed demonstrates a positive correlation with mention recall on both datasets when the proposed coreference-based training procedures are compared with entity-only methods. We posit that this demonstrates the efficacy of using explicit mention coreference relationships to learn representations for entity linking.

Related Work
Entity Linking Entity linking has been widely studied (Milne and Witten, 2008;Cucerzan, 2007;Lazic et al., 2015b;Gupta et al., 2017;Raiman and Raiman, 2018;Kolitsas et al., 2018;Cao et al., 2021, inter alia). Dutta and Weikum (2015) combine clustering-based cross-document coreference decisions and linking around sparse bag-of-word representations not well suited for the embeddingbased representations used in this work. Other works use global or collective models (Kulkarni et al., 2009;Hoffart et al., 2011;Cheng and Roth, 2013;Ganea and Hofmann, 2017;Le and Titov, 2018, inter alia), which consider the compatibility of entity linking decisions made in the same document(s) rather than making independent predictions. Zhang and Stratos (2021) use noise contrastive estimation to mine hard negatives for the linking task.
Alternatives to Cross-Encoders Our work demonstrates how clustering-based training improves dual-and cross-encoder models for linking and discovery. If prediction efficiency, and not training efficiency, was the only concern, one could also use model distillation to improve dualencoder performance (Hinton et al., 2015;Izacard and Grave, 2021, inter alia). We could also consider models such as poly-encoders as alternatives to dual-encoders (Humeau et al., 2020).

Conclusion
We present a novel approach for learning mention and entity representations for use in entity linking candidate generation and prediction, as well as in the discovery of new entities. Our method uses an objective that explicitly incorporates mention-tomention coreference relationships. We demonstrate its empirical effectiveness through analysis on two datasets -MedMentions and the Zero-Shot Entity Linking dataset. As future work, we hope to further analyze these objectives with the lens of efficiency, distillation, and domain transfer.

Ethical Considerations
There are several ways in which entity linking / resolution models could be biased and a potential for those biases to have harmful downstream consequences. There is already a large body of work studying the biases in language models (such as those used for fine-tuning in our work) and coreference models, most notably in understanding when error rates in coreference differ across certain populations (e.g., genders, races, and other entity types, more broadly, that display skewed distributions in the data). For instance, if entity mentions are author names on citation data and the entities are scientific authors, aggregated statistics like h-index or citation count could be biased if the models used to disambiguate the author names are biased. If entity linking and discovery systems are used to build or populate knowledge bases, those systems may propagate these biased predictions. This can be particularly problematic if one used such a biased knowledge base to train future models, thus perpetuating and amplifying the skew. Lastly, we also note that entity linking and discovery are analogous to surveillance and tracking in computer vision, which should warrant substantial weight of ethical considerations.

A.1 Experiment Details
Each training procedure is run on a single machine using 2 NVIDIA Quadro RTX 8000 GPUs. Our dual-encoder models for ZeShEL and MedMentions have 218M and 230M parameters, respectively. Each variant is optimized using mini-batch gradient descent using the Adam optimizer for 5 epochs using a mini-batch size of 128 to accumulate the gradients. Experiments with batch sizes < 128 performed poorly, possibly due to increased fluctuation of gradients, and sizes > 128 were computationally infeasible to run with our available compute resources. For ZeShEL, the dual-encoder models are trained using 192 warm-up steps and learning rates of 1e-5, 3e-5, and 3e-5 for IN-BATCH, K-NN, and ARBORESCENCE-based models, respectively. For MedMentions, each model is trained using 464 warm-up steps and a learning rate of 3e-5. All cross-encoder models are trained with a minibatch size of 2, learning rate of 2e-5, and an additional linear layer. Our MedMentions and ZeShEL cross-encoder models have 108M and 109M parameters, respectively. We use FAISS 2 (Johnson et al., 2017) for fast nearest-neighbor search during graph construction at both training and inference. For MedMentions, the execution time was 70 mins to embed and index 2M entities and 120K mentions, and 20 mins to perform exact nearestneighbor search for the 120K mentions.

A.2 Dual-Encoder Inference Procedure
Building the Graph The structure of the graph G impacts the dissimilarity function by changing the paths between pairs of nodes in addition to changing which pairs of nodes are connected. We advocate for a simple, deterministic approach to construct this graph. For each mention m, construct E m by (a) adding edges from m's k-nearest neighbor mentions in M to m, and (b) adding an edge from m's nearest entity to m: The complete collection of edges E in G is given by E(G) = m∈M E m . There are other ways that one could conceivably pick the pairs of mentions to 2 https://github.com/facebookresearch/ faiss be connected in the graph. For example, one could use the minimum spanning tree over the mentions. This approach, however, has a few drawbacks: (a) the directionality of nearest neighbor relationships is ignored leading to noise in the graph, and (b) the resultant graph includes edges that cross cluster boundaries due to this approach forcing all pairs of mentions to be connected, which is undesirable.

Forming Clusters & Making Predictions
To make linking decisions for each mention m d i , we assign the ID of the entity present in the mention's cluster as the linking label (or NIL if there is no entity in the cluster). Let C(m d i ) be the predicted cluster of mention m d i , then: Furthermore, the target clusters we aim to predict in the entity discovery setting are exactly C.

A.3 Experiment: Dual-Encoder Linking
Each model is evaluated using three inference procedures. "Independent" refers to predictions made using only mention-entity edges. This method was used by  to generate candidates for a cross-encoder model trained on ZeShEL. "Clustering (UNDIRECTED)" refers to a hierarchical agglomerative clustering (HAC) procedure, following previous work by Angell et al. (2021), which is akin to the procedure for positive sampling used for training our arborescence models but with no edge directionality. "Clustering (DIRECTED)" adds directed edges to the previous method. For each model, we pick the best performing inference procedure on the validation set and report the test set performance.
We report the linking accuracy in Table 6 and leave out models from previous works since they do not report linking metrics of their candidate generators. We specify the inference method used in each case, chosen based on the validation set accuracy of the models. Similar to our cross-encoder results in Table 3, we also report the "seen" and "unseen" performance on MedMentions.

A.4 Qualitative Results
In Table 7 and Table 8, we provide a set of representative examples that demonstrate the improvement in entity linking that our proposed coreferencebased methodology empirically provides on Med-Mentions and ZeShEL, respectively.   Angell et al. (2021) fails to retrieve the correct entity, and thus their cross-encoder is not able to correctly link the mention. Our coreference-based dual-encoder is able to retrieve the correct entity in the candidate set of 64 entities, and then the cross-encoder is able to link the mention to the correct entity.