MOLEMAN: Mention-Only Linking of Entities with a Mention Annotation Network

We present an instance-based nearest neighbor approach to entity linking. In contrast to most prior entity retrieval systems which represent each entity with a single vector, we build a contextualized mention-encoder that learns to place similar mentions of the same entity closer in vector space than mentions of different entities. This approach allows all mentions of an entity to serve as"class prototypes"as inference involves retrieving from the full set of labeled entity mentions in the training set and applying the nearest mention neighbor's entity label. Our model is trained on a large multilingual corpus of mention pairs derived from Wikipedia hyperlinks, and performs nearest neighbor inference on an index of 700 million mentions. It is simpler to train, gives more interpretable predictions, and outperforms all other systems on two multilingual entity linking benchmarks.


Introduction
A contemporary approach to entity linking represents each entity with a textual description d e , encodes these descriptions and contextualized mentions of entities, m, into a shared vector space using dual-encoders f (m) and g(d e ), and scores each mention-entity pair as the inner-product between their encodings (Botha et al., 2020;Wu et al., 2019). By restricting the interaction between e and m to an inner-product, this approach permits the pre-computation of all g(d e ) and fast retrieval of top scoring entities using maximum inner-product search (MIPS).
Here we begin with the observation that many entities appear in diverse contexts, which may not be easily captured in a single high-level description. For example, Actor Tommy Lee Jones played football in college, but this fact is not captured in the entity description derived from his Wikipedia page (see Figure 1). Furthermore, when new entities need to be added to the index in a zero-shot setting, it may be difficult to obtain a high quality description. We propose that both problems can be solved by allowing the entity mentions themselves to serve as exemplars. In addition, retrieving from the set of mentions can result in more interpretable predictions -since we are directly comparing two mentions -and allows us to leverage massively multilingual training data more easily, without forcing choices about which language(s) to use for the entity descriptions.
We present a new approach (MOLEMAN 1 ) that maintains the dual-encoder architecture, but with the same mention-encoder on both sides. Entity linking is modeled entirely as a mapping between mentions, where inference involves a nearest neighbor search against all known mentions of all entities in the training set. We build MOLEMAN using exactly the same mention-encoder architecture and training data as Model F (Botha et al., 2020). We show that MOLEMAN significantly outperforms Model F on both the Mewsli-9 and Tsai and Roth (2016) datasets, particularly for low-coverage languages, and rarer entities.
We also observe that MOLEMAN achieves high accuracy with just a few mentions for each entity, suggesting that new entities can be added or existing entities can be modified simply by labeling a small number of new mentions. We expect this update mechanism to be significantly more flexible than writing or editing entity descriptions. Finally, we compare the massively multilingual MOLEMAN model to a much more expensive English-only dualencoder architecture (Wu et al., 2019) on the wellstudied TACKBP-2010 dataset (Ji et al., 2010) and show that MOLEMAN is competitive even in this setting. Figure 1: Illustration of hypothetical contextualized mention (m) and multilingual description (d) embeddings for the entities 'Tommy Lee Jones (Q170587)' and 'Tom Jones (Q18152778). The query mention [ ] pertains to the former's college football career, which is unlikely to be captured by the high-level entity description. A retrieval against descriptions would get this query incorrect, but with indexed mentions gets it correct. Note that prior dual-encoder models that use a single vector to represent each entity are forced to contort the embedding space to solve this problem.

Overview
Task definition We train a model that performs entity linking by ranking a set of entity-linked indexed mentions-in-context. Formally, let a mentionin-context x = [x 1 , ..., x n ] be a sequence of n tokens from vocabulary V, which includes designated entity span tokens. An entity-linked mentionin-context m i = (x i , e i ) pairs a mention with an entity from a predetermined set of entities E. Let M I = [m 1 , ..., m k ] be a set of entity-linked mentions-in-context, and let entity(·) : M I → E be a function that returns the entity e i ∈ E associated with m i , and x(·) returns the token sequence x i .
Our goal is to learn a function φ(m) that maps an arbitrary mention-in-context token sequence m to a fixed vector h m ∈ R d with the property that gives a good prediction y * of the true entity label of a query mention-in-context x q .

Model
Recent state-of-the-art entity linking systems employ a dual encoder architecture, embedding mentions-in-context and entity representations in the same space. We also employ a dual encoder architecture but we score mentions-in-context (hereafter, mentions) against other mentions, with no consolidated entity representations. The dual encoder maps a pair of mentions (m, m ) to a score: where φ is a learned neural network that encodes the input mention as a d-dimensional vector. As in (Févry et al., 2020) and (Botha et al., 2020), our mention encoder is a 4-layer BERT-based Transformer network (Vaswani et al., 2017;Devlin et al., 2019) with output dimension d = 300.

Mention Pairs Dataset
We build a dataset of mention pairs using the 104language collection of Wikipedia mentions as constructed by Botha et al. (2020). This dataset maps Wikipedia hyperlinks to WikiData (Vrandečić and Krötzsch, 2014), a language-agnostic knowledge base. We create mention pairs from the set of all mentions that link to a given entity.
We use the same division of Wikipedia pages into train and test splits used by Botha et al. (2020) for compatibility to the TR2016 test set (Tsai and Roth, 2016). We take up to the first 100k mention pairs from a randomly ordered list of all pairs regardless of language, yielding 557M and 31M training and evaluation pairs, respectively. Of these, 69.7% of pairs involve two mentions from different languages. Our index set contains 651M mentions, covering 11.6M entities.

Hard Negative Mining and Positive Resampling
Previous work using a dual encoder trained with inbatch sampled softmax has improved performance with subsequent training rounds using an auxiliary cross-entropy loss against hard negatives sampled from the current model (Gillick et al., 2019;Wu et al., 2019;Botha et al., 2020). We investigate the effect of such negative mining for MOLEMAN, controlling the ratio of positives to negatives on a per-entity basis. This is achieved by limiting each entity to appear as a negative example at most 10 times as often as it does in positive examples, as done by Botha et al. (2020). In addition, since MOLEMAN is intended to retrieve the most similar indexed mention of the correct entity, we experiment with using this retrieval step to resample the positive pairs used to construct our mention-pair dataset for the in-batch sampled softmax, pairing each mention m with the highestscoring other mention m of the same entity in the index set. This is similar to the index refreshing that is employed in other retrieval-based methods trained with in-batch softmax (Guu et al., 2020;Lewis et al., 2020a).

Input Representations
Following prior work (Wu et al., 2019;Botha et al., 2020), our mention representation consists of the page title and a window around the mention, with special mention boundary tokens marking the mention span. We use a total context size of 64 tokens.
Though our focus is on entity mentions, the entity descriptions can still be a useful additional source of data, and allow for zero-shot entity linking (when no mentions of an entity exist in our training set). We therefore experiment with adding the available entity descriptions as additional "pseudo-mentions". These are constructed in a similar way to the mention representations, except without mention boundaries. Organic and psuedo-mentions are fed into BERT using distinct sets of token type identifiers. We supplement our training set with additional mention pairs formed from each entity's description and a random mention, adding 38M training pairs, and add these descriptions to the index, expanding the entity set to 20M.

Inference
For inference, we perform a distributed brute-force maximum inner product search over the index of training mentions. During this search, we can either return only the top-scoring mention for each entity, which improves entity-based recall, or else all mentions, which allows us to experiment with k-Nearest Neighbors inference (see Section 4.1).

Experiments
4.1 Mewsli-9  We also compare to the recent MGENRE system of De Cao et al. (2021), which performs entity linking using constrained generation of entity names. It should be noted that this work uses an expanded training set that results in fewer zero-and few-shot entities (see De Cao et al. (2021) Table 3). Table 2 shows per-language results for Mewsli-9.

Per-Language Results
A key motivation of Botha et al. (2020) was to learn a massively multilingual entity linking system, with a shared context encoder and entity representations between 104 languages in the Wikipedia corpus. MOLEMAN takes a step further: the indexed mentions from all languages are included in the retrieval index, and can contribute to the prediction in any language. In fact, we find that for 21.4% of mentions in the Mewsli-9 corpus, MOLEMAN's top prediction came from a different language.   Table 3 shows a breakdown in performance by entity frequency bucket, defined as the number of times an entity was mentioned in the Wikipedia training set. When indexing only mentions, MOLE-MAN can never predict the entities in the 0 bucket, but it shows significant improvement in the other frequency bands, particularly in the "few shot" bucket of [1,10). This suggests when introducing new entities to the index, labelling a small number of mentions may be more beneficial than producing a single description. To further confirm this intuition, we retrained MOLEMAN with a modified training set which had all entities in the [1, 10) band of Mewsli-9 removed, and only added to the index at inference time. This model achieved +0.2 R@1 and +5.6 R@10 relative to Model F + (which was trained with these entities in the train set). When entity descriptions are added to the index, MOLEMAN outperforms Model F + across frequency bands.

Inference Efficiency
Due to the large size of the mention index, nearest neighbor inference is performed using distributed maximum inner-product search. We also experiment with approximate search using ScaNN (Guo et al., 2020). Table 4 shows throughput and recall statistics for brute force search as well as two approximate search approaches that run on a single multi-threaded CPU, showing that inference over such a large index can be made extremely efficient with minimal loss in recall.

Tsai Roth 2016 Hard
In order to compare against previous multilingual entity linking models, we report results on the "hard" subset of Tsai and Roth (2016)'s crosslingual dataset which links 12 languages to English Wikipedia.

TACKBP 2010
Recent work on entity linking have employed dualencoders primarily as a retrieval step before reranking with a more expensive cross-encoder (Wu et al., 2019;Agarwal and Bikel, 2020). Table 6 shows results on the extensively studied TACKBP 2010 dataset (Ji et al., 2010). Wu et al. (2019) used a 24-layer BERT-based dual-encoder which scores the 5.9 million entity descriptions from English Wikipedia, followed by a 24-layer cross-encoder reranker. MOLEMAN does not achieve the same level of top-1 accuracy as their full model, as it lacks the expensive cross-encoder reranking step, but despite using a single, much smaller Transformer and indexing the larger set of entities from multilingual Wikipedia, it outperforms this prior work in retrieval recall at 100. We also report the accuracy of a MOLEMAN model trained only with English training data, and using an Enlish-only index for inference. This experiment shows that although the multilingual index contributes to MOLEMAN's overall performance, the pairwise training data is sufficient for high performance in a monolingual setting.

Discussion and Future Work
We have recast the entity linking problem as an application of a more generic mention encoding task. This approach is related to methods which perform clustering on test mentions in order to improve inference (Le and Titov, 2018;Angell et al., 2020), and can also be viewed as a form of crossdocument coreference resolution (Rao et al., 2010;Shrimpton et al., 2015;Barhom et al., 2019). We also take inspiration from recent instance-based language modelling approaches (Khandelwal et al., 2020;Lewis et al., 2020b).
Our experiments demonstrate that taking an instance-based approach to entity-linking leads to better retrieval performance, particularly on rare entities, for which adding a small number of mentions leads to superior performance than a single description. For future work, we would like to explore the application of this instance-based approach to entity knowledge related tasks (Seo et al., 2018;Petroni et al., 2020), and to entity discovery (Ji et al., 2017).

A.3 Profiling Details
The brute-force numbers we've reported are the theoretical maximum throughput for computing 300D dot-products on an AVX-512 processor running at 2.2Ghz, and are thus an overly optimistic baseline. Practical implementations, such as the one in ScaNN, must also compute the top-k and rarely exceed 70% to 80% of this theoretical limit. The brute-force latency figure is the minimum time to stream the database from RAM using 144 GiB/s of memory-bandwidth. In practice, we ran distributed brute-force inference on a large cluster of CPUs, which took about 5 hours. The numbers for ScaNN are empirical singlemachine benchmarks of an internal solution that uses the open-source ScaNN library 4 on a single 24-core CPU. We use ScaNN to search a multilevel tree that has the following shape: 78, 000 => 83 : 1 => 105 : 1 (687.3 million datapoints). We used a combination of several different anisotropic vector quantizations that combine 3, 6, 12, or 24 dimensions per 4-bit code, as well as re-scoring with an int8-quantization.

A.4 Expanded experimental results
Tables 7 and 8 present complete numerical comparisons between MOLEMAN Table 8: Results on the Mewsli-9 dataset, by entity frequency in the test set.