MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

Entity retrieval, which aims at disambiguating mentions to canonical entities from massive KBs, is essential for many tasks in natural language processing. Recent progress in entity retrieval shows that the dual-encoder structure is a powerful and efficient framework to nominate candidates if entities are only identified by descriptions. However, they ignore the property that meanings of entity mentions diverge in different contexts and are related to various portions of descriptions, which are treated equally in previous works. In this work, we propose Multi-View Entity Representations (MuVER), a novel approach for entity retrieval that constructs multi-view representations for entity descriptions and approximates the optimal view for mentions via a heuristic searching method. Our method achieves the state-of-the-art performance on ZESHEL and improves the quality of candidates on three standard Entity Linking datasets.


Introduction
Entity linking (EL) refers to the task that disambiguates the mentions in textual input and retrieves the corresponding unique entity in large Knowledge Bases (KBs) (Han et al., 2011;Ceccarelli et al., 2013). The majority of neural entity retrieval approaches consist of two steps: Candidate Generation (Pershina et al., 2015;Zwicklbauer et al., 2016), which nominates a small list of candidates from millions of entities with low-latency algorithms, and Entity Ranking (Yang et al., 2018;Le and Titov, 2019;, which ranks those candidates to select the best match with more sophisticated algorithms. In this paper, we focus on the Candidate Generation problem (a.k.a. the first-stage retrieval). Prior works filter entities by alias tables (Fang et al., 2019) or precalculated mention-entity prior probabilities, e.g., p(entity|mention) (Le and Titov, 2018). Ganea and Hofmann (2017) and Yamada et al. (2016) build entity embedding from the local context of hyperlinks in entity pages or entity-entity co-occurrences. Those embedding-based methods were extended by BLINK (Wu et al., 2020) and DEER (Gillick et al., 2019) to two-tower dualencoders (Khattab and Zaharia, 2020), which encode mentions and descriptions of entities into highdimensional vectors respectively. Candidates are retrieved by nearest neighbor search (Andoni and Indyk, 2008;Johnson et al., 2019) for a given mention. Solutions that require only entity descriptions (Logeswaran et al., 2019) are scalable, as descriptions are more readily obtainable than statistical or manually annotated resources.
Although description-based dual-encoders can compensate for the weakness of traditional methods and have better generalization ability to unseen domains, they aim to map mentions with divergent context to the same high-dimensional entity embedding. As shown in Figure 1, the description of "Kobe Bryant" mainly concentrates on his professional journey. As a result, the embedding of "Kobe Bryant" is close to the context which describes the career of Kobe but is semantically distant from his helicopter accident. Dual-encoders are trained to encode those semantically divergent contexts to representations that are close to the embedding of "Kobe Bryant". The evidence relies on the Figure  2 (section 3.2) that the previous method (Wu et al., 2020) is good at managing entities with short descriptions but seems troubling to retrieve entities with long descriptions, which contains too much information to be encoded into a single fixed-size vector.
To tackle those issues, we propose to construct multi-view representations from descriptions. The contributions of our paper are as follows:  We refer to each sentence as a view for descriptions to form a view set V (Gray circles with number) and merge views to approximate the optimal views for mentions (points enclosed by ellipses).
• We propose an effective approach, MuVER, for first-stage entity retrieval, which models entity descriptions in a multi-view paradigm.
• We define a novel distance metric for retrieval, which is established upon the optimal view of each entity. Furthermore, we introduce a heuristic search method to approximate the optimal view.
• MuVER achieves state-of-the-art performance on ZESHEL and generates higher-quality candidates on AIDA-B, MSNBC and WNED-WIKI in full Wikipedia settings.

Problem Setup
Formally, given an unstructured text D with a recognized mention m, the goal of entity linking is to learn a mapping from the mention m to the entity entry e in a knowledge base E = {e 1 , e 2 , . . . , e N }, where N can be extremely large (for Wikipedia, N = 5.9M ). In the literature, existing retrieval methods address this problem in a two-stage paradigm: (i) selecting the top relevant entities to form a candidate set C where |C| |E|; (ii) ranking candidates to find the best entity within C. In this work, we mainly focus on the first-stage retrieval, following Logeswaran et al. (2019)'s setting to assume that for each e ∈ E, entity title t and description d are provided in pairs.

Multi-View Entity Representations
Dual-encoders We tackle entity retrieval as a matching problem, where two separated encoders, entity encoder f and mention encoder g, are deployed. We consider BERT  as the architecture to encode textual input, which can be formulated as: where t, d, m, ctx l , ctx r refer to word-pieces tokens of the entity title, the entity description, the mention and the context before and after the mention correspondingly. Besides, we use [M s ] and [M e ] to denote the start of mention and end of mention identifiers respectively. The special token [EN T ] serves as the delimiter of titles and descriptions. T 1 and T 2 are two independent BERT, with which we estimate the similarity between mention m and entity e as sim(m, e) = f (t, d) · g(m).

Multi-view Description
Our method matches a mention to the appropriate entity by comparing it with entity descriptions. Motivated by the fact that mentions with different contexts correspond to different parts in descriptions, we propose to construct multi-view representations for each description. Specifically, we segment a description into several sentences. We refer to each sentence as a view v, which contains partial information, to form a view set V of the entity e. Figure 1 illustrates an example that constructs a view set V for "Kobe Bryant".
Multi-view Matching Given a view set V = {v 1 , v 2 , . . . , v k } for entity e, determining whether a mention m matches the entity e requires a metric space to estimate the relation between m and V, which can be defined as (1) where [v 1 , v 2 , ..., v k ] refers to an operation that concatenates tokens in views following the sentence order in descriptions and t is the corresponding entity title for V. Note that this metric can be applied to the subset of V to focus on partial information of the description. As mentioned before, for m in different contexts, only a part of the views are related. For each mention-entity pair (m, e) and the view set V of e, we define its optimal Q * as: where Q is a subset of V and Q * has the minimal distance to current mention m. We define the distance d(m, Q * (m, e)) as the matching distance between e and m. To find the optimal entity for mention m, we select the entity that has minimal matching distance: Distance Metric & Training Objectives The above retrieval process requires an appropriate metric space to estimate the similarity between views and mentions. The metric space should satisfy that similar inputs are pulled together and dissimilar ones are pushed apart. To achieve this, we introduce an NCE loss (van den Oord et al., 2018) to establish the metric space : where E = {e} ∪ {e 1 , . . . , e n−1 }. Mention-entity pairs (m, e) are pulled together and randomly sampled n − 1 negatives {e 1 , . . . , e n−1 } are pushed apart from m, based on their matching distance in the current metric space. Unfortunately, Q * (m, e) is intractable due to the non-differentiable subset operation in Equation 2. Besides, it is timeconsuming to obtain the optimal view by checking all subsets exhaustively. In this work, we consider a subset that contains only one view to approximate it. Specifically, we select the best v * (m, e) arg min v∈V d(m, {v}) from V as an alternative to the optimal view Q * : Note that this approximation can be done in time complexity of O(N ), which simply selects a view with minimal distance to the given mention.
Using Equation 4, we can rewrite the NCE loss as:

Heuristic Searching for Inference
The approximation in Equation 4 obviously can not reveal the matching distance because v * (m, e) contains insufficient information for retrieval. We want to search for a better view Q ⊂ V that d(m, Q ) < d(m, v * (m, e)).
Combining views (Q 1 , Q 2 ) that contain complementary information is more likely to incorporate richer information into the newly assembled view. Considering two sets Q 1 ⊂ V and Q 2 ⊂ V and a distance metric , where t is the title of the entity and f represents the entity encoder, the most distant pair of views (Q 1 , Q 2 ) achieve the largest magnitude on d(Q 1 , Q 2 ) among all pairs and is interpreted as the pair of views with less shared information. For each iteration, We search the top-k distant pairs (Q 1 , Q 2 ) to form a new view Q = Q 1 ∪ Q 2 and expand Q into V to encode the merged Q by f (t, Q ) to produce a new representation for the involved entity. Searching and merging are performed iteratively until |V| reaches the maximal allowable value or the number of iterations reaches the preset value. During the inference, we precompute and cache the representations of views and select the view with minimal distance to m.

Datasets
We evaluate MuVER under two different knowledge bases: Wikia, which the Zero-shot EL dataset is built upon, and Wikipedia, which contains 5.9M entities. We select one in-domain dataset, AIDA-CoNLL (Hoffart et al., 2011), and two out-ofdomain datasets, WNED-WIKI (Guo and Barbosa, 2018) and MSNBC (Cucerzan, 2007), from standard EL datasets to validate MuVER in the full Wikipedia setting. Statistics of datasets are listed in Appendix A.1.   emphasis on understanding the unstructured descriptions of entities to resolve the ambiguity of mentions on four unseen domains. Concretely, MuVER uses BERT-base for f and g to make a fair comparison with previous works. We adopt an adam optimizer with a small learning rate 1e −5 and 10% warmup steps. We use batched random negatives and set the batch size to 128. The max number of context tokens is 128 and the max number of view tokens equals 40. Training 20 epochs takes one hour on 8 Tesla-v100 GPUs.

KB: Wikia
We compare MuVER with previous baselines in Table 1. Since MuVER is not limited by the length of descriptions, we add another baseline to extend BLINK to have 512 tokens (which is the max number of tokens for BERT-base). As shown in the table, we exceed BLINK by 5.28% and outperform SOM by 3.22% on Recall@64. We observe that Recall@1 of MuVER is lower than BLINK and the heuristic searching method can alleviate this problem. Detailed results on unseen domains are listed in Appendix A.3.

Effect of Heuristic Search
We compare two distance-based merging strategies: taking closer or farther pairs of views to merge. We find out that merging views whose sentences are adjacent to each other in the original unstructured descriptions is a computationally efficient way to select the combined views. Table 3 shows that as the number of views increases, MuVER yields higher-quality candidates while the opposite strategy is troubled to provide more valuable views. Besides, our method can be regarded as a generalized form of SOM (Zhang and Stratos, 2021) and BLINK (Wu et al., 2020), which contain 128 views and one view correspondingly. SOM computes the similarity between mentions and tokens in descriptions, which stores 128 embeddings for each entity. Compared with SOM, MuVER reduces the number of views to a smaller size with improved quality, which is more efficient and effective.   Figure 2: Recall@64 differences between BLINK and MuVER on entities with 1 to 100 sentences in their descriptions. We partition the entities by the number of sentences in entity descriptions and calculate metrics within each bin. The size for each bin is 5.
Effect on entities with long descriptions As shown in Figure 2, existing EL systems (like BLINK) obtain passable performance on entities with short descriptions but fail to manage those well-populated entities as the length of descriptions increases. For instance, the error rate of BLINK is 7.79% for entities with 5-10 sentences but 39.91% for entities with 75-80 sentences, which is more likely to contain various aspects for the entity. Mu-VER demonstrates its superiority over entities with long descriptions, which significantly reduces the error rate to 17.65% (-22.06%) for entities with 75-80 sentences while maintains the performance on entities with short descriptions, which achieves the error rate of 6.78% (-1.01%) for entities with 5-10 sentences.

KB: Wikipedia
We test AIDA-B, MSNBC and WNED-WIKI on the version of Wikipedia dump provided in KILT , which contains 5.9M entities. Implementation details are listed in Appendix A.2. BLINK performance on these datasets is reported in its official Github repository 2 . We report the In-KB accuracy in Table 2 and observe that Mu-VER out-performs BLINK on all datasets except the recall@100 on WNED-WIKI.

Related Work
Representing each entity with a fixed-sized vector has been a common approach in Entity Linking. Ganea and Hofmann (2017) defines a word-entity conditional distribution and samples positive words from it. The representations of those positive words aim to approximate the entity embeddings compared with random words. Yamada et al. (2016) models the relatedness between entities into entity representations. NTEE (Yamada et al., 2017) trains entity representations by predicting the relevant entities for a given context in DBPedia abstract corpus. Ling et al. (2020) and Yamada et al. (2020) pre-train variants of the transformer-based model by maximizing the consistency between the context of the mentions and the corresponding entities. Those entity representations suffer from a cold-start problem that they cannot link mentions to unseen entities.
Another line of work is to generate entity representations using entity textual information, such as entity descriptions. Logeswaran et al. (2019) introduces an EL dataset in the zero-shot scenario to place more emphasis on reading entity descriptions. BLINK (Wu et al., 2020) proposes a bi-encoder to encode the descriptions and enhance the bi-encoder by distilling the knowledge from the cross-encoder. Yao et al. (2020) repeats the position embedding to solve the long-range modeling problem in entity descriptions. Zhang and Stratos (2021) demonstrates that hard negatives can enhance the contrast when training an EL model.

Conclusion
In this work, we propose a novel approach to construct multi-view representations from descriptions, which shows promising results on four EL datasets. Extensive results demonstrate the effectiveness of multi-view representations and the heuristic search strategy. In the future, we will explore more reliable and efficient approaches to construct views.

A Appendix
A.1 Statistics of datasets  • Batch size: [32,64,128,196] AIDA We finetune MuVER based on the EL model released by BLINK, which is pretrained on 9M annotated mention-entity pairs. Unlike the experiments on ZESHEL that adopting in-batch random negatives to train our model, we add hard negatives in batch. Due to the vast size of entities in Wikipedia, randomly sampled negatives are always too simple for the model to extract semantic features, thus degrading performance. We finetune our model on AIDA-CoNLL train set for one epoch. Batch size is set to 8. We add 3 hard negatives for each mention into the random in-batch negatives, which are precomputed using BLINK. The number of views is 5 for each entity and we choose the first 5 paragraphs with first 40 tokens, which are more likely to be summarizations. Other hyperparameters are consistent with configurations on ZESHEL.
Parameters for MuVER Since MuVER has two BERT encoders, it has twice the number of parameters as BERT, which are listed in Table 6.

A.3 Performance on Unseen Domains
In Table 4, we compare MuVER with BLINK on four unseen domains on ZESHEL. We observe a significant improvement on all four unseen domains, especially on Yugioh, which achieves +11.35 points on Recall@64. Furthermore, Mu-VER can reach comparative performance with BLINK's top-64 candidates by retrieving around 16-32 candidates, which reduces the computational cost for entity ranking.