Towards Better Entity Linking with Multi-View Enhanced Distillation

Dense retrieval is widely used for entity linking to retrieve entities from large-scale knowledge bases. Mainstream techniques are based on a dual-encoder framework, which encodes mentions and entities independently and calculates their relevances via rough interaction metrics, resulting in difficulty in explicitly modeling multiple mention-relevant parts within entities to match divergent mentions. Aiming at learning entity representations that can match divergent mentions, this paper proposes a Multi-View Enhanced Distillation (MVD) framework, which can effectively transfer knowledge of multiple fine-grained and mention-relevant parts within entities from cross-encoders to dual-encoders. Each entity is split into multiple views to avoid irrelevant information being over-squashed into the mention-relevant view. We further design cross-alignment and self-alignment mechanisms for this framework to facilitate fine-grained knowledge distillation from the teacher model to the student model. Meanwhile, we reserve a global-view that embeds the entity as a whole to prevent dispersal of uniform information. Experiments show our method achieves state-of-the-art performance on several entity linking benchmarks.


Introduction
Entity Linking (EL) serves as a fundamental task in Natural Language Processing (NLP), connecting mentions within unstructured contexts to their corresponding entities in a Knowledge Base (KB). EL usually provides the entity-related data foundation for various tasks, such as KBQA (Ye et al., 2022), Knowledge-based Language Models  and Information Retrieval (Li et al., 2022). Most EL systems consist of two stages: entity retrieval (candidate generation), which retrieves Figure 1: The illustration of two types of entities. Mentions in contexts are in bold, key information in entities is highlighted in color. The information in the first type of entity is relatively consistent and can be matched with a corresponding mention. In contrast, the second type of entity contains diverse and sparsely distributed information, can match with divergent mentions. a small set of candidate entities corresponding to mentions from a large-scale KB with low latency, and entity ranking (entity disambiguation), which ranks those candidates using a more accurate model to select the best match as the target entity. This paper focuses on the entity retrieval task, which poses a significant challenge due to the need to retrieve targets from a large-scale KB. Moreover, the performance of entity retrieval is crucial for EL systems, as any recall errors in the initial stage can have a significant impact on the performance of the latter ranking stage (Luan et al., 2021).
Recent advancements in pre-trained language models (PLMs) (Kenton and Toutanova, 2019) have led to the widespread use of dense retrieval technology for large-scale entity retrieval (Gillick et al., 2019;. This approach typically adopts a dual-encoder architecture that embeds the textual content of mentions and entities independently into fixed-dimensional vectors (Karpukhin et al., 2020) to calculate their relevance scores using a lightweight interaction metric (e.g., dot-product). This allows for pre-computing the entity embeddings, enabling entities to be retrieved through various fast nearest neighbor search techniques (Johnson et al., 2019;Jayaram Subramanya et al., 2019).
The primary challenge in modeling relevance between an entity and its corresponding mentions lies in explicitly capturing the mention-relevant parts within the entity. By analyzing the diversity of intra-information within the textual contents of entities, we identify two distinct types of entities, as illustrated in Figure 1. Entities with uniform information can be effectively represented by the dual-encoder; however, due to its single-vector representation and coarse-grained interaction metric, this framework may struggle with entities containing divergent and sparsely distributed information. To alleviate the issue, existing methods construct multi-vector entity representations from different perspectives Zhang and Stratos, 2021;Tang et al., 2021). Despite these efforts, all these methods rely on coarse-grained entity-level labels for training and lack the necessary supervised signals to select the most relevant representation for a specific mention from multiple entity vectors. As a result, their capability to effectively capture multiple fine-grained aspects of an entity and accurately match mentions with varying contexts is significantly hampered, ultimately leading to suboptimal performance in dense entity retrieval.
In order to obtain fine-grained entity representations capable of matching divergent mentions, we propose a novel Multi-View Enhanced Distillation (MVD) framework. MVD effectively transfers knowledge of multiple fine-grained and mentionrelevant parts within entities from cross-encoders to dual-encoders. By jointly encoding the entity and its corresponding mentions, cross-encoders enable the explicit capture of mention-relevant components within the entity, thereby facilitating the learning of fine-grained elements of the entity through more accurate soft-labels. To achieve this, our framework constructs the same multi-view representation for both modules by splitting the textual information of entities into multiple fine-grained views. This approach prevents irrelevant information from being over-squashed into the mentionrelevant view, which is selected based on the results of cross-encoders. We further design cross-alignment and self-alignment mechanisms for our framework to separately align the original entity-level and finegrained view-level scoring distributions, thereby facilitating fine-grained knowledge transfer from the teacher model to the student model. Motivated by prior works (Xiong et al., 2020;Zhan et al., 2021;, MVD jointly optimizes both modules and employs an effective hard negative mining technique to facilitate transferring of hard-to-train knowledge in distillation. Meanwhile, we reserve a global-view that embeds the entity as a whole to prevent dispersal of uniform information and better represent the first type of entities in Figure 1.
Through extensive experiments on several entity linking benchmarks, including ZESHEL, AIDA-B, MSNBC, and WNED-CWEB, our method demonstrates superior performance over existing approaches. The results highlight the effectiveness of MVD in capturing fine-grained entity representations and matching divergent mentions, which significantly improves entity retrieval performance and facilitates overall EL performance by retrieving high-quality candidates for the ranking stage.

Related Work
To accurately and efficiently acquire target entities from large-scale KBs, the majority of EL systems are designed in two stages: entity retrieval and entity ranking. For entity retrieval, prior approaches typically rely on simple methods like frequency information (Yamada et al., 2016), alias tables (Fang et al., 2019) and sparse-based models (Robertson et al., 2009) to retrieve a small set of candidate entities with low latency. For the ranking stage, neural networks had been widely used for calculating the relevance score between mentions and entities (Yamada et al., 2016;Ganea and Hofmann, 2017;Fang et al., 2019;Kolitsas et al., 2018).
Recently, with the development of PLMs (Kenton and Toutanova, 2019; , PLMbased models have been widely used for both stages of EL. Logeswaran et al. (2019) and Yao et al. (2020) utilize the cross-encoder architecture that jointly encodes mentions and entities to rank candidates, Gillick et al. (2019) employs the dualencoder architecture for separately encoding mentions and entities into high-dimensional vectors for entity retrieval. BLINK  improves overall EL performance by incorporating both architectures in its retrieve-then-rank pipeline, making it a strong baseline for the task. GERENE (De Cao et al., 2020) directly generates entity names through an auto-regressive approach.
To further improve the retrieval performance, various methods have been proposed. Zhang and Stratos (2021) and Sun et al. (2022) demonstrate the effectiveness of hard negatives in enhancing retrieval performance. Agarwal et al. (2022) and GER (Wu et al., 2023) construct mention/entity centralized graph to learn the fine-grained entity representations. However, being limited to the single vector representation, these methods may struggle with entities that have multiple and sparsely distributed information. Although Tang et al. (2021) and MuVER  construct multi-view entity representations and select the optimal view to calculate the relevance score with the mention, they still rely on the same entity-level supervised signal to optimize the scores of different views within the entity, which limit the capacity of matching with divergent mentions.
In contrast to existing methods, MVD is primarily built upon the knowledge distillation technique (Hinton et al., 2015), aiming to acquire finegrained entity representations from cross-encoders to handle diverse mentions. To facilitate finegrained knowledge transfer of multiple mentionrelevant parts, MVD splits the entity into multiple views to avoid irrelevant information being squashed into the mention-relevant view, which is selected by the more accurate teacher model. This Framework further incorporates cross-alignment and self-alignment mechanisms to learn mentionrelevant view representation from both original entity-level and fine-grained view-level scoring distributions, these distributions are derived from the soft-labels generated by the cross-encoders.

Task Formulation
We first describe the task of entity linking as follows. Give a mention m in a context sentence s =< c l , m, c r >, where c l and c r are words to the left/right of the mention, our goal is to efficiently obtain the entity corresponding to m from a large-scale entity collection ε = {e 1 , e 2 , ..., e N }, each entity e ∈ ε is defined by its title t and description d as a generic setting in neural entity linking (Ganea and Hofmann, 2017). Here we follow the two-stage paradigm proposed by : 1) retrieving a small set of candidate enti-ties {e 1 , e 2 , ..., e K } corresponding to mention m from ε, where K ≪ N ; 2) ranking those candidates to obtain the best match as the target entity. In this work, we mainly focus on the first-stage retrieval.

Encoder Architecture
In this section, we describe the model architectures used for dense retrieval. Dual-encoder is the most adopted architecture for large-scale retrieval as it separately embeds mentions and entities into high-dimensional vectors, enabling offline entity embeddings and efficient nearest neighbor search. In contrast, the cross-encoder architecture performs better by computing deeply-contextualized representations of mention tokens and entity tokens, but is computationally expensive and impractical for first-stage large-scale retrieval (Reimers and Gurevych, 2019;Humeau et al., 2019). Therefore, in this work, we use the cross-encoder only during training, as the teacher model, to enhance the performance of the dual-encoder through the distillation of relevance scores.

Dual-Encoder Architecture
Similar to the work of  for entity retrieval, the retriever contains two-tower PLMbased encoders Enc m (·) and Enc e (·) that encode mention and entity into single fixed-dimension vectors independently, which can be formulated as:

E(e) = Enc e ([CLS] t [ENT] d [SEP])
(1) where m,c l ,c r ,t, and d are the word-piece tokens of the mention, the context before and after the mention, the entity title, and the entity description. The special tokens [M s ] and [M e ] are separators to identify the mention, and [ENT] serves as the delimiter of titles and descriptions.
[CLS] and [SEP] are special tokens in BERT. For simplicity, we directly take the [CLS] embeddings E(m) and E(e) as the representations for mention m and entity e, then the relevance score s de (m, e) can be calculated by a dot product s de (m, e) = E(m) · E(e).

Cross-Encoder Architecture
Cross-encoder is built upon a PLM-based encoder Enc ce (·), which concatenates and jointly encodes mention m and entity e (remove the [CLS] token in the entity tokens), then gets the [CLS] vectors as their relevance representation E(m, e), finally fed it into a multi-layer perceptron (MLP) to compute the relevance score s ce (m, e).

Multi-View Based Architecture
With the aim to prevent irrelevant information being over-squashed into the entity representation and better represent the second type of entities in Figure 1, we construct multi-view entity representations for the entity-encoder Enc e (·). The textual information of the entity is split into multiple fine-grained local-views to explicitly capture the key information in the entity and match mentions with divergent contexts. Following the settings of MuVER , for each entity e, we segment its description d into several sentences d t (t = 1, 2, .., n) with NLTK toolkit 2 , and then concatenate with its title t as the t-th view e t (t = 1, 2, .., n): Meanwhile, we retain the original entity representation E(e) defined in Eq. (1) as the global-view e 0 in inference, to avoid the uniform information being dispersed into different views and better represent the first type of entities in Figure 1. Finally, the relevance score s(m, e i ) of mention m and entity e i can be calculated with their multiple embeddings.
Here we adopt a max-pooler to select the view with the highest relevant score as the mention-relevant view:

Multi-View Enhanced Distillation
The basic intuition of MVD is to accurately transfer knowledge of multiple fine-grained views from a more powerful cross-encoder to the dual-encoder to obtain mention-relevant entity representations. First, in order to provide more accurate relevance between mention m and each view e t (t = 1, 2, ..., n) of the entity e as a supervised signal for distillation, we introduce a multi-view based crossencoder following the formulation in Sec 3.2.3: where m enc and e t enc (t = 1, 2, .., n) are the wordpiece tokens of the mention and entity representations defined as in Eq. (1) and (2), respectively. 2 www.nltk.org We further design cross-alignment and selfalignment mechanisms to separately align the original entity-level scoring distribution and finegrained view-level scoring distribution, in order to facilitate the fine-grained knowledge distillation from the teacher model to the student model. Cross-alignment In order to learn entity-level scoring distribution among candidate entities at the multi-view scenario, we calculate the relevance score s(m, e i ) for mention m and candidate entity e i in candidates {e 1 , e 2 , ..., e K } by all its views {e 1 i , e 2 i , ..., e n i }, the indexes of relevant views i de and i ce for dual-encoder and cross-encoder are as follows: here to avoid the mismatch of relevant views (i.e., i de ̸ = i ce ), we align their relevant views based on the index i ce of max-score view in cross-encoder, the loss can be measured by KL-divergence as heres de (m, e i ) ands ce (m, e i ) denote the probability distributions of the entity-level scores which are represented by the i ce -th view over all candidate entities. Self-alignment Aiming to learn the view-level scoring distribution within each entity for better distinguishing relevant view from other irrelevant views, we calculate the relevance score s(m, e t ) for mention m and each view e t i (t = 1, 2, ..., n) of entity e i , the loss can be measured by KL-divergence as: heres de (m, e t i ) ands ce (m, e t i ) denote the probability distributions of the view-level scores over all views within each entity. Joint training The overall joint training framework can be found in Figure 2. The final loss function is defined as L total = L de + L ce + αL cross + βL self (10) Here, L cross and L self are the knowledge distillation loss with the cross-encoder and defined as in Eq. (6) and (8) respectively, α and β are coefficients for them. Besides, L de and L ce are the supervised training loss of the dual-encoder and cross-encoder on the labeled data to maximize the s(m, e k ) for the golden entity e k in the set of candidates {e 1 , e 2 , ..., e K }, the loss can be defined as: Inference we only apply the mention-encoder to obtain the mention embeddings, and then retrieve targets directly from pre-computing view embeddings via efficient nearest neighbor search. These view embeddings encompass both global and local views and are generated by the entity-encoder following joint training. Although the size of the entity index may increase due to the number of views, the time complexity can remain sub-linear with the index size due to mature nearest neighbor search techniques .

Hard Negative Sampling
Hard negatives are effective information carriers for difficult knowledge in distillation. Mainstream techniques for generating hard negatives include utilizing static samples  or top-K dynamic samples retrieved from a recent iteration of the retriever (Xiong et al., 2020;Zhan et al., 2021), but these hard negatives may not be suitable for the current model or are pseudo-negatives (i.e., unlabeled positives) . Aiming to mitigate this issue, we adopt a simple negative sampling method that first retrieves top-N candidates, then randomly samples K negatives from them, which reduces the probability of pseudo-negatives and improves the generalization of the retriever.

Datasets
We evaluate MVD under two distinct types of datasets: three standard EL datasets AIDA-CoNLL

Training Procedure
The training pipeline of MVD consists of two stages: Warmup training and MVD training. In the Warmup training stage, we separately train dualencoder and cross-encoder by in-batch negatives and static negatives. Then we initialize the student model and the teacher model with the well-trained dual-encoder and cross-encoder, and perform multiview enhanced distillation to jointly optimize the two modules following Section 3.3. Implementation details are listed in Appendix A.2.

Main Results
Compared Methods We compare MVD with previous state-of-the-art methods. These methods can be divided into several categories according to the representations of entities: BM25 (Robertson et al., 2009) is a sparse retrieval model based on exact term matching. BLINK ) adopts a typical dual-encoder architecture that embeds the entity independently into a single fixed-size vector. SOM (Zhang and Stratos, 2021) represents entities by its tokens and computes relevance scores via the sum-of-max operation (Khattab and Zaharia, 2020). Similar to our work, MuVER  constructs multi-view entity representations to match divergent mentions and achieved the best results, so we select MuVER as the main compared baseline. Besides, ARBORESCENCE (Agarwal et al., 2022) and GER (Wu et al., 2023) construct mention/entity centralized graphs to learn fine-grained entity representations.
For Zeshel dataset we compare MVD with all the above models. As shown in Table 1, MVD performs better than all the existing methods. Compared to the previously best performing method MuVER, MVD significantly surpasses in all metrics, particularly in R@1, which indicates the ability to directly obtain the target entity. This demonstrates the effectiveness of MVD, which uses hard negatives as information carriers to explicitly transfer knowledge of multiple fine-grained views from the cross-encoder to better represent entities for   matching multiple mentions, resulting in higherquality candidates for the ranking stage.
For Wikipedia datasets we compare MVD with BLINK 3 and MuVER . As shown in Table 2, our MVD framework also outperforms other methods and achieves state-of-the-art performance on AIDA-b, MSNBC, and WNED-CWEB datasets, which verifies the effectiveness of our method again in standard EL datasets.

Ablation Study
For conducting fair ablation studies and clearly evaluating the contributions of each fine-grained component and training strategy in MVD, we exclude the coarse-grained global-view to evaluate the capability of transferring knowledge of multiple fine-grained views, and utilize Top-K dynamic hard negatives without random sampling to mitigate the effects of randomness on training.
Fine-grained components ablation results are presented in Table 3. When we replace the multiview representations in the cross-encoder with the original single vector or remove the relevant view selection based on the results of the crossencoder, the retrieval performance drops, indicat-  ing the importance of providing accurate supervised signals for each view of the entity during distillation. Additionally, the removal of crossalignment and self-alignment results in a decrease in performance, highlighting the importance of these alignment mechanisms. Finally, when we exclude all fine-grained components in MVD and employ the traditional distillation paradigm based on single-vector entity representation and entity-level soft-labels, there is a significant decrease in performance, which further emphasizes the effectiveness of learning knowledge of multiple fine-grained and mention-relevant views during distillation. Training strategies we further explore the effectiveness of joint training and hard negative sampling in distillation, Table 4 shows the results. First, we examine the effect of joint training by freezing the teacher model's parameters to do a static distillation, the retrieval performance drops due to the teacher model's limitation. Similarly, the performance drops a lot when we replace the dynamic hard negatives with static negatives, which demonstrates the importance of dynamic hard negatives for making the learning task more challenging. Furthermore, when both training strategies are excluded and the student model is independently trained using static negatives, a substantial decrease in retrieval performance is observed, which validates the effectiveness of both training strategies in enhancing retrieval performance.

Comparative Study on Entity Representation
To demonstrate the capability of representing entities from multi-grained views, we carry out comparative analyses between MVD and BLINK , as well as MuVER .

Candidate Retriever
U.Acc.

Large Version Ranker
BLINK  63.03 SOM (Zhang and Stratos, 2021) 67.14 MVD (ours) 67.84 These systems are founded on the principles of coarse-grained global-views and fine-grained localviews, respectively. We evaluate the retrieval performance of both entity representations and present the results in Table  5. The results clearly indicate that MVD surpasses both BLINK and MuVER in terms of entity representation performance, even exceeding BLINK's global-view performance in R@1, despite being a fine-grained training framework. Unsurprisingly, the optimal retrieval performance is attained when MVD employs both entity representations concurrently during the inference process.

Facilitating Ranker's Performance
To evaluate the impact of the quality of candidate entities on overall performance, we consider two aspects: candidates generated by different retrievers and the number of candidate entities used in inference. First, we separately train BERT-base and BERT-large based cross-encoders to rank the top-64 candidate entities retrieved by MVD. As shown in Table 6, the ranker based on our framework achieves the best results in the two-stage performance compared to other candidate retrievers, demonstrating its ability to generate high-quality candidate entities for the ranking stage.
Additionally, we study the impact of the number of candidate entities on overall performance, as shown in Figure 3, with the increase of candidates number k, the retrieval performance grows steadily while the overall performance is likely to be stagnant. This indicates that it's ideal to choose an appropriate k to balance the efficiency and efficacy, we observe that k = 16 is optimal on most of the existing EL benchmarks.

Qualitative Analysis
To better understand the practical implications of fine-grained knowledge transfer and global-view entity representation in MVD, as shown in Table  7, we conduct comparative analysis between our method and MuVER  using retrieval examples from the test set of ZESHEL for qualitative analysis.
In the first example, MVD clearly demonstrates its ability to accurately capture the mentionrelevant information Rekelen were members of this movement and professor Natima Lang in the golden entity "Cardassian dissident movement". In contrast, MuVER exhibits limited discriminatory ability in distinguishing between the golden entity and the hard negative entity "Romulan underground movement". In the second example, Unlike Mu-VER which solely focuses on local information within the entity, MVD can holistically model multiple mention-relevant parts within the golden entity "Greater ironguard" through a global-view entity representation, enabling matching with the corresponding mention "improved version of lesser ironguard".

Conclusion
In this paper, we propose a novel Multi-View Enhanced Distillation framework for dense entity retrieval. Our framework enables better representation of entities through multi-grained views, and by using hard negatives as information car-

Entity retrieved by MVD Entity retrieved by MuVER
Title: Cardassian dissident movement Title: Romulan underground movement Rekelen was a member of the underground movement and a student under professor Natima Lang. In 2370, Rekelen was forced to flee Cardassia prime because of her political views.
The Cardassian dissident movement was a resistance movement formed to resist and oppose the Cardassian Central Command and restore the authority of the Detapa Council. They believed this change was critical for the future of their people. Professor Natima Lang, Hogue, and Rekelen were members of this movement in the late 2360s and 2370s. ... The Romulan underground movement was formed sometime prior to the late 24th century on the planet Romulus by a group of Romulan citizens who opposed the Romulan High Command and who supported a Romulan -Vulcan reunification. Its methods and principles were similar to those of the Cardassian dissident movement which emerged in the Cardassian Union around the same time. ...

Title: Greater ironguard
Title: Lesser ironguard Known as the improved version of lesser ironguard, this spell granted the complete immunity from all common, unenchanted metals to the caster or one creature touched by the caster.
Greater ironguard was an arcane abjuration spell that temporarily granted one creature immunity from all non-magical metals and some enchanted metals. It was an improved version of ironguard. The effects of this spell were the same as for "lesser ironguard" except that it also granted immunity and transparency to metals that had been enchanted up to a certain degree. ... ... after an improved version was developed, this spell became known as lesser ironguard. Upon casting this spell, the caster or one creature touched by the caster became completely immune to common, unenchanted metal. metal weapons would pass through the individual without causing harm. likewise, the target of this spell could pass through metal barriers such as iron bars, grates, or portcullises. ... riers to effectively transfer knowledge of multiple fine-grained and mention-relevant views from the more powerful cross-encoder to the dualencoder. We also design cross-alignment and selfalignment mechanisms for this framework to facilitate the fine-grained knowledge distillation from the teacher model to the student model. Our experiments on several entity linking benchmarks show that our approach achieves state-of-the-art entity linking performance.

Limitations
The limitations of our method are as follows: • We find that utilizing multi-view representations in the cross-encoder is an effective method for MVD, however, the ranking performance of the cross-encoder may slightly decrease. Therefore, it is sub-optimal to directly use the cross-encoder model for entity ranking.
• Mention detection is the predecessor task of our retrieval model, so our retrieval model will be affected by the error of the mention detection. Therefore, designing a joint model of mention detection and entity retrieval is an improvement direction of our method.

A.2 Implementation Details
For ZESHEL, we use the BERT-base to initialize both the student dual-encoder and the teacher cross-encoder. For Wikipedia-based datasets, we finetune our model based on the model released by BLINK, which is pre-trained on 9M annotated mention-entity pairs with BERT-large. All experiments are performed on 4 A6000 GPUs and the results are the average of 5 runs with different random seeds. Warmup training We initially train a dual-encoder using in-batch negatives, followed by training a cross-encoder as the teacher model via the top-k static hard negatives generated by the dual-encoder. Both models utilize multi-view entity representations and are optimized using the loss defined in Eq. (11), training details are listed in Table 10.  MVD training Next, we initialize the student model and the teacher model with the well-trained dual-encoder and cross-encoder obtained from the Warmup training stage. We then employ multiview enhanced distillation to jointly optimize both modules, as described in Section 3.3. To determine the values of α and β in Eq. (10), we conduct a grid search and find that setting α = 0.3 and β = 0.1 yields the best performance. We further adopt a simple negative sampling method in Sec 3.4 that first retrieves top-N candidates and then samples K as negatives. Based on the analysis in Sec 5.1 that 16 is the optimal candidate number to cover most hard negatives and balance the efficiency, we set it as the value of K; then to ensure high recall rates and sampling high quality negatives, we search from a candidate list [50,100,150,200,300] and eventually determine N=100 is the most suitable value. The training details are listed in Table 11  Inference MVD employs both local-view and global-view entity representations concurrently during the inference process, details are listed in Table  12.