Coherent Entity Disambiguation via Modeling Topic and Categorical Dependency

Previous entity disambiguation (ED) methods adopt a discriminative paradigm, where prediction is made based on matching scores between mention context and candidate entities using length-limited encoders. However, these methods often struggle to capture explicit discourse-level dependencies, resulting in incoherent predictions at the abstract level (e.g. topic or category). We propose CoherentED, an ED system equipped with novel designs aimed at enhancing the coherence of entity predictions. Our method first introduces an unsupervised variational autoencoder (VAE) to extract latent topic vectors of context sentences. This approach not only allows the encoder to handle longer documents more effectively, conserves valuable input space, but also keeps a topic-level coherence. Additionally, we incorporate an external category memory, enabling the system to retrieve relevant categories for undecided mentions. By employing step-by-step entity decisions, this design facilitates the modeling of entity-entity interactions, thereby maintaining maximum coherence at the category level. We achieve new state-of-the-art results on popular ED benchmarks, with an average improvement of 1.3 F1 points. Our model demonstrates particularly outstanding performance on challenging long-text scenarios.


Introduction
Entity disambiguation (ED) is a typical knowledgeintensive task of resolving mentions in a document to their corresponding entities in a knowledge base (KB), e.g.Wikipedia.This task is of great importance due to its active presence in downstream tasks such as information extraction (Hoffart et al., 2011), question answering (Yih et al., 2015) and web search query (Blanco et al., 2015).
To perform efficient entity disambiguation (ED), one common approach is to encode mentions and candidate entities into different embedding spaces .
Then a simple vector dot product is used to capture the alignment between mentions and candidate entities.While this method enables quick maximum inner product search (MIPS) over all candidates and efficiently determines the linked answer, it suffers from late and simplistic interaction between mentions and entities (Barba et al., 2022;Cao et al., 2021).Recently, researchers have proposed alternative paradigms for solving the ED problem, such as formulating it as a span extraction task (Barba et al., 2022).In this approach, a Longformer (Beltagy et al., 2020) is fine-tuned to predict the entity answer span within a long sequence consisting of the document and candidate entity identifiers.Another paradigm (Cao et al., 2021;De Cao et al., 2022) reduces the ED task to an auto-regressive style in which generation models are trained to produce entity identifiers token-by-token.
Although these approaches offer some mitigation for the late-and-simple interaction problem, they still exhibit certain vulnerabilities.For instance, Transformer-based encoders impose inherent limitations on input length, preventing the capture of long-range dependency for specific mentions.Also, these methods do not explicitly consider coherence constraints, while coherence is considered as important as context in early ED works (Hoffart et al., 2011;Chisholm and Hachey, 2015).We first propose to condition the model on compressed topic tokens, enabling the system to sustain topic coherence at the document level.
In addition, the relationship among entities holds significant importance in the ED task.For example, mentions in the document exhibit a high correlation at the category level, where we name it category coherence.However, previous bi-encoder and cross-encoder solutions have overlooked these entity dependencies and focused solely on learning contextualized representations.Among other works, extractive paradigm (Barba et al., 2022) neglects entity-entity relation as well; generative EL (Cao et al., 2021;De Cao et al., 2022) do possess some dependencies when linking an unknown mention.However, these dependencies arise from the auto-regressive decoding process and require heavy inference compute.
To address the above coherence problem, we propose two orthogonal solutions that target topic coherence and entity coherence, respectively.Following previous works that decode masked tokens to link unknown mentions (Yamada et al., 2020(Yamada et al., , 2022)), we present the overview of our coherent entity disambiguation work in Figure 1, where document words and unresolved entities are treated as input tokens of Transformer (Vaswani et al., 2017).First, we bring an unsupervised variational auto-encoder (VAE) (Kingma and Welling, 2014) to extract topic embeddings of surrounding sentences, which are later utilized to guide entity prediction.By docking these two representative language learners, BERT (Devlin et al., 2019) and GPT-2 (Radford et al., 2019), the variational encoder can produce topic tokens of sentences without training on labeled datasets.This approach promotes a higher level of coherence in model predictions from an abstract level (Li et al., 2020) (e.g.tense, topic, sentiment).
Moreover, in most KBs, categories serve as valuable sources of knowledge, grouping entities based on similar subjects.To enhance entity-entity coherence from a categorical perspective, we design a novel category memory bank for intermediate entity representations to query dynamically.As opposed to retrieving from a frozen memory layer, we introduce direct supervision from ground-truth category labels during pre-training.This enables the memory to be learned effectively even from random initialization.
Named COHERENTED, experimental results show that our proposed methods surpass previous state-of-the-art peers on six popular ED datasets by 1.3 F1 points on average.Notably, on the challenging CWEB dataset, which has an average document length of 1,700 words, our approach elevates the score from the previous neural-based SOTA of 78.9 to 81.1.Through model ablations, we verify the effectiveness of the two orthogonal solutions through both quantitative performance evaluation and visualization analysis.These ablations further affirm the superiority of our methods in generating coherent disambiguation predictions.

Related Works
Entity disambiguation (ED) is a task of determining entities for unknown mention spans within the document.Early ED works (Hoffmann et al., 2011;Daiber et al., 2013) commonly rely on matching scores between mention contexts and entities, disregarding the knowledge-intensive nature of ED.Many studies aim to infuse external knowledge into ED.Bunescu and Paşca (2006) begin to utilize hyperlinks in Wikipedia to supervise disambiguation.Yamada et al. (2020Yamada et al. ( , 2022) ) propose massive pre-training on paired text and entity tokens to implicitly inject knowledge for disambiguation usage.Li et al. (2022) first leverage knowledge graphs to enhance ED performance.
Another intriguing feature for entities to differentiate from each other is their type, as entities with similar surface forms (text identifiers) often possess different types.Onoe and Durrett (2020) disambiguate similar entities solely through a refined entity type system derived from Wikipedia, without using any external information.Ayoola et al. (2022) augment the robustness of such a type system even further.Furthermore, many researchers have explored new paradigms for ED.Cao et al. (2021) propose using a prefix-constrained dictionary on casual language models to correctly generate entity strings.Barba et al. (2022) recast disambiguation as a task of machine reading comprehension (MRC), where the model selects a predicted entity based on the context fused with candidate identifiers and the document.
The architecture of our system seamlessly blends the advantages of type systems and knowledge pretraining: incorporating type systems as dynamically updating neural blocks within the model, our design enables simultaneous learning through a multi-task learning schema -facilitating topic variation learning, masked disambiguation learning, and knowledge pre-training concurrently.
Prompt Compression is a commonly used technique in language models to economize input space, closely related to topics such as prompt compression (Wingate et al., 2022) and context distillation (Snell et al., 2022).Their mutual goal is to dynamically generate soft prompt tokens that replace original tokens without hurting downstream application performance.Our topic token design mirrors context compression to some extent but differs in the compression ratio and purpose.Our design is more compact; each context sentence gets converted into a single topic token using a variational encoder, and in addition to saving input space, it also retains high-level semantics to guide a more coherent ED.

Entity Disambiguation Definition
Let X be a document with N mentions {m 1 , m 2 , . . ., m N }, where each of mentions m i is associated with a set of entity candidates C i = {e i1 , e i2 , . . ., e i|C i | }.Given a KB with a set of triplets G = {(h, r, t)|h, t ∈ E, r ∈ R}, where h, r and t denote the head entity, relation and tail entity respectively, the goal of entity disambiguation is to link mentions m i to one of the corresponding mention candidates C i ⊆ E.

Overview
We present the overview of COHERENTED in Figure 1.Following Yamada et al. (2020), both words in the document and entities are considered input tokens for the BERT model.The final input representation sums over the following embeddings: Representation embedding denotes the topic latent, word embedding or entity embedding for topic inputs, document inputs or entity inputs accordingly.We set up two separate embedding layers for word and entity input respectively.X ∈ R Vw×H denotes the word embedding matrices and Y ∈ R Ve×H denotes the entity embedding matrices, where V w and V e represents the size of word vocabulary and entity vocabulary.
Type embedding is for discrimination usage.There are three types of tokens available, each of which corresponds to a dedicated group of parameters, C word , C entity , C topic .
Position embedding marks the position of input words and entities, avoiding the permutationinvariant property of the self-attention mechanism.The entity position embedding also indicates which word tokens the entity corresponds to.This is achieved by applying absolute position embedding to both words and entities.If a mention consists of multiple tokens, the entity position embedding is averaged over all corresponding positions.

Topic Variational Autoencoder
To preserve maximum topic coherence and optimize input space utilization, we introduce an external component that facilitates topic-level guidance in ED prediction.Among various latent variable models, variational autoencoders (VAEs) have demonstrated success in modeling high-level syntactic latent factors such as style and topic.Our setup is based on the motivation that, after being trained on a massive corpus using the variational objective, the encoder gains the capacity to encode sentences into topic latent vectors.The entire topic VAE is composed of two parts, BERT and GPT-2, both of which are powerful Transformer-based language encoder and decoder.

Encoder
Given a BERT encoder LM ϕ and input sentence token sequence x, we collect the aggregated em-  bedding X from the last layer hidden states corresponding to the [CLS] token: With this, we can construct a multivariate Gaussian distribution from which the decoder will draw samples.The following formula describes the variational distribution for the approximation of the posterior: where f µ ϕ and f σ ϕ denote separate linear layers for mean and variance representations, N denotes Gaussian distribution and z denotes intermediate information bottleneck.

Decoder
Given a GPT-2 decoder LM θ , we first review how to generate a text sequence of length T using such neural models.To generate word tokens of length L, x = [x 1 , x 2 , . . ., x T ], a language decoder utilizes all its parameters θ to predict the next token conditioned on all previous tokens generated x <t , formulated as follows: When training a language generator alone, the decoder is usually learned using the maximum likelihood estimate (MLE) objective.However, in our VAE setting, the decoder conditions on the vector z dynamically drew from a Gaussian distribution, instead of purely from previously generated tokens.Specifically, our decoder generates autoregressively via: As stated above in Part 3.3.1, the intractable posterior for z n is approximated by q ϕ (z n | x ≤n , z <n ) in Equation 1.Now we see the difference in the VAE decoder: the generation relies on high-level semantics and has the ability to produce a compact representation.

ELBO Training
Both the encoder and decoder need to be trained to optimize their parameters.Supported by the above approximation, the entire training objective can be interpreted as evidence lower bound objective (ELBO): Detailed derivations are emitted for clear idea depiction.In practice, we apply the reparametrization trick (Kingma and Welling, 2014) to allow back-propagation through all deterministic nodes and for efficient learning.
Intuitively, we consider the first term as an autoencoder objective, since it requires the model to do reconstruction based on the intermediate latent.
The second term defines the KL divergence between the real distribution q ϕ (z | x) and p(z).To better implement these objectives, we refer to the regularized version of ELBO (Li et al., 2020), where ELBO is considered as the linear combination of reconstruction error and KL regularizer: (4) Finally, we treat the loss term L variational as one of our minimization objectives in the following multi-task learning.

Category Memory
We now formally define the category memory layer inserted into the intermediate Transformer layers.Let E = {e 1 , e 2 , . . ., e n } be the set of all possible candidate entities, and correspondingly, each entity e i has a predefined set of categories C e i .The union set of category sets of all candidate entities will be C. Based on this motivation, we construct a category vocabulary and a category embedding table C ∈ R |C|×dcategory .The vocabulary establishes a mapping from textual category labels to indices.As sometimes category labels can be too fine-grained, we refer readers to Appendix A for the detailed design of category system.
The embedding table C stores category representations that can be updated during massive pretraining.To be specific, we first formulate our model's forward (Figure 3) as follows: where symbols T, W, E stand for hidden states of topics, words and entities respectively.
Each unresolved entity in the document is assigned with [MASK] token in the input sequence.During the forward pass, all intermediate entity representations corresponding to [MASK] token e i masked are projected from R d entity to R dcategory using a linear layer without bias terms: Subsequently, the adapted intermediate entity representations will query all entries in the category embedding table.The aggregated weighted hidden states are computed as follows: where α ij = sigmoid(C j • (E i masked ) T ) denotes matching score between i-th masked entity and jth category.W B is a linear projection layer for dimension matching.During training, we apply direct supervision from gold category guidance via binary cross-entropy loss: where I oracle denotes the indicator function of the oracle category labels for the i-th masked entity.

Multi-task Pre-training
This part discusses the pre-training stage for CO-HERENTED.First, we define the disambiguation loss, which is analogous to the well-known masked language modeling objective.In each training step, 30% of the entity tokens are replaced with a special [MASK] token.We employ a linear decoder at the end of our model to reconstruct the masked tokens, as shown in Equation 8.
Equation 9 represents the cross entropy loss over the entity vocabulary, where I e k denotes the indicator function of the k-th masked entity's groundtruth.
By incorporating all these losses, we derive our final multi-task learning objective being: where coefficients α and γ control relative importance of two auxiliary tasks.

COHERENTED Inference
Given a document X with N mentions, M = {m 1 , m 2 , . . ., m N }, we now describe the coherent ED inference process.Considering a Transformer with an input length limit being L tokens, we reserve k tokens for topic latent, n e for shallow entity representation input, leaving the word input window being L − k − n e tokens.Note that n e indicates the number of mentions in a certain sentence and varies among different training batches, so we set n e to the maximum number of mentions within the batch.We refer interested readers to Appendix C for how we sample topic sentences and prepare input tokens.
With all input tokens ready, we predict entities for N steps.Unlike a language generator which decodes the next token, at each step i, the model decodes all [MASK] tokens into entity predictions by selecting maximum indices in the logits.The entity prediction at step i is decided using the highestconfidence strategy, i.e., the entity with the highest log probability is resolved, while others have to wait until the next step.
It is worth noting that utilizing candidate entity information can significantly reduce noisy predictions, as indicated by the green bars in Once an entity prediction is determined, the category memory ceases to be queried at that position and instead receives a real category indicator to aggregate entries from the category memory.We refer to this as oracle category guidance because it allows for potential category-level guidance in disambiguating remaining mentions.

Datasets and Settings
For a fair comparison with previous works, we adopt the exact same settings used by Cao et al. (2021).Specifically, we borrow pre-generated candidate entity sets from (Le and Titov, 2018).Only entities with the top 30 p(e | m) score are considered candidates, and failure to include the oracle answer in the candidate set leads to a false negative prediction.For evaluation metrics, we report InKB micro F1 scores on test splits of AIDA-CoNLL dataset (Hoffart et al., 2011) (AIDA), cleaned version of MSNBC, AQUAINT, ACE2004, WNED-CWEB (CWEB) and WNED-WIKI (WIKI) (Guo and Barbosa, 2018)

Main Results
We report peer comparisons in Table 1.Note that we consider the Wikipedia-only training setting, meaning no further fine-tuning or mixture training on AIDA is allowed and all evaluations are out-ofdomain (OOD) tests.
In general, we achieve new state-of-the-art results on all test datasets except CWEB and AIDA datasets, surpassing the previous best by 1.3 F1 points and eliminating 9% errors.On the CWEB dataset, our work still shows superiority over other neural-based methods, as the lengthy samples are too unfriendly to be understood globally by native neural encoders.On the ACE2004 dataset, since the number of mentions is relatively small, many reported numbers are identical.On other datasets, the relative improvements are consistent for COHERENTED large even though only additional cheap category labels are provided during pre-training.Such improvements also confirm the outstanding OOD ability of our methods since no fine-tuning is conducted on the downstream test datasets.

Ablation Study
In the lower part of Table 1, we report ablation experiments of proposed methods.All ablations are conducted on COHERENTED large .
Compared with the model without topic token injections, COHERENTED large improves greatly from 89.3 to 90.9 in average micro F1 score with a particular gain on the lengthy CWEB dataset from 77.9 to 81.1.Such performance gain on CWEB only falls short of Yang et al. (2018) which requires extensive feature engineering targeted at documentlevel representations.All other test sets also benefit from the powerful abstract modeling ability of topic VAE.
Compared with the model without a category memory layer, the CoherentED large improves from 90.1 to 90.9, not as significant as the improvement of topic injections but still notable enough.Note that category memory does not bring consistent performance gain on all test datasets, possibly because not all test samples are sensitive to category-level coherence.Furthermore, we attempt to disable the category oracle guidance strategy during evaluation, meaning for predicted entities in step-by-step ED, we still query the external category memory and aggregate the retrieved entries.And ablation shows that the oracle guidance does have a positive impact on the overall performance metrics.

Case Analysis and Visualization
Besides ablations on evaluation metrics, we conduct a deeper analysis of proposed methods by visualizing data samples.Specifically, we per-  form t-SNE on category memory entries after joint training and topic vectors of sentences in MSNBC dataset.
To validate the effectiveness of the learned category memory layer, we expect the embeddings of category entries stored in the memory to exhibit a certain degree of similarity with the category hierarchy in Wikipedia, i.e., similar category entries are close to each other.In Figure 4, we present t-SNE visualization of the top 500 popular category entries and additional colored category group, where black cross data points represent popular category entries and blue circular points represent examples of a structured hierarchy from the Wikipedia category system 1 .Figure 5: T-SNE visualization of sentence vectors extracted by the topic probe strategy.Some polarized topic groups such as "Health" and "Sports" are denoted using colored dashed circles.Zoomed part reveals the intertwined nature between similar topics (e.g."World News" and "U.S. News").Best viewed in color.
1 Economic indicators → Stock market indices → American stock market indices → Companies in the Nasdaq-100 To evaluate the topic representation ability of our jointly trained topic VAE, we design an elegant probe strategy to investigate the topic modeling ability of COHERENTED.MSNBC covers 20 documents on 10 topics2 .By feeding each sentence in the MSNBC test set along with predicted entities tokens, we extract the [CLS] representation of CoherentED large and run t-SNE on these joint representations.
In Figure 5, topic latent vectors of sentences in these documents are plotted into 511 data points, whose colors denote their oracle topic labels.We see that the majority of sentences under the same topic cluster into polarized groups, despite a few outliers possibly because they are for general purposes such as greeting and describing facts.Consistent with our expectations, similar topics are intertwined as they share high-level semantics to some extent.

Conclusion
We propose a novel entity disambiguation method COHERENTED, which injects latent topic vectors and utilizes cheap category knowledge sources to produce coherent disambiguation predictions.Specifically, we introduce an unsupervised topic variational auto-encoder and an external category memory bank to mitigate inconsistent entity predictions.Experimental results demonstrate the effectiveness of our proposed methods in terms of accuracy and coherence.Analysis of randomly picked cases and vector visualizations further confirm such effectiveness.

Limitations
Still, our COHERENTED remains with two limitations: scalability and performance.Future works are expected to alleviate these limitations.First, COHERENTED can hardly handle emerging entities as this requires extending both the entity embedding layer and category memory layer.The evaluation metrics will degrade if no further training is conducted after such expansion.Second, despite the count of parameters and FLOPs of CO-HERENTED being quite comparable with baseline models, the advantage of coherent prediction only reveals itself in the scenario of step-by-step reasoning, i.e., mentions are resolved one by one.This means multiple forward passes are needed for each document to achieve the most accurate results.

B Evaluation Dataset Details
The brief dataset descriptions are as follows: 1 • GENRE (Cao et al., 2021) is the first to formulate entity linking and disambiguation into the constrained text generation task via predefined trie.Auto-regressive decoding nature makes it hard for real-time usage.
• Bootleg (Orr et al., 2021) focuses on modeling reasoning patterns for disambiguation in a self-supervised manner.Tail entities who rarely appear in KB and documents are especially investigated in this work.
• BiBSG (Yang et al., 2018) is the first to introduce the structured gradient tree boosting (SGTB) algorithm to collective entity disambiguation with many efforts in making use of global information from both the past and future to perform a better local search.
• ExtEND (Barba et al., 2022) formulates the ED problem into a span extraction task supported by a Longformer model that predicts entity span in the input sequence.
• ReFinED (Ayoola et al., 2022) is an efficient zero-shot end-to-end entity linker using scorebased bi-encoder architecture, which seeks a trade-off between performance and efficiency.
• GlobalED (Yamada et al., 2022) considers ED as a masked token prediction problem and is also the baseline of our work.

to the limited computation
There were more declining shares than advancers on the New York Stock Exchange and the Nasdaq Stock Market .…Investorsare looking for signs that consumer spending, one of the biggest drivers of the U.S. economy, will recover during the holidays.A report on industrial production weighed on the market.The Fed said output at the nation's factories, mines and utilities rose 0.1 percent in October, less than the 0.4 percent predicted by economists polled by Thomson Reuters… The Federal Reserve Bank regulation in the United States, Central Banks… …The song even induced a riot when The Fed performed "Hyphy" during halftime of the AND1 Live Tour at Oracle Arena in June 2004.On the strength of "Hyphy" and their second single "Donkey", the group's self-titled debut album was released under Virgin Records to critical reception… Topic: Economics

Topic: Music
The Fed (newspaper) The Federalist Papers The Federation (group) Musical groups established in 2002, Hip hop groups from California… Mentions are annotated with a gray background.Two documents share the same mention "The Fed".The document in the upper part is centered around the economics topic while the lower one elaborates on the music topic.The middle part lists four entity candidates for the mention "The Fed", with some corresponding category labels on yellow background.
resource, we do not run massive hyperparameter searches.Multi-task coefficients are set to α = 0.1 and γ = 10 with a few empirical trials done.During inference, the k in Top-K category retrieval is set to 10, as the average number of categories of all entities present is close to this number.We train our proposed model in two stages.In stage 1, we freeze all the parameters except for the fresh entity embedding layer and category memory layer.Consequently, the variational objective is disabled in stage 1.Then after 1 epoch, we activate all parameters and enable three objectives in stage 2, which lasts for 6 epochs.Pre-training a VAE can be difficult due to the notorious KL vanishing issue (Bowman et al., 2016), causing the decoder completely ignores the topic latent z in learning.As a practical solution to mitigate this, a cyclical schedule is applied to the KL regularizer coefficient β.The training takes approximately 1 day for COHERENTED base and 3 days COHERENTED large on 8 A100-SXM4-40GB GPUs.

F Case Study
In Figure 6, we illustrate document samples extracted from the MSNBC test dataset, wherein the mention "The Fed" can be readily disambiguated if provided with the corresponding topic.
Besides topic coherence, the relationship among entities also matters in the ED task.In the upper of Figure 6, mentions are highly correlated in their category level, and previous bi-encoder and crossencoder solutions totally ignore the dependencies among entities and focus on learning representations alone.
Consider the upper document in Figure 6, where three mentions are highlighted with a gray background.The correct linked entity for the mention "New York Stock Exchange" shares the exact category "Stock exchanges in the United States" with the correct linked entity for "Nasdaq Stock Market."Moreover, these two mentions can implicitly guide entity prediction for "The Fed" due to the high correlation between their respective categories.

Figure 2 :
Figure 2: Topic variational autoencoder injection illustration.K topic sentences are converted into topic tokens appending on the start of model's input.
Figure 1.During inference in the category memory layer, only top-k category entries are selected for weighted aggregation.

Figure 4 :
Figure 4: T-SNE visualization of category embeddings after pre-training.Zoomed part denotes the specific category labels and corresponding sample points.Category "Stock market indices" is emitted for clear depiction.

Figure 6 :
Figure6: ED case analysis revealing two critical motivations of our work, topic coherence and categorical coherence.Mentions are annotated with a gray background.Two documents share the same mention "The Fed".The document in the upper part is centered around the economics topic while the lower one elaborates on the music topic.The middle part lists four entity candidates for the mention "The Fed", with some corresponding category labels on yellow background.

Table 1 :
ED InKB micro F1 scores on test datasets.The best value is in bold and the second best is in underline.* means results come from reproduced results of official open-source code.† indicates non-comparable metrics due to an unfair experimental setting.-indicates not reported in the original paper.For direct comparison with Yamada et al. (2022), AVG-5 reports average micro F1 scores on all test datasets except AIDA.
. AIDA contains 18, 448 training samples, 4791 validation samples and 4485 test samples.It also served as one of the largest manually annotated EL and ED datasets.Note that other datasets contain test split only.First, the document is tokenized into L D text tokens and split into sentences.If text tokens fit in the word input window (i.e.L D ≤ L − k − n e ), we utilize all word tokens and sample k topic sentences uniformly.Otherwise, we truncate the document with the sentence to be disambiguated as the center 5 , and prioritize sampling k sentences outside the trimming range as the topic sentences.Selected sentences get encoded through a pre-trained topic encoder and prepend their topic representations on the input sequence.Note that the topic decoder is no longer needed as it only supports variational learning in the auxiliary branch.Lastly, N [MASK] tokens are appended to the sequence, indicating all N mentions are unresolved.For training samples where N < n e , we concatenate more N − n e [PAD] tokens.