CHOLAN: A Modular Approach for Neural Entity Linking on Wikipedia and Wikidata

In this paper, we propose CHOLAN, a modular approach to target end-to-end entity linking (EL) over knowledge bases. CHOLAN consists of a pipeline of two transformer-based models integrated sequentially to accomplish the EL task. The first transformer model identifies surface forms (entity mentions) in a given text. For each mention, a second transformer model is employed to classify the target entity among a predefined candidates list. The latter transformer is fed by an enriched context captured from the sentence (i.e. local context), and entity description gained from Wikipedia. Such external contexts have not been used in the state of the art EL approaches. Our empirical study was conducted on two well-known knowledge bases (i.e., Wikidata and Wikipedia). The empirical results suggest that CHOLAN outperforms state-of-the-art approaches on standard datasets such as CoNLL-AIDA, MSNBC, AQUAINT, ACE2004, and T-REx.


Introduction
The explicit schema, graph-based structure, and interlinking nature of information represented in publicly available knowledge graphs (KGs) e.g., DBpedia (Auer et al., 2007), Freebase (Bollacker et al., 2007), Wikidata (Vrandecic, 2012) or knowledge bases (KBs) such as Wikipedia; introduce a new landscape of features, as well as structured knowledge and embeddings. Researchers have developed several techniques to align information available in unstructured text to the concepts of these KGs (Wu et al., 2019b;Broscheit, 2019).
End-to-end Entity Linking (hereafter EL) task follows this direction; such that, given a sentence EL first identifies the entity mention in the sentence, then maps these mentions to the most likely KG/KB entities. The EL comprises of a threestep process. With respect to the given example sentence Soccer: Late Goals Give Japan win Over Syria, the first step called mention detection (MD) identifies the surface forms Japan and Syria. The next step is candidate generation (CG) aiming to find a list of possible entity candidates in the KG/KB for each entity mention. For example, the candidates list for entity mention Japan consists in part of Japan national football team, Japan (country), Japan (Band) and for Syria is Syria (Roman province), Syria national football team, Greater Syria. Finally, the third step deals with the entity disambiguation (ED) which employs the coreference and contextual features to discriminate the most likely entity from the candidates list e.g., Japan national football team and Syria national football team are correct entities.
Entity Linking approaches are broadly categorised into three categories. The initial attempts (Hoffart et al., 2011;Piccinno and Ferragina, 2014) solve MD and ED as independent sub-tasks of EL (i.e., a pipeline based system). However, these approaches exhibit a behaviour where errors propagate from MD to ED hence might downgrade the overall performance of the system. The second category has emerged in an attempt to mitigate these errors, where researchers focused on jointly modelling MD and ED, emphasising the importance of the mutual dependency of the two sub-tasks (Kolitsas et al., 2018). These two EL approaches depend on an intermediate candidate generation step and rely on a pre-computed list of entity candidates. For example, (Kolitsas et al., 2018) propose a joint MD and ED model and inherits the candidate list from (Ganea and Hofmann, 2017). The third approach combines the three sub-steps in a joint model and illustrates that each of those tasks is interdependent (Durrett and Klein, 2014;Broscheit, 2019).
The recent EL approaches focus on jointly modelling two or three subtasks (Sevgili et al., 2020). Furthermore, the NLP research community has extensively used transformers in end-to-end models for entity linking (Broscheit 2019, Peters et al. 2019, and Févry et al. 2020. Nevertheless, these works report less performance than (Kolitsas et al., 2018), which is a bi-LSTM based model. The observations regarding the limited performance of transformer-based models for the EL motivate our work, and in this paper, our focus is to understand the bottlenecks in the entity linking process. We argue that the less studied task in literature, i.e., candidate generation, has an essential role in the EL models' performance, which has not been a focus in the recently proposed transformer-based entity linking models.
In this paper, we hypothesise that the transformer models, though trained on a large corpus, may require additional task-specific contexts. Furthermore, inducing the context at the entity disambiguation step may positively impact the overall performance, which has not been utilised in the state of the art methods due to monolithic implementations (Kolitsas et al., 2018;Peters et al., 2019;Broscheit, 2019;Févry et al., 2020). Subsequently, we deviate from the joint modelling of two or three subtasks of the EL and revert to the methodology opted by earlier EL systems in 2011 (Hoffart et al., 2011), i.e. treat each subtask independently. As such, we study the research question: RQ: what is the impact of each sub-task (aka component) on the overall outcome of the transformer-based entity linking approach? We propose an intuitive novel approach named CHOLAN, comprising a modular architecture of two transformer models to solve MD and ED independently. In the first step, CHOLAN employs BERT  model to identify mentions of the entities in an input sentence. The second step involves expanding each mention with a list of KB entity candidates. Finally, the en-tity mention, sentence (local context), an entity candidate, and entity Wikipedia description (entity context) are fed as input sequences in the second BERT based model to predict the correct KB entity (cf. Figure 1). We train MD and ED steps independently during training, and while testing, we run the CHOLAN pipeline end-to-end for predicting the KB entity. The following are the novel features of CHOLAN: • The core focus of the approach is to flexibly induce external context and candidate lists in a transformer-based model to improve the EL performance. CHOLAN is independent of a particular candidate list and additional background context. We study four different configurations of CHOLAN to demonstrate the impact of candidate generation step and background knowledge (i.e. entity and sentential context) induced in the model. CHOLAN achieves a new state of the art performance on several datasets: T-REx (ElSahar et al., 2018) for Wikidata; AIDA-B, MSBC, AQUAINT, and ACE2004 for Wikipedia (Hoffart et al., 2011;Guo and Barbosa, 2018). • CHOLAN is the first approach which is empirically demonstrated to be transferable across KBs having completely different underlying structure and schema i.e., on semistructured Wikipedia and fully structured Wikidata.
The implementation is publicly available 1 . The paper is structured as follows: next section summarises the related work. Section 3 describes the problem statement and approach. Section 4 explains the experimental settings followed by results in 5. We conclude in Section 6.

Related Work
Mention Detection (MD): The first attempt to organise a named entity recognition (NER) task traced back to 1996 (Grishman and Sundheim, 1996). Since then, numerous attempts have been made ranging from conditional random fields (CRFs) with features constructed from dictionaries (Rocktäschel et al., 2013) or feature-inferring neural networks (Collobert and Weston, 2008).  Recently, contextual embedding based models achieve state of the art for NER/MD task (Akbik et al., 2018;. We point to the survey by Yadav and Bethard (2018) for details about NER. Few early EL models have performed MD task independently (Ceccarelli et al., 2013;Cornolti et al., 2016). Candidate Generation (CG): There are four prominent approaches for candidate generation. First is a direct matching of entity mentions with a pre-computed candidate set (Zwicklbauer et al., 2016). The second approach is the dictionary lookup, where a dictionary of the associated aliases of entity mentions is compiled from several knowledge base sources (e.g. Wikipedia, Wordnet) (Sevgili et al., 2020;Fang et al., 2019;Cao et al., 2017). The third approach is to generate entity candidates using empirical probabilistic entity-map p(e|m). The p(e|m) is a precalculated prior probability of correspondence between positive mentions and entities. A widely used entity map was built by (Ganea and Hofmann, 2017) from Wikipedia hyperlinks, Crosswikis (Spitkovsky and Chang, 2012) and YAGO (Hoffart et al., 2011) dictionaries. End-to-end EL approaches such as (Kolitsas et al., 2018;Cao et al., 2018) relies on the entity map built by Ganea and Hofmann. The next approach for generating the candidates is proposed by (Sakor et al., 2019). Authors build a local KG by expanding entity mentions using Wikidata and DBpedia entity labels and associated aliases. The local KG can be queried using BM25 ranking algorithm (Logeswaran et al., 2019). The modular architecture of CHOLAN gives us the flexibility to experiment with several ways of generating entity candidates. Hence, we reused candidate list proposed by (Ganea and Hofmann, 2017) and built a new CG approach based on (Sakor et al., 2019). End to End EL: Few EL approaches accomplish MD and ED tasks jointly. (Nguyen et al., 2016) propose joint recognition and disambiguation of named-entity mentions using a graphical model and show that it improves EL. The work in (Kolitsas et al., 2018) also proposes a joint model for MD and ED. Authors use a bi-LSTM based model for mention detection and computes the similarity between the entity mention embedding and set of predefined entity candidates. The work in (Broscheit, 2019) employs BERT to jointly model three subtasks of the EL. Author employ an entity vocabulary of 700K top most frequent entities to train the model. Work in (Févry et al., 2020) uses a Transformer architecture with large scale pre-training from Wikipedia links for EL. For CG, authors train the model to predict BIO-tagged mention boundaries to disambiguate among all entities. For Wikidata KG, Opentapioca is an entity linking approach which relies on a heuristic-based model for disambiguation of the mentions in a text to the Wikidata entities (Delpeuch, 2020). Arjun (Mulang et al., 2020) is the most similar to our approach CHOLAN and trains two independent neural models for MD and ED. It generates candidates on the fly using a Wikidata entity alias map. Arjun does not induce any context in the model.

Problem Statement and Approach
We formally define EL task as follows: given an input sequence of words W = {w 1 , w 2 , w 3 , . . . , w n }, and a set of entities denoted by E from a KG/KB. The EL task aligns the text into a subset of entities represented as Θ : W → E where E ⊂ E. We formulate the EL task as a three step process in which the first step is the mention detection (MD). The MD is a function θ 1 : W → M, where the set of mentions is denoted by M = (m 1 , m 2 , ..., m k ) (k ≤ n) and each mention m x is a sequence of words starting from i to end position j: m . The next task is candidate generation where for each mention m x a set of candidates C(m x )= {e x 1 , ..., e x n |e x i ∈ E} is derived. Finally, the entity disambiguation (ED) task aims to map each mention m x ∈ M to the most likely entity from its list of candidates. In our case, we model the ED task as a classification task and augment the input with extra signals as context. For every candidate entity c i ∈ C(m x ), the model estimates a probability p i , thus the most likely entity is the one with the highest probability as where W and C are the input representations respectively for the given sentence (local context) and the context derived from KG/KB. As such the probability of score p i is conditioned not only on m x and c x i but also on W and C as contextual parameters.

CHOLAN Approach
The CHOLAN architecture comprises of three main modules as illustrated in Figure 1.

Mention Detection (MD)
We adapt the vanilla BERT  model for the task of entity mention detection in an unstructured text. For each input sentence, we append the special tokens [CLS] and [SEP] to the beginning and end of the sentence, respectively. This is then used as input to the model which learns a representation of the tokens in the sentence. We then introduce a (logistic regression based) classification layer on top of the BERT model to determine named entity tags for each token following the BIO format (Sang and Meulder, 2003). Our BERT † model is initialised using publicly available weights from the pretrained BERT BASE model and is fine-tuned to the specific dataset for detecting a mention m i . Please note that BERT BASE model is the latest approach which successfully outperformed in various NLP tasks, including MD. Thus, we reuse this model for the completion of our approach.

Candidate Generation (CG)
One of the critical focus of CHOLAN is to understand the bottleneck at the CG step. Hence, we reuse the DCA candidate list and propose a novel candidate list to understand the candidate generation impact on overall EL performance. DCA Candidates: (Yang et al., 2019) adapts the probabilistic entity-map p(e|m) created by (Ganea and Hofmann, 2017) (cf. section 2) to calculate the prior probabilities of candidate entities for a given mention. In the probabilistic entitymap, each entity mention has 30 potential entity candidates. Yang and colleagues also provide associated Wikipedia description of each entity. In CHOLAN, we reuse candidate set C(m) provided by (Yang et al., 2019) and further consider associated Wikipedia entity descriptions. Falcon Candidates: (Sakor et al., 2019) created a local index of KG items from Wikidata entities expanded with entity aliases. For example, in Wikidata the entity Q33 2 has the label "Finland". Sakor and colleagues expanded the entity label with other aliases from Wikidata such as "Finlande", "Finnia", "Land of Thousand Lakes", "Suomi", and "Suomen tasavalta". We adopt this local KG index to generate entity candidates per entity mention in the employed datasets. The local KG has a querying mechanism using BM25 † algorithm (cf. equation (2)) and ranked by the calculated score. We build a predefined candidate set using the top 30 Wikidata entity candidates in C F alcon(m) for each entity mention. We enrich the candidates set obtained from Wikidata by the correspondence from Wikipedia. We also add the first paragraph of Wikipedia as entity descriptions (only if Wikidata entity has corresponding Wikipedia page) to the hyperlinks. By selecting two different candidate list, our idea is to understand the impact of candidate generation step on end-to-end entity linking performance.

Entity Disambiguation (ED)
In order to use the power of the transformers, we propose "WikiBERT" to perform the ED task. In WikiBERT, our novel methodological contribution is the induction of local sentential context and global entity context at the ED step in a transformer model, which has not been used in the recent EL models. WikiBERT is derived from the vanilla BERT BASE model and fine-tuned on the two EL datasets (CoNLL-AIDA and T-REx). We view the ED task as sequence classification task.
The input to our model is a combination of two sequences. The first sequence S 1 concatenates the entity mention m ∈ M and sentence W where the sentence acts as a local context. The second sequence S 2 is a concatenation of entity candidate e ∈ C(m)/C F alcon(m)(obtained from Equation 2) and its corresponding Wikipedia description (entity context ct i ). The two sequences are paired together with special start and separator tokens: ([CLS] S 1 [SEP] S 2 [SEP]). The sequences are fed into the model which in turn learns the input representations according to the architecture of BERT . Any given token (local context word, entity mention, or entity context words) is a summation of the three embeddings : i. Token embedding: refers to the embedding of the corresponding token. We make note here on specific tokens that comprises the input representations for our model more specialised as compared to other fine-tuning tasks. The entity mention tokens appended at the beginning of S 1 and separated from the sentence context tokens by a single vertical token bar |, likewise, for the entity context sequence S 2 , we prepend the entity title tokens from the KB before adding the descriptions. ii. Segment embedding: each of the sequences receive a single representation such that the segment embedding for the local con-text E LC refers to the representation for S 1 whereas E EC is the representation of S 2 iii. Position embedding: represents the position of the token in an input sequence. A token appearing at the i-th position in the input sequence is represented with E i To train the model, we use the negative sampling approach similar to Yamada and Shindo (2019). The candidate list is generated for each identified mention. The desired entity candidate item is labelled as one, and the rest of the incorrect candidate items (from candidate list) are labelled as zero for a given mention. This process iterates over all the identified mentions using Equation 1.
The training process fine-tunes BERT using the contextual input from sentence and Wikipedia resulting into the WikiBERT model (Equation (3)). The model predicts the relatedness of the two sequences by classifying it as either positive or negative.  (Hoffart et al., 2011) and MSNBC, AQUAINT, ACE2004 datasets from (Guo and Barbosa, 2018).

Baselines over Wikidata
We now briefly explain Wikidata baselines. 1. OpenTapioca (Delpeuch, 2020): is a heuristicbased end-to-end approach that depends on topic similarity and mapping coherence for linking Wikidata entity in an input text. 2. Arjun (Mulang et al., 2020): is a pipeline of two attentive neural networks employed for MD and ED. Arjun is the SotA, and we take baseline values from Arjun's paper.

Baselines over Wikipedia
1. (Hoffart et al., 2011): build a weighted graph of entity mentions and candidate entities. Then, the model computes a dense subgraph that predicts the best joint mention-entity mapping. 2. DBpedia Spotlight (Mendes et al., 2011) proposes a probabilistic model and relies on the context of the text to link the entities. 3. KEA (Steinmetz and Sack, 2013) employs a linguistic pipeline coupled with metadata generated from several Web sources. The candidates are ranked using a heuristic approach. 4. Babelfy (Moro et al., 2014) is a graph-based approach that uses loose identification of candidate meanings coupled with the densest subgraph heuristic to link the entities. 5. Piccinno and Ferragina (2014): to solve entity linking, authors focus on mentions recognition and annotations pruning to propose a voting algorithm for entity candidates using PageRank. 6. Kolitsas et al. (2018) train MD and ED task jointly using word and character-level embeddings. The model reuses candidate set from (Ganea and Hofmann, 2017) and generates a global voting score to rank the entity candidates. 7. Peters et al. (2019) induce multiple KBs into a large pretrained BERT model with a knowledge attention mechanism. 8. Broscheit (2019) trains MD, CG, ED task jointly using a BERT-based model. Besides, an entity vocabulary containing 700K most frequent entities in English Wikipedia was utilised. 9. Févry et al. (2020) consider large scale pretraining from Wikipedia links as the context for a transformer model to predict KB entities. In Wikipedia-based experiments, we report values from (Févry et al., 2020) and (Kolitsas et al., 2018) for AIDA-B test set. On MSNBC (MSB), AQUAINT (AQ), and ACE2004 (ACE) test datasets, only (Kolitsas et al., 2018), DBpedia Spotlight (Mendes et al., 2011), KEA (Steinmetz and Sack, 2013), and Babelfy (Moro et al., 2014) report the values and we compare against them.

CHOLAN Configurations
We configure CHOLAN model applying various candidate generation approaches detailed below. CHOLAN-Wikidata: we train the model using T-REx dataset and employ C F alcon(m) candidate set. The ED model (WikiBERT) is fed with the sentential context but not with entity description as not all Wikidata entities have a corresponding Wikipedia entity. CHOLAN-Wiki+FC: is trained on CoNLL-AIDA (Hoffart et al., 2011). For CG step, we employ Falcon candidate set C F alcon(m). Here, the ED model (WikiBERT) is only fed with the sentential context. CHOLAN-Wiki+DCA: We train the MD and ED models on CoNLL-AIDA. The CG step involves DCA candidate set C(m). During ED step (Wik-iBERT), Wikipedia descriptions associated with each entity is fed along with sentential context. CHOLAN: inherits CHOLAN-Wiki+FC but in addition, Wikipedia entity description is induced into the ED model (WikiBERT).

Metrics and Hyper-parameters
On Wikidata-based experiments, we employ standard metrics of accuracy i.e., precision (P), recall (R), and F-score (F) same as (Mulang et al., 2020). For Wikipedia-based datasets, we use Micro-F1 score in strong matching setting (Kolitsas et al., 2018). The strong matching needs exactly predicting the gold mention (i.e. target entity mention) boundaries and its corresponding entity annotation in the KB. To compare the recalls of two CG approaches, we report the performance on gold recall. Gold recall is the percentage of entity mentions for which the candidate set contain the ground truth entity (Yao et al., 2019).
We have implemented all our models in PyTorch 3 and optimized using Adam (Kingma and Ba, 2015). We used the pre-trained BERT models from the Transformers library (Wolf et al., 2019). We ran all the experiments on a single GeForce GTX 1080 Ti GPU with 11GB size. Table 1 outlines the hyper-parameters used in the fine-tuning on both the datasets. We followed the standard settings suggested by . The average run time is 9.31 hours/epoch for CHOLAN and without description, it was 7.23 hours/epoch.

Results
We study the following research question:what is the impact of each sub-task (aka component) on the overall outcome of the transformer-based entity linking approach? We further investigate a sub-research question: how do the external context and the candidate generation step impact the overall performance of CHOLAN? Our every experiment systematically studies the research questions in different settings.   Table 2 summarises CHOLAN performance on T-REx dataset. CHOLAN-Wikidata configuration outperforms the baselines. We dig deeper into our reported values. We observe that for MD task, our F-score is 94.3 (compared to 77 F-score of Arjun (Mulang et al., 2020)). However, the gold recall for CG step is 81.2. We generate the entity candidates using an information retrieval approach (BM25 † algorithm) to get the top 30 candidates based on the confidence score. The Wikidata KG is challenging, and many labels share the same name. It contributes to a large loss in the F-score for the CG step. For instance, the entity mention "National Highway" matches exactly with four Wikidata ID labels while 2,055 other entities contain the full mention in their labels. Please note that we did not perform retraining of (Kolitsas et al., 2018) (SOTA on Wikipedia EL) on the T-REx dataset since we determined that the model is tightly coupled and relies on pre-computed Wikipedia candidate list from (Ganea and Hofmann, 2017).

Ablation Study on Wikidata
We study the impact of local context on the performance of CHOLAN. Therefore, we exclude the sentence as input in the ED step at training and testing time. Hence, the inputs to the ED model are only entity mention and the entity candidates gained from the CG step. We observe that the performance drops when the local sentential context is not fed (cf.    Table 5 also approves transferability of CHOLAN when we apply cross-domain experiments.

Ablation Study on Wikipedia
We conducted three ablation studies to understand the behaviour of CHOLAN's configurations over Wikipedia datasets. The first study    Table 6. CG plays a crucial role in trading off precision and recall. We conclude that more robust CG approaches likely impact overall performance. The second ablation study is about to calculate the performance of our configurations for ED step, i.e., running WikiB-ERT in isolation. Here, we assume that all entities are truly recognised; thus, our focus of the study is the ED model. We report the impact of various candidate generation approaches on the ED model in Table 7. The significant jump in the performance from "CHOLAN-Wiki+FC Vs CHOLAN" contributes to the additional background knowledge provided in CHOLAN as entity candidate descriptions. The third ablation study tests the impact of sentential context fed into two configurations on a Wikipedia dataset. Table 8 reports the achieved performance after excluding sentence as the additional context. Obviously, the performance decreases. The model shows similar be-haviour on T-REx in Table 3. These observations confirm our hypothesis as the ED model is enhanced using additional contexts.

Conclusions
In the last two years, the NLP research community has extensively tried transformer-based models for the EL task. However, the performance remained lower than Kolitsas et al. (2018). This paper combines the traditional software engineering principle of modular architecture with the contextinduced transformers to effectively solve the EL task. Our reason to deviate from an end-to-end architecture was to provide full flexibility to our system in terms of candidate generation list, underlying KG, and induction of the context at the ED step. We attribute CHOLAN's outperformance to the following reasons: 1) the modular architecture, which brings flexibility and interoperability as CHOLAN can treat each task independently. Kolitsas et al. (2018) reports that shifting towards joint modelling of MD and ED tasks helps mitigate error propagation from MD to ED. However, the performance of BERT BASE for the MD task is significantly high (92.3 on AIDA-B and 94.3 F1-score on T-REX calculated by us) remarkably reducing the errors in MD. CHOLAN leverages this capability in the MD subtask, placing more focus on CG and ED tasks.
2) The flexibility in architecture further permits us to induce sentence and entity descriptions as additional contexts. Furthermore, using candidate list in plug and play manner has resulted in a significant increase in the performance. In earlier transformer approaches, the implementation is monolithic and context is not utilised. There are scopes for improvement in our approach. Wu et al. (2019a) introduces a novel CG method that retrieves candidates in a dense space defined by a bi-encoder and can be used as alternate CG approach. We aim for scaling CHOLAN to multilingual entity linking as a viable next step.