mReFinED: An Efficient End-to-End Multilingual Entity Linking System

End-to-end multilingual entity linking (MEL) is concerned with identifying multilingual entity mentions and their corresponding entity IDs in a knowledge base. Prior efforts assume that entity mentions are given and skip the entity mention detection step due to a lack of high-quality multilingual training corpora. To overcome this limitation, we propose mReFinED, the first end-to-end MEL model. Additionally, we propose a bootstrapping mention detection framework that enhances the quality of training corpora. Our experimental results demonstrated that mReFinED outperformed the best existing work in the end-to-end MEL task while being 44 times faster.


Introduction
End-to-end entity linking (EL) is the task of identifying entity mentions within a given text and mapping them to the corresponding entity in a knowledge base.End-to-end EL plays a crucial role in various NLP tasks, such as question answering (Nie et al., 2019;Asai et al., 2020;Hu et al., 2022) and information retrieval (Zhang et al., 2022).
To clarify the terminology used in this paper and previous work, it should be noted that when referring to EL in previous work, we are specifically referring to entity disambiguation (ED) where the entity mentions are given.We will only refer to our definition of EL (mention detection with ED) using the full name "end-to-end EL".Existing EL research has extended models to support over 100 languages in a single model using Wikipedia as the training corpus.We call this task multilingual entity linking (MEL).Recent work proposed MEL frameworks by minimizing discrepancy between mention and entity description representation (Botha et al., 2020) or the same mention but in different contexts (FitzGerald et al., 2021) based on bi-encoder pre-trained language models (Devlin et al., 2019).An alternative method is predicting the target entity's Wikipedia title in an auto-regressive manner (Cao et al., 2022) by extending the sequence-to-sequence pipeline of Cao et al. (2021).However, none of the existing works perform end-to-end MEL because of a lack of high-quality multilingual entity mention training resources.For example, we found that Wikipedia suffers from an unlabelled entity mention problem, i.e. not all entity mentions have hyperlink markups to train a reliable mention detection model.Thus, devising an end-to-end MEL system remains a challenging task.In this paper, we propose the first end-to-end MEL system.To address the unlabelled mention problem in end-to-end MEL, we propose a bootstrapping mention detection (MD) framework.Our framework leverages an existing multilingual MD model to create a bootstrapped dataset, which we use to train a new mention detection model for annotating unlabelled mentions in Wikipedia.The framework provides an improvement for detecting named and non-named entities in Wikipedia, compared to previous multilingual MD approaches (Honnibal et al., 2020;Hu et al., 2020;Tedeschi et al., 2021).To construct the end-to-end MEL system, we extend ReFinED (Ayoola et al., 2022) since it is comparable to the state-of-the-art (SOTA) models in the English end-to-end EL setting, and significantly faster than any other methods to date.We call this new model mRe-FinED.Our code is released at: https: //github.com/amazon-science/ReFinED/tree/mrefined.
To demonstrate mReFinED's effectiveness, we compare it with SOTA MEL (Cao et al., 2022) on the end-to-end MEL task across two datasets, Mewsli-9 (Botha et al., 2020) and TR2016 hard (Tsai and Roth, 2016).Experimental results show that mReFinED outperforms a two-stage model (combining SOTA MD and MEL models) on both datasets.Moreover, mReFinED's inference speed is 44 times faster than SOTA MEL.
Our contributions are as follows: We propose the first end-to-end MEL in a single model by extending ReFinED to multilingual ReFinED.In addition, we propose a bootstrapping mention detection framework to solve the unlabelled mention problem in end-to-end MEL.

Methodology
Overview.We first fine-tune a mention detection (MD) model -based on a multilingual pre-trained language model (PLM) -with our bootstrapping MD framework, as shown in Figure 1.We then use the bootstrapped MD model to annotate unlabelled mentions in Wikipedia.Finally, we use the data bootstrapped to train mReFinED in a multitask manner (Ayoola et al., 2022), which includes mention detection, entity type prediction, entity description, and entity disambiguation.

Bootstrapping Mention Detection
Adding new annotation to training data

Multilingual Wikipedia
Adding new annotation Multilingual NER

Augmented datasets
Fine-tuning MD process As shown in Figure 1, we employ an existing multilingual named entity recognition (NER) model to annotate unlabelled mentions in Wikipedia corpora.Our framework allows various choices of existing NER models (Honnibal et al., 2020;Hu et al., 2020;Tedeschi et al., 2021) without any constraints.Based on our MD experiments in section 3.3, we adopted the XTREME NER model (fine-tuned on 40 languages) since it supports every language in both MEL datasets.

Existing hyperlinks + new annotations as ground truths
We then train a bootstrapped multilingual MD model using the same PLM as previous MD works (mBERT; Devlin et al., 2019) with the newly annotated mentions and existing markups in Wikipedia.For simplicity, we train the bootstrapped MD model similarly to BERT's token classification fine-tuning (using the BIO tagging format; Ramshaw and Marcus, 1995).When there are possibilities of overlapping entities, we only use the longest entity mention.In addition, we found that using only 200k sentences per language as training data is enough for formulating the bootstrapped MD (see Appendix A.3 for a further details).Finally, we use the new MD model to annotate unlabelled mentions in Wikipedia.The main advantage of using the new MD model over the XTREME model is its ability to detect both named and common noun entities.This is because the new MD model learned named entities from the XTREME model's annotation and common noun entities from existing markups in Wikipedia, which is useful since most entities in current MEL datasets are common noun entities.However, the XTREME model is trained only on NER datasets and using it on MEL datasets would harm performance.

Mention Detection
We use mBERT as our mention encoder to encode the words w i , with the output from the last layer serving as the contextualized word representations h i .We add a linear layer to our mention encoder to train token classification (BIO tagging format) from h i using cross-entropy L MD to the gold label.Then, we obtain mention representations m i for each m i by average pooling h i of the entity mention tokens.All words W are encoded in a single forward pass, resulting in fast inference.

Entity Typing Score
Previous work in English EL (Raiman and Raiman, 2018;Onoe and Durrett, 2020) has shown that using an entity typing model to link entities in a KB can improve the accuracy of the EL task.Thus, we train a fine-grained entity typing model to predict entity types t of each mention m i where t is a set of entity types t ∈ T from the KB.We add a linear layer f θ 1 with a sigmoid activation σ to map m i to a fixed-size vector.We then calculate the entity typing score ET(e j , m i ) using Euclidean distance between σ(f θ 1 ) and multi-label entity types T ′ : We formulate T ′ by assigning a value of one to the correct entity types in T and a value of zero to the rest (one entity can have multiple types).We then minimize the distance between the gold label (T ′ ) and ET(•) using cross-entropy L ET following the distantly-supervised type labels from Onoe and Durrett (2020).

Entity Description Score
In this module, we compute the cross-lingual similarity score between the mention m i and entity description (d j ) in the KB.We use English as the primary language of d j since English is the dominant language for mBERT (the language with the highest amount of data), and the model tends to perform substantially better compared to other languages (Arivazhagan et al., 2019;Limkonchotiwat et al., 2022).When English is unavailable we randomly select another language.We use another mBERT to encode entity descriptions d j and train jointly with our mention encoder.We formulate and derive the contextualized representation d j from the [CLS] token in the final layer embedding.We incorporate linear layers f θ 2 and f θ 3 to mention and description encoders, respectively, with L2normalization at the output of linear layers.
Prior works used cosine similarity to derive description score for each entity (Botha et al., 2020;FitzGerald et al., 2021;Ayoola et al., 2022).In contrast, we employ NT-Xent (the temperature-scaled cross-entropy loss) as our training objective since it demonstrated better robustness in ranking results compared to cosine similarity (Chen et al., 2020): (2) where D denotes a set of description produced by candidate generation, τ denotes the temperature parameter, and sim(•) denotes the cosine similarity between two feature vectors.

Entity Disambiguation Score
Prior studies demonstrated that using entity description benefits entity disambiguation (Logeswaran et al., 2019;Wu et al., 2020).Therefore, we concatenate three outputs: (i) entity typing score (ET); (ii) the cross-lingual similarity score between m i and d j with additional temperature scaling τ (similarity score calibration; Guo et al., 2017); and (iii) entity prior score ( P(e|m)) derived from Wikipedia hyperlink count statistics and Wikidata aliases (Hoffart et al., 2011).These three outputs are passed through a linear layer f θ 4 with an output dimension of one, as shown below: EL = f θ 4 (ET(e j , m i ); sim(m i , d j )/τ ; P(e j |m i )) (3) The output from EL is a score for each e j corresponding to m i .We train the EL score by minimizing the difference between EL and the gold label using cross-entropy L EL .

Multi-task training
We train mReFinED in a multi-task manner by combining all losses in a single forward pass.
During the training process, we use provide entity mentions (hyperlink markups), and we simultaneously train MD (hyperlink markups and new annotations in Section 2.1) along with other tasks.

Experiment Setting
Setup.We used the Wikipedia data and articles from 11 languages with a timestamp of 20221203 as our training data.To generate candidates, we used the top-30 candidates from Ayoola et al. ( 2022) and concatenated them with the top-30 candidates from Cao et al. (2022).We then select only the top 30 candidates with the highest entity prior scores for both training and inference steps.For the full parameter, language, and candidate generation settings, please refer to Appendix A.1 and A.2. Metric.We evaluate mReFinED on the end-toend EL task in both MEL datasets (Mewsli-9 and TR2016 hard ) using the same metric in previous MEL works, which is based on the recall score (Botha et al., 2020;FitzGerald et al., 2021;Cao et al., 2022).

End-to-End MEL Results
Table 1 presents the performance of mReFinED and mGENRE on both MEL datasets.The performance of mReFinED is compared with that of a twostage model, which involves bootstrapping MD with mGENRE.Our experiment results on Mewsli-9 demonstrated that mReFinED outperforms the  Table 1: Recall on Mewsli-9 and TR2016 hard datasets.We report both datasets' results of entity disambiguation and entity linking tasks.We use the bootstrapping MD model as mention detection for mGENRE.
two-stage model on the micro-and macro-averages by 1.8 and 1.9 points, respectively.The experimental results from TR2016 hard also demonstrated the same results as Mewsli-9.These results highlight the essentials of the end-to-end system in a single model, which is better than the two-stage model.For precision and F1 scores, see table 8 in Appendix.The ablations in Table 1 show that, when the bootstrapping MD framework is removed and using only the Wikipedia markups, mReFinED produces zero scores for almost all languages in both datasets.This is because the number of entity mentions in the training data decreased from 880 million to only 180 million mentions.These results emphasize the importance of our bootstrapping MD framework, which can effectively mitigate the unlabelled entity mention problem in Wikipedia.
On Mewsli-9, entity priors and descriptions are slightly complementary and contribute +0.5 macro and micro average recall when combined.Entity types are less useful and contribute +0.1 on macro and micro average recall when added.Combining all three achieves the best macro and micro average recall -58.8 and 65.5 respectively.
For Arabic (ar), removing either entity types or descriptions has no difference and achieves the same recall (61.8) compared to using all information.This suggests entity types and descriptions are redundant when either one of them is combined with entity priors.Entity priors hurt the performance in Farsi/Persian (fa) as removing it gives +0.3 recall.However, there are only 535 Farsi mentions in the Mewsli-9 dataset, which is too small of a sample size to draw reliable conclusions.For German (de) and Turkish (tr), removing descriptions seems to be beneficial and yields recall gains of +0.6 and +0.3 respectively.This could be a resource-specific issue (there could be fewer/lower quality Wikidata descriptions in these two languages) or a language-related issue (both languages are morphologically rich) but we will leave further investigations for future work.
On TR2016 hard , entity descriptions show the most contribution, +17.1 macro average recall when added.Entity types show a small amount of contribution, with +0.7 macro average recall.Entity priors turn out to be harmful when added except for es language.Macro average recall is +0.4 when entity priors are removed.This could be explained by how mentions in TR2016 hard dataset are selected.Mentions in TR2016 hard dataset are chosen so that the correct entity did not appear as the top-ranked candidate by alias table lookup.This means entity priors are not very useful for finding the correct entity for these mentions and model needs to use other information such as entity descriptions and types to choose the correct entity.We believe this introduces a discrepancy to the training scenario where entity priors are very useful signal for finding the correct entity given a mention surface form.On the other hand, since the gap between with and without entity priors is small, it also demonstrates mReFinED model's ability to use appropriate information when entity priors alone is not enough to make correct predictions.

Multilingual Mention Detection Results
This experiment compares the performance of our mention detection models with prior multilingual MD works, such as spaCy (Honnibal et al., 2020), XTREME (Hu et al., 2020), and WikiNEu-Ral (Tedeschi et al., 2021) on both MEL datasets.We use the exact match score to evaluate the efficiency of this study following previous MD works (Tjong Kim Sang and De Meulder, 2003;Tsai et al., 2006;Diab et al., 2013).As shown in Table 9, our bootstrapping MD outperforms competitive methods in all languages; e.g., our bootstrapping MD outperformed XTREME by 9.9 points and 7.3 points on Mewsli-9 and TR2016 hard datasets, respectively.In addition, mReFinED showed superior performance to the bootstrapping MD by an average of 2.8 points on Mewsli-9.These results highlight the benefits from additional join training MD with other tasks in a single model outperformed a single task model.We also run an experiment on XTREME NER dataset to better understand our bootstrapping MD's performance on multilingual mention detection task.We expect our bootstrapping MD to achieve comparable results to competitive multilingual NER models in the literature when trained on NER data, please refer to Appendix A.5 for more details.

Analysis
Incorrect labels in MEL datasets.It is noteworthy that both Mewsli-9 and TR2016 datasets contains incorrect labels.In particular, we identified entities that were erroneously linked to the "Disambiguation page" instead of their actual pages; e.g., the mention "imagine" in Mewsli-9 was linked to Q225777 -a "Wikimedia disambiguation page".Therefore, we removed those incorrect labels in both datasets and re-evaluated mReFinED and mGENRE on the cleaned dataset in Table 2. mRe-FinED's performance on the cleaned Mewsli-9 dataset increases from 65.5 to 67.4 micro-avg, and mGENRE's performance increases from 63.6 to 65.7.Lastly, the number of entity mention in Mewsli-9 was decreased from 289,087 to 279,428 mentions.See Appendix A.4 for TR2016 results.Unlabelled entity mentions in MEL datasets.It is important to note that the unlabelled entity mention problem also occurs in both MEL datasets.As mentioned in Section 2.3, most of entities in MEL datasets are common noun because these datasets use Wikipedia markups as entity mention ground truths.Thus, the MEL datasets also suffer from the unlabelled entity mention problem.For example, consider the document en-106602 in Mewsli-9 (Figure 3), it was annotated with only eight entity mentions, but mReFinED found an additional 11 entity mentions in the document, including location (i.e., "Mexico"), person (i.e., "Richard A. Feely"), organization (i.e., "NOAA"), and common noun (i.e., "marine algae") mentions.These results demonstrate that mReFinED can also mitigate the unlabelled entity mention in MEL datasets.
This presents an opportunity for us to re-annotate the MEL datasets in the future using mReFinED as an annotator tool to detect unlabelled mentions.
Run-time Efficiency.This study measures the time per query of mReFinED compared to mGENRE on one 16 GB V100.Our findings indicate that mGENRE takes 1,280 ms ± 36.1 ms to finish a single query.In contrast, mReFinED requires only 29 ms ± 1.3 ms making it 44 times faster than mGENRE because mReFinED encodes all mentions in a single forward pass.

Conclusion
In this paper, we propose mReFinED, the first multilingual end-to-end EL.We extend the monolingual ReFinED to multilingual and add the new bootstrapping MD framework to mitigate the unlabelled mention problem.mReFinED outperformed SOTA MEL in the end-to-end EL task, and the inference speed is faster than SOTA 44 times.

Limitations
We did not compare mReFinED with other MEL works (Botha et al., 2020;FitzGerald et al., 2021) since they did not release their code.However, the experimental results from other MEL works in the ED task demonstrated lower performance than mGENRE.Our experiment report results are based on standard MEL datasets, such as Mewsli-9 and TR2016 hard , which may not reflect mReFinED's performance in real-world applications.

A.1 Setup
We trained our model on 11 languages: ar, de, en, es, fa, fr, it, ja, ta, and tr.During training, we segment the training document into chucks consisting of 300 tokens each, and limited the mention per chunk only 30 mentions.We use two layers of mBERT as the description encoder.We train mRe-FinED for eight days using the AdamW optimizer, with a learning rate of 5e −4 , and a batch size of 64 for two epochs on 8 A100 (40 GB).For hyperparameter settings, we set the hyper-parameters as shown in Table 3.In addition, we evaluate the recall score of the development set in every 2,000 steps to save the best model.

A.2 Candidate Generation (CG) Results
In this experiment, we demonstrated the recall score of various candidate generation methods on both MEL datasets and why we need to combine two CG.We adopt ReFinED's CG from monolingual to multilingual PEM tables using multilingual Wikipedia data and articles.As shown in Table 4, mGENRE's CG outperformed ReFinED's CG by 3.4 points on the average case.This is because mGENRE's CG was formulated from Wikipedia in 2019, while ReFinED's CG was formulated in 2022, and there are many rare candidates that are not found in ReFinED's CG but appear in mGENRE's CG.On the other hand, Table 5 demonstrates that ReFinED's CG outperformed mGENRE's CG on the TR2016 hard dataset.Thus, combining two CGs outperforms using only one CG on both MEL datasets.

A.3 Bootstrapping MD Results
In this study, we demonstrated the effect of training size on bootstrapping MD framework.For the training data sizes, we set the size as follows: 10k, 20k, 100k, 200k, 500k, 1M.As shown in Figure 2, the training data has affected the performance of the bootstrapping MD framework.However, we found that increasing the number of training data more than 200k samples does not increase the performance of MD significantly.

A.4 Incorrect Labels TR2016 Results
This study demonstrates the performance of mGENRE and mReFinED on the cleaned TR2016 hard dataset.As shown in Table 6, we observe a 1.5 points improvements from the original result 1 (28.4).Lastly, the number of entity mentions was decreased from 16,357 to 15,380.

A.5 XTREME MD Results
To evaluate the performance of mention detection in mReFinED, we run an experiment on XTREME NER dataset by converting the labels to mention detection task with BIO tags.We chose a subset of 8 languages from XTREME NER dataset as our Bootstrapping MD is trained on 9 languages in Table 9: Exact match score of mention detection on on Mewsli-9 and TR2016 hard datasets.We omitted unsupported languages for each model with "-".
NOAA says Earth's oceans becoming more acidic According to a study performed by the National Oceanic and Atmospheric Administration's (NOAA) Pacific Marine Environmental Laboratory, the level of acid in the world's oceans is rising, decades before scientists expected the levels to rise.
The study was performed on the coastal waters of the Pacific Ocean from Baja California, Mexico to Vancouver, British Columbia, where tests showed that acid levels in some areas near the edge of the Continental Shelf were high enough to corrode the shells of some sea creatures as well as some corals.Some areas showed excessive levels of acid less than four miles off the northern California coastline in the United States.
"What we found ... was truly astonishing.This means ocean acidification may be seriously impacting marine life on the continental shelf right now.The models suggested they wouldn't be corrosive at the surface until sometime during the second half of this century," said Richard A. Feely, an oceanographer from the NOAA.
The natural processes of the seas and oceans constantly clean the Earth's air, absorbing 1/3 to 1/2 of the carbon dioxide generated by humans.As the oceans absorb more of the gas, the water becomes more acidic, reducing the amount of carbonate which shellfish such as clams and oysters use to form their shells, and increasing the levels of carbonic acid.Although levels are high, they are not yet high enough to threaten humans directly.
"Scientists have also seen a reduced ability of marine algae and free-floating plants and animals to produce protective carbonate shells," added Feely.
Feely noted that, according to the study, the oceans and seas have absorbed more than 525 billion tons of carbon dioxide since the Industrial Revolution began.

2. 2
Task Definition: end-to-end MEL Given a sequence of words in the document W = {w 1 , w 2 , ..., w |W | }, where the document can be written in multiple languages.We identify entity mentions within the document M = {m 1 , m 2 , ...m |M | } and map each one to the corresponding entity E = {e 1 , e 2 , ..., e |E| } in knowledge base (KB).

Figure 2 :
Figure 2: Effect of training size in the bootstrapping MD framework.

Table 7 :
F1 score of mention detection on XTREME NER dataset.