Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP

Retrieval is a core component for open-domain NLP tasks. In open-domain tasks, multiple entities can share a name, making disambiguation an inherent yet under-explored problem. We propose an evaluation benchmark for assessing the entity disambiguation capabilities of these retrievers, which we call Ambiguous Entity Retrieval (AmbER) sets. We define an AmbER set as a collection of entities that share a name along with queries about those entities. By covering the set of entities for polysemous names, AmbER sets act as a challenging test of entity disambiguation. We create AmbER sets for three popular open-domain tasks: fact checking, slot filling, and question answering, and evaluate a diverse set of retrievers. We find that the retrievers exhibit popularity bias, significantly under-performing on rarer entities that share a name, e.g., they are twice as likely to retrieve erroneous documents on queries for the less popular entity under the same name. These experiments on AmbER sets show their utility as an evaluation tool and highlight the weaknesses of popular retrieval systems.


Introduction
Substantial progress in NLP has been made on "closed" tasks, where queries are paired with relevant documents (Rajpurkar et al., 2016;Dua et al., 2019). However, there is growing interest in "opendomain" tasks, where relevant documents need to be retrieved from a knowledge source before an NLP system can perform reasoning and produce an answer (Chen et al., 2017;Petroni et al., 2021). The open-domain setting better reflects real-world usage for tasks where relevant information is generally not provided (e.g., fact checking). * Work started during an internship at Apple. 1 The AmbER sets used in this paper and the code to generate them are available at https://github.com/ anthonywchen/AmbER-Sets.  Figure 1: Queries for two entities (president & musician) with the name "Abe Lincoln". Retrieving the gold document involves disambiguating which "Abe Lincoln" each query is asking about. BLINK performs sub-optimally on the second query, as it ranks the document of the president over the gold document.
Because success hinges on finding relevant documents, open-domain progress has been closely tied to improvements in retrieval systems 2 Lewis et al., 2020b).
A crucial challenge when interacting with a large knowledge source (e.g., Wikipedia) is entity ambiguity, the phenomenon where a single name can map to multiple entities. Resolving this ambiguity is referred to as entity disambiguation and is an important step for effective retrieval. For example, given the query "What musical instrument does Abe Lincoln play?", documents about the musician should rank higher than other entities with the same name ( Figure 1). Although entity disambiguation has been extensively studied in entity linking (Hoffart et al., 2011;Rao et al., 2013;Sevgili et al., 4473 2020) and search (Balog et al., 2010(Balog et al., , 2011, in the context of open-domain NLP, it is unclear how good retrieval systems are when faced with queries with ambiguous entities. Evaluating entity ambiguity is challenging because the popularity of entities follows a long-tail ( Figure 2) and rare entities are seldom covered in naturally-occurring datasets.
In this paper we introduce AmbER sets, a benchmark for evaluating the entity disambiguation capabilities of retrievers across multiple NLP tasks. Each AmbER set is a collection of Wikidata entities that share a name, and their corresponding queries for specific NLP tasks. For each set, we define the head entity as the most popular entity and tail entities as the less popular ones. By creating queries for multiple entities that share a name, AmbER sets provide an accurate test of entity disambiguation capabilities of retrievers and help assess the role of entity popularity in disambiguation. We show examples of AmbER sets for the question answering task in Table 1. We automatically create Am-bER sets by mining the Wikidata knowledge graph (Vrandecic and Krötzsch, 2014) for relevant names and entities, and leveraging task-specific templates to generate inputs for three tasks: fact checking, slot filling, and question answering ( Figure 3). In total, our AmbER sets contain 80k task-specific queries which we align to the Wikipedia snapshot from KILT (Petroni et al., 2021).
We use AmbER sets to conduct a systematic study of various retrieval systems that operate under different principles, such as token overlap and dense embedding similarity. Retrievers perform very differently on AmbER sets in terms of absolute retrieval numbers, with Bootleg (Orr et al., 2020), an entity-linking-based retriever, performing best. Despite these differences, all retrievers exhibit a large degree of popularity bias, underperforming on inputs concerning tail entities. TF-IDF, a token-based retriever, performs about four times worse on tail entity inputs compared to head entity inputs. Even with Bootleg, the best performing retriever, performance on tail entities is still 1.5 times lower than on head entities. Our results on AmbER sets demonstrate that there is significant work to be done on making retrievers robust in handling entity disambiguation.

AmbER Sets
Retrieving relevant documents from large knowledge sources such as Wikipedia is an important Figure 2: The Long Tail of Entity Popularity: Graph of the Wikipedia pageviews (in October 2019) for each Wikidata entity, ranked by popularity. Gray are 100k randomly sampled entities, while red/blue are entities with the name "Abe Lincoln". first step in the open-domain pipeline. An inherent problem in working with such sources is entity disambiguation: resolving a name (mention) to an entity in the knowledge source. Entity disambiguation can be challenging because many entities share a name, and the popularity of entities follows a long-tail distribution ( Figure 2). Despite the importance of entity disambiguation, it remains an understudied problem for open-domain NLP. We introduce AmbER sets for evaluating entity disambiguation capabilities of retrievers and analyze the role of entity popularity in disambiguation.

What is an AmbER Set?
We first provide an intuition for an AmbER set before concretely defining one. Consider two entities, a president and a musician, both of which have the name "Abe Lincoln" (Figure 1). Now, consider the query "Which battle did Abe Lincoln fight in?" and assume a retriever correctly returns the article about the president for this query. Simply because the correct document was retrieved does not mean a retriever has the ability to disambiguate between the president and the musician, as the president is much more popular. We should only be confident in its ability to disambiguate entities if we also pose a query about the less popular musician and the retriever again returns the correct document (as opposed to the document about the president).
Based on this intuition, we define an AmbER set as a collection of queries that satisfy the following: • Criteria 1: Polysemous Name: The queries in an AmbER set are all about entities that share a common name (e.g., Abe Lincoln).  • Criteria 2: Disparity in Popularity: An Am-bER set contains queries about both the most popular entity for a name (the head entity), e.g., the president, and the less popular entities (the tail entities), e.g., the musician.
• Criteria 3: Resolvable Ambiguity: The content of the query should be sufficient to resolve to the correct entity. The query "Which battle did Abe Lincoln fight in?" satisfies this criteria, because there is only one Abe Lincoln that fought in a war, while "Where was Abe Lincoln born?" does not since it applies to all Abe Lincolns. We provide examples of AmbER sets for the task of question answering in Table 1.

Open-Domain Tasks
In this work, we create AmbER sets for three tasks: fact checking, slot filling, and question answering ( Table 2). We consider these three tasks for three reasons. First, these three set of tasks are diverse in nature. In this work, slot filling is a generation task, question answering is a span selection task, and fact checking is a classification task. Second, the training sets available for each task are quite disparate. The largest fact checking training set, FEVER (Thorne et al., 2018), has 80k instances, while the slot filling dataset, T-REx (Elsahar et al., 2018), has over 2 million instances. The final reason we study these three tasks is that their inputs are short and easy to create.

Creating AmbER Sets
While AmbER sets can be manually created, doing so can be time-consuming, requiring a human to manually scour a knowledge base for polysemous  names and related entities before manually writing queries for those entities. Instead, we present a pipeline for automatically creating AmbER sets using the Wikidata knowledge graph (Vrandecic and Krötzsch, 2014). In this section, we describe two different collections of AmbER sets, and discuss our automatic pipeline for creating AmbER sets.

Two Collections of AmbER Sets
A natural question is "How do retrievers handle entity ambiguity when two entities have the same entity type as opposed when they have different types?". To answer this question, we create two collections of AmbER sets. The first is AmbER-H, a collection of AmbER sets where all entities are humans. The choice to restrict AmbER-H to humans is motivated by the fact that humans have properties that help distinguish themselves from other humans, generally based on occupation. The second is AmbER-N, a collection of AmbER sets where all entities contained are non-humans, and disambiguation of a name is between non-human entities with different entity types. This is because a non-human entity, like a movie, does not generally have a single distinguishing property to distinguish from other movies. This makes it natural to compare non-human entities to other non-human entities with different types. We specify the entity types in each collection in Table 3.  Figure 3: Automated creation of AmbER sets for three tasks. We collect sets of entities from Wikipedia that share a name, where the most popular entity is the head entity (in red) and others are tail entities (in blue), along with their properties and associated values. We filter out properties that do not help distinguish entities in the set (gray-ed out), and remove entities that do not have any properties remaining. From the remaining properties, we instantiate queries via templates for three tasks: question answering (QA), slot filling (SF), and fact checking (FC).

Automatic Creation of AmbER Sets
We now describe a pipeline to automatically create AmbER sets for three tasks: fact checking, slot filling, and question answering. We provide a visualization of the pipeline in Figure 3.

Collecting Names and Entities
We begin by collecting all entity aliases 3 in Wikidata. From these aliases, we filter for those that are shared by multiple Wikidata entities. Each entity in Wikidata is represented by a unique QID. The entities must have an entity type from Table 3 depending on the collection we are collecting AmbER sets for. Each alias and associated entities form the basis for an AmbER set. Within each set, we define the head and tail entities based on the number of Wikipedia page views for the month of October 2019. We filter out AmbER sets where the percentage gap in popularity between the head entity and the most popular tail entity is less than 10% to account for noise in the monthly page views.
Collecting Distinguishing Properties We gather properties and associated values for each entity from Wikidata. We only retain properties that are in a specified list (Table 3), as they are useful for resolving ambiguity (Criteria 3). We also filter a property if two entities within an AmbER set have that property, ensuring that the remaining properties can be used to disambiguate between entities with the same name. These properties are used to instantiate the queries.   For slot filling, we create a single input from each Wikidata tuple by concatenating the AmbER set name with the property name, and using the value of the tuple as the answer. For question answering, we also create a single input for each tuple by filling in the template with the AmbER set name and using the value of the tuple as the answer. For fact checking, we create two inputs for each tuple, one claim that is true using the tuple value and one claim that is false. The false claim is created by finding the most popular value for the tuple property that does not match the tuple value 5 .

Dataset Statistics
We provide statistics for AmbER sets in Table 4. On average, each AmbER set has about three entities that share the same name. Of these three entities, on average, only two have properties after filtering. In total, our AmbER sets contain about 80k task-specific input queries.

Limitations
Since our pipeline is automated and relies on Wikipedia and Wikidata, there are a few limitations worth noting. AmbER sets will be affected by incompleteness of the knowledge source, sometimes resulting ambiguous queries if a property is missing from Wikidata, but answerable from Wikipedia text. For this reason, we only select a few properties for each type (Table 3). Second, even though we author multiple templates for each property, the reliance on these templates limits the syntactic diversity in the queries (not a critical concern, since we are only evaluating existing models). Also, we use Wikipedia page views as a proxy for real-world popularity of entities. Defining popularity in this way may be problematic, as page views for an entity can fluctuate, and may make our pipeline difficult to generalize to other knowledge sources, where this information may not be available.
Several design choices in creating AmbER sets are worth further investigation. We limit AmbER sets to a pre-specified list of entity types and properties to ensure that entities in an AmbER set are distinguishable. This precludes other properties that may be useful in distinguishing entities, reducing the diversity in AmbER sets. Another design choice is we allow any alias in Wikidata to form an AmbER sets, however, not all aliases are canonical ways to refer to the entity. For instance, Shaquille O'Neal has the unusual alias "The Big Cactus", potentially leading to a somewhat unrealistic query "What sport did The Big Cactus play?". We plan to revisit the these design choices in future work.

Evaluation Setup
Retrieval Systems The primary focus of this work is to evaluate entity ambiguity of retrieval systems. We consider four retrievers based on different retrieval paradigms. The first three are TF-IDF, a token-based retriever using sparse embeddings, DPR , a dense embedding based retriever, and BLINK (Wu et al., 2020), a linker-based retriever which ranks documents based on input entities. These three retrievers have been thoroughly evaluated on a number of open-domain tasks in Petroni et al. (2021) with no obvious winner across tasks. Encouraged by the disambiguation success on rare entities by Orr et al. (2020), we also evaluate a retriever based on Bootleg, another entity linker. We provide additional details about these retrievers in Appendix D.   Table 6: Entity confusion measures the % of queries the gold document ranks worse (lower) than a document for another entity with the same name (i.e., another entity in the AmbER set). Retrievers are four times as likely to exhibit this when dealing tail queries.

Downstream Models
The dominant approach to open-domain tasks is a two-stage process where a retriever first finds relevant documents, followed by a downstream model that processes these documents to produce an answer. We evaluate the end-to-end performance on AmbER sets by training downstream NLP models on our tasks of interest. For fact checking, we fine-tune a BERT classifier  on FEVER (Thorne et al., 2018). For question answering, we fine-tune a RoBERTa model (Liu et al., 2019) on Natural Questions (Kwiatkowski et al., 2019). For slot filling, a generation task, we fine-tune a BART model (Lewis et al., 2020a) on T-Rex (Elsahar et al., 2018). We provide example training instances in Table 2 and additional details on the models in Appendix E. We use the AllenNLP and HuggingFace Transformers library to finetune our downstream models (Gardner et al., 2018;Wolf et al., 2020).

Results
In this section, we evaluate existing open-domain NLP pipelines using AmbER sets. We also conduct Figure 4: Popularity Gap vs Retrieval Gap. We bin QA queries of pairs of head and tail entities based on the popularity gap between the entities. For each bin, we calculate the retrieval accuracy@1 difference on the head and tail queries. Larger popularity gaps tend to lead to a wider gaps in retrieval performance. The red line is retrievers' performance gaps between head and tail queries on the entire collection. a user study to evaluate the quality of the queries in the AmbER sets.
Top Document Retrieval We report retrieval performance in Table 5 in terms of retriever ac-curacy@1 (the % of instances where the first retrieved document is the gold document). For each task, we report values on the entire AmbER set ("All"), as well as instances corresponding only to "Head" entities or to "Tail" entities. We also report a metric we call all correct (∀), the fraction  Table 7: End-to-end performance on AmbER sets. We evaluate systems in an oracle setting, where the gold document is provided, and a retrieval setting, where 20 documents are provided from a retriever.
of AmbER sets in which all queries had the correct document retrieved. All retrievers do better on head entities compared to tail entities. Since BLINK, Bootleg, and DPR are initialized using pre-trained language models, they may have a predisposition towards being biased to more popular entities. However, we find TF-IDF also does better on head entities, perhaps because more popular entities have longer Wikipedia pages, possibly increasing term-frequency scores. Second, there are large discrepancies between a retriever's performance on different tasks for an AmbER collection. For instance, DPR does substantially worse on slot filling compared to its performance on question answering. This is surprising since queries for all tasks are created from the same set of Wikidata tuples. Finally, we find that retrievers are mostly incorrect on getting all the queries in a set correct, with some receiving a ∀ score of 0 on some tasks. Overall, we find that the Bootleg retriever on average does the best across tasks, however there is significant scope for improvement.

Entity Confusion
To explicitly evaluate whether retrievers get confused by entities in the same Am-bER set, we compute entity confusion for retrievers defined as the percentage of queries where the retriever ranks a document for an incorrect entity from the same AmbER set over the gold document (Table 6). We find that across retrievers, tasks, and AmbER collections, entity confusion is twice as high for tail entity inputs. This result indicates that the popularity of an entity for a given name plays a significant role in retrieval performance.
Effect of Popularity Gap Since the difference in popularity between the head and tail entities can vary considerably, these results obfuscate the effect of the size of the popularity gap. We explore how the gap in popularity between head and tail entities translates to the gaps in performance on their associated queries. For a head entity with popularity p h and a tail entity with popularity p t from the same AmbER set, we calculate popularity gap, p h −pt pt , and bin associated head/tail inputs based on the gap 6 . For each bin, we calculate the difference in accuracy@1 between the head and tail entity queries. Results for QA AmbER sets (Figure 4) show that there is a strong correlation between the popularity gap and the difference in performance.

End to End Results
We evaluate end to end performance in several evaluation settings with all results provided in Table 7. The metrics used are F1 for slot filling and question answering and accuracy for fact checking. In the "oracle" setting, we directly provide the downstream NLP model the gold document, and find that the gap between head entities and tail entities is fairly small. This suggests that in closed NLP settings, where the gold document is known, entity disambiguation is not a major concern.
In the regular retrieval setting, we provide the model the top 20 documents as ranked by a retrieval system (BLINK and DPR), and find that retrievers still perform better on head entity queries (see Appendix A). The downstream systems that use retrieved documents display a noticeable gap in end-to-end performance between head and tail entity inputs. This is expected, as retrieval systems perform worse on tail entities.
User Study AmbER sets are created in a largely automatic process, raising questions about data quality. To address these questions, we conduct a small user study on AmbER sets to evaluate whether the queries are resolvable by humans. We present a query from a QA AmbER set along with three documents for the entities from the same Am-bER set, one of which is the gold document. We first ask the user to select the relevant document, then we ask the user to select an answer span from the selected document. In total, we asked 7 subjects to examine about 120 queries across AmbER-H and AmbER-N, and computed their accuracy in  Table 8: User study on AmbER QA. Humans are nearly perfect in identifying the correct document for each query (Doc Acc), while existing retrievers frequently fail. When the gold document is provided to downstream NLP models (BERT), they do almost as well as humans in answering the question (EM).
selecting the correct document and answer (Table  8). We also compare retrievers for this task, i.e. select from 3 documents for the same queries, and find that humans perform very well on the document selection task compared to retrievers on both sets. We also compare the accuracy of answer selection, and see that the closed domain NLP model (fine-tuned BERT) is as almost accurate as humans on the same set of queries 7 . This further confirms that closed NLP models are not the source of bias towards head entities, but the retrievers are.

Related Work
Entity Ambiguity As previously mentioned, entity ambiguity is when a single name can match multiple entities in a knowledge source. Entity ambiguity has been most studied in the context of entity linking (Rao et al., 2013). To improve disambiguation, entity linkers have included auxiliary information such as entity types (Onoe and Durrett, 2020) and entity descriptions (Logeswaran et al., 2019). A recent thread of work aims to study how language models recall and leverage information about names and entities. Prabhakaran et al. (2019) shows that names can have a measurable effect on the prediction of sentiment analysis systems. Shwartz et al. (2020) demonstrates that pre-trained language models implicitly resolve entity ambiguity by grounding names to entities based on the pretraining corpus. The problem of entity ambiguity also appears implicitly in entity-centric tasks such as determining the semantic relatedness between entities (Hoffart et al., 2012) and entity-oriented 7 The relatively low answer score is due to artifacts in using EM for QA evaluation, and is consistent with human performance on span selection (Rajpurkar et al., 2016)).
search (Balog et al., 2010(Balog et al., , 2011. We draw inspiration from these works by studying entity ambiguity in the context of open-domain NLP. Popularity Bias System's that perform worse on the long-tail suffer from what is known as popularity bias. This problem has been studied extensively in the recommendation systems literature, where recommendation systems are known to often ignore the long-tail of products and instead recommend very popular items (Abdollahpouri et al., 2017;Chen et al., 2020). This has the effect of unfairly hurting users who would prefer these less-popular items (Abdollahpouri et al., 2019;Ciampaglia et al., 2018). We explore popularity bias from the angle of retrieval as opposed to recommendation, and find popularity bias exists in retrieval systems.
Open-Domain Ambiguity Ambiguity is an inherent problem when it comes to open-domain reasoning.  showed that half of instances sampled from Natural Questions are ambiguous, with multiple correct answers. AmbER sets are similar in that the ambiguity is in terms of the entity in the query, however, in contrast to Natural Questions, AmbER set inputs have been constructed such that the ambiguity is resolvable.
Challenge Sets There have been many evaluation sets specifically designed to assess a model's ability to handle a specific phenomenon (Naik et al., 2018;Zhao et al., 2018;McCoy et al., 2019;Richardson et al., 2020;Jeretic et al., 2020;Ribeiro et al., 2019). Some of these challenge sets, similar to AmbER sets, use templates to generate a large amount of evaluation data quickly (Richardson et al., 2020;McCoy et al., 2019;Ribeiro et al., 2020). AmbER sets can be viewed as a challenge set for assessing opendomain systems' ability to handle entity ambiguity.

Conclusion
Entity ambiguity is an inherent problem in retrieval, as many entities can share a name. For evaluating disambiguation capabilities of retrievers, we introduce AmbER sets; an AmbER set is a collection of task-specific queries about entities that share a name, but the queries have sufficient content to resolve the correct entity. We create a broad range of AmbER sets, covering many entity types, with input queries for three open-domain NLP tasks: fact checking, slot filling, and question answering. Our experiments demonstrate the struggles of current retrievers in handling entity ambiguity. In particular, we find that the popularity of an entity in relation to other entities that share a name plays a significant role during disambiguation. For instance, we find that all tested retrievers are about twice as likely to retrieve erroneous documents when dealing with less popular entities than the most popular entity with the same name. Future goals include improving entity disambiguation capabilities of retrievers, perhaps more directly incorporating ideas from entity linking and coreference resolution. The AmbER sets and the code for the generation pipeline is available at https: //github.com/anthonywchen/AmbER-Sets.
We provide results for top-20 retrieval in Table 9. Top-20 retrieval is used for providing documents in the end-to-end evaluation setting. In this setting, retrieval accuracy measures whether a gold document appears in one of the top-20 retrieved documents. Similar to top-1 retrieval, retrievers continue to perform better on head queries. Table 10 contains the templates used to instantiate the task-specific inputs. Templates were written on a per-property basis. We note that many of the properties share templates that are very similar.

C Computational Resources
All experiments (e.g., training baselines, generating AmbER sets, etc.) were conducted on a machine with 500 GB of RAM, 64 CPUs, and using an NVIDIA TitanRTX with 24 GB of RAM. Retrieval on a collection of AmbER sets takes about 12 hours for the most time-consuming retriever, BLINK. Training a downstream model takes roughly 5 hours and inference on a collection of AmbER sets takes less than 30 minutes.

D Retriever Details
For BLINK, DPR, and TF-IDF, we use the retriever code in the KILT repository released by Facebook 8 . For Bootleg, we use the code provided by the Hazy Research group 9 .

E Downstream Model Details
For question answering, we train a RoBERTa-Large model on Natural Questions. We use the negative documents in Natural Questions to train a "noanswer" classifier using the [CLS] token. During inference, we take the highest-scoring span where the answer is not classified as "no-answer". For slot filling, we train a BART-base model. For each slot filling instance, we train with the top non-gold document retrieved by TF-IDF as a negative document. For this negative document, we train the model to generate a "none" token, and during inference, we take the highest scoring answer that is 8 https://github.com/facebookresearch/ KILT 9 https://github.com/HazyResearch/ bootleg not "none". For fact checking, we train a three-way (i.e., SUPPORTS, REFUTES, NEUTRAL) BERTbase classifier. Similar to slot filling, we train with the top non-gold document retrieved by TF-IDF as a negative document and train the model to classify this negative document as NEUTRAL. During inference, we take the highest scoring prediction that is not NEUTRAL. When training baselines models, we do not tune over hyperparameters and train with a batch size of 32 for 3 epochs.  $object performs in $name. $object is the performer of $name . $name was performed by $object.
record label What is the record label of $name.? What is the record label for $name? $name belongs to which record label?
$object is the record label for $name. $name's record label is $object.
tracklist What song appears in the album $name? What song appears on $name? What are the tracks in $name?
$name belongs to $object tracklist. $object is on the release of $name . $object is a song in the $name tracklist.
industry Which industry is $name in?
In what industry is $name? What is $name's industry?
$name is in the industry of $object. The company $name is in the $object industry. $name's industry is $object.
population What is the total population of $name? What is the population of $name? How many people live in $name?
The population of $name is $object. $name's population is $object. $name has a population of $object.