An Empirical Study on Multiple Information Sources for Zero-Shot Fine-Grained Entity Typing

Auxiliary information from multiple sources has been demonstrated to be effective in zero-shot fine-grained entity typing (ZFET). However, there lacks a comprehensive understanding about how to make better use of the existing information sources and how they affect the performance of ZFET. In this paper, we empirically study three kinds of auxiliary information: context consistency, type hierarchy and background knowledge (e.g., prototypes and descriptions) of types, and propose a multi-source fusion model (MSF) targeting these sources. The performance obtains up to 11.42% and 22.84% absolute gains over state-of-the-art baselines on BBN and Wiki respectively with regard to macro F1 scores. More importantly, we further discuss the characteristics, merits and demerits of each information source and provide an intuitive understanding of the complementarity among them.


Introduction
Fine-grained entity typing (FET) aims to detect the types of an entity mention given its context (Abhishek et al., 2017;Xu and Barbosa, 2018;Jin et al., 2019). The results of FET benefit lots of downstream tasks (Chen et al., 2020;Hu et al., 2019;Zhang et al., 2020a;Liu et al., 2021;Chu et al., 2020). In many scenarios, the type hierarchy is continuously evolving, which requires newly emerged types to be accounted into FET systems. As a result, zero-shot FET (ZFET) is welcomed to handle the new types which are unseen during training stage (Ma et al., 2016;Ren et al., 2020;Zhang et al., 2020b).
The major challenge of ZFET is to build the semantic connections between the seen types (during training) and the unseen ones (during inference). Auxiliary information has been proved to be essential in this regard (Xian et al., 2019), with a variety of approaches focused on scattered information (Ma et al., 2016;Zhou et al., 2018;Obeidat et al., 2019;Ren et al., 2020;Zhang et al., 2020b). However, the power of auxiliary information has not been sufficiently exploited in existing solutions. Besides, the effects of each information source also remain to be clearly understood.
In this paper, we propose a Multi-Source Fusion model (MSF) integrating three kinds of popular auxiliary information for ZFET, i.e., context consistency, type hierarchy, and background knowledge, as illustrated in Figure 1. (i) Context consistency means a correct type should be se-mantically consistent with the context if we replace the mention with the type name in the context. Type name is the surface form of a type, which is a word or a phase, e.g., type name of /organization/corporation is corporation. (ii) Type hierarchy is the ontology structure connecting seen and unseen types. (iii) Background knowledge provides the external prior information that depicts types in detail, e.g., prototypes (Ma et al., 2016) and descriptions (Obeidat et al., 2019).
MSF is composed of three modules, with each targeting a specific information source. (i) In the CA (Context-Consistency Aware) module, we measure the context consistency by large-scale pretrained language models, e.g., BERT (Devlin et al., 2019). By masking mentions and predicting the names of ground truth types through finetuning on the data of seen types, CA is expected to measure the context consistency of unseen types more precisely. (ii) In the HA (Type-Hierarchy Aware) module, we use Transformer encoder (Vaswani et al., 2017) to model the hierarchical dependency among types. There have been substantial works exploring type hierarchy in the supervised typing task (Shimaoka et al., 2017;Xu and Barbosa, 2018;Xiong et al., 2019), but only some preliminary research in ZFET (Ma et al., 2016;Zhang et al., 2020b). (iii) In the KA (Background-Knowledge Aware) module, we introduce prototypes (Ma et al., 2016) and WordNet descriptions (Miller, 1995) as background knowledge of types. KA is embodied as natural language inference with a translation-based solution to better incorporate knowledge.
Extensive experiments are carried out to verify the effectiveness of the proposed fusion model. We also conduct a deep analysis on the characteristics, merits and demerits of each information source. We find that, similar to type hierarchy, background knowledge also implies some hierarchical information through the shared prototypes and the descriptions semantically similar with their parent types. Besides, the context consistency is an essential clue in handling long-tail unseen types and longer contexts. Moreover, we further discuss the complementarity among different information sources and their contributions to the proposed fusion model.
In summary, our contributions are as follows: • We propose a multi-source fusion model integrating multiple information sources for ZFET, which achieves new state-of-the-art results on BBN and Wiki.
• We are the first work to conduct a comprehensive study on the strengths and weaknesses of three auxiliary information sources for ZFET. Besides, we also make a deep analysis about how different information sources complement each other and how they contribute to the proposed fusion model.

Overview
Zero-shot Fine-grained Entity Typing (ZFET) is defined on a type set T = T train ∪ T test , which forms a hierarchy. During inference, ZFET aims to identify the correct types for a mention m based on its context c, where the target types are unseen during the training stage, i.e., T train ∩ T test = ∅. As shown in Figure 1, we propose a Multi-Source Fusion model (MSF) that captures information from these sources and integrates them to make a better prediction under the zero-shot scenario. In the following, we first describe the details of each module (Sec 2.2, 2.3 and 2.4), and then present the joint loss function and inference details (Sec 2.5).

Context-Consistency-Aware (CA) Module
We base the CA module upon the pre-trained BERT (Devlin et al., 2019) and fine-tune it for assessment of context consistency.

Fine-tuning by Masking Mentions
Vanilla BERT randomly masks some input tokens and then predicts them. Nevertheless, in finetuning stage for ZFET, CA only masks the entity mentions and predicts their type names instead. For instance, given the context in Figure  1, we replace the entity mention Northwest with a [MASK] token and let CA module predict the name corporation of the target type /organization/corporation with a higher score. In more general cases, the length of a type name may exceed 1. Thus, the number of [MASK] tokens for replacement depends on the length of the type name (e.g., for type name living thing, we replace the corresponding mention with [MASK] [MASK]).

Loss Function for CA Module
For each mention m in the training set, we denote its ground-truth types as T pos . For each type t in T pos , we replace m with l [MASK] tokens in the context of m, where l is the length of t's type name. We define the score s t and loss t for type t as (1) where p n t,k is the probability for the k-th token of type name n t predicted by BERT. Considering all the types in T pos , the overall loss for mention m is: Note, the vocabulary of BERT contains all the constituent tokens of all the type names in T train and p n t,k is the output of the Softmax function over the vocabulary, minimizing the loss above will also punish scores of negative types in T train .

Type Hierarchy-Aware (HA) Module
In HA module, we use Transformer encoder (Vaswani et al., 2017) with mask-self-attention to capture the hierarchical information for better type representations. Besides, we take the encoder from Lin and Ji (2019) to learn the features of mentions and contexts. Then a similarity function is defined to compute the matching score between a mention and a candidate type based on the context.

Mention-Context Encoder
In the mention-context encoder, an entity mention and its context are represented as the weighted sum of their ELMo word representations. Then the mention representation r m and context representation r c are concatenated as the final representation: r mc = r m ⊕ r c , where r m , r c ∈ R dm , r mc ∈ R 2dm , ⊕ denotes concatenation.

Hierarchy-Aware Type Encoder
Given a type set T = T train ∪T test and its hierarchy structure Ψ, we denote the initialized type embeddings 1 as E = [e t 1 , e t 2 , ..., e t N ], which are the inputs of Transformer encoder, where e t i is the embedding for the i-th type t i , N is the size of T . Note that the positional embeddings are removed since the input type sequence is disordered. To inject the hierarchical information, we perform the mask-selfattention operation on types. Specifically, during the process of computing self-attention in Transformer encoder, a type only attends to its parent type in the hierarchy and itself, while the attention 1 The initialization details are presented in Appendix A to the remaining types will be masked. We omit other details and denote the final representation for each type t ∈ T as r t ∈ R dt .

Loss Function for HA Module
Given a mention m and a candidate type t ∈ T train , we first map the mention representation r mc and type representation r t into a shared space by where A ∈ R ds×2dm and B ∈ R ds×dt are learnable matrices. The matching score is defined as During training, we match mention m with all the types in T train , so the loss function for m is: whereŷ ∈ R |T train | denotes the binary vector for the ground-truth types of m with 1 for positive and 0 for negative. |T train | denotes the size of T train . y ∈ R |T train | denotes the predicted score vector. Although the HA module does not directly learn any knowledge from instances of T test , by encoding the type hierarchy Ψ using mask-self-attention, Transformer encoder will capture the semantic correlation between types in T train and T test , thus producing reliable representations for types in T test .

Background Knowledge-Aware (KA) Module
We introduce prototypes and descriptions as two kinds of knowledge in the KA module.
Prototypes refer to the carefully selected mentions for a type based on Normalized Point-wise Mutual Information (NPMI), which provide a mention-level summary for types (Ma et al., 2016).
Descriptions are queried from WordNet glosses (Miller, 1995) by type names, which provide a brief high-level summary for each type.

Inference from Background Knowledge
We hope to infer whether a mention m matches a candidate type t, given the prototypes, type description and the context. In this work, we embody the KA module as natural language inference (NLI) from multiple premises (Lai et al., 2017). An example is presented in Figure 2, with input the same

Multiple Premises
• Context-based premise: Northwest and Midway are two of the five airlines with which Budget has agreements. • Prototypes-based premise: /organization/corporation has the following prototypes: western_union, … • Description-based premise: /organization/corporation denotes a collection of business firms whose articles of incorporation have been approved in some state.

Hypothesis
• /organization/corporation is a correct type for the mention Northwest. Figure 2: An example to illustrate the multiplepremises and the hypothesis for KA.
as Figure 1. We construct three premises corresponding to the context, prototypes and description respectively. The target hypothesis encodes that "the type is correct for the mention". For both premises and hypothesis, we organize them into natural language sentences. We reuse the Mention-Context Encoder in Sec 2.3.1 to obtain representations for the contextbased premise, i.e., r mc = r m ⊕ r c , where r m and r c represent the mention and context respectively. To encode the prototypes-based and descriptionbased premises, we also use the same encoder, where the type is aligned with the mention while the rest of the sentence is aligned with the context of the mention. We denote the premises based on prototypes and description as r tp = r t ⊕ r p and r td = r t ⊕ r d , where r t , r p , r d ∈ R dm are considered as the representations for the type, prototypesbased and description-based sentences respectively. Since the hypotheses for the same mention targeting different types have the same word sequences except for the type spans , we simplify the representation of hypothesis as r h = r t ⊕ r m ∈ R 2dm , where r t and r m are the type and mention representations directly taken from r mc and r tp , In the KA module, the encoders for all the premises and hypothesis share the parameters in ELMo.
Loss Function for KA Module Motivated by TransE (Bordes et al., 2013) and TransR (Lin et al., 2015), we propose a simple translation-based solution for NLI by extending the translation operations over triples to quadruples, i.e., (context-based premise, prototypes-based premise, descriptionbased premise, hypothesis).
Given a mention m and a candidate type t, we first use the matrix W to project all the representations to a new space for inference: where W ∈ R dw×2dm . We hope thatr mc +r tp + r td ≈r h when the hypothesis can be inferred from the premises, i.e., the type t is correct for the mention m under the context c. Thus, we try to minimize their squared euclidean distance with norm constraints, i.e., r mc 2 2 = r tp 2 2 = r td 2 2 = r h 2 2 = 1. Then the score for type t is defined as: p t = −D t . The closer the distance, the higher the score. Finally, the loss function for mention m is: where T pos are the ground-truth types of T train for m with size |T pos |, while T neg are the negative types in T train , i,e., T neg = T train \ T pos .

Training and Inference
Overall Loss Given a training mention m, we derived the loss from the aforementioned modules. Finally, the overall loss to train the fusion model is: (9) where M denotes the training mention set.
Inference Given a test mention m and a candidate type t in T test , we first compute the scores from each module: s t (by CA module), y t (by HA module) and p t (by KA module). Then we normalize them according to where x is the score vector from a module for mention m towards all types t ∈ T test with x as component. µ x and σ x denote the mean and standard deviation of the vector x. The final decision score by our fusion model for type t is: where λ 1 , λ 2 , λ 3 ≥ 0 are hyper-parameters and λ 1 + λ 2 + λ 3 = 1.

Datasets and Evaluation Metrics
We evaluate our model on two widely-used datasets: BBN (Weischedel and Brunstein, 2005) and Wiki (Ling and Weld, 2012  Following prior works (Ling and Weld, 2012;Ma et al., 2016), we report all the popular metrics in our main results for a better comparison, i.e., strict accuracy (Acc), macro-averaged F1 (Ma-F1), micro-averaged F1 (Mi-F1) and micro-averaged precision (Mi-P).

Comparison Models
We abbreviate our Multi-Source Fusion model as MSF, and compare it with the following baselines: (1) Proto-HLE (Ma et al., 2016) which introduces prototype-driven hierarchical label embedding for ZFET; (2) ZOE (Zhou et al., 2018) which infers the types of a given mention according to its typecompatible Wikipedia entries; (3) DZET (Obeidat et al., 2019) which derives type representations from Wikipedia pages and leverages a contextdescription matching approach for type inference; (4) NZFET * (Ren et al., 2020) which employs entity type attention to make the model focus on information relevant to the entity type; (5) MZET * (Zhang et al., 2020b) which adopts a memory network to connect the seen and unseen types.
All the results are reimplemented except the ones indicated by *. The implementation details and hyperparameter settings (e.g., λ 1 , λ 2 , λ 3 for MSF ) are presented in Appendix A. Table 2 and Table 3 present the results on BBN and Wiki, evaluated on both the unseen fine-grained types and the seen coarse-grained types.

Main Results
Zero-shot Performance From Table 2, we see that our model significantly outperforms the baselines across the metrics. MSF gains up to 11.42% over DZET on BBN and 22.84% over ZOE on Wiki according to Ma-F1. Compared with MSF avg , which treats each information source as equally important, MSF considers the importance of each source and achieves better performance on both datasets. Besides, the single-source modules of MSF (i.e., CA, HA and KA) also produce relatively promising results, among which KA yields the best scores. Nevertheless, MSF still surpasses these modules by a large margin, which verifies the necessity of information fusion for the ZFET task. Table 3 demonstrates the advantage of MSF in predicting the seen types, with Ma-F1 increased by 2.01% over Proto-HLE on BBN and 2.25% over DZET on Wiki. Besides, CA, HA and KA still maintain a highly competitive performance in this regard. Combined with Table  2, we find that the proposed MSF has a particular superiority on the unseen types, since the auxiliary information from multiple sources tends to be more helpful when short of annotated training samples.

Ablation Studies
We conduct ablation studies on the single-source modules of MSF. The results are shown in Table 4.
Ablations of CA We observe that the vanilla CA (i.e., the BERT-based CA module without fine-tuning, denoted as "CA w/o finetuning") has reached a certain level of performance. This indicates the potential of BERT for context consistency assessment thanks to its large-scale unsupervised pre-training technique. After fine-tuning with our modified mask mechanism, CA surpasses its vanilla version by 23.13% and 10.28% on BBN and Wiki respectively.    encoder with averaged Glove word embeddings to obtain type representations and denote it as "HA-Glove". Besides, we also implement the variation of HA that removes the Transformer encoder and simply multiplies the type embeddings by a binary hierarchical matrix as (Ma et al., 2016) to model the type hierarchy (denoted as "HA-HierMatrix"). We see that HA greatly advances its counterparts that do not use Transformer encoder. Also notice that HA-HierMatrix performs better than HA-Glove, indicating hierarchical constraint enforced by Hi-erMatrix is also important for type representation learning. In addition, HA also shows a strong advantage over Proto-HLE and MZET * which also take the relationships among types into account.

Ablations of HA
Ablations of KA We remove either descriptions or prototypes from KA and denote them as "KA w/o Description" and "KA w/o Prototypes". The results reveal that, both descriptions and prototypes consistently contribute to KA, wherein prototypes seem to play a more important role on both datasets. In fact, the prototypes used in KA are carefully selected by Ma et al. (2016) while the descriptions from WordNet only contain the brief high-level summaries of types. Additionally, two baselines (i.e., Proto-HLE and DZET) which also leverages background knowledge are included for a more comprehensive comparison. We notice that KA w/o Prototypes is slightly inferior to DZET which also uses type descriptions by a type-description matching approach. However, when prototypes and descriptions are combined, the superiority of KA with NLI framework is obvious.

Characteristics, Merits and Demerits of Each Information Source
In this section, we focus on the impact of long-tail types and context length for ZFET. Based on the observations, we discuss the characteristics, merits and demerits of different modules targeting each information source (i.e., CA, HA and KA).

Impact of Long-tail Types
We examine the performance of each module on the test subset of long-tail (with less than 200 test cases) unseen types. We compute the precision, recall and F1 value for each type and report the average values over all these types in Table 5. The results show CA obtains the best F1 avg score on the long-tail types. In fact, CA is based on the pretrained BERT that contains much implicit information of the unseen long-tail types. Moreover, CA masks the mentions and completely depends on the contexts for prediction. This reduces the risk for CA to remember the mentions for prediction and improves the generalization capability. KA produces better P avg score than HA, which verifies that background knowledge is helpful in distinguishing among easily confused types. However, KA often makes mistakes on the unseen types that share little knowledge with the seen types, which makes KA perform poorly in R avg .
We also notice that the combination of different information sources brings a significant improvement to the performance of MSF regarding P avg , but a drop regarding R avg on the contrary. This inspires us to take more advantages of CA while minimizing the disturbance from KA and HA to promote the model's generalization capacity on long-tail types in the future.

Impact of Context Length
We separate the test samples into three groups by the context length, and compare the Ma-F1 scores in each group, as shown in Figure 3. We see that CA, HA, KA and MSF all perform better on the mentions with longer contexts, since longer contexts tend to be more informative than the shorter ones. MSF outperforms the single-source modules CA, HA and KA in both the situations with short and median contexts. Nevertheless, the performance of MSF is poorer than CA in the longcontext scenario. This indicates that the informa-tion from context consistency is with higher confidence in handling longer contexts. Whereas introducing HA and KA modules may prevent the performance growth compared with only using CA module in this case. Conversely, a distinct drop appears when CA is evaluated on the mentions with short contexts.

Complementarity among Different Information Sources
We present the overlaps and disjoint parts of the true cases predicted by the single-source modules in Figure 4. About 31.33% of the test mentions are successfully categorized by all the three modules, while the rest are misidentified by at least one module. We notice that HA and KA share the most true cases (up to 61.04%, i.e., 31.33% + 29.71%) among the pairwise intersections. A possible reason is that HA and KA use the same mentioncontext encoder based on ELMo. Another reason is that the premises and hypothesis constructed by KA implicitly encode some hierarchical information like HA. For example, part of the prototypes are shared between the parent and child types.  KA demonstrates greater capacity than HA with 5.27% (i.e., 2.17% + 3.1%) additional true cases that HA fails to recognize, since background knowledge helps to distinguish among the confusing sibling types sharing the same parent type. However, there still exist 1.44% (i.e., 0.65% + 0.79%) cases where HA does better than KA. This is because the hierarchy-wise information incorporated to KA is less obvious than that inside HA. Meanwhile, KA also suffers from the problem of low recall in long-tail types as discussed in Sec 4.3.1.
Another noticeable observation is that quite a proportion of cases (16.2%) are difficult for HA and KA to recognize, but easy for CA. This indicates the consistency between type names and contexts is a nonnegligible clue for the improvement of performance in ZFET.

Contributions of Multiple Information Sources to MSF
We also look into the intersections and differences between the true case sets of MSF and CA/HA/KA, as well as their union in Figure 5. We see that MSF takes more advantage of HA and KA, with 57.73% and 61.5% overlaps, respectively. Although CA provides lots of auxiliary information for MSF, there still exist 6.84% true cases of CA wrongly predicted by MSF after fusion. Besides, the 4.76% missing part of HA and the 4.82% of KA also remain to be more fully exploited. Thus, it is worth exploring deeply to make the best of each information source during model fusion.

Related Work
As a zero-shot paradigm of FET, ZFET suffers from a huge information gap between the seen and unseen types due to the lack of annotated data. In spite of simply computing type representations by averaging the embeddings of words comprising their names (Yuan and Downey, 2018), a variety of auxiliary information has been explored to fill this gap.  proposes a hierarchical clustering model with domain-specific knowledge base for unsupervised entity typing. Ma et al. (2016) first introduces prototypical information to learn type embeddings and encodes type hierarchy by multiplying the type embeddings with a binary hierarchical matrix. Zhou et al. (2018) matches the entity mention with a set of Wikipedia entries and classifies the mention based on the Freebase types of its type-compatible entries. Obeidat et al. (2019) leverages Wikipedia descriptions of types and designs a context-description matching model. Ren et al. (2020) employs entity type attention to make the model focus on context semantically relevant to the type. Zhang et al. (2020b) transfers the knowledge from seen types to the unseen ones through memory network. As for context consistency, Xin et al. (2018) first takes the language models as constraint in supervised typing tasks. Recently, Qian et al. (2021) studies unsupervised entity typing without using knowledge base, where pseudo data with fine-grained labels are automatically created from large unlabeled dataset.

Conclusion
In this paper, we explored multiple information sources for ZFET. We proposed a multi-source fusion model to better integrate these sources, which has achieved state-of-the-art performance in ZFET.
Besides, we conducted a deep analysis about the characteristics, merits and demerits of each information source, and discussed the complementarity among different sources. In particular, the context consistency information from the pre-trained language model is relatively useful in complex scenarios with long-tail types or long contexts. Along this way, we will conduct more in-depth research to take full advantage of context consistency. Besides, we will also explore more reasonable methods for information fusion in ZFET.

A Implementation Details
For Proto-HLE and DZET we employ their type representation methods but reuse our ELMo-based mention-context encoder for representations of mentions and contexts. In ZOE, we remove the test mentions of target dataset from the Wikipedia entry source and report the performance under our zero-shot setting. For CA, our implementation is based on the pre-trained BERT (BERT-base, uncased) available in the HuggingFace Library 2 . For HA, we adopt GloVe 200-dimensional word embeddings for the initialization of type embeddings. The type embeddings are frozen during training. The Transformer encoder is trained from scratch with 4 heads and 2 layers with hidden dimension of 2048. For KA, the numbers of prototypes used for BBN and Wiki are 5 and 30 respectively. For the fusion of CA, HA and KA, λ 1 , λ 2 , λ 3 are tuned according to the performance on the development set by Macro F1, and their values are as follows.