Fine-grained Entity Typing via Label Reasoning

Conventional entity typing approaches are based on independent classification paradigms, which make them difficult to recognize inter-dependent, long-tailed and fine-grained entity types. In this paper, we argue that the implicitly entailed extrinsic and intrinsic dependencies between labels can provide critical knowledge to tackle the above challenges. To this end, we propose Label Reasoning Network(LRN), which sequentially reasons fine-grained entity labels by discovering and exploiting label dependencies knowledge entailed in the data. Specifically, LRN utilizes an auto-regressive network to conduct deductive reasoning and a bipartite attribute graph to conduct inductive reasoning between labels, which can effectively model, learn and reason complex label dependencies in a sequence-to-set, end-to-end manner. Experiments show that LRN achieves the state-of-the-art performance on standard ultra fine-grained entity typing benchmarks, and can also resolve the long tail label problem effectively.

The fundamental challenge of FET comes from its large-scale and fine-grained entity label set, which leads to significant difference between FET and conventional entity typing. First, due to the * Corresponding authors. massive label set, it is impossible to independently recognize each entity label without considering their dependencies. For this, existing approaches use the predefined label hierarchies (Ren et al., 2016a;Shimaoka et al., 2017;Abhishek et al., 2017;Karn et al., 2017;Xu and Barbosa, 2018;Ren, 2020) or label co-occurrence statistics from training data (Rabinovich and Klein, 2017;Xiong et al., 2019;Lin and Ji, 2019) as external constraints. Unfortunately, these label structures or statistics are difficult to obtain when transferring to new scenarios. Second, because of the fine-grained and large-scale label set, many long tail labels are only provided with several or even no training instances. For example, in Ultra-Fine dataset (Choi et al., 2018), >80% of entity labels are with <5 instances, and more seriously 25% of labels never appear in the training data. However, training data can provide very limited direct information for these labels, and therefore previous methods commonly fail to recognize these long-tailed labels.
Fortunately, the implicitly entailed label dependencies in the data provide critical knowledge to tackle the above challenges. Specifically, the dependencies between labels exist extrinsically or intrinsically. On the one hand, the extrinsic dependencies reflect the direct connections between labels, which partially appear in the form of label hierarchy and co-occurrence. For example, in Figure 1(a) the labels person, musician, composer are with extrinsic dependencies because they form a three-level taxonomy. Furthermore, singer and composer are also with extrinsic dependency because they often co-occur with each other. On the other hand, the intrinsic dependencies entail the indirect connections between labels through their underlying attributes. For the example in Figure 1(b), label theorist and scientist share the same underlying attribute of scholar. Such intrinsic dependencies provide an effective way to tackle the long tail labels, because many long tail labels are actually composed by non-long tail attributes which can be summarized from non-long tail labels.
To this end, this paper proposes Label Reasoning Network (LRN), which uniformly models, learns and reasons both extrinsic and intrinsic label dependencies without given any predefined label structures. Specifically, LRN utilizes an auto-regressive network to conduct deductive reasoning and a bipartite attribute graph to conduct inductive reasoning between labels. Both of these two kinds of mechanisms are jointly applied to sequentially generate fine-grained labels in an end-to-end, sequence-toset manner. Figure 1(c) shows several examples. To capture extrinsic dependencies, LRN introduces deductive reasoning (i.e., draw a conclusion based on premises) between labels, and formulates it using an auto-regressive network to predict labels based on both the context and previous labels. For example, given previously-generated label person of the mention they, as well as the context they theorize, LRN will deduce its new label theorist based on the extrinsic dependency between person and theorist derived from data. For intrinsic dependencies, LRN introduces inductive reasoning (i.e., gather generalized information to a conclusion), and utilizes a bipartite attribute graph to reason labels based on current activated attributes of previous labels. For example, if the attributes {expert, scholar} have been activated, LRN will induce a new label scientist based on the attribute-label relations. Consequently, by decomposing labels into attributes and associating long tail labels with frequent labels, LRN can also effectively resolve the long tail label problem by leveraging their nonlong tail attributes. Through jointly leveraging the extrinsic and intrinsic dependencies via deductive and inductive reasoning, LRN can effectively handle the massive label set of FET.
Generally, our main contributions are: • We propose Label Reasoning Network, which uniformly models, automatically learns and effectively reasons the complex dependencies between labels in an end-to-end manner. • To capture extrinsic dependencies, LRN utilizes deductive reasoning to sequentially reason labels via an auto-regressive network. In this way, extrinsic dependencies are discovered and exploited without predefined label structures. • To capture intrinsic dependencies, LRN utilizes inductive reasoning to reason labels via a bipartite attribute graph. By decomposing labels into attributes and associating long-tailed labels with frequent attributes, LRN can effectively reason long-tailed and even zero-shot labels.
We conduct experiments on standard Ultra-Fine (Choi et al., 2018) and OntoNotes (Gillick et al., 2014) dataset. Experiments show that our method achieves new state-of-the-art performance: a 13% overall F1 improvement and a 44% F1 improvement in the ultra-fine granularity. 1

Related Work
One main challenge for FET is how to exploit complex label dependencies in the large-scale label set. Previous studies typically use predefined label hierarchy and co-occurrence structures estimated from data to enhance the models. To this end, Ren et al. López and Strube (2020) embed labels into a highdimension or a new space. And the studies exploit co-occurrence structures including limiting the label range during label set prediction (Rabinovich and Klein, 2017), enriching the label representation by introducing associated labels (Xiong et al.,  Ac-vate Func-on ✗ ✔ Figure 2: Overview of the process for LRN which contains an encoder, a deductive reasoning-based decoder and an inductive reasoning-based decoder. The figure shows: at step 1, the label person is predicted by deductive reasoning, and the attribute human is activated; at step 3, the label scientist is generated by inductive reasoning. 2019), or requiring latent label representation to reconstruct the co-occurrence structure (Lin and Ji, 2019). However, these methods require predefined label structures or statistics from training data, and therefore is difficult to be extended to new entity types or domains. The ultra fine-grained label set also leads to data bottleneck and the long tail problem. In recent years, some previous approaches try to tackle this problem by introducing zero/few-shot learning methods (Ma et al., 2016;Zhou et al., 2018;Yuan and Downey, 2018;Obeidat et al., 2019;Zhang et al., 2020b;, or using data augmentation with denosing strategies (Ren et al., 2016b;Onoe and Durrett, 2019;Zhang et al., 2020a;Ali et al., 2020) or utilizing external knowledge (Corro et al., 2015;Dai et al., 2019) to introduce more external knowledge.
In this paper, we propose Label Reasoning Network, which is significantly different from previous methods because 1) by introducing deductive reasoning, LRN can capture extrinsic dependencies between labels in an end-to-end manner without predefined structures; 2) by introducing inductive reasoning, LRN can leverage intrinsic dependencies to predict long tail labels; 3) Through the sequenceto-set framework, LRN can consider two kinds of label dependencies simultaneously to jointly reason frequent and long tail labels. Figure 2 illustrates the framework of Label Reasoning Network. First, we encode entity mentions through a context-sensitive encoder, then sequentially generate entity labels via two label reasoning mechanisms: deductive reasoning for exploiting extrinsic dependencies and inductive reasoning for exploiting intrinsic dependencies. In our Seq2Set framework, the label dependency knowledge can be effectively modeled in the parameters of LRN, automatically learned from training data, and naturally exploited during the sequential label decoding process. In the following we describe these components in detail.

Encoding
For encoding, we form the input instance X as are entity markers, m is mention word and x is context word. We then feed X to BERT and obtain the source hidden state H = {h 1 , ..., h n }. Finally, the hidden vector of [CLS] token is used as sentence embedding g.

Deductive Reasoning for Extrinsic Dependencies
This section describes how to capture extrinsic dependencies for label prediction via a deductive reasoning mechanism. To this end, the deductive reasoning-based decoder sequentially generates labels based on both context and previous labels, e.g., "for his books" + person → writer and "record an album" + person → musician. In this way, a label is decoded by considering both context-based prediction and previous labels-based prediction.
Concretely, we utilize a LSTM-based autoregressive network as decoder and obtain the hidden state of decoder S = {s 0 , ..., s k }, where k is the number of predicted labels. We first initialize s 0 using sentence embedding g, then at each time step, two attention mechanisms -contextual attention and premise attention, are designed to capture context and label information for next prediction.
Contextual Attention is used to capture the context evidence for label prediction. For example, the context "they theorize" provides rich information for theorist label. Specifically, at each time step t, contextual attention identifies relevant context by assigning a weight α ti to each h i in the source hidden state H: where W c , U c , v c are weight parameters and s t is the hidden state of decoder at time step t. Then the context representation c t is obtained by: Premise Attention exploits the dependencies between labels for next label prediction. For example, if person has been generated, its hyponym label theorist will be highly likely to be generated in context "they theorize". Concretely, at each time step t, premise attention captures the dependencies to previous labels by assigning a weight α tj to each s j of previous hidden states of decoder S <t : where W p , U p , v p are weight parameters. Then the previous label information u t is obtained by: Label Prediction. Given the context representation c t and the previous label information u t , we use m t = [c t + g; u t + s t ] as input, and calculate the probability distribution over label set L: where W o and W b are weight parameters and we use the mask vector I t ∈ R L+1 (Yang et al., 2018) to prevent duplicate predictions.
where Y * t−1 is the predicted labels before step t and l i is the i th label in label set L. The label with maximum value in y t is generated and used as the input for the next time step until [EOS] is generated.

Inductive Reasoning for Intrinsic Dependencies
Deductive reasoning can effectively capture extrinsic dependencies. However, labels can also have intrinsic dependencies if they share attributes, e.g., theorist and scientist shares scholar attribute.
To leverage intrinsic dependencies, LRN conducts inductive reasoning by associating labels to attributes via a bipartite attribute graph. A label will be generated if most of its attributes are activated. Instead of heuristically setting the number of attributes to be activated, we select labels based on their overall activation score from all attributes. By capturing such label-attribute relations, many long tail labels can be effectively predicted because they are usually related to non-long tail attributes.
To this end, we first design a bipartite attribute graph to represent attribute-label relations. Based on the bipartite attribute graph, at each time step, attributes will be activated based on the hidden state of decoder, and new labels will be inducted by reasoning over the activated attributes. For example, in Figure 2 the predicted labels person, theorist and commander will correspondingly activate the attributes human, scholar and expert, and then the scientist label will be activated via inductive reasoning based on these attributes.
Bipartite Attribute Graph (BAG). BAG G = {V, E} is designed to capture the relations between attributes and labels. Specifically, nodes V contain attribute nodes V a and label nodes V l , and edges E only exist between attributes nodes and labels nodes, with the edge weight indicating the attribute-label relatedness. Attributes are represented using natural language words in BAG. Figure 2 shows   BAG Construction. Because there are many labels and many attributes, we dynamically build a local BAG during the decoding for each instance. In this way the BAG is very compact and the computation is very efficient (Zupan et al., 1999). In local BAG, we collect attributes in two ways: (1) We mask the entity mention in the sentence, and predict the [MASK] token using masked language model (this paper uses BERT-base-uncased), and the non-stop words whose prediction scores greater than a confidence threshold θ c will be used as attributes -we denote them as context attributes; Since PLM usually predicts high-frequency words, the attributes are usually not long-tailed, which facilitates modeling dependencies between head and tail labels. This mask-prediction strategy is also used in Xin et al. (2018), for collecting additional semantic evidence of entity labels. (2) We directly segment the entity mention into words using Stanza 2 , and all non-stop words are used as attributes -we denote them as entity attributes. Figure 3 shows several attribute examples. Given attributes, we compute the attribute-label relatedness (i.e. E in G) using the cosine similarity between their GloVe embeddings (Pennington et al., 2014).
Reasoning over BAG. At each time step, we activate attributes in BAG by calculating their similarities to the current hidden state of decoder s t . For the i th attribute node V a (i) , its activation score is: where W s is the weight parameter, W a is the attribute embedding (i.e., word embedding of attribute words). We use cosine distance to measure similarity and employ ReLU to activate attributes. Then we induce new labels by reasoning over the activated attributes as: 2 https://pypi.org/project/stanza/ where n a is the number of attributes, V l (j) is the j th label nodes and E ij is the weight between them. Finally a label will be generated if its activation score is greater than a similarity threshold θ s .
Note that our inductive reasoning and deductive reasoning are jointly modeled in the same decoder, i.e., they share the same decoder hidden state but with different label prediction process. Once deductive reasoning-based decoder generates [EOS], the label prediction stops. Finally, we combine the predicted labels of both deductive reasoning and inductive reasoning as the final FET results.
Set Prediction Loss. In FET, cross entropy loss is not appropriate because the prediction results is a label set, i.e., {y * 1 , y * 2 , y * 3 } and {y * 3 , y * 2 , y * 1 } should have the same loss. Therefore we measure the similarity of two label set using the bipartite matching loss (Sui et al., 2020). Given the golden label set Y = {y 1 , ..., y m } and generated label set Y * = {y * 1 , ..., y * m }, the matching loss L(ij) S of y i and y * j is calculated by 13, then we use the Hungarian Algorithm (Kuhn, 1955) to get the specific order of golden label set as Y = { y 1 , ..., y m } to obtain minimum matching loss L S : where CE is cross-entropy.
BAG Loss. To make the model activate labels correctly, we add a supervisory loss to the bipartite attribute graph to active correct labels: Final Loss. The final loss is a combination of set loss and BAG loss: where λ is the relative weight of these two losses 3 . Baselines For Ultra-Fine dataset, we compare with following baselines: Onoe and Durrett (2019) which offers two multi-classifiers using BERT and ELMo as encoder respectively, Choi et al. (2018) which is a multi-classifier using GloVe+LSTM as encoder, Xiong et al. (2019) which is a multiclassifier using GloVe+LSTM as encoder and exploits label co-occurrence via introducing associated labels to enrich the label representation, López and Strube (2020) which is a hyperbolic multiclassifier using GloVe. For OntoNotes dataset, in addition to the baselines for Ultra-Fine, we also compare with  which offers a multi-classifier using BERT as encoder, Lin and Ji (2019) which offers a multi-classifier using ELMo as encoder and exploits label co-occurrence via requiring the latent representation to reconstruct the co-occurrence association and  which offers a multi-classifier using ELMo as encoder and exploits label hierarchy via designing a hierarchy-aware loss function.
Implementation We use BERT-Base(uncased) (Devlin et al., 2019) as encoder, Adam optimizer (Kingma and Ba, 2015) with learning rate of BERT as 5e-5 and of other parameters as 1e-3. The batch size is 32, encoder hidden size is 768, the decoder hidden size is 868 and label embedding size is 100, the dropout rate of decoder is 0.6. The confidence 3 In our auxiliary experiments, we find that its impact is minor, so this paper empirically sets it to 1. 4 Released in https://github.com/uwnlp/open_type Model P R F1 without label dependency *Choi et al. (2018) 47.1 24.2 32.0 *ELMo (Onoe and Durrett, 2019) 51.5 33.0 40.2 BERT (Onoe and Durrett, 2019) 51.6 33.0 40.2 BERT [in-house] 55.9 33.0 41.5 with label dependency *LABELGCN (Xiong et al., 2019)  * means using augmented data. "without label dependency" methods formulated FET as multi-label classification without considering associations between labels. "with label dependency" methods leveraged associations between labels explicitly or implicitly.
threshold θ c and the similarity threshold θ s both are optimized on dev set and set as 0.1 and 0.2 respectively. We use the GloVe embedding (Pennington et al., 2014) to represent the nodes of BAG and fix it while training. Table 1 shows the main results of all baselines and our method in two settings: LRN is the full model and LRN w/o IR is the model without inductive reasoning. For fair comparisons, we implement a baseline with same settings of LRN but replace the decoder with a multi-classifier same as Choi et al. (2018) -BERT[in-house]. We can see that: 1) By performing label reasoning, LRN can effectively resolve the fine-grained entity typing problem. Compared with previous methods, our method achieves state-of-the-art performance with a F1 improvement from 40.2 to 45.4 on test set. This verified the necessity for exploiting label dependencies for FET and the effectiveness of our two label reasoning mechanisms. We believe this is because label reasoning can help FET by making the learning more data-efficient (i.e., labels can share knowledge) and the prediction of labels global coherent.

Overall Results
2) Both deductive reasoning and inductive reasoning are useful for fine-grained label prediction. Compared with BERT[in-house], LRN w/o IR can achieve 4.3% F1 improvement by exploiting extrinsic dependencies via deductive reasoning. LRN can further improve F1 from 43.3 to 45.4 by exploiting intrinsic dependencies via inductive reasoning. We believe this is because deductive reasoning and inductive reasoning are two fundamental but different mechanisms, therefore, modeling them simultaneously will better leverage label dependencies to  Table 2: Macro P/R/F1 of each label granularity on Ultra-Fine dev set, and long tail labels are mostly in the ultra-fine layer. * means using augmented data. † We adapt the results from López and Strube (2020).   Table 4: Performance of the zero-shot, shot=1 and shot=2 label prediction. "Category" means how many kinds of types are predicted. "Prediction" means how many labels are generated. predict labels.
3) Seq2Set is an effective framework to model, learn and exploit label dependencies in an end-toend manner. Compared with LABELGCN (Xiong et al., 2019) which heuristically exploits label cooccurrence structure, LRN can achieve a significant performance improvement. We believe this is because neural networks have strong ability for representing and learning label dependencies. And the end-to-end manner makes LRN can easily generalize to new scenarios.

Effect on Long Tail Labels
As described above, another advantage of our method is it can resolve the long tail problem by decomposing long tail labels to common attributes and modeling label dependencies between head and tail labels. Because the finer the label granularity, the more likely it to be a long tail label, we report the performance of each label granularity on dev set and test set same as previous works in Table 2  and Table 3. Moreover, we report the performance of the labels with shot≤2 in Table 4. Based on these results, we find that: 1) LRN can effectively resolve the long tail label problem. Compared to BERT[in-house], LRN can significantly improve the F-score of ultra-fine granularity labels by 44% (22.6 → 32.5) and recall more fine-grained labels (14.6 → 26.0).
2) Both deductive reasoning and inductive reasoning are helpful for long tail label prediction, but with different underlying mechanisms: deductive reasoning exploits the extrinsic dependencies between labels, but inductive reasoning exploits the intrinsic dependencies between labels. LRN w/o IR cannot predict zero-shot labels because it resolves long tail labels by relating head labels with long tail labels, therefore it cannot predict unseen labels. By contrast, LRN can predict zero-shot labels via inductive reasoning because it can decompose labels into attributes. Furthermore, we found LRN w/o IR has higher precision for few-shot (shot=2) labels than BERT and LRN, we believe this is because inductive reasoning focuses on recalling more labels, which inevitably introduce some incorrect labels.

Detailed Analysis
Effect of Components To evaluate the effect of different components, we report the ablation results in Table 5. We can see that: (1) Set prediction loss is effective: replacing it with cross-entropy loss will lead to a significant decrease. (2) Both context   Effect of Attributes Set To explore the impact of entity attributes and context attributes in BAG, Figure 4(a) shows the results of different attributes configurations. We can see that: both attributes are useful, the context attribute has high coverage and may be noisy, while the entity attribute is opposite. However when introducing both of them, the information in entity attributes might help the context attributes to disambiguate them. This is similar to the effectiveness of contextual information in word sense disambiguation. As a result, these two kinds of attributes can complement each other. And Figure 4(b) shows the performance on different thresholds, and we optimize confidence threshold θ c = 0.1 and similarity threshold θ s = 0.2 on dev set. Notice that θ s is the threshold of activating labels and when θ s = 1, it is equivalent to LRN w/o IR .

Results of OntoNotes
To verify the generality of our method, we further conduct experiments on OntoNotes and report results of with and without augmentation data in Table 6. To embed labels  Table 6: Results on OntoNotes test set. Augmentation is the augmented data created by (Choi et al., 2018) which contains 800K instances and therefore there're little few-shot labels in this setting. And * indicates using additional features to enhance the label representation.
in OntoNotes, we use the embedding of the last word of a label, e.g., /person/artist/director is represented using embedding of director.
We can see that: 1) LRN still achieves the best performance on both settings, which verified the robustness of our method. 2) Compared with Ultra-Fine, our method achieves a smaller improvement on OntoNotes. We found this is mainly because: First, OntoNotes has weaker label dependencies for its label set is smaller (89 vs 2519 for Ultra-Fine) and most of its labels are coarse-grained. Secondly, most labels in OntoNotes are frequent labels with many training instances, therefore the long tail label problem is not serious. This also explains why LRN w/o IR can achieve better performance than LRN in the setting of with augmentation data: the more the training instance, the less need for long tail prediction.

Case Study
To intuitively present the learned label dependencies, Figure 5 shows the label co-occurrence matrices of different models' predictions and ground truth, we can see that both LRN and LRN w/o IR can accurately learn label dependencies. Figure 6 shows some prediction cases and demonstrates that deductive and inductive reasoning have quite different underlying mechanisms and predict quite different labels.  Figure 6: Cases of prediction results.

Conclusions
This paper proposes Label Reasoning Network, which uniformly models, learns and reasons complex label dependencies in a sequence-to-set, endto-end manner. LRN designs two label reasoning mechanisms for effective decoding -deductive reasoning to exploit extrinsic dependencies and inductive reasoning to exploit intrinsic dependencies. Experiments show that LRN can effectively cope with the massive label set on FET. And because our method uses no predefined structures, it can be easily generalized to new datasets and applied to other multi-classification tasks.