Modeling Label Correlations for Ultra-Fine Entity Typing with Neural Pairwise Conditional Random Field

Ultra-fine entity typing (UFET) aims to predict a wide range of type phrases that correctly describe the categories of a given entity mention in a sentence. Most recent works infer each entity type independently, ignoring the correlations between types, e.g., when an entity is inferred as a president, it should also be a politician and a leader. To this end, we use an undirected graphical model called pairwise conditional random field (PCRF) to formulate the UFET problem, in which the type variables are not only unarily influenced by the input but also pairwisely relate to all the other type variables. We use various modern backbones for entity typing to compute unary potentials, and derive pairwise potentials from type phrase representations that both capture prior semantic information and facilitate accelerated inference. We use mean-field variational inference for efficient type inference on very large type sets and unfold it as a neural network module to enable end-to-end training. Experiments on UFET show that the Neural-PCRF consistently outperforms its backbones with little cost and results in a competitive performance against cross-encoder based SOTA while being thousands of times faster. We also find Neural-PCRF effective on a widely used fine-grained entity typing dataset with a smaller type set. We pack Neural-PCRF as a network module that can be plugged onto multi-label type classifiers with ease and release it in .


Introduction
Entity typing assigns semantic types to entities mentioned in the text.The extracted type information has a wide range of applications.It acts as a primitive for information extraction (Yang and Zhou, 2010) and spoken language understanding (Coucke et al., 2018), and assists in more complicated tasks such as machine reading comprehension (Joshi et al., 2017) and semantic parsing (Yavuz et al., 2016).During its long history of development, the granularity of the type set for entity typing and recognition changes from coarse (sized less than 20) (Tjong Kim Sang and De Meulder, 2003;Hovy et al., 2006), to fine (sized around 100) (Weischedel and Brunstein, 2005;Ling and Weld, 2012;Gillick et al., 2014;Ding et al., 2021b), to ultra-fine and free-form (sized 10k) (Choi et al., 2018).The expansion of the type set reveals the diversity of real-world entity categories and the importance of a finer-grained understanding of entities in applications (Choi et al., 2018).
The increasing number of types results in difficulties in predicting correct types, so a better understanding of the entity types and their correlations is needed.Most previous works solve an N -type entity typing problem as N independent binary classifications (Ling and Weld, 2012;Gillick et al., 2014;Choi et al., 2018;Onoe and Durrett, 2019;Ding et al., 2021a;Pan et al., 2022;Li et al., 2022).However, types are highly correlated and hence they should be predicted in a joint manner.As an example, when 'Joe Biden' is inferred as a president, it should also be inferred as a politician, but not a science fiction.Type correlation is partially specified in fine-grained entity typing (FET) datasets by a two or three-level type hierarchy, and is commonly utilized by a hierarchy-aware objective function (Ren et al., 2016;Jin et al., 2019;Xu and Barbosa, 2018).However, type hierarchies cannot encode type relationships beyond strict containment, such as similarity and mutual exclusion (Onoe et al., 2021), and could be noisily defined (Wu et al., 2019) or even unavailable (in ultra-fine entity typing (UFET)).Many recent works handle these problems by embedding types and mentions into special spaces such as the hyperbolic space (López and Strube, 2020) or box space (Onoe et al., 2021) that can be trained to latently encode type correlations without a hierarchy.Although these methods are expressive for modeling type correlations, they are constrained by these special spaces and thus incapable of being combined with modern entity typing backbones such as prompt learning (Ding et al., 2021a) and cross-encoder (Li et al., 2022), and cannot integrate prior type semantics.
In this paper, we present an efficient method that expressively models type correlations while being backbone-agnostic.We formulate the UFET and FET problems under a classical undirected graphical model (UGM) (Koller and Friedman, 2009) called pairwise conditional random field (PCRF) (Ghamrawi and McCallum, 2005).In PCRF, types are binary variables that are not only unarily influenced by the input but also pairwisely relate to all the other type variables.We formulate the unary potentials using the type logits provided by any modern backbone such as prompt learning.To compose the pairwise potentials sized Op4N 2 q (N is the number of types, 10k for UFET), we use matrix decomposition, which is efficient and able to utilize prior type semantic knowledge from word embeddings (Pennington et al., 2014).Exact inference on such a large and dense UGM is intractable, and therefore we apply mean-field variational inference (MFVI) for approximate inference.Inspired by Zheng et al. (2015), we unfold the MFVI as a recurrent neural network module and connect it with the unary potential backbone to enable end-to-end training and inference.We call our method Neural-PCRF (NPCRF).Experiments on UFET show that NPCRF consistently outperforms the backbones with negligible additional cost, and results in a strong performance against cross-encoder based SOTA models while being thousands of times faster.We also found NPCRF effective on a widely used fine-grained entity typing dataset with a smaller type set.We pack NPCRF as a network module that can be plugged onto multi-label type classifiers with ease and release it in github.com/modelscope/adaseq/examples/NPCRF.

Pairwise Conditional Random Field
Pairwise conditional random field (Ghamrawi and McCallum, 2005) is a classical undirected graphical model proposed for modeling label correlations.
In the deep learning era, PCRF was first found effective when combined with a convolutional neural network (CNN) (Fukushima and Miyake, 1982) for semantic segmentation in computer vision (Zheng et al., 2015;Chandra and Kokkinos, 2016;Arnab et al., 2016;Shen et al., 2017;Lê-Huu and Alahari, 2021a), in which it was used to encourage  adjacent pixels being segmented together.In contrast to its popularity in computer vision, PCRF is much less explored in natural language processing.Ghamrawi and McCallum (2005) and Wang et al. (2017) apply PCRF on an n-gram feature based non-neural sentence classifier with up to 203 classes.Besides the difference in tasks, numbers of classes and backbones, our method is different from theirs in two main aspects: (1) We unfold mean-field approximate inference of PCRF as a recurrent neural network module for efficient endto-end training and inference, while they use the 'supported inference method' (Ghamrawi and Mc-Callum, 2005) which is intractable for large type sets and incompatible with neural backbones (2) We design the parameterization of the pairwise potential function based on the technique of matrix decomposition to accelerate training and inference, and encode type semantics which is important for entity typing (Li et al., 2022) and sentence classification (Mueller et al., 2022).Our parameterization also conforms to intrinsic properties of pairwise potentials (explained in Sec.3.4).Hu et al. (2020) investigates different potential function variants for linear-chain CRFs for sequence labeling, while we design pairwise potential functions for PCRF.

Problem Definition
Entity typing datasets consist of M entity mentions m i with their corresponding context sentences c i : D " tpm 1 , c 1 q, ¨¨¨, pm M , c M qu, and a type set Y of size N .The task of entity typing is to predict the types y p i of the entity mention m i in the given context c i , where y p i is a subset of the type set.The number of gold types |y g i | could be larger than one in most entity typing datasets.The average number of gold types per instance avgp|y g i |q and the size of the type set |Y| vary in different datasets.

PCRF for Entity Typing
We first describe our pairwise conditional random field for the entity typing (Ghamrawi and McCallum, 2005), as shown in Fig. 1(a).x denotes a data point pm, cq P D. Y j P t0, 1u denotes the binary random variable for the j-th type in type set Y. The type variables Y 1:N are unarily connected to the input, model how likely a type is given the input, and pairwisely connect a type variable with all other type variables to model type correlations.Let y P Y 1 ˆY2 ˆ¨¨¨ˆY N be an assignment of all the type variables.The probability of y given a data point x i under a conditional random field can be factorized as: where θ u , θ p are scoring functions for unary and pairwise edges, and Z is the partition function for normalization.
language models (PLM) 1 .We introduce two of the backbones below.
Multi-label Classifier (MLC) Given a data point pm, cq, we formulate the input as: rCLSs c rSEPs m rSEPs and feed it into RoBERTa-large (Liu et al., 2019) to obtain the rCLSs embedding h.Then we obtain the unary type scores as follows.
We use a verbalizer V to map types to subwords, V y denotes the subwords corresponding to type y P Y, e.g., V living_thing " t'living', 'thing'u.We obtain the unary score of choosing type y j using the averaged masked language model logits of V y j : Similarly, we set θpy j " 0; xq " 0.

Pairwise Potential Function
Pairwise log potential θ p py j , y k q encoding type correlations should naturally satisfy the properties shown in Table 1.Specifically, Θ 11 and Θ 00 are symmetric matrices and encode co-occurrence and co-absence respectively, while Θ 10 " Θ T 01 and are asymmetric ( e.g., "not a person but a president" " "not a president but a person").Directly parameterizing these 4 potential matrices (Ghamrawi and McCallum, 2005;Wang et al., 2017) ignores these intrinsic properties and results in an unbearable number of model parameters for datasets with a large type set (e.g., 400M parameters for 10331type UFET, which is more than Bert-large).
To tackle these problems, we parameterize the pairwise potential based on matrix rank decomposition, i.e., we represent each N ˆN matrix as the product of two N ˆR matrices, where R is the number of ranks.Crucially, we use pretrained word embedding to derive the two N ˆR matrices, so that they can encode type semantics.Specifically, we obtain a type embedding matrix E P R N ˆ300 based on 300-dimension GloVe embedding (Pennington et al., 2014).For a type phrase consisting of multiple words (e.g., "living_thing"), we take the average of the word embeddings.We then transform E into two embedding spaces encoding "occurrence" and "absence" respectively using two feed-forward networks.
To enforce the intrinsic properties of the four matrices, we parameterize them as follows: The negative signs in defining Θ 01 , Θ 10 ensure that they encode co-exclusion, not similarity.As we will show in the next subsection, we do not need to actually recover these large matrices during inference, leading to lower computational complexity.

Mean-field Variational Inference
We aim to infer the best type set given the input.
The MAP (maximum-a-posteriori) inference over this large and dense graph is NP-hard.We instead use mean-field variational inference (MFVI) to approximately infer the best assignment y p .MFVI approximates the true posterior distribution ppy j |xq by a variational distribution qpy j q that can be factorized into independent marginals (Wainwright et al., 2008).It iteratively minimizes the KL-divergence between p and q.We initialize the q distribution by unary scores produced by backbones: q 0 py j q " softmaxpθ u py j ; xqq and derive the MFVI iteration as follows: We rewrite the formulas in the vector form with our parametrization.θ u 1 P R N denotes the unary score vector of θ u py j " 1; xq for all j, and we define θ u 0 similarly.q t 0 , q t 1 are vectors of the q distributions at the t-th iteration.
that by following these update formulas, we do not need to recover the N ˆN matrices, and hence the time complexity of each iteration is N R. Since the iteration number T is typically a small constant (T ă 7), the computational complexity of the whole MFVI is OpN Rq.The predicted type set of x i is obtained by: y p " ty j | q T 1 rjs ą 0.5, y j P Yu We follow the treatment of MFVI as entropyregularized Frank-Wolfe (Lê-Huu and Alahari, 2021b) and introduce another hyper-parameter λ to control the step size of each update.Let q t " rq t 0 ; q t 1 s, q t`1 " q t `λpq t`1 ´qt q

Unfolding Mean-field Variational Inference as a Recurrent Neural Network
We follow Zheng et al. (2015) and treat the meanfield variational inference procedure with a fixed number of iterations as a recurrent neural network (RNN) parameterized by the pairwise potential functions.As shown in the top part of Figure 1(b), the initial hidden state of the RNN is the type distribution q 0 produced by the unary logits, and the final hidden states q T after T iterations are used for end-to-end training and inference.

Training Objective and Optimization
We use the Binary Cross Entropy loss for training: L is the loss of each instance x, y g j P t0, 1u is the gold annotation of x for type y j .We follow previous works (Choi et al., 2018;Dai et al., 2021) and use α as a weight for the loss of positive types.We train the pretrained language model, label embedding matrix E and FFNs.We use AdamW (Loshchilov and Hutter, 2018) for optimization.

Datasets and Experimental Settings
We mainly evaluate our NPCRF method on the ultra-fine entity typing dataset: UFET (Choi et al., 2018).We also conduct experiments on the augmented version (Choi et al., 2018) of OntoNotes (Gillick et al., 2014) to examine if our method also works for datasets with smaller type sets and fewer gold types per instance.We show the dataset statistics in Table 2.Note that UFET also provides 25M distantly labeled training data extracted by linking to KB and parsing.We follow recent works (Pan et al., 2022;Liu et al., 2021) and only use the manually annotated 2k data for training.We use standard metrics for evaluation: for UFET, we report macro-averaged precision (P), recall (R), and F1; for OntoNotes, we report macro-averaged and micro-averaged F1.We run experiments three times and report the average results.

Baseline Methods
MLC-ROBERTA and PL-BERT introduced in Sec.3.3 are natural baselines.We compare their performances with and without NPCRF.UFET (Choi et al., 2018) A multi-label linear classifier (MLC) with a backbone using BiLSTM, GloVe, and CharCNN to encode the mention.LDET (Onoe and Durrett, 2019) An MLC with Bert-base-uncased and ELMo (Peters et al., 2018) and trained by 727k examples automatically denoised from the distantly labeled UFET.LABELGCN (Xiong et al., 2019) An MLC with BiLSTM and multi-head self-attention to encode the mention and context, and a GCN running on a fixed co-occurrence type graph to obtain better type embedding.Type scores are dot-product of mention and type embedddings.For fair comparison, we replace their mention encoder with RoBertalarge.To our knowledge, LabelGCN cannot be directly combined with prompt learning.BOX4TYPE (Onoe et al., 2021) use Bert-large as backbone and project mentions and types to the box space for training and inference; trained on the same 727k data (Onoe and Durrett, 2019).LRN (Liu et al., 2021) generate types using Bertbase and an LSTM decoder in a seq2seq manner, and use 2k manually labeled data for training.MLMET (Dai et al., 2021) A multi-label linear classifier using Bert-base, first pretrained by the distantly-labeled data augmented by masked word prediction, then finetuned and self-trained on the 2k human-annotated data.DFET (Pan et al., 2022) A 3-round automatic denoising method for 2k mannually labeled data, using PL-BERT as backbone.LITE (Li et al., 2022) Previous SOTA system that formulates entity typing as natural language inference, treating x i as premise and types as hypothesis, concatenating them and feeding them into RoBerta-large to score types.LITE models the correlation better between the input and types, and has great zero-shot performance but needs to concatenate x i with all the N types, resulting in very slow inference.LITE is pretrained on MNLI (Williams et al., 2018) and trained on 2k annotated data, Its authors find that the performance drops when using distantly labeled data.

UFET Result
As shown in Table 3, our NPCRF can be integrated with various entity typing backbones and enhance their performance to SOTA on UFET.
MLC-ROBERTA W/ NPCRF improves the basic MLC-ROBERTA backbone by `3.2 F1 score and reaches the best performance among models using the MLC architecture except for MLMET which uses millions of distantly labeled data.LA-BELGCN (Xiong et al., 2019) utilizing fixed type co-occurrence information to obtain type embedding is still effective with the replaced ROBERTA encoder and improves F1 by `1.7 over MLC-ROBERTA while our method produce a further `1.8F1 improvement, because NPCRF models not only co-occurrence (in Θ 11 ), but also semantic co-absence and co-exclusion in our pairwise potentials, and it also explicitly updates these potentials during training rather than using fixed type co-occurrence information.
Prompt-learning based backbones (Ding et al., 2021a;Pan et al., 2022) such as PL-BERT are already strong in UFET, and our method further improves PL-BERT-BASE by `1.9 F1 and reaches 49.7 F1 which is slightly better than MLMET and DFET and is the SOTA performance for Bertbase models, on par with performance of Bert-large models.Also worth noting is that our method is trained in a simpler single-round end-to-end manner compared with other competitors requiring multi-round training (Dai et al., 2021) and denoising (Pan et al., 2022)  Methods marked by : use either distantly labeled training data or additional pretraining tasks and the 3 marker denotes methods modeling label correlations.
1.0 F1, and results in performance on par with the previous SOTA system LITE+NLI+L with much faster inference speed (discussed in Sec.5.5).
Note that the improvement is smaller on large models because large models are stronger in inferring types without pairwise potentials from limited and noisy contexts.We also observe that models with NPCRF tend to predict more types and therefore have higher recalls and sometimes lower precisions.
It is possibly because NPCRF can use type dependencies to infer additional types that are not directly supported by the input.We will discuss it in detail in Sec.5.4.

FET Result
We present the performance of NPCRF on the augmented OntoNotes dataset in Table 4.The results show that MLC-ROBERTA W/ NPCRF outperforms MLC-ROBERTA by `1.5 in macro-F1 and `1.7 in micro-F1, and reaches competitive performances against a recent SOTA system DFET focusing on denoising data.In general, we find our method still effective, but the improvement is less significant compared with UFET, especially for PL-BERT.One possible reason is that the average number of types per instance in OntoNotes is 1.5, and therefore the type-type relationship is less important.Another possible reason is that some type semantics are already covered in the prompt-based method through the verbalizer, so our method fails to boost the performance of PL-BERT.Table 6: Performances on coarser granularity.

Ablation Study
We show the ablation study in Table 5.It can be seen that: (1) Performance drops when we randomly initialize label embedding E (denoted by -w/o GloVe), which indicates that GloVe embedding contains useful type semantics and helps build pairwise type correlations.
(2) Performance drops when we remove the hidden layer and the tanh nonlinearity in FFNs (Eq.1), showing the benefit of expressive parameterization.
(3) When we remove the entire FFN layers and use E to parameterize the four matrices in (Eq.2) model training fails to converge, showing that a reasonable parameterization is important.

Performance on Coarser Granularity
We evaluate the performance of NPCRF on type sets with coarser granularity.UFET (Choi et al., 2018) splits the types into coarse-grained, finegrained and ultra-fine-grained types.We discard ultra-fine-grained types to create the fine-grained setting (130 types), and further discard the finegrained types to create the coarse-grained setting (9 types).As shown in Table 6, NPCRF still has positive effect in these two settings, while LABEL-GCN does not.
The culture succeeded the Daxi culture and reached southern Shaanxi, northern Jiangxi and southwest Henan.
Mention Span

(a)
Left fielder Carl Crawford was removed from that day's game with soreness in his shoulder...

(b)
Mention Span …he will be the first Democrat since Franklin D.
Roosevelt to be elected to a second full term. (c) Figure 2: Two MFVI cases.We show, from top to bottom, how the type probabilities q t 1 change with iteration given the mention and its context.The reddish grids (probability ą 0.5) indicates the chosen types, and gold types are colored red on the x-axis.

Model Performance at Each Iteration
We evaluate the per-iteration performance of PL-BERT W/ NPCRF on the test set and show the results in Table 7.We obtain the prediction y p t of each iteration t by binarizing q t for each instance: y p t " ty j | q t 1 rjs ą 0.5, y j P Yu.The results show that model prediction at iteration 0 (i.e., based solely on unary scores) has a high recall and a very low precision, while NPCRF keeps correcting wrong candidates during consecutive iterations to reach a higher F1 score.We show some concrete cases in Sec.5.4.

MFVI Iterations
We show the per-iteration predictions of PL-BERT W/ NPCRF for two inputs in Figure 2. As can be seen, NPCRF tends to delete wrong types through iterations, such as "history", "nation" in Fig. 2(a) and "pitcher", "soc-cer_player" in Fig. 2(b).This results in higher precision (as shown in Table 7).However, we find NPCRF is also capable of increasing the probabilities of some types, e.g., "ballplayer" in the second case.NPCRF may also erroneously deprecate gold types and results in lower recall.As been shown in 2(c), NPCRF wrongly deletes gold types such as "adult", "man" while it correctly predicts "president" which is not annotated as gold label, and increases the score of "compaigner".
Pairwise Potentials We show the four learned pairwise potentials in Appendix A.1.

Efficiency
We compare the training and inference efficiency of different methods on the UFET dataset in Table 8.We run all these methods on one Tesla V100 GPU three times, and report the average speed (number of sentences per second) during training and inference.Except for LRN which is based on Bertbase-uncased, all the other methods are based on RoBERTa-large.Results show that NPCRF (4 iterations, using the best hyper-parameters) is the fastest to model type correlations (compared with LABELGCN and LRN), it slows down training by 15.7% and inference by only 13.8%.NPCRF is much faster than LITE which is based on the cross-encoder architecture in inference2 .

Conclusion and Future Work
We propose NPCRF, a method that efficiently models type correlation for ultra-fine and finegrained entity typing, and is applicable to various entity typing backbones.In NPCRF, the unary potential is formulated as the type logits of modern UFET backbones, the pairwise potentials are derived from type phrase representations that both capture prior semantic information and facilitate accelerated inference.We unfold mean-field variational inference of NPCRF as a neural network for end-to-end training and inference.We find our method consistently outperforms its backbone, and reach competitive performance against very recent baselines on UFET and FET.NPCRF is efficient and require low additional computation costs.For future work, modeling higher-order label correlations, injecting prior knowledge into pairwise potentials, and extending NPCRF to other tasks are worth exploring.

Limitations
As shown in the experiments, the main limitation of NPCRF is that it has less positive effect on tasks that do not require understanding type correlation (e.g., tasks with small label sets and a low number of gold labels per instance), and on models that already model label semantics quite well (e.g., prompt-based methods).Another limitation of NPCRF is that, although it can be combined with many backbones, there still exist some backbones that cannot directly use NPCRF, such as models that generate types one by one in an autoregressive way (e.g., LRN) and models that cannot efficiently compute label logits (e.g., LITE).

Acknowledgement
This work was supported by the National Natural Science Foundation of China (61976139) and by Alibaba Group through Alibaba Innovative Research Program.

Figure 1 :
Figure 1: (a) Our pairwise CRF (PCRF) of ultra-fine entity typing.X is the entity mention and its context.(b) The neural network architecture corresponding to the Neural-PCRF.

Table 2 :
|Y| denotes the size of the type set, avgp|y g i |q denotes the average number of gold types per instance.

Table 3 :
Macro-averaged UFET result.LITE+L is LITE without NLI pretraining, LITE+L+NLI is the full LITE model, LABELGCN-ROBERTA denotes our implementation of LabelGCN with RoBerta-large as mention encoder.Methods marked by : use either distantly labeled training data or additional pretraining tasks and 3 marker denotes method focusing on modeling label correlations.

Table 4 :
procedure.For promptlearning powered by Bert-large models, NPCRF boosts the performance of PL-BERT-LARGE by Results on the augmented OntoNotes datasets.MA and MI are abbreviations of macro and micro.

Table 7 :
Model performance at each iteration, avgp|y p i |q denotes the average number of predicted types.

Table 8 :
Comparison of training and inference speed of different methods in Table 3.