Calibrated Seq2seq Models for Efficient and Generalizable Ultra-fine Entity Typing

Ultra-fine entity typing plays a crucial role in information extraction by predicting fine-grained semantic types for entity mentions in text. However, this task poses significant challenges due to the massive number of entity types in the output space. The current state-of-the-art approaches, based on standard multi-label classifiers or cross-encoder models, suffer from poor generalization performance or inefficient inference. In this paper, we present CASENT, a seq2seq model designed for ultra-fine entity typing that predicts ultra-fine types with calibrated confidence scores. Our model takes an entity mention as input and employs constrained beam search to generate multiple types autoregressively. The raw sequence probabilities associated with the predicted types are then transformed into confidence scores using a novel calibration method. We conduct extensive experiments on the UFET dataset which contains over 10k types. Our method outperforms the previous state-of-the-art in terms of F1 score and calibration error, while achieving an inference speedup of over 50 times. Additionally, we demonstrate the generalization capabilities of our model by evaluating it in zero-shot and few-shot settings on five specialized domain entity typing datasets that are unseen during training. Remarkably, our model outperforms large language models with 10 times more parameters in the zero-shot setting, and when fine-tuned on 50 examples, it significantly outperforms ChatGPT on all datasets. Our code, models and demo are available at https://github.com/yanlinf/CASENT.


Introduction
Classifying entities mentioned in text into types, commonly known as entity typing, is a fundamental problem in information extraction.Earlier research on entity typing focused on relatively small Input: In addition, Greer said, scientists do not know if chemically treated oil will degrade as quickly as oil that 's dispersed through wind and wave, and if it 's more toxic.
type inventories (Ling and Weld, 2012) which imposed severe limitations on the practical value of such systems, given the vast number of types in the real world.For example, WikiData, the current largest knowledge base in the world, records more than 2.7 million entity types2 .As a result, a fully supervised approach will always be hampered by insufficient training data.Recently, Choi et al. (2018) introduced the task of ultra-fine entity typing (UFET), a multi-label entity classification task with over 10k fine-grained types.In this work, we make the first step towards building an efficient general-purpose entity typing model by leveraging the UFET dataset.Our model not only achieves state-of-the-art performance on UFET but also generalizes outside of the UFET type vocabulary.An Ultra-fine entity typing can be viewed as a multilabel classification problem over an extensive label space.A standard approach to this task employs multi-label classifiers that map contextual representations of the input entity mention to scores using a linear transformation (Choi et al., 2018;Dai et al., 2021;Onoe et al., 2021).While this approach offers superior inference speeds, it ignores the type semantics by treating all types as integer indices and thus fails to generalize to unseen types.The current state-of-the-art approach (Li et al., 2022) reformulated entity typing as a textual entailment task.They presented a cross-encoder model that computes an entailment score between the entity mention and a candidate type.Despite its strong generalization capabilities, this approach is inefficient given the need to enumerate all 10k types in the UFET dataset.
Black-box large language models, such as GPT-3 and ChatGPT, have demonstrated impressive zeroshot and few-shot capabilities in a wide range of generation and understanding tasks (Brown et al., 2020;Ouyang et al., 2022).Yet, applying them to ultra-fine entity typing poses challenges due to the extensive label space and the context length limit of these models.For instance, Zhan et al. (2023) reported that GPT-3 with few-shot prompting does not perform well on a classification task with thousands of classes.Similar observations have been made in our experiments conducted on UFET.
In this work, we propose CASENT, a Calibrated Seq2Seq model for Entity Typing.CASENT predicts ultra-fine entity types with calibrated confidence scores using a seq2seq model (T5-large (Raffel et al., 2020)).Our approach offers several advantages compared to previous methods: (1) Standard maximum likelihood training without the need for negative sampling or sophisticated loss functions (2) Efficient inference through a single autoregressive decoding pass (3) Calibrated confidence scores that align with the expected accuracy of the predictions (4) Strong generalization performance to unseen domains and types.An illustration of our approach is provided in Figure 2.
While seq2seq formulation has been successfully applied to NLP tasks such as entity linking (De Cao et al., 2020, 2022), its application to ultra-fine entity typing remains non-trivial due to the multi-label prediction requirement.A simple adaptation would employ beam search to decode multiple types and use a probability threshold to select types.However, we show that this approach fails to achieve optimal performance as the raw conditional probabilities do not align with the true likelihood of the corresponding types.In this work, we propose to transform the raw probabilities into calibrated confidence scores that reflect the true likelihood of the decoded types.To this end, we extend Platt scaling (Platt et al., 1999), a standard technique for calibrating binary classifiers, to the multi-label setting.To mitigate the label sparsity issue in ultrafine entity typing, we propose novel weight sharing and efficient approximation strategies.The ability to predict calibrated confidence scores not only impacts task performance but also provides a flexible means of adjusting the trade-off between precision and recall in real-world scenarios.For instance, in applications requiring high precision, predictions with lower confidence scores can be discarded.
We carry out extensive experiments on the UFET dataset and show that filtering decoded types based on calibrated confidence scores leads to state-ofthe-art performance.Our method surpasses the previous methods in terms of both F1 score and calibration error while achieving an inference speedup of more than 50 times compared to cross-encoder methods.Furthermore, we evaluate the zero-shot and few-shot performance of our model on five specialized domains.Our model outperforms Flan-T5-XXL (Chung et al., 2022), an instruction-tuned large language model with 11 billion parameters in the zero-shot setting, and surpasses ChatGPT when fine-tuned on 50 examples.

Fine-grained Entity Typing
Ling and Weld (2012) initiated efforts to recognize entities with labels beyond the small set of classes that is typically used in named entity recognition (NER) tasks.They proposed to formulate this task as a multi-label classification problem.More recently, Choi et al. (2018) extended this idea to ultrafine entity typing and released the UFET dataset, expanding the task to include an open type vocabulary with over 10k classes.Interest in ultra-fine entity typing has continued to grow over the last few years.Some research efforts have focused on modeling label dependencies and type hierarchies, such as employing box embeddings (Onoe et al., 2021) and contrastive learning (Zuo et al., 2022).Another line of research has concentrated on data augmentation and leveraging distant supervision.For instance, Dai et al. (2021) obtained training data from a pretrained masked language model, while Zhang et al. (2022) proposed a denoising method based on an explicit noise model.Li et al. (2022) formulated the task as a natural language inference (NLI) problem with the hypothesis being an "is-a" statement.Their approach achieved state-of-the-art performance on the UFET dataset and exhibited strong generalization to unseen types, but is inefficient at inference due to the need to enumerate the entire type vocabulary.

Probability Calibration
Probability calibration is the task of adjusting the confidence scores of a machine learning model to better align with the true correctness likelihood.Calibration is crucial for applications that require interpretability and reliability, such as medical diagnoses.Previous research has shown that modern neural networks while achieving good task performance, are often poorly calibrated (Guo et al., 2017;Zhao et al., 2021).One common technique for calibration in binary classification tasks is Platt scaling (Platt et al., 1999), which fits a logistic regression model on the original probabilities.Guo et al. (2017) proposed temperature scaling as an extension of Platt scaling in the multi-class setting.
Although probability calibration has been extensively studied for single-label classification tasks (Jiang et al., 2020;Kadavath et al., 2022), it has rarely been explored in the context of fine-grained entity typing which is a multi-label classification task.To the best of our knowledge, the only exception is Onoe et al. (2021), where the authors applied temperature scaling to a BERT-based model trained on the UFET dataset and demonstrated that the resulting model was reasonably well-calibrated.

Methodology
In this section, we present CASENT, a calibrated seq2seq model designed for ultra-fine entity typing.We start with the task description ( §3.1) followed by an overview of the CASENT architecture ( §3.2).While the focus of this paper is on the task of entity typing, our model can be easily adapted to other multi-label classification tasks.

Task Definition
Given an entity mention e, we aim to predict a set of semantic types t = {t 1 , . . ., t n } ⊂ T , where T is a predefined type vocabulary (|T | = 10331 for the UFET dataset).We assume each type in the vocabulary is a noun phrase that can be represented by a sequence of tokens t = (y 1 , y 2 , . . ., y k ).We assume the availability of a training set D train with annotated (e, t) pairs as well as a development set for estimating hyperparameters.

Overview of CASENT
Figure 2 provides an overview of our system.It consists of a seq2seq model and a calibration module.At training time, we train the seq2seq to output a ground truth type given an input entity mention by maximizing the length-normalized log-likelihood using an autoregressive formulation where θ denotes the parameters of the seq2seq model.
During inference, our model takes an entity mention e as input and generates a small set of candidate types autoregressively via constrained beam search by using a relatively large beam size.We then employ a calibration module to transform the raw conditional probabilities (Equation 1) associated with each candidate type into calibrated confidence scores p(t | e) ∈ [0, 1]. 3 The candidate types whose scores surpass a global threshold are selected as the model's predictions.
The parameters of the calibration module and the threshold are estimated on the development set before each inference run (which takes place either at the end of each epoch or when the training is complete).The detailed process of estimating calibration parameters is discussed in §3.4.

Training
Our seq2seq model is trained to output a type t given an input entity mention e.In the training set, each annotated example (e, t) ∈ D train with |t| = n ground truth types is considered as n separate input-output pairs for the seq2seq model. 4We initialize our model with a pretrained seq2seq language model, T5 (Raffel et al., 2020), and finetune it using standard maximum likelihood objective: (2) Our seq2seq formulation greatly simplifies the training process by eliminating the need for nega-tive sampling, which is required by previous crossencoder approaches (Li et al., 2022;Dai et al., 2021).

Calibration
At the core of our approach is a calibration module that transforms raw conditional log-probability log p θ (t | e) into calibrated confidence p(t | e).We will show in section 4 that directly applying thresholding using p θ (t | e) is suboptimal as it models the distribution over target token sequences instead of the likelihood of e belonging to a certain type t.Our approach builds on Platt scaling (Platt et al., 1999) with three proposed extensions specifically tailored for the ultra-fine entity typing task: 1) incorporating model bias p θ (t | ∅), 2) frequencybased weight sharing across types, and 3) efficient parameter estimation with sparse approximation.
Platt Scaling: We first consider calibration for each type t separately, in which case the task reduces to a binary classification problem.A standard technique for calibrating binary classifiers is Platt scaling, which fits a logistic regression model on the original outputs.A straightforward application of Platt scaling in our seq2seq setting computes the calibrated confidence score by σ(w t • log p θ (t | e) + b t ), where σ is the sigmoid function and calibration parameters w t and b are estimated on the development set by minimizing the binary cross-entropy loss.
Inspired by previous work (Zhao et al., 2021) which measures the bias of seq2seq models by feeding them with empty inputs, we propose to learn a weighted combination of both the conditional probability p θ (t | e) and model bias p θ (t | ∅).Specifically, we propose σ w (1) as the calibrated confidence score.We will show in section 4 that incorporating the model bias term improves task performance and reduces calibration error. Multi frequency in the dataset: + w (2) where maps type t to its frequency category. 5Intuitively, rare types are more vulnerable to model bias thus should be handled differently compared to frequent types.Furthermore, instead of training logistic regression models on all |D dev | • |T | data points, we propose a sparse approximation strategy that only leverages candidate types generated by the seq2seq model via beam search. 6This ensures that the entire calibration process retains the same time complexity as a regular evaluation run on the development set.The pseudo code for estimating calibration parameters is outlined in algorithm 1.Once the calibration parameters have been estimated, we select the optimal threshold by running a simple linear search.

Inference
At test time, given an entity mention e, we employ constrained beam search to generate a set of candidate types autoregressively.Following previous work (De Cao et al., 2020, 2022), we pre-compute a prefix trie based on T and force the model to select valid tokens during each decoding step.Next, we compute the calibrated confidence scores using Equation 3 and discard types whose scores fall below the threshold.
In section 4, we also conduct experiments on single-label entity typing tasks.In such cases, we directly score each valid type using Equation 3 and select the type with the highest confidence score.

Datasets
We use the UFET dataset (Choi et al., 2018), a standard benchmark for ultra-fine entity typing.This dataset contains 10331 entity types and is curated by sampling sentences from GigaWord (Parker et al., 2011), OntoNotes (Hovy et al., 2006) and web articles (Singh et al., 2012).
To test the out-of-domain generalization abilities of our model, we construct five entity typing datasets for three specialized domains.We derive these from existing NER datasets, WNUT2017 (Derczynski et al., 2017), JNLPBA (Collier and Kim, 2004), BC5CDR (Wei et al., 2016), MITrestaurant and MIT-movie. 7We treat each annotated entity mention span as an input to our entity typing model.WNUT2017 contains usergenerated text from platforms such as Twitter and Reddit.JNLPBA and BC5CDR are both sourced from scientific papers from the biomedical field.MIT-restaurant and MIT-movie are customer review datasets from the restaurant and movie domains respectively.Table 1 provides the statistics and an example from each dataset.

Implementation
We initialize the seq2seq model with pretrained T5-large (Raffel et al., 2020) and finetune it on the UFET training set with a batch size of 8. We optimize the model using Adafactor (Shazeer and Stern, 2018) with a learning rate of 1e-5 and a constant learning rate schedule.The constrained beam search during calibration and inference uses a beam size of 24.We mark the entity mention span with a special token and format the input according to the template "{CONTEXT} </s> {ENTITY} is </s>".Input and the target entity type are tokenized using the standard T5 tokenizer.

Baselines
We compare our method to previous state-of-the-art approaches, including multi-label classifier-based methods such as BiLSTM (Choi et al., 2018), BERT, Box4Types (Onoe et al., 2021) and ML-MET (Dai et al., 2021).In addition, we include a bi-encoder model, UniST (Huang et al., 2022) as well as the current state-of-the-art method, LITE (Li et al., 2022), which is based on a cross-encoder architecture.
We also compare with ChatGPT8 and Flan-T5-XXL (Chung et al., 2022), two large language models that have demonstrated impressive few-shot and zero-shot performance across various tasks.For the UFET dataset, we randomly select a small set of examples from the training set as demonstrations for each test instance.Instruction is provided before the demonstration examples to facilitate zero-shot evaluation.Furthermore, for the five cross-domain entity typing datasets, we supply ChatGPT and Flan-T5-XXL with the complete list of valid types.Sample prompts are shown in Appendix A.

UFET
In Table 2, we compare our approach with a suite of baselines and state-of-the-art systems on the UFET dataset.Our approach outperforms LITE (Li et al., 2022), the current leading system based on a crossencoder architecture, with a 0.7% improvement in Method P R F1
ChatGPT exhibits poor zero-shot performance with significantly low recall.However, it is able to achieve comparable performance to a BERT-based classifier with a mere 8 few-shot examples.Despite this, its performance still lags behind recent fully supervised models.

Out-of-domain Generalization
We evaluate the out-of-domain generalization performance of different models on the five datasets discussed in §4.1.The results are presented in Table 3.It is important to note that we don't compare with multi-label classifier models like Box4Types and MLMET that treat types as integer indices, as they are unable to generalize to unseen types.
In the zero-shot setting, LITE and CASENT are trained on the UFET dataset and directly evaluated on the target test set.Flan-T5-XXL and ChatGPT are evaluated by formulating the task as a classification problem with all valid types as candidates.As shown in Table 3, ChatGPT demonstrates superior performance with a large margin compared to other models.This highlights ChatGPT's capabilities on classification tasks with a small label space.Our approach achieves comparable results to LITE and significantly outperforms Flan-T5-XXL, despite having less than 10% of its parameters.
We also conduct experiments in the few-shot setting, where either a small training set or development set is available.We first explore re-estimating the calibration parameters of CASENT on the target development set by following the process discussed in §3.4 without weight sharing and sparse approximation. 9Remarkably, this re-calibration process, without any finetuning, results in an absolute improvement of +9.9% and comparable performance with ChatGPT on three out of five datasets.When finetuned on 50 randomly sampled examples, our approach outperforms ChatGPT and a finetuned RoBERTa model by a significant margin, highlighting the benefits of transfer learning from the ultra-fine entity typing task. 9The number of calibration parameters is 3|T |, which is less than 40 on all five datasets.6 Analysis

Calibration
Table 4 presents the calibration error of different approaches.We report Expected Calibration Error (ECE) and Total Calibration Error (TCE) which measures the deviation of predicted confidence scores from empirical accuracy.Interestingly, we observe that the entailment scores produced by LITE, the state-of-the-art cross-encoder model, are poorly calibrated.Our approach achieves slightly lower calibration error than Box4Types, which applies temperature scaling (Guo et al., 2017) to the output of a BERT-based classifier.Figure 3

Ablation Study
We also perform an ablation study to investigate the impacts of various design choices in our proposed calibration method.Table 4 displays the results of different variants of CASENT.A vanilla seq2seq model without any calibration yields both low task performance and high calibration error, highlighting the importance of calibration.Notably, a naive extension of Platt scaling that considers each type independently leads to significant overfitting, illustrated by an absolute difference of 47.28% TCE between the development and test sets.Removing the model bias term also has a negative impact on both task performance and calibration error.

Choice of Seq2seq Model
In Table 5, we demonstrate the impact of calibration on various T5 variants.Our proposed calibration method consistently brings improvement across models ranging from 80M parameters to 3B parameters.The most substantial improvement is achieved with the smallest T5 model.

Training and Inference Efficiency
In   single GPU, while previous methods require more than 40 hours.While CASENT achieves an inference speedup of over 50 times over LITE, it is still considerably slower than MLMET, a BERT-based classifier model.This can be attributed to the need for autoregressive decoding in CASENT.

Impact of Beam Size
Given that the inference process of CASENT relies on constrained beam search, we also investigate the impact of beam size on task performance and calibration error.As shown in Figure 4, a beam size of 4 results in a low calibration error but also low F1 scores, as it limits the maximum number of predictions.CASENT consistently maintains high F1 scores with minor fluctuations for beam sizes ranging from 8 to 40.On the other hand, a beam size between 8 and 12 leads to high calibration errors.This can be attributed to our calibration parameter estimation process in algorithm 1, which approximates the full |D dev | • |T | calibration data points using model predictions generated by beam search.A smaller beam size leads to a smaller number of calibration data points, resulting in a suboptimal estimation of calibration parameters.

Conclusion
Engineering decisions often involve a tradeoff between efficiency and accuracy.CASENT simultaneously improves upon the state-of-the-art in both dimensions while also being conceptually elegant.The heart of this innovation is a constrained beam search with a novel probability calibration method designed for seq2seq models in the multi-label classification setting.Not only does this method outperform previous methods-including ChatGPT and the existing fully-supervised methods-on ultrafine entity typing, but it also exhibits strong generalization capabilities to unseen domains.

Limitations
While our proposed CASENT model shows promising results on ultra-fine entity typing tasks, it does have certain limitations.Our experiments were conducted using English language data exclusively and it remains unclear how well our model would perform on data from other languages.In addition, our model is trained on the UFET dataset, which only includes entity mentions that are identified as noun phrases by a constituency parser.Consequently, certain types of entity mentions such as song titles are excluded.The performance and applicability of our model might be affected when dealing with such types of entity mentions.Future work is needed to adapt and evaluate the proposed approach in other languages and broader scenarios.
example prediction of our model is shown in Figure 1.

Figure 2 :
Figure 2: Overview of the training and inference process of CASENT.We present an example output from our model.

Figure 3 :
Figure 3: Reliability diagrams of CASENT on the UFET test set.The left diagram represents rare types with fewer than 10 occurrences while the right diagram represents frequent types.

Figure 4 :
Figure 4: Test set Macro F1 score and Expected Calibration Error (ECE) with respect to the beam size on the UFET dataset.
-label Platt Scaling: We now discuss the extension of this equation in the multi-label setting where |T | ≫ 1.A naive extension that considers each type independently would introduce 3|T | parameters and involve training |T | logistic regression models on |D dev | • |T | data points.To mitigate this difficulty, we propose to share calibration parameters across types based on their occurrence 3 for e, types in Ddev do 4 for t in model.beam_search(e)do 5 X ← [log p θ (t|e), log p θ (t|∅)] 6 if t in types then 13 for i in range(n_groups) do 14

Table 1 :
Dataset statistics and examples.Only UFET has multiple types for each entity mention.

Table 2 :
Macro-averaged precision, recall and F1 score (%) on the UFET test set.The model with highest F1 score is shown in bold and the second best is underlined.

Table 3 :
Test set accuracy on five specialized domain entity typing datasets derived from existing NER datasets.The best score is shown in bold and the second best is underlined.The results of LITE are obtained by running inference using the model checkpoint provided by the authors.

Table 4 :
Onoe et al. (2021)bility diagrams of CASENT for both rare types and frequent types.As illustrated by the curve in the left figure, high-confidence predictions for rare types are less well-calibrated.Macro F1, ECE (Expected Calibration Error) and TCE (Total Calibration Error) on the UFET dataset.ECE and TCE are computed using 10 bins.The best score is shown in bold.Onoe et al. (2021)only reported calibration results on the dev set thus the results of Box4Types on the test set are not included.

Table 5 :
Macro F1 score (%) of CASENT on the UFET test set with different T5 variants.

Table 6
, we compare the efficiency of our method with previous state-of-the-art systems.Remarkably, CASENT only takes 6 hours to train on a

Table 6 :
Training time, inference latency and inference time GPU memory usage estimated on a single NVIDIA RTX A6000 GPU.Inference time statistics are estimated using 100 random UFET examples.Results marked by † are reported by Li et al. (2022).