Recall, Expand, and Multi-Candidate Cross-Encode: Fast and Accurate Ultra-Fine Entity Typing

Ultra-fine entity typing (UFET) predicts extremely free-formed types (e.g., president, politician) of a given entity mention (e.g., Joe Biden) in context. State-of-the-art (SOTA) methods use the cross-encoder (CE) based architecture. CE concatenates a mention (and its context) with each type and feeds the pair into a pretrained language model (PLM) to score their relevance. It brings deeper interaction between the mention and the type to reach better performance but has to perform N (the type set size) forward passes to infer all the types of a single mention. CE is therefore very slow in inference when the type set is large (e.g., N=10k for UFET). % Cross-encoder also ignores the correlation between different types.To this end, we propose to perform entity typing in a recall-expand-filter manner. The recall and expansion stages prune the large type set and generate K (typically much smaller than N) most relevant type candidates for each mention. At the filter stage, we use a novel model called {pasted macro ‘NAME’} to concurrently encode and score all these K candidates in only one forward pass to obtain the final type prediction. We investigate different model options for each stage and conduct extensive experiments to compare each option, experiments show that our method reaches SOTA performance on UFET and is thousands of times faster than the CE-based architecture. We also found our method is very effective in fine-grained (130 types) and coarse-grained (9 types) entity typing. Our code is available at {pasted macro ‘CODE’}.


Introduction
Ultra-fine entity typing (UFET) (Choi et al., 2018) aims to predict extremely fine-grained types (e.g., president, politician) of a given entity mention within its context.It provides detailed semantic understandings of entity mentions and is a fundamental step in fine-grained named entity recognition (Ling and Weld, 2012).It can also be utilized to assist various downstream tasks such as relation extraction (Han et al., 2018), keyword extraction (Huang et al., 2020) and content recommendation (Upadhyay et al., 2021).Most recently, the cross-encoder (CE) based method (Li et al., 2022) achieves the SOTA performance in UFET.Specifically, Li et al. (2022) proposes to treat the mention with its context as a premise, and each ultra-fine-grained type as a hypothesis.They then concatenate them together as input and feed it into a pretrained language model (PLM) (e.g., RoBERTa (Liu et al., 2019)) to score the entailment between the mention-type pair as illustrated in Figure 1(b).
Compared with the traditional multi-label classification method (shown in Figure 1(a)) that simultaneously scores all types using the same mention representation, CE has the advantage of incorporating type semantics in the encoding and inference process (by taking words in type labels as input) and enabling deeper interactions between each type and the mention via cross-encoding.However, the CE-based method is slow in inference because it has to enumerate all the types (up to 10k types in UFET) and score entailment for each of them given the mention as a premise.There is also no direct interaction between types in CE for modeling correlations between types (e.g., one has to be a person if he or she is categorized as a politician), which has been proven to be useful in previous works (Xiong et al., 2019;Jiang et al., 2022).
To this end, we propose a recall-expand-filter paradigm for UFET (illustrated in Figure 2) for faster and more accurate ultra-fine entity typing.As the name suggests, we first train a multi-label classification (MLC) model to efficiently recall top candidate types, which reduces the number of potential types from 10k to hundreds.As the MLC model recalls candidates based on representations learned from the training data, it may not be able to recall candidates that are scarce or unseen in the training set.Consequently, we apply a type candidate expansion step utilizing lexical information and weak supervision from masked language models (Dai et al., 2021) to improve the recall rate of the candidate set.Finally, we propose a novel method called multi-candidate cross-encoder (MCCE) to concurrently encode and filter the expanded type candidate set.Different from CE, MCCE concatenates all the recalled type candidates with the mention and its context.The concatenated input is then fed into a PLM to obtain candidate representations and candidate scores.The MCCE allows us to simultaneously encode and infer all the types from the candidate set and is thus much faster than the CE-based method, but it still preserves the advantages of CE in modeling interactions between the types and the mention.Concatenating all the candidates also enables MCCE to implicitly learn correlations between types.The advantages of MCCE over existing methods are summarized in Figure 3.
Experiments on two UFET datasets show that our recall-expand-filter paradigm reaches SOTA performance and MCCE is thousands of times

Problem Definition
Given an entity mention m within its context sentence c, ultra-fine entity typing (UFET) aims to predict its correct types y g ⊂ Y (|Y| can be larger than 10k).As |y g | > 1 in most cases, UFET is a multi-label classification problem.We show statistics of two UFET datasets, UFET (Choi et al., 2018) and CFET1 (Lee et al., 2020), in Table 1.

Multi-label Classification Model for UFET
Multi-label classification (MLC) models are widely adopted as backbones for UFET (Choi et al., 2018;Onoe and Durrett, 2019;Onoe et al., 2021).They use an encoder to obtain the mention representation and use a decoder (e.g., MLP) to score types simultaneously.Figure 1(a) shows a representative MLC model adopted by recent methods (Dai et al., 2021;Jiang et al., 2022).The contextualized mention representation is obtained by feeding c and m into a pretrained language model (PLM) and taking the last hidden state of [CLS], h cls .The mention representation is then fed into an MLP layer to concurrently obtain all type scores s 1 , MLC Inference Types with a probability higher than a threshold τ are predicted: y p = {y j |σ(s j ) > τ, 1 ≤ j ≤ N }, where σ is the sigmoid function.τ is tuned on the development set.

MLC Training
The binary cross-entropy (BCE) loss over the predicted label probabilities and the gold types is used to train the MLC model.MLC is very efficient in inference.However, the interactions between mention and types in MLC are weak, and the correlations between types are ignored (Onoe et al., 2021;Xiong et al., 2019;Jiang et al., 2022).In addition, MLC has difficulty in integrating type semantics (Li et al., 2022).

Vanilla
The concatenation allows deeper interaction between the mention, context, and type (via the multihead self-attention in PLMs), and also incorporates type semantics.
CE Inference Similar to MLC, types that have a higher probability than a threshold are predicted.
To compute the probabilities, CE requires N forward passes to infer types of a single mention, so its inference is very slow when N is large.
CE Training CE is typically trained with the marginal ranking loss (Li et al., 2022).A positive type y + ∈ y g and a negative type y − ̸ ∈ y g are sampled from Y for each training sample (m, c).
The loss is computed as: where s + , s − are scores of the sampled positive and negative types, and δ is the margin tuned on the development set to determine how far positive and negative samples should be separated.

Method
Inspired by techniques in information retrieval (Larson, 2010) and entity linking (Ledell et al., 2020), we decompose the inference of UFET into three stages as illustrated in Figure 2: (1) A recall stage to reduce the type candidate number (e.g., from N = 10k to K = 100) while maintaining a good recall rate using an efficient MLC model.(2) An expansion stage to improve the recall rate by incorporating lexical information using exact matching and weak supervision (Dai et al., 2021) from large pretrained language models such as BERT-Large (Devlin et al., 2019).(3) A filter stage to filter the expanded type candidates to obtain the final prediction.For the filter stage, we propose an efficient model, Multi-Candidate Cross-Encoder (MCCE), to concurrently encode and filter type candidates of a given mention with only a single forward pass.

Recall Stage
To prune the type candidate set, we train a MLC model introduced in Sec.2.2 on the training set and tune it based on the recall rate (e.g., recall@64) on the development set.Then we use it to infer the top K 1 (typically less than 256) candidates C R for each data point (m, c).We find that MLC significantly outperforms BM25 (Robertson and Zaragoza, 2009) as a recall model (see Sec. 5.1.1).

Expansion Stage
In UFET, the number of training data per type is small, especially for fine-grained and ultra-finegrained types.30% of the types in the development set of UFET dataset is unseen during training.Consequently, we find the MLC used in the recall stage easily overfits the train set and has difficulty in predicting types that only appear in the development or test set.Therefore, we utilize two methods, exact match and masked language models (MLM), to expand the recalled candidates.Both exact match and MLM are able to recall unseen type candidates without any training.
Exact Match MLC recalls candidates using dense representations.They are known to be weak at identifying and utilizing lexical matching information between the input and types (Tran et al., 2019;Khattab and Zaharia, 2020).However, types are extremely fine-grained in UFET (e.g., son, child) and are very likely to appear in the context or mention (e.g., mention "He" in context "He is the son and child of ...").To this end, we first identify and normalize all nouns in the context and mention using NLTK2 , and then recall types that exactly match these nouns.We denote the recalled type set as C EM .
MLM Inspired by recent prompt-based methods for entity typing (Ding et al., 2021;Pan et al., 2022), we recall candidates by asking PLMs to fill masks in prompts.Suppose a type y ∈ Y is tokenized to l subwords w 1 , • • • w l .To score y given m, c, we first formulate the input as in Figure 4. We use 'such as' as the template to induce types.The input is then fed into BERT-large-uncased 3 to obtain the probabilities of the subwords.The score of y is calculated by s M LM = ( l n=1 log p n )/l, where p n denotes the probability of subword w n predicted by the PLM.We rank all types in descending order by their scores s M LM .
We expand K 2 candidates by the following strategy: First, expand all candidates recalled by exact match candidates using MLM based on their scores.After expansion, we obtain K = K 1 + K 2 type candidates for each data point.

Filter Stage
The filter stage infers types from the candidate pool C generated by the recall and expansion stages.Let K = |C| = K 1 + K 2 and K is typically less than 128.A trivial choice of the filter model is the CE model introduced in Sec.2.3.We can score these candidates C using CE by K forward passes as introduced before.For training, the positive type y + and negative type y _ are sampled from C instead of Y and are then used for calculating the marginal ranking loss.As K is much smaller than |Y|, the inference speed with CE under our Recall-Expand-Filter paradigm is much faster than that of vanilla CE.However, it is still inefficient compared with the MLC model that concurrently predicts scores of all types in a single forward pass.For faster inference and training, we propose multi-candidate cross-encoders (MCCE) in the next section. 3We use the PLM from https://huggingface.co 4 Multi-Candidate Cross-Encoder

Overview
As shown in Figure 5, compared with CE that concatenates one candidate at a time, MCCE concatenates all the candidates in C with the mention and context.The concatenated input is then fed into the PLM to obtain the hidden state of each candidate.Finally, we apply an MLP over the hidden states to concurrently score all the candidates.MCCE models use only one forward pass to infer types from candidates.
where y 1:K is short for y 1 , . . ., y K ∈ C, and similarly, h 1:K and s 1:K denote hidden states and scores of all the candidates respectively.Similar to MLC training and inference discussed in Sec.2.2, we use the binary cross-entropy loss as the training objective and tune a probability threshold on the development set for inference.We find that during training, all positive types are ranked very high in the candidate set at the first stage, which is however not the case for the development and test data.To prevent the filter model from overfitting the order of training candidates and only learning to predict the highest-ranked candidates, we keep permuting candidates during training.

Input Format of Candidates
We show two kinds of candidates representations in this section.
Average of type sub-tokens We treat each possible type y ∈ C as a new token u and add it to the vocabulary of the PLM.The static embedding (layer 0 embedding of the PLM) of u is initialized with the average static embedding of all the subtokens in type y.The advantages of this method include: (1) Compressing types into single tokens allows us to consider more candidates; (2) Types in UFET are tokenized into only 2.1 sub-tokens on average (by RoBERTa's tokenizer), so averaging sub-token embeddings does not lose too much semantic information of the sub-tokens.
Fixed-size sub-token block To preserve more type semantics, we represent each candidate type with its sub-tokens.We pad or truncate the subtokens to a fixed-sized block to facilitate parallel implementation of the attention mechanisms that we will introduce next.We use the PLM hidden state of the first sub-token in the block as the output representation of each candidate.

Attention in MCCE
There are four kinds of attention in MCCE as shown in Figure 6: sentence-to-sentence (S2S), sentence-to-candidates (S2C), candidate-tosentence (C2S), and candidate-to-candidate (C2C).Since we score candidates based on the mention and its context, the attention from candidates to the sentence (C2S) is necessary.On the other hand, the C2C, S2S, and S2C attention are optional.We empirically find that S2C is important, S2S is useful, and C2C is only useful in some settings (see Sec.

Experiments
We conduct experiments on two ultra-fine entity typing datasets, UFET (English) and CFET (Chinese).Their statistics are shown in Table 1.We mainly report macro-averaged recall at the recall and expansion stages and macro-F1 of the final prediction.We also evaluate the MCCE models on fine-grained (130 types) and coarse-grained (9 types) entity typing.

Recall Stage
We compare recall@K on the test sets of UFET and CFET between our MLC model and a traditional BM25 model (Robertson and Zaragoza, 2009) in Figure 7.The MLC model uses RoBERTalarge as the backbone and is tuned based on recall@128 on the development set.We use the AdamW optimizer with a learning rate of 2 × 10 −5 .Results show that MLC is a strong recall model, consistently outperforming BM25 on both UFET and CFET datasets.Its recall@128 reaches over 85% on UFET and over 94% on CFET.

Expansion Stage
We show the improvement of recall from using candidate expansion in Figure 8.On the UFET dataset, the recall of expanding K 2 = 32 additional candidates based on K 1 = 96 MLC candidates is 2% (absolute value) higher than the recall of K 1 = 128, K 2 = 0, and is comparable to the recall of K 1 = 179, K 2 = 0 (the red dotted line in Figure 8).Similarly in CFET, expanding 10 candidates based on 54 MLC candidates is comparable to recalling 141 candidates using MLC alone in the recall.In subsequent experiments, we ex-  pand 48 and 10 candidates for UFET and CFET respectively for the filter stage.

Filter Stage and Final Results.
We report the performance of MCCE variants as the filter model and compare them with various strong baselines.We treat the number of candidates K 1 and K 2 recalled and expanded by the first two stages as a hyper-parameter and tune it on the development set.For a fair comparison with baselines, we conduct experiments of MCCE using different backbone PLMs.For all MCCE models, we use the AdamW optimizer with a learning rate tuned between 5 × 10 −6 and 2 × 10 −5 .The batch size we use is 4 and we train the models for at most 50 epochs with early stopping.
Baselines The MLC model we use for the recall stage and the cross-encoder (CE) we introduced in Sec.2.3 are natural baselines.We also compare our methods with recent PLM-based methods.We introduce them in Appendix B. Similar to the results on UFET, filter models under our paradigm significantly outperform MLC-like baselines, +2.0 F1 for NEZHA-base and +1.8 F1 for BERT-base-Chinese.MCCE-B is significantly better than MCCE-S on both NEZHA-base and BERT-base-Chinese, indicating the importance of type semantics in the Chinese language.We also find that MCCE w/o C2C is generally better than MCCE w/ C2C, possibly because C2C attention distracts the candidates from attending to the mention and contexts.4 shows the number of PLM forward passes and the empirical inference speed of different RoBERTa-Large models on UFET.We conduct the speed test using NVIDIA TITAN RTX for all the models and the inference batch size is 4. At the filter stage, the inference speed of MCCE-S is 40 times faster than CE 128 and thousands of times faster than LITE.Surprisingly, MCCE-B W/O C2C is not significantly faster than MCCE-B.It is possibly because the computation (Appendix A) related to the block attention is not fully optimized by the deep learning framework we use.However, we expect the speed advantage of MCCE-B W/O C2C over MCCE- B would become greater with more candidates.

Fine and Coarse-grained Entity Typing
We also conduct experiments on fine-grained (130class) and coarse-grained (9-class) entity typing, and the results are shown in 6 Analysis

Importance of expansion stage
We perform ablation study on the importance of the expansion stage by comparing the results of MCCE-S with and without the expansion stage in Table 6.It can be seen that the expansion stage has a positive effect, improving the final recall by +1.0 and +2.2 on UFET and CFET respectively without harming the precision.

Attention
We conduct an ablation study on S2S, C2S, S2C, and C2C attention introduced in Sec.4.3 and show the results in Table 7.According to the results, we find that C2C is useful but not necessary on base models, MCCE-S using BERT-base reaches 50.2 without C2C on UFET.Removing S2S has a non-negligible negative effect but surprisingly, it will not destroy the model.A possible reason is the interaction between sub-tokens in the sentence can be achieved indirectly by first attending to the candidates and then being attended back by the candidates in the next layer.We also find that C2S is necessary for the task (18.7 F1 w/o C2S) because we rely on the mention and context to encode and classify candidates.Furthermore, it is important

A Removing C2C Attention
Let L S and L C be the number of sub-tokens used by the sentence and candidates respectively.We can formulate the attention query of the sentence as , where q s i is the query vector of the i-th sub-token in the sentence, and D is the embedding dimension.Similarly, the query of candidates is formulated as When we treat candidates as average of sub-tokens, q c i is a Ddimensional vector, and when we use fixed-sized blocks to place candidates, q c i ∈ R B×D is the concatenation of the query vectors in the i-th candidate block and B is the number of sub-tokens in a block.The keys and values are defined similarly as The attention outputs are computed as: where A CC is the intra-candidate or intra-block attention, and a c j is a scaler when we treat candidates as the average of sub-tokens and is a B × B matrix when we represent candidates as blocks.The last step (Eq.8) can be parallelly implemented by Einstein summation.

B Baselines
We introduce recent PLM-based methods for UFET that we compare in Sec.5.3 here.LDET (Onoe and Durrett, 2019) is an MLC with Bert-baseuncased and ELMo (Peters et al., 2018) trained on 727k examples automatically denoised from the distantly labeled UFET.GCN (Xiong et al., 2019) uses GCN to model type correlations and obtain type embeddings.Types are scored by dot-product of mention and type embeddings.The original paper uses BiLSTM as the mention encoder, but we report the results of the re-implementation by Jiang et al. (2022) using RoBERTa-large.BOX4TYPE (Onoe et al., 2021) uses Bert-large as the backbone and uses box embedding to encode mentions and types for training and inference.LRN (Liu et al., 2021) use Bert-base as the encoder and an LSTM decoder to generate types in a seq2seq manner.MLMET (Dai et al., 2021) is an MLC with Bertbase, first pretrained by distantly-labeled data augmented with masked word prediction, and then finetuned and self-trained on the 2k human-annotated data.PL (Ding et al., 2021) uses prompt learning for entity typing.DFET (Pan et al., 2022) uses PL as the backbone and is a multi-round automatic denoising method on the 2k labeled data.LITE (Li et al., 2022) is the previous SOTA system that formulates entity typing as textual inference.LITE uses RoBERTa-large-MNLI as the backbone and is a cross-encoder (introduced in Sec.2.3) with designed templates and a hierarchical loss.Jiang et al. (2022) proposes NPCRF to enhance backbones such as PL and MLC by modeling type correlations, reaching performance comparable to LITE.

Figure 3 :
Figure 3: Comparison of different models.M, C, and T are abbreviations of mention, context, and type.

Figure 8 :
Figure 8: The effect of the expansion stage.

Table 1 :
Statistics of UFET datasets.avg(|y g |) denotes the average number of gold types per instance.
dataset |Y| avg(|y g |) train/dev/test language faster than the previous SOTA CE-based method.We also comprehensively investigate the performance and efficiency of MCCE with different input formats and attention mechanisms.We found MCCE is effective in fine-grained (130 types) and coarse-grained (9 types) entity typing.Our code is available at http://github.com/modelscope/ AdaSeq/tree/master/examples/MCCE.

Table 3 :
Naming Conventions MCCE-S denotes the MCCE model using the average of sub-tokens as candidates' input, and MCCE-B denotes the model representing candidates as fixed-sized blocks.The MCCE model without C2C attention (mentioned in Sec.4.3) is denoted as MCCE-B w/o C2C.For PLM backbones used in UFET, Macro-averaged CFET result.
UFET ResultsWe show the results on the UFET dataset in Table2and make the following observations.(1)Therecall-expand-filterparadigmis effective.Our methods outperform all the baselines without the paradigm by a large margin.The CE under our paradigm reaches 51.8 F1, while LITE, a more complicated CE, achieves only 50.CFET ResultsOn CFET, we compare MCCE models with several strong baselines: NPCRF and GCN with an MLC-like architecture, and CE under our paradigm which is shown to be better than LITE on UFET.The results are shown in Table3.

Table 4 :
Inference speed comparison of models.# FP means the number of PLM forward passes required by a single inference.We also report the practical inference speed SENTS/S and the F1 scores on UFET.

Table 5 :
Micro-averaged results for fine and coarsegrained entity typing on UFET.

Table 6 :
Ablation study of the expansion stage.

Table 5 .
Since the type candidate set is already small, it is not necessary to apply the recall and expand stage to further prune the type set.Then, we only evaluate different model options for the filter stage.Results show that MCCE models are still better than MLC and CE, and MCCE-S is better than MCCE-B on the coarser-grained setting possibly because the coarser-grained types are simpler in surface-forms and MCCE-S does not lose much type semantics.
LimitationOne limitation of the MCCE models is that the number of candidates during training and inference should be the same, otherwise, the performance drops severely.One simple potential solution is to divide or pad the candidates during inference to match the number of candidates during training.For example, divide 128 candidates into two sets with 64 candidates and apply twice forward passes of a filter model if it is trained on 64 candidates and required to filter 128 candidates during inference.We don't fully explore the solutions to this limitation and leave it as future work.