Modeling Fine-Grained Entity Types with Box Embeddings

Neural entity typing models typically represent fine-grained entity types as vectors in a high-dimensional space, but such spaces are not well-suited to modeling these types’ complex interdependencies. We study the ability of box embeddings, which embed concepts as d-dimensional hyperrectangles, to capture hierarchies of types even when these relationships are not defined explicitly in the ontology. Our model represents both types and entity mentions as boxes. Each mention and its context are fed into a BERT-based model to embed that mention in our box space; essentially, this model leverages typological clues present in the surface text to hypothesize a type representation for the mention. Box containment can then be used to derive both the posterior probability of a mention exhibiting a given type and the conditional probability relations between types themselves. We compare our approach with a vector-based typing model and observe state-of-the-art performance on several entity typing benchmarks. In addition to competitive typing performance, our box-based model shows better performance in prediction consistency (predicting a supertype and a subtype together) and confidence (i.e., calibration), demonstrating that the box-based model captures the latent type hierarchies better than the vector-based model does.


Introduction
The development of named entity recognition and entity typing has been characterized by a growth in the size and complexity of type sets: from 4 (Tjong Kim Sang and De Meulder, 2003) to 17 (Hovy et al., 2006) to hundreds (Weischedel and Brunstein, 2005;Ling and Weld, 2012) or thousands (Choi et al., 2018). These types follow some kind of hierarchical structure (Weischedel and Brunstein, 2005;Ling and Weld, 2012;Gillick et al., 2014;, so effective models for these tasks frequently engage with this hierarchy explicitly. Prior systems incorporate this structure via hierarchical losses Xu and Barbosa, 2018;Chen et al., 2020) or by embedding types into a high-dimensional Euclidean or hyperbolic space (Yogatama et al., 2015;López and Strube, 2020). However, the former approach requires prior knowledge of the type hierarchy, which is unsuitable for a recent class of large type sets where the hierarchy is not explicit (Choi et al., 2018;Onoe and Durrett, 2020a). The latter approaches, while leveraging the inductive bias of hyperbolic space to represent trees, lack a probabilistic interpretation of the embedding and do not naturally capture all of the complex type relationships beyond strict containment.
In this paper, we describe an approach that represents entity types with box embeddings in a highdimensional space . We build an entity typing model that jointly embeds each entity mention and entity types into the same box space to determine the relation between them. Volumes of boxes correspond to probabilities and taking intersections of boxes corresponds to computing joint distributions, which allows us to model mentiontype relations (what types does this mention exhibit?) and type-type relations (what is the type hierarchy?). Concretely, we can compute the conditional probability of a type given the entity mention with straightforward volume calculations, allowing us to construct a probabilistic type classification model.
Compared to embedding types as points in Euclidean space (Ren et al., 2016a), the box space is expressive and suitable for representing entity types due to its geometric properties. Boxes can nest, overlap, or be completely disjoint to capture

Motivation
When predicting class labels like entity types that exhibit a hierarchical structure, we naturally want our model's output layer to be sensitive to this structure. Previous work (Ren et al., 2016a;Shimaoka et al., 2017;Choi et al., 2018;Onoe and Durrett, 2019, inter alia) has fundamentally treated types as vectors, as shown in the left half of Figure 1. As is standard in multiclass or multi-label classification, the output layer of these models typically involves taking a dot product between a mention embedding and each possible type. A type could be more general and predicted on more examples by having higher norm, 2 but it is hard for these representations to capture that a coarse type like Person will have many mutually orthogonal subtypes.
By contrast, box embeddings naturally represent these kinds of hierarchies as shown in the right half of Figure 1. A box that is completely contained in another box is a strict subtype of that box: any entity exhibiting the inner type will exhibit the outer one as well. Overlapping boxes like Politician and Author represent types that are not related in the type hierarchy but which are not mutually exclusive. The geometric structure of boxes enables complex interactions with only a moderate number of dimensions (Dasgupta et al., 2020).  also define a probability measure over the box space, endowing it with probabilistic semantics. If the boxes are restricted to a unit hypercube, for example, the volumes of type boxes represent priors on types and intersections capture joint probabilities, which can then be used to derive conditional probabilities.
Critically, box embeddings have previously been trained explicitly to reproduce a given hierarchy such as WordNet. A central question of this work is whether box embeddings can be extended to model the hierarchies and type relationships that are implicit in entity typing data: we do not assume access to explicit knowledge of a hierarchy during training. While some datasets such as OntoNotes have orderly ontologies, recent work on entity typing has often focused on noisy type sets from crowdworkers (Choi et al., 2018) or derived from Wikipedia (Onoe and Durrett, 2020a). We show that box embeddings can learn these structures organically; in fact, they are not restricted to only tree structures, but enable a natural Venndiagram style of representation for concepts, as with Politician and Author in Figure 1.
If we normalize the volume of the box space to be 1, we can interpret the volume of each box as the marginal probability of a mention exhibiting the given entity type.
Furthermore, the intersection volume between two boxes, x and y, is defined as Vol(x ∩ y) = ∏︁ i max (min(x M,i , y M,i ) − max(x m,i , y m,i ), 0) and can be seen as the joint probability of entity types x and y. Thus, we can obtain the conditional probability P (y | x) = Vol(x∩y) Vol(x) . Soft boxes Computing conditional probabilities based on hard intersection poses some practical difficulties in the context of machine learning: sparse gradients caused by disjoint or completely contained boxes prevent gradient-based optimization methods from working effectively. To ensure that gradients always flow for disjoint boxes, Li et al. (2019) relax the hard edges of the boxes using Gaussian convolution. We follow the more recent approach of Dasgupta et al. (2020), who further improve training of box embeddings using max and min Gumbel distributions (i.e., Gumbel boxes) to represent the min and max coordinates of a box.

Box-based Multi-label Type Classifier
Let s denote a sequence of context words and m denote an entity mention span in s. Given the input tuple (m, s), the output of the entity typing model is an arbitrary number of predicted types {t 0 , t 1 , ...} ∈ T , where t k is an entity type belonging to a type inventory T . Because we do not assume an explicit type hierarchy, we treat entity typing as a multi-label classification problem, or |T | independent binary classification problems for each mention. Section 3.3 will describe how to use a BERTbased model to predict a mention and context box 3 x from (m, s). For now, we assume x is given and we are computing the probability of that mention exhibiting the kth entity type, with type box y k . Each type t k ∈ T has a dedicated box y k , which is parameterized by a center vector c k y ∈ R d and an offset vector o k y ∈ R d . The minimum and maximum corners of a box y k are computed as y k m = σ(c k y − softplus(o k y )) and y k M = σ(c k y + softplus(o k y )) respectively, so that parameters c ∈ R d and o ∈ R d yield a valid box with nonzero volume.
The conditional probability of the type t k given the mention and context (m, s) is calculated as where z k is the intersection between x and y k ( (2) and (3) in Figure 2). Our final type predictions are based on thresholding these probabilities; i.e., predict the type if p > 0.5.
As mentioned in Section 3.1, we use the Gumbel box approach of Dasgupta et al. (2020), in which the box coordinates are interpreted as the location parameter of a Gumbel max (resp. min) distribution with variance β. In this approach, the intersection box coordinates become Following Dasgupta et al. (2020), we approximate the expected volume of a Gumbel box using a softplus function: where i is an index of each coordinate and γ ≈ 0.5772 is the Euler-Mascheroni constant, 4 and softplus(x) = 1 t log(1 + exp(xt)), with t as an inverse temperature value.

Mention and Context Encoder
We format the context words s and the mention span m as and chunk into WordPiece tokens (Wu et al., 2016). Using pre-trained BERT 5 (Devlin et al., 2019), we encode the whole sequence into a single vector by taking the hidden vector at the [CLS] token. A highway layer (Srivastava et al., 2015) projects down the hidden vector h [CLS] ∈ R ℓ to the R 2d space, where ℓ is the hidden dimension of the encoder (BERT), and d is the dimension of the box space. This highway layer transforms representations in a vector space to the box space without impeding the gradient flow. We further split the hidden vector h ∈ R 2d into two vectors: the center point of the box c x ∈ R d and the offset from the maximum and minimum corners o x ∈ R d . The minimum and maximum corners of the mention and context box are computed as where σ is an element-wise sigmoid function, and SOFTPLUS is an element-wise softplus function as defined in Section 3.2 ((1) in Figure 2). The output of the softplus is guaranteed to be positive, guaranteeing that the boxes have volume greater than zero.

Learning
The goal of training is to find a set of parameters θ that minimizes the sum of binary cross-entropy losses over all types over all examples in our train-ing dataset D: 1} is the gold label for the type t k . We optimize this objective using gradient-based optimization algorithms such as Adam (Kingma and Ba, 2015). 6

Experimental Setup
Our focus here is to shed light on the difference between type hierarchies learned by the box-based model and the vector-based model. To this end, we first evaluate those two models on standard entity typing datasets. Then, we test models' consistency, robustness, and calibration, and evaluate the predicted types as entity representations on a downstream task (coreference resolution). See Appendix A for hyperparameters.

Baseline
Our chief comparison is between box-based and vector-based modeling of entity types. As our main baseline for all experiments, we use a vector-based version of our entity typing model. We use the same mention and context encoder followed by a highway layer, but this baseline has vector-based type embeddings (i.e., a |T | × d ′ matrix), and type predictions are given by a dot product between the type embeddings and the mention and context representation followed by element-wise logistic regression. This model is identical to that of Onoe and Durrett (2020b) except for the additional highway layer.

Evaluation and Datasets
Entity Typing We evaluate our approach on the Ultra-Fine Entity Typing (UFET) dataset (Choi et al., 2018) with the standard splits (2k for each of train, dev, and test). In addition to the manually annotated training examples, we use the denoised distantly annotated training examples from Onoe and Durrett (2019). 7 This dataset contains 10,331 entity types, and each type is marked as one of the three classes: coarse, fine, and ultra-fine. Note that this classification does not provide explicit hierarchies in the types, and all classes are treated equally during training.
Additionally, we test our box-based model on three other entity typing benchmarks that have relatively simpler entity type inventories with known hierarchies, namely OntoNotes (Gillick et al., 2014), BBN (Weischedel and Brunstein, 2005) , and FIGER (Ling and Weld, 2012). See Appendix B for more details on these datasets.
Consistency A model that captures hierarchical structure should be aware of the relationships between supertypes and subtypes. When a model predicts a subtype, we want it to predict the corresponding supertype together, even when this is not explicitly enforced as a constraint or consistently demonstrated in the data, such as in the UFET dataset. That is, when a model predicts artist, person should also be predicted. To check this ability, we analyze the model predictions on the UFET dev set. We select 30 subtypes from the UFET type inventory and annotate corresponding supertypes for them in cases where these relationships are clear, based on their cooccurrence in the UFET training set and human intuition. Based on the 30 pairs, we compute accuracy of predicting supertypes and subtypes together. Table 10 in Appendix C lists the 30 pairs.
Robustness Entity typing datasets with very large ontologies like UFET are noisy; does our box-based model's notion of hierarchy do a better job of handling intrinsic noise in a dataset? To test this in a controlled fashion, we synthetically create noisy labels by randomly dropping the gold labels with probability 1 3 . 8 We derive two noisy training sets from the UFET training set: 1) adding noise to the coarse types and 2) adding noise to fine & ultra-fine types. We train on these noised datasets and evaluate on the standard UFET dev set. (2020) Table 1: Macro-averaged P/R/F1 on the test set for the ultra-fine entity typing task of Choi et al. (2018). for their logits depending on how long they are trained, we post-hoc calibrate each of our models using temperature scaling (Guo et al., 2017) and a shift parameter. We report the total error (e.g., the sum of the errors between the mean confidence and the empirical accuracy) on the UFET dev set and the OntoNotes dev set.

Entity Representations
We are interested in the usefulness of the trained entity typing models in a downstream task. Following Onoe and Durrett (2020b), we evaluate entity representation given by the box-based and vector-based models on the Coreference Arc Prediction (CAP) task  derived from PreCo (Chen et al., 2018). This task is a binary classification problem, requiring to judge if two mention spans (either in one sentence or two sentences) are the same entity or not. As in Onoe and Durrett (2020b), we obtain type predictions (a vector of probabilities associated with types) for each span and use it as an entity representation. The final prediction of coreference for a pair of mentions is given by the cosine similarity between the entity type probability vectors with a threshold 0.5. The original data split provides 8k examples for each of the training, dev, and test sets. We report accuracy on the CAP test set.

Entity Typing
Here we report entity typing performance on Ultra-Fine Entity Typing (UFET), OntoNotes, FIGER, and BBN. For each dataset, we select the best model from 5 runs with different random seeds based on the development performance.
UFET Table 1 shows the macro-precision, recall, and F1 scores on the UFET test set. Our boxbased model outperforms the vector-based model and state-of-the-art systems in terms of macro-   (2019) use ELMo (Peters et al., 2018) and apply denoising to fix label inconsistency in the distantly annotated data. Note that past work on this dataset has used BERT-base (Onoe and Durrett, 2019). Work on other datasets has used ELMo and observed that BERT-based models have surprisingly underperformed (Lin and Ji, 2019). Some of the gain from our vector-based model can be attributed to our use of BERT-Large; however, our box model still achieves stronger performance than the corresponding vector-based version which uses the same pretrained model. Table 2 breaks down the performance into the coarse, fine, and ultra-fine classes. Our box-based model consistently outperforms the vector-based model in macro-recall and F1 across the three classes. The largest gap in macro-recall is in the fine class, leading to the largest gap in macro-F1 within the three classes.
We also list the numbers from prior work in Table 2. HY XLarge (López and Strube, 2020), a hyperbolic model designed to learn hierarchical structure in entity types, exceeds the performance of the models with similar sizes such as Choi et al. (2018) and Xiong et al. (2019) especially in macrorecall. In the ultra-fine class, both our box-based model and HY XLarge achieve higher macro-F1 compared to their vector-based counterparts.
One possible reason for the higher recall of our model is a stronger ability to model dependencies between types. Instead of failing to predict a highly correlated type, the model may be more likely to predict a complete, coherent set of types.  (2019) propose an ELMo-based model with an attention layer over mention spans and train their model on the augmented data from Choi et al. (2018). Among the models trained only on the original OntoNotes training set, the box-based model achieves the highest macro-F1 and micro-F1. The state-of-the-art system on BBN, the system of Chen et al. (2020) in the "undefined" setting, uses explicit knowledge of the type hierarchy. This is particularly relevant on the BBN dataset, where the training data is noisy and features training points with obviously conflicting labels like person and organization, which appear systematically in the data. To simulate constraints like the ones they use, we use three simple rules to modify our models' prediction: (1) dropping person if organization exists, (2) dropping location if gpe exists, and (3) replacing facility by fac, since both versions of this tag appear in the training set but only fac in the dev and test set. Our box-based model and the vectorbased model perform similarly and both achieve results comparable with recent systems.

Other datasets
On FIGER, our box-based model shows lower performance compared to the vector-based model, though both are approaching comparable results   with state-of-the-art systems. We notice that some of the test examples have inconsistent labels (e.g., /organization/sports team is present, but its supertype /organization is missing), penalizing models that predict the supertype correctly. In addition, FIGER, like BBN, has systematic shifts between training and test distributions. We hypothesize that our model's hyperparameters (tuned on OntoNotes only) are suboptimal.
The high dev performance shown in Table 4 implies that our model optimized on held-out training examples may not capture these specific shifts as well as other models whose inductive biases are better suited to this unusually mislabeled data.

Consistency
One factor we can investigate is whether our model is able to predict type relations in a sensible, consistent fashion independent of the ground truth for a particular example. For this evaluation, we investigate our model's predictions on the UFET dev set. We count the number of occurrences for each subtype in 30 supertype/subtype pairs (see Table 10 in Appendix C). Then, for each subtype, we count how many times its corresponding supertype is also predicted. Although these supertype-subtype relations are not strictly defined in the training data, we believe they should nevertheless be exhibited by models' predictions. Accuracy is given by the ratio between those counts, indicating how often the supertype was correctly picked up. Table 5 lists the total and per-supertype accuracy on the supertype/subtype pairs. We report the number of subtypes grouped by their supertypes to show their frequency (the "Count" column in Table 5). Our box-based model achieves better accuracy compared to the vector-based model on all supertypes. The gaps are particularly large on place and organization. Note that some of the UFET training examples have inconsistent labels (e.g., a subtype team can be a supertype organization or group), and this ambiguity potentially confuses a model during training. Even in those tricky cases, the box-based model shows reasonable performance. The geometry of the box space itself gives some evidence as to why this consistency would arise (see Section 5.6 for visualization of box edges). Table 6 analyzes models' sensitivity to the label noise. We list the UFET dev performance by models trained on the noised UFET training set. When the coarse types are noised (i.e., omitting some supertypes), the vector-based model loses 4.8 points of macro-F1 while our box-based model only loses 1.5 points. A similar trend can be seen when the fine and ultra-fine types are noised (i.e., omitting some subtypes). In both cases, the vector-based model shows lower recall compared to the same model trained on the clean data, while our boxbased model is more robust. We also note that the vector-based model tends to overfit to the training data quickly. We hypothesize that the use of boxes works as a form of regularization, since moving boxes may be harder than moving points in a space, thus being less impacted by noisy labels.

Calibration
Following Nguyen and O'Connor (2015), we split model confidence (output probability) for each typing decision of each example into 10 bins (e.g., 0-0.1, 0.1-0.2 etc.). For each bin, we compute mean confidence and empirical accuracy. We show the total calibration error (lower is better) as well as the scaling and shifting constants in Table 7. As the results on UFET and OntoNotes show, both boxbased and vector-based entity typing models can be   reasonably well calibrated after applying temperature scaling and shifting. However, the box-based model achieves slightly lower total error.

Entity Representation for Coreference
This experiment evaluates if model outputs are immediately useful in a downstream task. For this task, we use the box-based and vector-based entity typing models trained on the UFET training set (i.e., we do not train models on the CAP training set). Table 8 shows the test accuracy on the CAP data. Our box-based model achieves slightly higher accuracy than the vector-based model, indicating that "out-of-the-box" entity representations obtained by the box-based model contains more useful features for the CAP task. 11

Box Edges
To analyze how semantically related type boxes are located relative to one another in the box space, we plot the edges of the person and actor boxes along the 109 dimensions one by one. Figure 3 shows how those two boxes overlap each other in the high-dimensional box space. The upper plot 11 Our results are not directly comparable to those of Onoe and Durrett (2020b); we train on the training set of UFET dataset, and they train on examples from the train, dev, and test sets.    . This is a binary classification task.
in Figure 3 compares the person box and the actor box learned on the UFET data. We can see that the edges of person contain the edges of actor in many dimensions but not all, meaning that the person box overlaps with the actor box but doesn't contain it perfectly as we might expect. However, we can additionally investigate whether the actor box is effectively contained in the person for parts of the space actually used by the mention boxes. The lower plot in Figure 3 compares the person box and the minimum bounding box of the intersections between the actor and the mention and context boxes obtained using the UFET dev examples where the actor type is predicted. This minimum bounding box approximates the effective region within the actor box. Now the edges of actor are contained in the edges of person in the most of dimensions, indicating that the person box almost contains this "effective" actor box.

Related Work
Embeddings Embedding concepts/words into a high-dimensional vector space (Hinton, 1986) has a long history and has been an essential part of neural networks for language (Bengio et al., 2003;Collobert et al., 2011). There is similarly a long history of rethinking the semantics of these embedding spaces, such as treating words as regions using sparse count-based vectors (Erk, 2009a,b) or dense distributed vectors (Vilnis and McCallum, 2015). Order embeddings (Vendrov et al., 2016) or their probabilistic version (POE) (Lai and Hockenmaier, 2017) are one technique suited for hierarchical modeling. However, OE can only handle binary entailment decisions, and POE cannot model negative correlations between types, a critical limitation in its use as a probabilistic model; these shortcomings directly led to the development of box embeddings. Hyperbolic embeddings (Nickel and Kiela, 2017; López and Strube, 2020) can also model hierarchical relationships as can hyperbolic entailment cones (Ganea et al., 2018); however, these approaches lack a probabilistic interpretation.
Recent work on knowledge base completion (Abboud et al., 2020) and reasoning over knowledge graphs (Ren et al., 2020) embeds relations or queries using box embeddings, but entities are still represented as vectors. In contrast, our model embed both entity mentions and types as boxes.
Entity typing Entity typing and named entity recognition (Tjong Kim Sang and De Meulder, 2003) are old problems in NLP. Recent work has focused chiefly on predicted fine-grained entity types (Ling and Weld, 2012;Gillick et al., 2014;Choi et al., 2018), as these convey significantly more information for downstream tasks. As a result, there is a challenge of scaling to large type inventories, which has inspired work on type embeddings (Ren et al., 2016a,b).
Entity typing information has been used across a range of NLP tasks, including models for entity linking and coreference (Durrett and Klein, 2014). Typing has been shown to be useful for crossdomain entity linking specifically (Gupta et al., 2017;Onoe and Durrett, 2020a). It has also recently been applied to coreference resolution (Onoe and Durrett, 2020b;Khosla and Rose, 2020) and text generation (Dong et al., 2020), suggesting that it can be a useful intermediate layer even in pretrained neural models.

Conclusion
In this paper, we investigated a box-based model for fine-grained entity typing. By representing entity types in a box embedding space and projecting entity mentions into the same space, we can naturally capture the hierarchy of and correlations between entity types. Our experiments showed several benefits of box embeddings over the equivalent vector-based model, including typing performance, calibration, and robustness to noise.

Acknowledgments
Thanks to the members of the UT TAUR lab, Pengxiang Cheng, and Eunsol Choi for helpful discussion; Tongfei Chen and Ying Lin for providing the details of experiments. This work was also partially supported by NSF Grant IIS-1814522, NSF Grant SHF-1762299, and based on research in part supported by the Air Force Research Laboratory (AFRL), DARPA, for the KAIROS program under agreement number FA8750-19-2-1003, as well as University of Southern California subcontract no. 123875727 under Office of Naval Research prime contract no. N660011924032. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of AFRL, DARPA, or the U.S. Government.