Box Embeddings: An open-source library for representation learning using geometric structures

A fundamental component to the success of modern representation learning is the ease of performing various vector operations. Recently, objects with more geometric structure (eg. distributions, complex or hyperbolic vectors, or regions such as cones, disks, or boxes) have been explored for their alternative inductive biases and additional representational capacity. In this work, we introduce Box Embeddings, a Python library that enables researchers to easily apply and extend probabilistic box embeddings. Fundamental geometric operations on boxes are implemented in a numerically stable way, as are modern approaches to training boxes which mitigate gradient sparsity. The library is fully open source, and compatible with both PyTorch and TensorFlow, which allows existing neural network layers to be replaced with or transformed into boxes easily. In this work, we present the implementation details of the fundamental components of the library, and the concepts required to use box representations alongside existing neural network architectures.


Introduction
Much of the success of modern deep learning rests on the ability to learn representations of data compatible with the structure of deep architectures used for training and inference (Hinton, 2007;LeCun et al., 2015). Vectors are the most common choice of representation, as linear transformations are well understood and element-wise non-linearities offer increased representational capacity while be-ing straightforward to implement. Recently, various alternatives to vector representations have been explored, each with different inductive biases or capabilities. Vilnis and McCallum (2015) represent words using Gaussian distributions, which can be thought of as a vector representation with an explicit parameterization of variance. This variance was demonstrated to be capable of capturing the generality of concepts, and KL-divergence provides a natural asymmetric operation between distributions, ideas which were expanded upon in Athiwaratkun and Wilson (2018). Nickel and Kiela (2017), on the other hand, change the embedding space itself from Euclidean to hyperbolic space, where the negative curvature has been shown to provide a natural inductive bias toward modeling tree-like graphs (Nickel and Kiela, 2018;Weber, 2020;Weber and Nickel, 2018).
A subset of these alternative approaches explores region-based representations, where entities are not represented by a single point in space but rather explicitly parameterized regions whose volumes and intersections are easily calculated. Order embeddings (Vendrov et al., 2016) represent elements using infinite cones in R n + and demonstrate their efficacy of modeling partial orders. Lai and Hockenmaier (2017) endow order embeddings with probabilistic semantics by integrating the space under a negative exponential measure, allowing the calculation of arbitrary marginal, joint, and conditional probabilities. Cone representations are not particularly flexible, however -for instance, the resulting probability model cannot represent negative correlation -motivating the development of probabilistic box embeddings , where entities are represented by n-dimensional rectangles (i.e. Cartesian products of intervals) in Euclidean space.
Probabilistic box embeddings have undergone several rounds of methodological improvements.
The original model used a surrogate function to pull disjoint boxes together, which was improved upon in  via Gaussian convolution of box indicator functions, resulting in a smoother loss landscape and better performance as a result.  improved box training further by using a latent random variable approach, where the corners of boxes are modeled using Gumbel random variables. These latter models lacked valid probabilistic semantics, however, a fact rectified in .
While each methodological improvement demonstrated better performance on various modeling tasks, the implementations grew more complex, bringing with it various challenges related to performance and numerical stability. Various applications of probabilistic box embeddings (eg. modeling joint-hierarchies (Patel et al., 2020), uncertain knowledge graph representation (Chen et al., 2021), or fine-grained entity typing (Onoe et al., 2021)) have relied on bespoke implementations, adding unnecessary difficulty and differences in implementation when applying box embeddings to new tasks. To mitigate this issue and make applying and extending box embeddings easier, we saw the need to introduce a reusable, unified, stable library that provides the basic functionalities needed in studying box embeddings. To this end, we introduce "Box Embeddings", a fully open-source Python library hosted on PyPI. The contributions of this work are as follows: • Provide a modular and reusable library that aids the researchers in studying probabilistic box embeddings. The library is compatible with both of the most popular Machine Learning libraries: PyTorch and TensorFlow. • Create extensive documentation and example code, demonstrating the use of the library to make it easy to adapt to existing code-bases. • Rigorously unit-test the codebase with high coverage, ensuring an additional layer of reliability.

Box Embeddings
Formally, a "box" is defined as a Cartesian product of closed intervals,  where θ represent some latent parameters. In the simplest case, θ ∈ R 2n are free parameters, and z i , Z i are projections onto the i and n + i components, respectively. In general, however, the parameterization may be more complicated, eg. θ may be the output from a neural network. For brevity, we omit the explicit dependency on θ. The different operations (such as volume and intersection) commonly used when calculating probabilities from box embeddings can all be defined in terms of z i , Z i -the min and max coordinates of the interval in each dimension.

Parameterizations
The fundamental component of the library is the BoxTensor class, a wrapper around the torch.Tensor and tensorflow.Tensor class that represents a tensor/array of boxes.
BoxTensor is an opaque wrapper, in that it exposes the operations and properties necessary to use the box representations (see table 1) irrespective of the specific way in which the parameters θ are related to z i , Z i . The main two properties of the BoxTensor are z and Z, which represent the min and max coordinates of an instance of BoxTensor. Listing 1 shows how to create an instance of BoxTensor consisting of two 2dimensional boxes in Figure 1. Listing 1: Manually initializing a BoxTensor consisiting for the 2-D boxes depicted in Figure 1.
Given a torch.Tensor corresponding to the parameters θ of a BoxTensor, one can obtain a box representation in multiple ways depending on the constraints on the min and max coordinates of the box representations as well as the the range of values in θ. The BoxTensor class itself simply splits θ in half on the last dimension, using θ[. . . , 1 : n] as z and θ[. . . , n + 1 : 2n] as Z. Here, the Ellipsis ". . . " denotes any number of leading dimensions, for instance, batch, sequencelength, etc. For the sake of simplifying the notations, from here on, the presence of the leading dimensions will not be explicitly denoted using the Ellipsis. Moreover, all the indexing operations can be assumed to be operating only on the last dimension, unless stated otherwise. Any box can be represented in this fashion, however some settings of θ may lead to situations where z i > Z i . This scenario is invalid under conventional box models , and although valid for models which interpret these coordiantes as parameters of a latent random variable  it is often still desirable to constrain side-lengths to be non-negative. MinDeltaBoxTensor represents boxes that are unbounded and have non-negative side-length in each dimension. That is, it outputs boxes with z, Z ∈ R n and z i ≤ Z i , and furthermore any such box has a corresponding θ under this parameterization. A valid probabilistic interpretation of box embeddings requires that their embedding space has finite measure, however. One trivial way to accomplish this is to parameterize boxes to remain within the unit hypercube, which can be accomplished via the SigmoidBoxTensor or TanhBoxTensor classes. The specific mathematical operations re-  lating the θ variables to their z, Z coordinates are found in Table 2, and example usage can be found in Listing 2. 2

Operations on BoxTensor
We provide a variety of modules that implement different operations on the box-tensors, such as Intersection, Volume, Pooling and Regularization. We also implemented a BoxEmbedding layer that, just like a vector embedding layer, provides index lookup. However, unlike a vector embedding layer, this returns boxes instead of vectors. We discuss these layers in detail below.

Intersection
Given two instances of BoxTensor with compatible shapes, this operation performs the intersection between the two box-tensors and returns an instance of BoxTensor as the result. For two instances of BoxTensor A and B with coordinates (z A , Z A ) and (z B , Z B ) respectively, the (z, Z) coordinates of the resulting intersection box for the two types of intersection operations, HardIntersection  and GumbelIntersection , are shown in Table 3, and corresponding codes are provided in Listing 3. Listing 3: Various approaches to computing the intersection of two box tensors.
Intersection type

Volume
Boxes (or intersections of boxes) are typically queried for their volumes. Our HardVolume layer implements the volume calculation as originally introduced in Vilnis et al. (2018), which is simply a direct multiplication of side-lengths. It is in this setting where bounded parameterizations such as SigmoidBoxTensor and TanhBoxTensor are particularly useful, as the resulting volumes can be interpreted as yielding a valid marginal or joint probability. Note, however, that the guarantees of positive side-lengths do not apply when taking the intersection of two disjoint boxes, in which case the resulting box should have zero volume. Our SoftVolume layer implements the volume function proposed by , which mitigates the training difficulties that arise when disjoint boxes should overlap. Finally, our BesselApproxVolume layer implements the volume function proposed in , which approximates the expected volume of a box where the coordinates are interpreted as location parameters of Gumbel random variables. The expressions and the code snippets for the various volume operations are given in Table 4 and 4, respectively.
Remark 1. Note that due to the presence of the product, the naive implementation of volume computations as shown in Table 4 will often result in numerical overflow or underflow for dimensions greater than 5. Hence, we provide an option to compute the volume in log-space, which is on by default. Listing 4: Different proposed methods for computing box volume, of increasing "smoothness". Here, (z, Z) are the min-max coordinates of the input BoxTensor, T is the volume temperature hyperparameter, γ is the Euler-Mascheroni constant, β is the gumbel intersection parameter, and softplus(x) = log(1 + exp x).

Pooling
The library also provides pooling operations that take as input an instance of BoxTensor and reduce one of the leading dimensions by pooling across it. Currently, there are two types of pooling operations implemented -intersection based, which takes intersection across all the boxes in a particular dimension, and mean based, which takes the arithmetic mean of the min and max coordinates of the boxes across a dimension.

Regularization
There is an excessive slackness in the learning objective defined using containment conditions on boxes, which leads to large flat regions of local minima resulting in poor training. In order to mitigate this problem, Patel et al. (2020) introduces volume based regularization for boxes, which augments the loss with a penalty if the box volume exceeds a certain threshold. This penalty reduces the size of the flat local minima facilitating better training of boxes. Listing 5: Box pooling and regularization operations.

Embedding
BoxTensor and its children classes, do not store learnable parameters directly, they simply wrap the input tensor and provide an interface which interprets the wrapped tensor as box representation. However, when working with a shallow model (embedding only model), one needs an embedding layer that owns its parameters and outputs boxes corresponding to the input indices. The library provides BoxEmbedding layer that works like a native embedding layer in PyTorch or TensorFlow, i.e., it performs index lookup, but instead of returning an instance of the native tensor, it returns instance of BoxTensor.

Initializers
We also provide an abstract interface BoxInitializer to implement various methods for initializing the learnable parameters of the BoxEmbedding layer. As a concrete example we implement UniformBoxInitializer, which initializes boxes with uniformly random min coordinates and side lengths. This is used as the default initializer for the BoxEmbedding layer unless specified otherwise.

Applications
In this section, we demonstrate the Box Embeddings library by using it to implement models for two real-world tasks: a representation learning task of hierarchical graph modeling (Nickel and Kiela, 2017;, and the NLP task of natural language inference (Dagan et al., 2005;Bowman et al., 2015). We first demonstrate the intuition behind the containment-based loss function used to train these models using a toy example involving two 2-dimensional boxes.

Toy example
For the purpose of demonstration, we set up a toy example which embeds a simple graph with just two nodes, X, Y and one edge (X, Y ). We start with two non-overlapping boxes at initialization: box X and box Y , and use SGD to train the parameters that minimize the following loss function Geometrically, this encourages box Y ⊆ box X . If using a box embedding with valid probabilistic semantics, this loss function can be interpreted as binary cross-entropy with P (X|Y ) = 1. 3 The code for this example can be found in Appendix A.2. We visualize the containment training process in Figure 3. Each line represents the edge of the box in one dimension, with the left endpoint of a blue or orange line to be the minimum coordinate of a box, and the right endpoint of a line to be the maximum coordinate of a box.

Representing hierarchical graph
Representing relations between the nodes of a hierarchy is useful for various NLP and Machine Learning tasks such as natural language inference (Wang et al., 2019;Sharma et al., 2019), entity typing (Onoe et al., 2021), multi-label classification (Chatterjee et al., 2021), and question answering (Jin et al., 2019;Fang et al., 2020). For example, in Figure 2, knowing the hypernym relationship between the pairs (herb, basil), (herb, thyme), and (herb, rosemary) can help paraphrase the sentence "This dish requires basil, thyme and rosemary" into "This dish requires several herbs.". Additionally, knowing the relationship between (herb, banana), and (fruit, banana) can help answer questions such as "What is both a herb and a fruit?" Note that this latter example maps directly onto the notion of box intersection, as we are seeking an element contained in both "herb" and "fruit". For demonstration, we train box embeddings to represent the hypernym graph of WordNet (Miller et al., 1990). Hypernym or IS-A is a transitive relation between a pair of words, where one word (hypernym) represents a general/broader concept, and the other word (hyponym) is a more specific sub-concept (Yu et al., 2015). The transitive reduction of the WordNet noun hierarchy contains 82,114 entities and 84,363 edges. The learning task is framed as an edge classification task where, given a pair of nodes (h, t), the model outputs the probability of existence of an edge from h to t. Following Patel et al. (2020), we train an edge classification model using the transitive reduction edges augmented with varying percentages of the transitive closure edges (10%, 25%, 50%) as positive examples and randomly sampled negative examples with positive to negative ratio of 1:10. The BoxEmbedding layer is initialized with random boxes representing the nodes of the hypernym graph. For each input pair x = (h i , t i ), the probability of existence of the edge h i → t i is computed as In our case, we use MinDeltaBoxTensor parameterization, HardIntersection and   SoftVolume. Binary cross-entropy loss is used to train the model for edge classification. The test set consists of positive edges sampled from the rest of the transitive closure (not seen during training) and a fixed set of random negatives with the same positive to negative ratio as training. As seen in Table 5, we are able to replicate the result from Patel et al. (2020).

Natural Language Inference (NLI)
Natural language inference (Dagan et al., 2005;Bowman et al., 2015) is a task where, given two sentences, premise and hypothesis, the model is required to pick whether the premise entails the hypothesis, contradicts the hypothesis, or whether neither relationship holds. The task of NLI is setup as multi-class classification, and in the two-class version, the model is only required to decide whether the premise entails the hypothesis or not (Mishra et al., 2021). Although NLI deals with a pair of sentences at a time, in the space of all possible sentences the transitive relation of entailment establishes a partial order. If the sentences are encoded as boxes then we can train box containment to capture the transitive entailment relation. To demonstrate this, we choose the MNLI corpus (Williams et al., 2018) from the GLUE benchmark (Wang et al., 2018). Since the MNLI dataset presents the NLI task as a three-class problem, we collapse contradiction and neutral labels into a single label called not-entails to obtain a two-class problem with class labels entails and not-entails.
In order to obtain box representation for the premise and hypothesis sentences, we use a neural network E to first get vector representations v p and v h for the premise and the hypothesis, respectively. Both these vectors are then interpreted as the parameters θ p := v p and θ h := v h of a box tensor. Finally, the probability of the entails class is computed as The parameters of the encoder are trained using the ADAM optimizer (Kingma and Ba, 2014) with binary cross-entropy as the loss. Table 6 shows the test accuracy with two different encoders. As seen, the performance is much higher than random or majority class baselines.

Conclusion
In this paper, we have introduced Box Embeddings, the first Python library focused on allowing regionbased representations to be used with deep learning libraries. Our library implements proposed training methods and geometric operations on probabilistic box embeddings in a well-tested and numericallystable fashion. We described the concepts needed to understand and apply this library to novel tasks, and applied the library to graph modeling and natural language inference, demonstrating both shallow and deep contextualized box representations. We hope the release of this package will aid researchers in using region-based representations in their work, and that the well-documented codebase will facilitate additional methodological extensions to probabilistic box embedding models.

A Appendix
A. Listing 11: Training Pipeline for the Toy Example (3.1)