Joint Learning of Hyperbolic Label Embeddings for Hierarchical Multi-label Classification

We consider the problem of multi-label classification, where the labels lie on a hierarchy. However, unlike most existing works in hierarchical multi-label classification, we do not assume that the label-hierarchy is known. Encouraged by the recent success of hyperbolic embeddings in capturing hierarchical relations, we propose to jointly learn the classifier parameters as well as the label embeddings. Such a joint learning is expected to provide a twofold advantage: i) the classifier generalises better as it leverages the prior knowledge of existence of a hierarchy over the labels, and ii) in addition to the label co-occurrence information, the label-embedding may benefit from the manifold structure of the input datapoints, leading to embeddings that are more faithful to the label hierarchy. We propose a novel formulation for the joint learning and empirically evaluate its efficacy. The results show that the joint learning improves over the baseline that employs label co-occurrence based pre-trained hyperbolic embeddings. Moreover, the proposed classifiers achieve state-of-the-art generalization on standard benchmarks. We also present evaluation of the hyperbolic embeddings obtained by joint learning and show that they represent the hierarchy more accurately than the other alternatives.


Introduction
The problem of multi-label text classification is well known and extensively studied in literature (McCallum, 1999;Yang et al., 2009;Liu et al., 2017). The fundamental assumption is that a document is associated with multiple labels from a fixed vocabulary of labels. Often, these labels are organised in a hierarchical structure. For ex. consider a sample headline from the NYT (NewYork * Equal contribution Times) corpus "Voice Recognition Is Improving, but Don't Stop the Elocution Lessons" for which labels are "Top/News/Technology". Here, labels are arranged in a hierarchy, hereafter referred as label hierarchy. We undertake the task of labelling documents with classes that are hierarchically organised; this problem is popularly known as hierarchical multi-label text classification (HMC). HMC methods have found several applications in online advertising systems (Agrawal et al., 2013), bio-informatics (Peng et al., 2016;Triguero and Vens, 2016), text classification (Rousu et al., 2006;Mao et al., 2019).
The main challenge in HMC is in modelling classification of the document into a large, imbalanced and structured output space. In HMC, the label taxonomy is a partially ordered set (L, ≺) where L is a finite set of all class labels. Relation ≺ refers to is-a relationship between labels, which is asymmetric, anti-reflexive and transitive (Silla and Freitas, 2011).
Hierarchical structures can provide important insights for learning and classification tasks. However, explicit knowledge of hierarchy is not available in several domains, for instance, extreme classification datasets (Bhatia et al., 2016). In this paper, we consider the problem of structured prediction from unstructured text, in which label hierarchy is not known apriori. We infer hierarchies from classification judgements on the outputs that are readily available. We focus on discovering relationships between the labels in a hyperbolic space, which has natural capacity to encode hierarchical structures.
In our approach, HIDDEN (HyperbolIc label embeDDings for hiErarchical multi-label classi-ficatioN), the labels are represented in a hyperbolic space to help respect their latent hierarchical organisation. We use this intuition to learn label embeddings for HMC without explicit supervision on the label hierarchy.
Apart from employing hyperbolic embeddings, another key aspect of our methodology is that the parameters of the classifier as well as of the label embedding are learnt jointly. We next explain the advantage in doing so. In the absence of any partial information regarding the hierarchy, label embeddings are typically learnt using the weak supervision available in label co-occurrences Kiela, 2017, 2018). This weak form of supervision can be complemented if the label embedding learning is also aware of the manifold structure of the input (documents). For e.g., similar documents may have similar labels, etc. Such a strengthening is possible only if learning happens in a joint fashion. Moreover, the generalization of the classifier also improves because of the improved embeddings (and vice-versa). Our contributions can be summarised as follows: 1. We present an approach HIDDEN, that models the implicit hierarchical organisation of labels for improved classification. It leverages properties of hyperbolic geometry to help learn embeddings for the hierarchically organised labels. 2. We present a novel formulation for jointly learning the parameters of the classifier as well as the label embedding, which can be trained solely using the supervision from the training data, and without using any explicit information regarding the label hierarchy. 3. We evaluate HIDDEN on real-world as well as synthetic datasets and show: (a) significant improvement over classical multi-label classification methods as well as baselines that employ hyperbolic label embeddings learnt in isolation solely based on label co-occurrence information (b) HIDDEN sometimes generalizes even better than state-of-the-art hierarchical multi-label classifiers that have complete access to the true label hierarchy (c) label embeddings learnt using the joint optimisation approach correlate better with the ground truth than other alternatives.

Related Work
Several conventional classification methods are capable of handling classification in multi-label settings. However, relatively fewer of these are designed to incorporate the possibly hierarchical organisation of the class labels. These include both traditional methods (Gopal and Yang, 2013;Lewis et al., 2004) as well as deep learning methods (Johnson and Zhang, 2015;Peng et al., 2018) across varied domains such as news articles, web content, etc. Some approaches (Bairi et al., 2015(Bairi et al., , 2016 have also attempted to identify a subset of class labels from the classification hierarchy that effectively represents most instances from the training dataset. Traditional or flat classification approaches typically perform prediction assuming that all the classes are independent of each other, ignoring the class hierarchy. Whereas 'local' classification approaches (Koller and Sahami, 1997;Cesa-Bianchi et al., 2006) train a set of classifiers at each level of the hierarchy. However, it has also been argued (Cerri et al., 2011) that it is impractical to train separate classifiers at each level. On the other hand, 'global' approaches (Silla Jr and Freitas, 2009;Wang et al., 2001) train a single classifier that factors in the complete class hierarchy, while often also explicitly factoring in the label-label correlation (Kulkarni et al., 2018). Unlike the local approach, 'global' approaches do not suffer from the error propagation problem, although they are prone to under-fit by not considering local information in the hierarchy. Some recent papers have proposed a mix of local and global approaches for HMC. Wehrmann et al. (2018) propose an objective that leverages both local and global information while introducing global hierarchical violation penalty. Mao et al. (2019) employ a reinforcement learning framework to learn a label assignment policy. They model HMC as a markov decision process, wherein, the agent takes an action of label assignment on the tree hierarchy and receives scalar rewards as feedback for the actions. Chen et al. (2019) embed both document and label hierarchy in the same hyperbolic space and use interactions between these embeddings for HMC. Our approach differs from these in two important ways: (i) we embed only labels into the hyperbolic space and (ii) label hierarchy is not known apriori -all we assume is that there is some hidden hierarchy.
Recently, the use of hyperbolic geometry has been found to be promising in machine learning and network sciences to model data with latent hierarchies. Krioukov et al. (2010) showed that properties of complex networks, namely heterogeneous degree distribution and strong clustering, naturally manifest in hyperbolic geometry. They showed that if a network has some heterogeneous degree distribution and metric structure, the network can be mapped effectively to the hyperbolic space (since euclidean distance has limitations in approximating the distance between nodes in a tree). Gromov (1987) have shown that any finite tree structure can be embedded into a finite hyperbolic space while preserving the distance between nodes. Nickel and Kiela (2017) learnt hierarchical representations of symbolic data by embedding them into an n-dimensional Poincaré ball by leveraging the distance property of hyperbolic spaces. Instead of relying on the true hierarchy to learn embeddings, Nickel and Kiela (2018) inferred hierarchies from real-valued similarity scores using the Lorentz model of hyperbolic geometry. We used a similar formulation in our model HIDDEN to build our HMC model by leveraging the co-occurrence count of labels for each document, but additionally (and more importantly), in a joint manner, learn the parameters of the classifier.

Hyperbolic Geometry & the Poincaré Model
In this section, we give an overview of hyperbolic geometry and the Poincaré model for embedding in hyperbolic spaces (Nickel and Kiela, 2017). A hyperbolic space is a non-Euclidean Riemannian manifold of constant negative curvature. Though there are several fundamental differences between the Euclidean and the hyperbolic geometry, the most interesting characteristic of hyperbolic spaces is their ability to naturally represent hierarchical relations (Krioukov et al., 2010). In the Poincaré ball model, which is one of the standard models for hyperbolic geometry, the Euclidean distances between equidistant points, according to the inherent manifold metric d, fall exponentially as one moves from origin towards the surface of the ball. This interesting property is the key for enabling learning of continuous embeddings of hierarchies. For example, one can imagine root node of hierarchy at origin and leaf nodes near the ball's surface. Then, this model can easily accommodate exponentially growing number of equidistant siblings at deeper levels of the hierarchy. Whereas, such an accommodation is not possible using Euclidean geometry. Below we provide some details of this model. Let B n = {x ∈ R n | x < 1} be the open n-dimensional unit ball, where . is the Euclidean 2 norm. The Poincaré ball model is a Riemannian Manifold (B n , g x ), the open unit ball equipped with the Riemannian metric tensor the Euclidean metric tensor. The geodesic distance between two points u, v ∈ B d is given as always lies in the Poincare ball (refer Appendix for detailed explanation).

Problem Formulation and Approach
In this section, we present present details of our model, training, as well as inference.

Problem Formulation
Here we consider an interesting special case of multi-label classification. The training data is of the form: where D i ∈ R n is the input representation of the i th document, y i ∈ {0, 1} L represents the set of active/annotated labels for it (y l i = 1 ⇐⇒ D i is labelled with l), and L is the total number of labels. Importantly, the labels are assumed to be nodes of an unknown, yet fixed, hierarchy. Using this prior knowledge and the training data, the goal is to learn a classifier that generalises well for labelling new documents.
Classical text classification methods ignore the informative prior knowledge that the set of labels form a hierarchy. Most of the hierarchical multiclass classification models assume that the hierarchy over the labels is completely known, which might not be a pragmatic assumption, since constructing hierarchies is an expensive process, especially when the number of labels is large (Bhatia et al., 2016). In contrast, here we assume no explicit information regarding the hierarchy other than it's existence, and the implicit information encoded in the training data. Also, in our set-up, we do not restrict the labels to be the leaves nodes in the hierarchy. As motivated earlier, here we propose to learn a classifier that jointly learns the classifier parameters as well as the label embeddings.

Our Model: HIDDEN
Our proposed model HIDDEN has two key components: one for representing the documents that may lead to well-generalizing classifiers and the other for embedding the labels in a hyperbolic space. Recall that hyperbolic spaces have shown to be wellsuited for data satisfying hierarchical relations.
Document Model F w accepts as input a document, D, and outputs a n-dimensional representation of it, F w (D) ∈ R n . Here, w is the set of parameters to be learnt. In this work, we use TextCNN (Kim, 2014) as the document model. But our approach remains valid irrespective of the chosen document model.
Label Embedding Model G Θ accepts as input a label l and outputs a finite dimensional representation, G Θ (l). Here, Θ is the set of parameters to be learnt. In this work, following Nickel and Kiela (2018), we employ the simple look-up based model defined by G Θ (l) ≡ Θ * y l = Θ l , where Θ ∈ R n×L and Θ l is the l th column of Θ. These Euclidean embeddings Θ l are then projected onto the Poincare manifold using the transformation . In summary, the hyperbolic embedding of label, l, is given by Π(Θ l ).
We next assume that there exists some optimal set of parameters w * , Θ * such that the labels annotated/active for a document, D, are exactly those whose label representations are highly aligned with that of D's representation. Here, alignment between the representations is intended to model the natural intuition of appropriateness between label and document. Following the principle of largemargin separation, in this paper we employ the alignment model defined below: whereŷ l D (w, Θ) denotes the alignment between the document, D, and the l th label as per the model with parameters (w, Θ), and σ is the Sigmoid activation function.
Inference: Given the learnt parameters (ŵ,Θ), the labels withŷ l D (ŵ,Θ) > 0.5 are predicted to be the active ones for D. We next detail the proposed joint objective for learning the parameters.

Joint Objective
The proposed objective consists of two terms: the first is an empirical multi-label loss term over the training data, and the second is a loss for ensuring that the hyperbolic label embeddings respect the pairwise label co-occurrence or any other such (pairwise) partial information regarding the underlying label hierarchy.
First Term is simply a binary cross entropy loss to promote high alignment scores for each annotated label and vice-versa: whereŷ l i is a short-hand forŷ l D i . Second Term induces lesser geodesic distance in the hyperbolic space between the label embeddings that have higher co-occurrences than those between label pairs that have less cooccurrence (Nickel and Kiela, 2018): (4) where d is the metric in the hyperbolic space given by Eq.1, Π(Θ l ) is hyperbolic embedding of l th label, and N (l, l ) is the set of all labels that less frequently co-occur with l than l co-occurs with l.
The overall objective function is a weighted sum of the two components described above We refer to the model corresponding to the parameters (w jnt , Θ jnt ) that minimizes this joint objective in Eq.5 as HIDDEN jnt : Both components of our loss interact with each other to minimize the distance between document and label embeddings in the hyperbolic space. The advantage of the joint learning is well illustrated when HIDDEN jnt is compared with the following baseline, henceforth referred to as HIDDEN cas : (1) L 2 is minimized to obtain label embeddingŝ Θ cas ∈ arg min Θ L 2 (Θ).
(2) These are then used in L 1 to obtain document parameters:ŵ cas ∈ arg min w L 1 (w,Θ cas ).
We also empirically compare with the following multi-class classification baseline, henceforth referred to as HIDDEN flt : (1) Θ flat is fixed to the identity matrix.
(2) These are then used in L 1 to obtain document parameters:ŵ flat ∈ arg min w L 1 (w,Θ flat ). To evaluate the benefit of using hyperbolic spaces for embedding labels, we also compare with a variant of HIDDEN jnt called HIDDEN euc for which L 2 is modified to be Note that none of the variants of HIDDEN assume any explicit information regarding the underlying hierarchy. However, the former three exploit the prior knowledge that there exists a label hierarchy; whereas the latter, which is the classical multi-label classification network, completely ignores this useful information. Moreover, since the proposed model HIDDEN jnt performs joint learning, it is expected that HIDDEN jnt not only achieves better generalization, but also leads to better label embeddings, when compared to HIDDEN cas . The simulation results in section 5 confirm the same.

Training Details
In all our experiments, the initial word embedding layer of TextCNN in the document model is initialized using 300 dimensional GloVe embeddings (Pennington et al., 2014). Following Nickel and Kiela (2017), we randomly initialize Θ from the uniform distribution U(−0.001, 0.001). Both the document and label representations are are of length n = 300. We randomly choose 10% of training set as the validation set and report test set results on the best validation epoch. During training, dropout is applied to the outputs of document model as well the label model with probabilities 0.1 and 0.6 respectively. We found λ = 0.1 to yield the best validation performance. The number of training epochs are set to 30 for all experiments. Both models are optimized using stochastic gradient descent using Adam optimizer (Kingma and Ba, 2014) with learning rate as 0.001 for TextCNN.
We run all our experiments on Nvidia RTX 2080 Ti GPUs 12 GB RAM over Intel Xeon Gold 5120 CPU having 56 cores and 256 GB RAM. It takes around 1, 2 and 5 hours to train the model on RCV1, NYT and Yelp datasets respectively.

Experiments
In this section, we compare our approach against the baseline models and other state-of-the-art HMC approaches. First we describe the evaluation metrics and illustrating results in a synthetic setting.

Evaluation Measures
Classifier Evaluation Measures: We use standard measures for evaluating any HMC, viz., Macro-F1 and Micro-F1. Let T P , T N , F N , F P denote the true positive, true negative, false negative and false positive labels respectively. Precision is T P T P +F P and recall is T P T P +F N . F1-score is the harmonic mean of precision and recall. Macro-F1 assigns equal weightage to each class and is computed as the averaged F1 score over all classes. Micro-F1 is the F1-score computed over all instances. Label Embedding Evaluation Measures: For a given application, let us say H * is provided to us as the ground truth hierarchy of labels/nodes, which was assumed to be unknown in our problem formulation in Section 4.1. Recall that none of the variants of HIDDEN has access to H * . How consistent with respect to H * are the label embeddings learnt by these models? We attempt to assess this consistency by adopting standard measures such as Spearman's rank correlation coefficient (Zar, 2005) and Normalized Discounted cumulative gain (NDCG) (Järvelin and Kekäläinen, 2002).
Recall from Section 4.3, that the hyperbolic embedding for label l is Π (Θ l ) and likewise for l , it is Π (Θ l ). The model parameters Θ might be learnt using any variant of HIDDEN. Given a query label, l, the geodesic distance d(Π (Θ l ) , Π (Θ l )) is used to rank all other labels l = l; smaller the distance, larger the rank. Any two labels l = l that are at the same geodesic distance from l, will be assigned the same rank r = r . Next, we define a graded relevance score for labels with respect to the ground truth hierarchy, H * . For any given query label l ∈ H * , we also assign a graded relevance rel ∈ N to every other label l = l based on the distance (number of hops hops(l, l )) of l from l in the hierarchy H D ; smaller the distance, larger the graded relevance (we considered rel ∝ 1 hops(l,l ) , for example).
Discounted Cumulative Gain (DCG) (Järvelin and Kekäläinen, 2002) is a standard measure of the quality of ranking of an approach with respect to the graded relevance provided in the ground truth. DCG@k measures items (eg: labels l ) k hops away Here, rel i is the graded relevance at position i. This result is itself averaged over all query labels l ∈ H * .
Spearman's rank correlation coefficient denoted by r is a non-parametric metric to measure statistical dependence between ranking of two variables. For each query label (l), we first measure rank correlation between the predicted rank r p of a label l and its own rank r h as per ground truth hierarchy h. The correlation coefficient between r lp and r l h is computed as r l = cov(rp,r h ) σr p σr h . Here, cov is the covariance and σ, the standard deviation. The final score, r is the averaged score across all labels l in the set.

Validation via Synthetic Experiments
To observe the behaviour of the proposed approach HIDDEN with respect to the evaluation measures in a controlled environment, we present one such synthetic setup. The goal in this section is to illustrate the advantage of joint learning of parameters over isolated learning. Consider 2D data generated from 16 neatly separated Gaussians laid out on a grid as illustrated in Figure 1. Each of the 16 gaussians corresponds to a single label l 1 , l 2 ...l 16 . We consider a second layer of 4 labels l 17 ...l 20 , obtained by grouping the gaussians into 4; each quadrant in the larger square would correspond to a single label. Finally, we have a third layer consisting of a single label l 21 -viz., the entire large square. This simple hierarchy is hidden from our variants of HIDDEN as well as from the flat model. The synthesized data is split randomly into train-test in the ratio 60:40. For each of the jointly optimised model HIDDEN jnt , the cas- caded model HIDDEN cas as well as the flat model, we observe (i) the performance of the classification models F w (D) measured in terms of Micro-F1 and Macro-F1 as well as (ii) the consistency of the label embedding models G Θ (l) with respect to the hidden 3-level hierarchy over the 21 labels. We record observations in two settings: SETTING 1 in which for each training instance, one of the annotated labels are dropped, uniformly at random: In Table 1 we note the performances of the different approaches with increasing rate at which labels are dropped. The jointly optimised model HIDDEN jnt accounts for the classification task through loss component L 1 as also the somewhat redundant label co-occurrence through the loss component L 2 .
As expected, we observe that the performance of HIDDEN jnt is more robust to this form of label noise than the HIDDEN cas and HIDDEN flt models. This is because HIDDEN flt entirely relies on the training data and ignores the prior knowledge of existence of a label hierarchy. HIDDEN cas is also less robust as it over-relies on the label co-occurrence by minimising L 2 (which, in isolation will be sensitive to label noise), before venturing into the classification task by minimising L 1 .
SETTING 2 in which the size of the training set is decreased without corrupting labels: We observe the performances of the different approaches with decreasing size of the training set and note that the performance of the jointly optimised model HIDDEN jnt falls back on the label correlation signals through the loss component L 2 and is therefore more robust to decreasing size of the data set than the flat classifier. We observed similar results for other synthetic settings. Owing to space constraints, the plots and other ranking results are provided in the supplementary material.

Real-world Text Datasets
We used three datasets, namely, RCV1, Yelp and NYT in our experiments: (1) RCV1 (Lewis et al., 2004) -RCV1 is a newswire dataset of the articles collected between 1996-1997 from Reuters (2) NYT (Sandhaus, 2008) -This corpus contains articles from New York Times published between January 1st, 1987 andJune 19th, 2007 (3) Yelp 1 -Yelp is a review dataset of restaurants and each review is labelled with hierarchical categories of restaurants. Following the experimental design in Mao et al. (2019), we use the set of reviews for a business to predict the categories to which the business belongs. Some statistics pertaining to these datasets are presented in Table 2.

Comparison of models that do not use the true hierarchy
We compare performance of the different models that do not use the true hierarchy. These include our flat baseline HIDDEN flt , the cascaded model HIDDEN cas as well as our joint model HIDDEN jnt . We compare them against the baseline TextCNNflat model reported in Mao et al. (2019). The results are presented in Table 3 for λ = 0.1. Overall, our baseline (HIDDEN flt ) performs better than the previous baseline with the exception of Macro-F1 on NYT. We observe improvement of the joint model HIDDEN jnt over the flat (HIDDEN flt ) and cascaded (HIDDEN cas ) models on RCV1 and NYT (for each of which, the labels form a tree) in Table 3. However, on the Yelp dataset, the cascaded model (HIDDEN cas ) performs somewhat worse (-2 Micro-F1 and -3.3 Macro-F1) than our baseline model (HIDDEN flt ), hinting at the possibility that label co-occurrence information might not be helpful toward the classification task. This could be partly also because the labels in Yelp are structured in the form of a DAG. Constant curvature property of hyperbolic spaces makes them unsuitable for learning DAG structures . However, the Macro-F1 performance of HIDDEN jnt is far better than that of the cascaded model HIDDEN cas . 1 https://www.yelp.com/dataset/ challenge This illustrates that our joint model is able to better recover from less reliable (or less useful) label co-occurrence information, just as was illustrated in the Table 1

Comparison of Hyperbolic space and Euclidean space
To assess the utility of the hyperbolic space for embedding hierarchical labels, we compare HIDDEN jnt and HIDDEN euc . Table 4 presents this comparison on the three datasets. HIDDEN euc performs worse than HIDDEN jnt which uses the hyperbolic space for embedding labels (except Micro-F1 for Yelp due to reasons stated before). This is expected since embedding trees is much more effective in the hyperbolic space compared to Euclidean space since in the hyperbolic space, volume grows exponentially with distance from the origin while in Euclidean space, this growth is polynomial. The number of nodes in a tree also increases exponentially with distance from the root, making Hyperbolic spaces useful for embedding hierarchies.

Comparison with model that explicitly uses the true hierarchy
We compare performance of our joint approach HIDDEN jnt against a state-of-the-art hierarchical multi-label classifier, HiLAP (Mao et al., 2019). However, unlike our proposed models (variants of HIDDEN), HiLAP has access to the true hierarchy  both training and inference. Thus, HiLAP serves as some form of skyline for the HIDDEN suite of approaches proposed in this paper. HiLAP learns label assignment policy using the reinforcement learning framework.
In Table 5, we compare the performance of HIDDEN jnt against HiLAP model as reported in Mao et al. (2019). Interestingly, on RCV1, we obtain better Micro-F1 score (+0.7) for the joint model HIDDEN jnt over the HiLAP method. On NYT, our Micro-F1 score is far better (+7.1) than HiLAP; our Macro-F1 score is also marginally better (+0.4) than their Macro-F1 scores. These results are interesting because HIDDEN jnt seems to obtain better generalisation through joint learning of the document classifier and label embeddings in a hyperbolic space, even without access to the true hierarchy. However, on Yelp, HiLAP seems to benefit over HIDDEN jnt by explicitly using the true hierarchy.

Evaluating performance of embeddings
We compare embeddings learned using different approaches with the ground truth hierarchy to eval-uate the effectiveness of the embeddings. Figure  2 shows the plot of NDCG scores for different values of k on the RCV1 and NYTimes dataset across HIDDEN cas and HIDDEN jnt (for two different values of λ). In Table 6, we compare the Spearman rank correlation. The superior performance of HIDDEN jnt strongly suggests that the embeddings learnt using the joint model are more representative of the true hierarchical organisation of the labels than those obtained using the flat and cascaded variants. This also goes to show that even the first term in our objective has positive contribution towards the learning of hyperbolic embeddings and indeed joint learning is beneficial. Figure 2: Plot of NDCG versus k for assessing the quality of the learnt label embeddings with respect to the actual hierarchy on RCV1 and NYT datasets. The better performance of HIDDEN jnt indicates the label embeddings Θ jnt are most representative of the true hierarchy.

Conclusion
We propose a novel approach to hierarchical multilabel classification based on joint learning of document classifier and label embeddings in hyperbolic space. The proposed framework HIDDEN allows us to discover label hierarchical relationship by leveraging properties of hyperbolic geometry. Even though label-hierarchy is assumed to be unavailable, our method achieves comparable results  with state-of-the-art hierarchy aware methods. We performed extensive experiments on three datasets and demonstrate effectiveness of the learned embeddings.

Appendix 1 Explanation for Π(x)
The Lorentz model is defined as the Riemannian Manifold, L n = (H n , g l ), where H n = {x ∈ R n+1 : x, x L = −1, x 0 > 0}, and g l = diag ([−1 1 . . . 1]). Here, x, y L , known as the Minkowski inner-product, is given by The Poincaré model and the Lorentz model are equivalent in isometry. Therefore points in Lorentz manifold can be mapped into Poincaré ball as, p : A point x in the Euclidean space R n can be projected onto the Lorentz manifold H n using the transformation Ω(x) = 1 + x 2 , x . This transformation ensures that the Minkowski innerproduct, Ω(x), Ω(x) L , is equal to −1, and that the first component, Ω(x) 0 ≡ 1 + x 2 is positive, as required for membership in the Lorentz manifold. Now using the isometry between Poincaré and Lorentz models (Nickel and Kiela, 2018), we have Π : R n → P n as Π(x) = p (Ω(x)) = x 1 + 1 + x 2

Dataset Details
We describe the details of the datasets used in our experiments. For RCV1 dataset (Lewis et al., 2004), we use the original training/test split and use 10% of the training set as the validation set. We introduce an extra Root label in addition to the 103 labels present in the dataset. Each document in the dataset is labelled with this label.
The details for the other datasets used are same as in Mao et al. (2019) and we refer the readers to the same.

Remarks on Synthetic Experiments
In SETTING 1, with increasing probability of a label being randomly dropped, we observe in Figure 3 that the performance of all the models decreases which is expected. However, it is interesting to note that HIDDEN jnt is more robust to the noisy labels and always performs better than HIDDEN flt since it has an additional source of information about the labels via label co-occurrences. This information is also implicitly available to HIDDEN flt but from our experiments, we observe that providing this information explicitly improves performance. We also observe that performance of HIDDEN cas fluctuates quite a bit which is probably due to the fact that it overly relies on label co-occurrence and commits to a set of label embedding before trying to solve the classification task.
In SETTING 2, with increasing number of points in the training set, we observed in Figure 4 that with very small datasets, HIDDEN cas performs slightly better than HIDDEN jnt but with increasing dataset sizes, HIDDEN jnt performs better. Since there is no label noise in this setting, the label co-occurances are expected to be quite meaningful and with small dataset sizes, jointly learning both the document embedding model F w and the label embeddings Θ jointly (HIDDEN jnt ) is more difficult compared learning Θ first and then F w (HIDDEN cas ).
HIDDEN jnt and HIDDEN cas perform better than HIDDEN flt since the former can fall back of the label correlation signal via L 2 which is available in a much more easily usable format than to HIDDEN flt . As discussed above, having the label co-occurrences available explicitly to the models help achieve better performance.
We observe similar results in other synthetic settings.
4 Hierarchy of the Synthetic data As described in the paper, the synthetic data is generated from 16 Bivariate Gaussian distributions with their means placed evenly in a 4 × 4 grid. These Gaussians are then grouped at various levels to get an hierarchy with 3 level as shown in Figure 5.

Contrasting label hierarchies across datasets
Recall, how in Table 3 of the paper, we observe improvement of the joint model HIDDEN jnt over the flat (HIDDEN flt ) and cascaded (HIDDEN cas ) models on RCV1 and NYTimes datasets. Note that the labels of RCV1 as well as of NYTimes form trees. However, on the Yelp dataset, the cascaded model (HIDDEN cas ) performs somewhat worse (-2 on micro and -3.3 on macro) than our baseline , hinting at the possibility that label co-occurrence information might not be helpful toward the classification task. This could be partly also because the labels in Yelp are structured in the form of a DAG with 12 labels in the labels set having more than one parent.

δ-hyperbolicity
To further investigate why our method fails to perform well on Yelp dataset, we compute hyperbolicity (Gromov, 1987) for each of the label hierarchies.
The hyperbolicity δ of a graph G is a measure of how tree-like the graph is. Lower the δ, the more tree-like is the graph. Hyperbolicity δ is 0 for trees. As shown in Table 7, RCV1 has 0 hyperbolic-Dataset RCV1 NYT Yelp Hyperbolicity 0 1 1 Table 7: Hyperbolicity (Gromov, 1987) of the label hierarchies for the datasets used ity as expected, since the label hierarchy is a tree. We would have expected NYT to also have a hyperbolicity of 0 but the label Others appears at different levels of the hierarchy making it a DAG. However, this is the only deviation from being a tree and thus our method is able to perform well on NYT. For Yelp, the hyperbolicity is 1 and there are multiple labels with more than one parent and thus it is less tree-like than the NYT hierarchy. It