HyHTM: Hyperbolic Geometry based Hierarchical Topic Models

Hierarchical Topic Models (HTMs) are useful for discovering topic hierarchies in a collection of documents. However, traditional HTMs often produce hierarchies where lowerlevel topics are unrelated and not specific enough to their higher-level topics. Additionally, these methods can be computationally expensive. We present HyHTM - a Hyperbolic geometry based Hierarchical Topic Models - that addresses these limitations by incorporating hierarchical information from hyperbolic geometry to explicitly model hierarchies in topic models. Experimental results with four baselines show that HyHTM can better attend to parent-child relationships among topics. HyHTM produces coherent topic hierarchies that specialise in granularity from generic higher-level topics to specific lowerlevel topics. Further, our model is significantly faster and leaves a much smaller memory footprint than our best-performing baseline.We have made the source code for our algorithm publicly accessible.


Introduction
The topic model family of techniques is designed to solve the problem of discovering humanunderstandable topics from unstructured corpora (Paul and Dredze, 2014) where a topic can be interpreted as a probability distribution over words (Blei et al., 2001).Hierarchical Topic Models (HTMs), in addition, organize the discovered topics in a hierarchy, allowing them to be compared with each other.The topics at higher levels are generic and broad while the topics lower down in the hierarchy are more specific (Teh et al., 2004).
While significant efforts have been made to develop HTMs (Blei et al., 2003;Chirkova and Vorontsov, 2016;Isonuma et al., 2020;Viegas et al., 2020), there are still certain areas of improvement.Words such as space shuttle and satellite, which belong to moderately different super-concepts such as vehicles and space, respectively, are brought closer together due to their semantic similarity.This leads to a convergence of their surrounding words, such as helicopter and solar system, creating a false distance relationship and a crowding effect in Euclidean spaces.In figure (b), we see a concept tree in Hyperbolic spaces (Poincaré ball), which inherently has more space (represented by grey circles) than Euclidean spaces.The distances here grow exponentially towards the edge of the ball, and the concepts at deeper levels such as helicopter and solar systems move apart in these growing spaces and are far from each other.The dashed blue line shows how the distances in both spaces are calculated.
First, the ordering of topics generated by these approaches provides little to no information about the granularity of concepts within the corpus.By granularity, we mean that topics near the root should be more generic, while topics near the leaves should be more specific.Second, the lower-level topics must be related to the corresponding higher-level topics.Finally, some of these approaches such as CluHTM (Viegas et al., 2020) are very computationally intensive.We argue that these HTMs have such shortcomings primarily because they do not explicitly account for the hierarchy of words between topics.
Most of the existing approaches use document representations that employ word embeddings from euclidean spaces.These spaces tend to suffer from the crowding problem which is the tendency to accommodate moderately distant words close to each other (Van der Maaten and Hinton, 2008).There are several notable efforts that have shown that Euclidean spaces are suboptimal for embedding concepts in hierarchies such as trees, words, or graph entities (Chami et al., 2019(Chami et al., , 2020;;Guo et al., 2022).
In figure 1(a), we show the crowding of concepts in euclidean spaces.Words such as space shuttle and satellite, which belong to moderately different concepts such as vehicles and space, respectively, are brought closer together due to their semantic similarity.This also leads to a convergence of their surrounding words, such as helicopter and solar system creating a false distance relationship.As a result of this crowding, topic models such as CluHTM that use Euclidean word similarities in their formulation tend to mix words that belong to different topics.
Contrary to this, hyperbolic spaces are naturally equipped to embed hierarchies with arbitrarily low distortion (Nickel and Kiela, 2017;Tifrea et al., 2019;Chami et al., 2020).The way distances are computed in these spaces are similar to tree distances, i.e., children and their parents are close to each other, but leaf nodes in completely different branches of the tree are very far apart (Chami et al., 2019).In figure 1(b), we visualise this intuition on a Poincaré ball representation of hyperbolic geometry (discussed in detail in Section 3).As a result of this tree-like distance computation, hyperbolic spaces do not suffer from the crowding effect and words like helicopter and satellite are far apart in the embedding space.
Inspired by the above intuition and to tackle the shortcomings of traditional HTMs, we present Hy-HTM, a Hyperbolic geometry based Hierarchical Topic Model which uses hyperbolic geometry to create topic hierarchies that better capture hierarchical relationships in real-world concepts.To achieve this, we propose a novel method of incorporating semantic hierarchy among words from hyperbolic spaces and encoding it explicitly into topic models.This encourages the topic model to attend to parent-child relationships between topics.

Experimental results and qualitative examples
show that incorporating hierarchical information guides the lower-level topics and produces coherent, specialised, and diverse topic hierarchies (Section 6).Further, we conduct ablation studies with different variants of our model to highlight the importance of using hyperbolic embeddings for representing documents and guiding topic hierarchies  (comp.graphics, comp.os.ms-windows.misc),we find that HyHTM is better at discriminating between similar document labels.CluHTM's root-level topics are not related to computer concepts, and it cannot separate these labels at lower levels.HyHTM groups them in the same root level and separates them into different lower-level topics, showing the advantage of using hyperbolic embeddings over euclidean ones to avoid the crowding problem.We show the top words with the highest probability for the topics.
(Section 7).We also compare the scalability of our model with different sizes of datasets and find that our model is significantly faster and leaves much smaller memory footprint than our best-performing baseline (Section 6.1).We also present qualitative results in Section 6.2), where we observe that HyHTM topic hierarchies are much more related, diverse and specialised.Finally, we discuss and perform in-depth ablations to show the role of hyperbolic spaces and importance of every choice we made in our algorithm (See Section 7).

Related Work
To the best of our knowledge, HTMs can be classified into three categories, (I) Bayesian generative model like hLDA (Blei et al., 2003), and its variants (Paisley et al., 2013;Kim et al., 2012;Tekumalla et al., 2015) utilize bayesian methods like Gibbs sampler for inferring latent topic hierarchy.These are not scalable due to the high computational requirements of posterior inference.(II) Neural topic models like TSNTM (Isonuma et al., 2020) andothers (Wang et al., 2021;Pham and Le, 2021) use neural variational inference for faster parameter inference and some heuristics to learn topic hierarchies but lack the ability to learn appropriate semantic embeddings for topics.Along with these methods, there are (III) Non-negative matrix factorization (NMF) based topic models, which decompose a term-document matrix (like bag-of-words) into low-rank factor matrices to find latent topics.The hierarchy is learned using some heuristics (Liu et al., 2018a,b) or regularisation methods (Chirkova and Vorontsov, 2016) based on topics in the previous level.
However, the sparsity of the BoW representation for all these categories leads to incoherent topics, especially for short texts.To overcome this, some approaches have resorted to incorporating external knowledge from knowledge bases (KBs) (Duan et al., 2021b;Wang et al.) or leveraging word embeddings (Meng et al., 2020).Pre-trained word embeddings are trained on a large corpus of text data and capture the relationships between words such as semantic similarities, and concept hierarchies.These are used to guide the topic hierarchy learning process by providing a semantic structure to the topics.Viegas et al. (2020) utilizes euclidean embeddings for learning the topic hierarchy.However, Tifrea et al. (2019); Nickel and Kiela (2017); Chami et al. (2020); Dai et al. (2021) have shown how the crowding problem in Euclidean spaces makes such spaces suboptimal for representing word hierarchies.These works show how Hyperbolic spaces can model more complex relationships better while preserving structural properties like concept hierarchy between words.Recently, shi Xu et al. made an attempt to learn topics in hyperbolic embedding spaces.Contrary to the HTMs above, this approach adopts a bottom-up training where it learns topics at each layer individually starting from the bottom, and then during training leverages a topic-linking approach from Duan et al. (2021a), to link topics across levels.They also have a supervised variant that incorporates concept hierarchy from KBs.
Our approach uses latent word hierarchies from pretrained hyperbolic embeddings to learn the hierarchy of topics that are related, diverse, specialized, and coherent.

Preliminaries
We will first review the basics of Hyperbolic Geometry and define the terms used in the remainder of this section.We will then describe the basic building blocks for our proposed solution, followed by a detailed description of the underlying algorithm.

Hyperbolic Geometry
Hyperbolic geometry is a non-Euclidean geometry with a constant negative Gaussian curvature.Hyperbolic geometry does not satisfy the parallel postulate of Euclidean geometry.Consequently, given a line and a point not on it, there are at least two lines parallel to it.There are many models of hyperbolic geometry, and we direct the interested reader to an excellent exposition of the topic by Cannon et al. (1997).We base our approach on the Poincaré ball model, where all the points in the geometry are embedded inside an n-dimensional unit ball equipped with a metric tensor (Nickel and Kiela, 2017).Unlike Euclidean geometry, where the distance between two points is defined as the length of the line segment connecting the two points, given two points u ∈ D n and v ∈ D n , the distance between them in the Poincaré model is defined as follows: Here, arcosh is the inverse hyperbolic cosine function, and ∥.∥ is the Euclidean norm. Figure 1 has shown an exemplary visualization of how words get embedded in hyperbolic spaces using the Poincaré ball model.As illustrated in Figure 1(b), distances in hyperbolic space follow a tree-like path, and hence they are informally also referred to as tree distances.As can be observed from the figure, the distances grow exponentially larger as we move toward the boundary of the Poincaré ball.This alleviates the crowding problem typical to Euclidean spaces, making hyperbolic spaces a natural choice for the hierarchical representation of data.

Matrix Factorization for Topic Models
A topic can be defined as a ranked list of strongly associated terms representative of the documents belonging to that topic.Let us consider a document corpus D consisting of n documents d 1 , d 2 , . . ., d n , and let V be the corpus vocabulary consisting of m distinct words w 1 , w 2 , . . ., w m .The corpus can also be represented by a document-term matrix A ∈ R n×m such that A ij represents the relative importance of word w j in document d i (typically represented by the TF-IDF weights of w i in d j ).
A popular way of inferring topics from a given corpus is to factorize the document-term matrix.Typically, non-negative Matrix Factorization (NMF) is employed to decompose the documentterm matrix, A, into two non-negative approximate factors: W ∈ R n×N and H ∈ R N×m .Here, N can be interpreted as the number of underlying topics.The factor matrix W can then be interpreted as the document-topic matrix, providing the topic memberships for documents, and H, the topic-term matrix, describes the probability of a term belonging to a given topic.This basic algorithm can also be applied recursively to obtain a hierarchy of topics by performing NMF on the set of documents belonging to each topic produced at a given level to get more fine-grained topics (Chirkova and Vorontsov, 2016;Viegas et al., 2020).

Hierarchical Topic Models Using Hyperbolic Geometry
We now describe HyHTM -our proposed Hyperbolic geometry-based Hierarchical Topic Model.
We first describe how we capture semantic similarity and hierarchical relationships between terms in hyperbolic space.We then describe the stepby-step algorithm for utilizing this information to generate a topic hierarchy.

Learning Document Representations in Hyperbolic Space and Root Level Topics
As discussed in Section 3.2, the first step in inferring topics from a corpus using NMF is to compute the document-term matrix A. A typical way to compute the document-term matrix A is by using the TF-IDF weights of terms in a document that provides reprsentations of the documents in the term space.However, usage of TF-IDF (and its variants) results in sparse representations and ignores the semantic relations between different terms by considering only the terms explicitly present in a given document.Viegas et al. (2019) proposed an alternative formulation for document representations that utilizes pre-trained word embeddings to enrich the document representations by incorporating weights for words that are semantically similar to the words already present in the document.The resulting document representations are computed as follows.
Here, ⊙ indicates the Hadamard product.A is the n × m document-term matrix.TF is the termfrequency matrix such that TF i,j = tf (d i , w j ) and M S is the m × m term-term similarity matrix that captures the pairwise semantic relatedness between the terms and is defined as M s i,j = sim(w i , w j ), where sim(w i , w j ) represents the similarity between terms w i and w j and can be computed using typical word representations such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014).Finally, IDF is the m × 1 inverse-document-frequency vector representing the corpus-level importance of each term in the vocabulary.Note that Viegas et al. (2019) used the following modified variant of IDF in their formulation, which we also chose in this work.
Here, µ (w i , d) is the average of the similarities between term w i and all the terms w in document d such that M S (w i , w) ̸ = 0. Thus, unlike traditional IDF formulation where the denominator is document-frequency of a term, the denominator in the above formulation captures the semantic contribution of w i to all the documents.
In our work, we adapt the above formulation to obtain document representations in Hyperbolic spaces by using Poincaré GloVe embeddings (Tifrea et al., 2019), an extension of the traditional Euclidean space GloVe (Pennington et al., 2014) to hyperbolic spaces.Due to the nature of the Poincaré Ball model, the resulting embeddings in the hyperbolic space arrange the correspondings words in a hierarchy such that the sub-concept words are closer to their parent words than the sub-concept words of other parents.
There is one final missing piece of the puzzle before we can obtain suitable document representations in hyperbolic space.Recall that due to the nature of the Poincaré Ball model, despite all the points being embedded in a unit ball, the hyperbolic distances between points, i.e., tree distances (Section 3.1) grow exponentially as we move towards boundary of the ball (see Figure 1).Consequently, the distances are bounded between 0 and 1.As NMF requires all terms in the input matrix to be positive, we cannot directly use these distances to compute the term-term similarity matrix M S in Equation ( 2) as 1 − d P w, w ′ can be negative.To overcome this limitation, we introduce the notion of Poincaré Neighborhood Similarity, (s pn ), which uses a neighborhood normalization technique.The k-neighborhood of a term w is defined as the set of top k-nearest terms w 1 , ..., w k in the hyperbolic space and is denoted as n k (w).For every term in the vocabulary V, we first calculate the pair-wise poincaré distances with other terms using Equation (1).Then, for every term w ∈ V, we compute similarity scores with all the other terms in its k-neighborhood n k (w) by dividing each pair-wise poincaré distance between the term and its neighbor by the maximum pair-wise distance in the neighborhood.This can be represented by the following equation where w ′ ∈ n k (w): With this, we can now compute the term-term similarity matrix M S as follows.
Note that there are two hyperparameters to control the neighborhood -(i) the neighborhood size using k s ; and (ii) the quality of words using α, which keeps weights only for the pair of terms where the similarity crosses the pre-defined threshold α thereby reducing noise in the matrix.Without α, words with very low similarity may get included in the neighborhood eventually leading to noisy topics.
We now have all the ingredients to compute the document-representation matrix A in the hyperbolic space and NMF can be performed to obtain the first set of topics from the corpus as described in Section 3.2.This gives us the root level topics of our hierarchy.Next, we describe how we can discover topics at subsequent levels.

Building the Topic Hierarchy
In order to build the topic hierarchy, we can iteratively apply NMF for topics discovered at each level as is typically done in most of the NMF based approaches.However, do note that working in the Hyperbolic space allows us to utilize hierarchical information encoded in the space to better guide the discovery of topic hierarchies.Observe that the notion of similarity in the hyperbolic space as defined in Equation( 4) relies on the size of the neighborhood.In large neighborhood, a particular term will include not only its immediate children and ancestors but also other semantically similar words that may not be hierarchically related.On the other hand, a small neighborhood will include only the immediate parent-child relationships between the words, since subconcept words are close to their concept words.HyHTM uses this arrangement of words in hyperbolic space to explicitly guide the lower-level topics to be more related and specific to higher-level topics.In order to achieve this, we construct a Term-Term Hierarchy matrix, Here, k H is a hyperparameter that controls the neighborhood size.M H is a crucial component of our algorithm as it encodes the hierarchy information and helps guide the lower-level topics to be related and specific to the higher-level topics.Without loss of generality, let us assume we are at i th topic node t i at level l in the hierarchy.We begin by computing A 0 = A, as outlined in Equation (2), at the root node (representing all topics) and subsequently obtaining the first set of topics (at level l = 1).Also, let the number of topics at each node in the hierarchy be N (a user-specified parameter).Every document is then assigned to one topic with which it has the highest association in the document-topic matrix W l−1 .Once all the documents are collected into disjoint parent topics, we use a subset of A 0 with only the set of documents (D t j ) belonging to the j th topic, and denote this by A l−1 .We then branch out to N lower-level topics at the i th node, using the following steps: Parent-Child Re-weighting for Topics in the Next Level: We use the term-term hierarchical matrix M H to assign more attention to words hierarchically related to all the terms in the topic node t i , and guide the topic hierarchy so that the lowerlevel topics are consistent with their parent topics.We take the product of the topic-term matrix of the t i , denoted by, H i with the hierarchy matrix M H .This assigns weights with respect to associations in the topic-term matrix Here, 1 i is the one-hot vector for topic i, and H l−1 is the topic-term factor obtained by factorizing the document-representations A l−1 of the parent level.Document representation for computing next level topics: We now compute the updated document representations for documents in topic node t i that infuse semantic similarity between terms with hierarchical information as follows. A By using the updated document representations A l we perform NMF as usual and obtain topics for level l + 1.The algorithm then continues to discover topics at subsequent levels and stops exploring topic hierarchy under two conditions -(i) if it reaches a topic node such that the number of documents in the node is less than a threshold (D min ); (ii) when the maximum hierarchy depth (L max ) is reached.We summarize the whole process in the form of a pseudcode in Algorithm 1.

Algorithm 1: The HyHTM Algorithm
Input : Max depth level (L max ) Min # of documents (D min ) Default # of topics (N ) Output :Hierarchy of Topics 1 Compute A using Eq (2) & (5) 2 GetHier(A, 1) 3 def GetHier(A, L): Get parent topic using H l−1 8 Add topic to hierarchy 9 Get Docs of topic t j using W l−1 Compute Parent-Child Reweighting M ti using Eq (7) 12 Compute A l next level from M ti & A l−1 using Eq (8) 13 GetHier(A l , L + 1)

Experimental Setup
Datasets: To evaluate our topic model, we consider 8 well-established public benchmark datasets.
In Table 1 we report the number of words and documents, as well as the average number of words per document.We have used datasets with varying numbers of documents and average document lengths.We provide preprocessing details in the Appendix (See C.1). Baseline Methods: Our model is a parametric topic model which requires a fixed number of topics to be specified.This is different from non-parametric models, which automatically learn the number of topics during training.For the sake of completeness, we also compare our model to various non-parametric models such as hLDA (Blei et al., 2003) a bayesian generative model, and TSNTM (Isonuma et al., 2020) which uses neural variational inference.We also compare with NMFbased parametric models like hARTM (Chirkova and Vorontsov, 2016) which learns a topic hierarchy with a bag of words of documents and CluHTM (Viegas et al., 2020) which uses euclidean based pre-trained embeddings (Mikolov et al., 2017) to provide semantic similarity context to topic models.We provide the implementation details of these baselines in the Appendix (See C).
Number of topics: hARTM only allows fixing the total number of topics at a level and cannot specify the number of child topics for every parent topic.CluHTM, on the other hand, has a method to learn the optimal number of topics, but it is highly inefficient2 .We use the same number of topics for fair comparison in hARTM, CluHTM, and HyHTM.
We fix the number of topics for the top level as 10, with 10 sub-topics under each parent topic.The total number of topics at each level is 10, 100, and 1000.Non-parametric models hLDA and TSNTM learn the number of topics, and we report these numbers in the appendix (See E).We select the best values for the hyperparameters k H , k S , and α by tuning them for the model with the best empirical results.We report these in the Appendix C.

Experimental Results
In this section we compare our model's performance on well-estabilished metrics to assess the coherence, specialisation, and diversity of topics.We present qualitative comparision for selected topics in Figure 2 and in Appendix 6.2.We discuss and perform ablations to show the role of hyperbolic spaces and effectiveness of our algorithm (See Appendix 7).RQ1: Does HyHTM produce coherent topics?Topic coherence is a measure that can be used to determine how much the words within a topic cooccur in the corpus.The more the terms co-occur, the easier it is to understand the topic.We employ the widely used coherence measure from Aletras and Stevenson (2013) and report the average across the top 5 and 10 words for every topic in Table 2.We observe that for majority of the datasets, HyHTM consistently ranks at the top or second highest in terms of coherence.We also observe that for some cases hLDA and TSNTM, which have very few topics (See E) compared to HyHTM, have higher coherence values.To this end, we conclude that incorporating neighborhood properties of words from hyperbolic spaces can help topic models to produce topics that are comprehensible and coherent.Coherence is mathematically defined as, where w i and w j are words in the topic, while P (w i , w j ) and P (w j ) are the probabilities of cooccurrence of w i and w j and the of occurrence of w j in the corpus respectively.RQ2: Does HyHTM produce related and diverse hierarchies?To assess the relationships between higher-level parent topics and lower-level child topics, we use two metrics: (i) hierarchical coherence, and (ii) hierarchical affinity.Hierarchical Coherence: We build upon the coherence metric above to compute the coherence between parent topic words and child topic words.
For every parent-topic and child-topic pair, we calculate the average across the top 5 words and top 10 words and report this in Table 3.We observe that HyHTM outperforms the baselines across datasets, and we attribute this result to our parent-child reweighting framework of incorporating the hierarchy of higher-level topics.In most cases, hLDA and TSNTM have very low hierarchical coherence because the topics generated by these models are often too generic across levels and contain multiple words from different concepts, whereas hARTM and CluHTM have reasonable scores and are often better than these.From this observation, we conclude that adding hierarchies from hyperbolic spaces to topic models produces a hierarchy where lower-level topics are related to higher-level topics.Hierarchical coherence is defined as, HCoherence = n i=1 n j=1 log P (w i ,w j ) P (w i )P (w j ) n 2 (10) where w i and w j represent words from the parent topic and child topic, while P (w i , w j ) and P (w j ) are the probabilities of co-occurrence of w i and w j and the of occurrence of w j in the corpus respectively.Hierarchical Affinity: We employ this metric from Isonuma et al. (2020) which considers the topics at levels 2 as parent topics and the topics at level 3 to compute (i) child affinity, and, (ii) nonchild affinity.The respective affinities are measured by the average cosine similarity of topic-term distributions between parent & child and parent & non-child topics.3When child affinity is higher than non-child affinity, it implies (i) the topic hierarchy has a good diversity of topics, and, (ii) the parents are related to their children.We present the hierarchical affinities in figure 3. We observe that HyHTM has the largest between child affinities across all the datasets.We also observe that the difference between child and nonchild affinities is also larger than that for any other baseline.hLDA and TSNTM have very similar child and non-child affinities, which indicates how generic topics are across the hierarchy.In hARTM, we observe high child affinity and negligible non- Topic Affinity

AGNews
Non-Child Affinity Figure 3: Analysis of Hierarchical Topic Affinities.A higher Child Affinity value indicates stronger relatedness between parent and child topics.The more the difference between Child to Non-Child Affinities, the more diverse the topics are in the hierarchy.Please note, some affinities appear to be missing in the visualization due to their significantly lower magnitudes compared to the highest affinity value."child affinity.From these observations, we conclude that HyHTM produces related and diverse topics.RQ3: Does HyHTM produce topics with varying granularity across levels?We use the Topic specialisation metric from Kim et al. (2012), to understand the granularity of topics in the hierarchy.Topic specialization is the cosine distance between the term distribution of a topic with the term distribution of the whole corpus.According to the metric, the root-level topics are trained on the whole corpus so they are very generic, while the lower-level topics are trained on a subset of documents, and they specialise.A higher specialization value means that the topic vector is not similar to the whole corpus vector, and hence it is more specialised.With increasing depth in the hierarchy, the specialisation of a topic should increase and its distance from the corpus vector must increase to model reasonable topic hierarchies described above.
As the resulting topic-proportions and range of topic-specialisation of CluHTM and HyHTM are similar, we first focus on these models to effectively underscore the advantages of employing hyperbolic spaces.As depicted in Figure 4, unlike CluHTM, our HyHTM model consistently exhibits an increasing trend in topic specialization across majority of the datasets.We attribute this result to our use of hyperbolic spaces in our algorithm which groups together documents of similar concepts from the root level itself.
Additionally, we present the topic specialization of other models in Appendix Table 5.We discover that TSNTM usually scores low, suggesting generic topics at all levels.Although hLDA shows increasing specialization, it seemingly fails to generate related topic hierarchies, as evidenced by quantitative metrics and qualitative topics (See Section 6.2).Despite hARTM showing an increase in granularity, it often lumps unrelated concepts under a single topic hierarchy, akin to CluHTM, as illustrated in the qualitative examples (See Section 6.2).To evaluate how our model scales with the size of the datasets, we measure the training time and memory footprint by randomly sampling a different number of documents (5k to 125k) from the AGNews dataset.From Figure 5 we observe that, as the number of documents increases, the training time of our model does not change considerably, whereas that of the CluHTM increases significantly.HyHTM can be trained approximately 15 times faster than the CluHTM model with even 125k documents.CluHTM works inefficiently by keeping the document representations of all the topics at a level in the working memory.This is a result of CluHTM developing the topic hierarchy in a breadth-first manner.We have optimized the HyHTM code to train one branch from root to leaf in a depth-first manner which makes our model more memory and efficient.hLDA took approximately 1.32 hours for training on the complete dataset, and hARTM and TSNTM took more than 6 hours.

Quality of Topics
To intuitively demonstrate the ability of our model to generate better hierarchies, we present topic hierarchies of all models for some selected 20News target labels in the Appendix in Figure 6. 4 Across various topic categories, unlike HyHTM, other models tend to struggle with delineating specific subconcepts, maintaining relatedness, and ensuring specialization within their topics, which highlights HyHTM's improved comprehensibility.For the sci.space 20News label, we observe that topics from CluHTM across all the levels are related space concepts but it is challenging to label them as specific subconcepts.The hARTM topics for space has a resonable hierarchy but it has documents of different concepts such as sci.space, sci.med, rec.sports.baseball.For hLDA and TSNTM, the lack of relatedness and specialization makes it difficult to identify these topics as space-themed.A similar trend can be observed for comp.os.mswindows.misc and sci.med 20News categories in the figure, where the models exhibit similar struggles.

Ablation Do Hyperbolic embeddings represent documents better than Euclidean ones?
To investigate this we consider a variant of our model called Ours (Euc) which incorporates pretrained Fasttext (Bojanowski et al., 2017) (trained on euclidean spaces) instead of Poincare embeddings in M s (w, w ′ ), and we keep all the other steps unchanged.From Table 4, we observe that using hyperbolic embeddings guiding parent-child in A l is better choice as it produces topics that are more coherent and hierarchies in which lower-level topics are related to higher-level topics.4, Ours (Euc), which accounts for word hierarchies between higher-level and lower-level topics, generates topic hierarchies that are nearly twice as effective in terms of topical hierarchical coherence and hierarchical affinity.
In the Appendix (See Section B), we also examine the importance of our approach by replacing the underlying algorithm with hierarchical clustering methods.

Conclusion
In this paper, we have proposed HyHTM, which uses hyperbolic spaces to distill word hierarchies of higher-level topics in order to refine lower-level topics.Both quantitative and qualitative experiments have demonstrated the effectiveness of HyHTM in creating hierarchies in which lower-level topics are realted and more specific than higher-level topics.HyHTM is much more efficient compared to our best-performing baseline.A major limitation of HyHTM is that it is parametric and therefore requires empirical analysis to find the optimal number of topics at each level.We plan to investigate this shortcoming in the future.

Limitations
In this paper, we propose a method to effectively incorporate the inherent word hierarchy in topic models for hierarchical topic mining.We use poincare embeddings, trained on wikipedia, to compute the hierarchical relatedness between words.Hence, our model relies on how well these embeddings are trained and whether they effectively capture the word hierarchy.Moreover, any bias in the embeddings is translated into our model.The second major limitation of our model is that since these embeddings are trained on wikipedia, they may not perform well on datasets that are very different from wikipedia or on datasets where the relation between two words is very different from their relation in wikipedia.For example, topic and hierarchy will have a very different relation in scientific journals from what they have in wikipedia.Our model is parametric HTM, and we plan on investigating methods to induce number of topics using hyperbolic spaces.

Ethics Statement
• The dataset used to train the poincare embeddings is Wikipedia Corpus, a publicly available dataset standardized for research works.
• We have added references for all the papers, open-source code repositories and datasets.
• In terms of dataset usage for topic modeling, we have used only publicly available datasets.We also ensure that any datasets used in our research do not perpetuate any harmful biases.
• We also plan to make our models publicly available, in order to promote transparency and collaboration in the field of natural language processing.

A Additional Results
A.1 Topic Specialisation In section 6 we report the Topic Specialisation for CluHTM and HyHTM.In this section we present the topic specialisation results in the table 5.

B Additional Ablation Study
Hierarchical clustering with Hyperbolic Embeddings: We replace the underlying topic model algorithm with BERTopic (Grootendorst, 2022) which uses an HDBSCAN hierarchical clustering method under the hood which does not take into account the hierarchy between words in higher-level topics and lower-level topics.Both our model and BERTopic employ hyperbolic document embeddings as A 0 , followed by their respective approaches to generate a hierarchy of topics.As seen in Table 6, our model outperforms BERTopic in terms of coherence and hierarchical coherence measures.While the lower-level topics in BERTopic are related to their higher-level topics, the topic pairs (parent, child) were not unique as compared to our model.Investigating the Need for Post-Processing Techniques in HyHTM for Ensuring Uniqueness Across Topic Levels: BERTopic (Grootendorst, 2022) employs a classbased TFIDF approach for topic-word representation, treating all documents in a cluster as one.
Inspired by this, we examined the impact of applying a similar class-based TFIDF to topics generated by our model as an additional post-processing step.Theoretically, this should ensure unique topics at each level.However, as reported in Table 6 under HyHTM c-TFIDF, we found no noticeable improvement in topic coherence and hierarchy.This affirms that HyHTM inherently organizes documents into diverse and coherent themes at every level, obviating the need for additional post-processing.

C.1 Preprocessing
We remove numeric tokens, punctuations, non-ascii codes and convert the document tokens to lowercase.In addition to NLTK's stopwords, we also remove smart stopwords5 Next we lemmatise each token using NLTK's WordNetLemmatizer.We filter the vocabulary by removing tokens whose ratio of total occurrence count to number of training  documents in which the token appears is less than 0.8.

C.2 Computing Infrastructure
The experiments were run on a machine with NVIDIA GeForce RTX 3090 GPU and 24 GB of G6X memory.However, these experiments can also be replicated on CPU.The CUDA version used is 11.4.

C.3 HyHTM
All experiments were performed with three runs per dataset.We use the implementation provided by (Stražar et al., 2016)   word would be in the neighborhood of every other word, and for very small values of k H , even though some very similar words will be included in the neighborhood, the overall document representation will become very sparse, and many concept and sub-concept relations are discarded.We empirically tested this for k H in the range [500, 3000], and show our findings in figure 7. We observe that when k H is 500, the hierarchical coherence along with the other metrics, is the highest, and after that, it drops.

C.3.2 Varying α: Similarity threshold in the similarity matrix
The similarity threshold α in equation ( 5) is a hyperparameter that controls the pairs of words that should be considered similar and used to create the document representation.When the value is very high, only the most similar words are included in the term similarity matrix, which will result in a very sparse matrix, and defeat the purpose of adding more context about words from pretrained embeddings.If the value is very low, words which are not very similar can be picked up by the topic models as similar words.It is also important to note that while the vocabulary of terms can be controlled depending on the corpus used for topic modeling, the embeddings are pre-trained on large corpora which can result in biases from these corpora seeping into the arrangements of words in the embedding space.We test our model with values of α that range from 0.1 to 0.5.In figure 8, we observe that α value 0.4 gives the maximum value of hierarchical coherence for 20ng, and α value 0.3 is the maximum for Amazon Reviews.Similarly, we fine-tuned for all other datasets and report the results in the table 7.

C.4 CluHTM
We use the implementation provided by (Viegas et al., 2020) 6 for the CLUHTM baseline.While this implementation does provide a method to learn the optimal number of topics, it is highly inefficient, taking O(n 3 ) time.The training time for this model on 20NG data was ≈ 32 hours, and AR was ≈ 22 hours.Additionally, the number of topics is different in every branch, and comparison across models becomes difficult.

C.5 hARTM
For the hARTM baseline model, we use the Bi-gARTM7 package, version 0.10.1.For this model, we cannot choose the number of subtopics explored for each parent, but we can control the total number of subtopics from all parents at a certain level.In our other parametric models, since each parent has n subtopics, we obtain a total of n l topics at level l.Thus for hARTM, we indicate that the model chooses n l topics at level l starting from l = 1 to a depth of l = 3.

C.6 hLDA
We use the following implementation8 for hLDA.

C.8 BERTopic
We use the official implementation provided by (Grootendorst, 2022) 10 for BERTopic.We use the default parameters setup by BERTopic for HDB-SCAN clustering.

D Number of topics for parametric models
For the parametric models like hARTM, CluHTM, and our model HyHTM, we use the same number of topics at every level for a fair comparison.We explain how the topic hierarchy grows when the number of topics at each node of the tree is N = 10.
1.At the root level (level 1), we train the model on the entire corpus of documents D and set the number of topics as N = 10.As a result, we get 10 topics at the root level.
2. For every topic in the previous level, each parametric model organizes how documents will get distributed across topics.For CluHTM and HyHTM, a document is assigned to the topic with which it has the maximum association.Therefore, each document is assigned only 1 topic at a given level.Once the documents are categorized, we perform NMF on these documents and produce 10 topics for every parent topic.
In this way, we obtain topics at root level as 10, level 2 as 10 2 = 100, and level 3 as 10 3 = 1000.hARTM follows a different procedure using regularisers for categorizing documents and exploring lower-level topics.After level 1, hARTM produces flat topics in level 2 and learns the association between every lower-level topic with the higher-level topic.We assign the number of topics in level 2 as 10 2 , the same as the total number of topics in level 2 for CluHTM and HyHTM, and similarity for level 3.

E Number of topics for Non-Parametric models
The number of topics for non-parametric models is listed in

Figure 1 :
Figure1: In figure (a) we see a concept tree in Euclidean spaces.Words such as space shuttle and satellite, which belong to moderately different super-concepts such as vehicles and space, respectively, are brought closer together due to their semantic similarity.This leads to a convergence of their surrounding words, such as helicopter and solar system, creating a false distance relationship and a crowding effect in Euclidean spaces.In figure (b), we see a concept tree in Hyperbolic spaces (Poincaré ball), which inherently has more space (represented by grey circles) than Euclidean spaces.The distances here grow exponentially towards the edge of the ball, and the concepts at deeper levels such as helicopter and solar systems move apart in these growing spaces and are far from each other.The dashed blue line shows how the distances in both spaces are calculated.

Figure 4 :
Figure 4: Comparision between Topic Specialisation of CluHTM and HyHTM for different datasets.An increasing trend from Level 1 (L1) to Level 3 (L3) indicates that topics are becoming more specific, diverging from a more generic corpus-word distribution.

Figure 5 :
Figure 5: Comparing runtime and memory footprint for Hy-HTM (our model) and CluHTM on AGNews dataset.

Figure 6 :
Figure 6: Comparing topic hierarchies for 20News documents.Every topic is represented by the top most probable words of the topic.

Figure 7 :
Figure 7: KH =500 performs the best out of all the choices on Hierarchical Coherence.A similar trend is observed on other metrics as well

ComparingFigure 8 :
Figure 8: α=0.4 performs the best out of all the choices on Hierarchical Coherence.A similar trend is observed on other metrics as well

Table 2 :
Comparing topic coherence, where higher coherence is better.Bold represents the best-performing metric and underline represents the second-best metric.

Table 3 :
Comparing Hierarchical Coherence.Bold represents the best-performing metric and underline represents the second-best metric.

Table 4 :
Analysis the role of hyperbolic embeddingsDoes enforcing hierarchy between parent-child topics in equation 8 result in better hierarchy?We examine this by comparing the Ours (Euc) variant and the CluHTM baseline.Both models use identical underlying document representations, yet they differ in how they guide their hierarchies, particularly in the equation 8 of our model.As demonstrated in Table Dongsheng Wang, Yi shi Xu, Miaoge Li, Zhibin Duan, Chaojie Wang, Bo Chen, and Mingyuan Zhou.Knowledge-aware bayesian deep topic model.In Advances in Neural Information Processing Systems.

Table 5 :
Topic Specialisation for other models

Table 6 :
Ablation Study analyzing the effectiveness of our approach using the 20News dataset.

on Amazon Reviews dataset (a) on 20NG dataset
H : Neighbourhood of a word defined in the hierarchical matrix The term k H in equation (6) defines a neighborhood around words which helps us extract concept and sub-concept relations from hyperbolic geometry.If very large values of k H are considered, every Comparing Hierarchical coherence (y-axis) for different values of k (x-axis)

Table 8 :
Number of topics for non-parametric models