Seeded Hierarchical Clustering for Expert-Crafted Taxonomies

Practitioners from many disciplines (e.g., political science) use expert-crafted taxonomies to make sense of large, unlabeled corpora. In this work, we study Seeded Hierarchical Clustering (SHC): the task of automatically fitting unlabeled data to such taxonomies using only a small set of labeled examples. We propose HierSeed, a novel weakly supervised algorithm for this task that uses only a small set of labeled seed examples. It is both data and computationally efficient. HierSeed assigns documents to topics by weighing document density against topic hierarchical structure. It outperforms both unsupervised and supervised baselines for the SHC task on three real-world datasets.


Introduction
Practitioners across a diverse set of domains that include web mining, political science and social network analysis rely on machine learning techniques to understand large, unlabeled corpora (Alfonseca et al., 2012;Grimmer, 2010;Yin and Wang, 2014).In particular, they often need to fit this data to taxonomies (i.e., hierarchies) constructed by nontechnical domain experts using only a few labeled examples.In this work, we formalize this task, Seeded Hierarchical Clustering (SHC), and propose a novel algorithm, HIERSEED, for it.
Consider a researcher analyzing social media to track public feeling around a hierarchy of wellbeing indicators (see Figure 1).Working with such taxonomies can be challenging.Since they are hand-crafted by domain experts to explore a particular area of focus, they may be unbalanced (with subtopics that over-represent one aspect of their parent topic) or incomplete (with subtopics that are only partially enumerated).Moreover, these hierarchies may not fully explain every document in large, diverse corpora.Finally, given their domain specificity, producing many labeled examples for each topic in such taxonomies can be expensive.
SHC incorporates these challenges as constraints: given only a user-defined topic hierarchy and a few labeled examples, the task is to assign documents from a much larger corpus to the individual topics.
While many unsupervised techniques and their hierarchical extensions automatically discover latent structure within text corpora, they are difficult to integrate with user-defined taxonomies (e.g.Blei et al. (2003); Lloyd (1982); Campello et al. (2013)).Moreover, as these methods often rely on centroids, density metrics and maximum likelihood objectives to discover dataset partitions, they may produce clusters that favor the denser, semantically more coherent regions of an unbalanced taxonomy at the expense of the sparser but more diverse regions.Although supervised hierarchical methods avoid these issues, they are usually very data intensive.
To address these challenges, we propose HI-ERSEED, a weakly supervised hierarchical method for fitting a large unlabeled corpus to a user-defined taxonomy.It assigns documents to topics by weighing document density against a topic's local hierarchical structure.To accommodate imbalance or incompleteness, HIERSEED constructs and uses topic representations that account for subtopic density (degree of semantic coherence) and spread (degree of semantic divergence) around each topic.As it uses only a few labeled seed examples to optimize its objective in a non-parametric fashion, it is both data and computationally efficient.We evaluate HIERSEED on three real-world newswire and scientific datasets and show that it outperforms state-of-the-art unsupervised and supervised baselines on this new, difficult task.
Our contributions are: (1) we formalize the task of Seeded Hierarchical Clustering, (2) we present HIERSEED, a novel algorithm that uses only a few labeled examples to efficiently fit a large corpus to a user-defined hierarchy (even if it is unbalanced or incomplete), (3) we show it outperforms existing methods on three real-world datasets from different Figure 1: A researcher wants to track public well-being using a large, unlabeled social media corpus.She creates a taxonomy of relevant topics (0) -it does not cover every document in her dataset.Moreover, as it is hand-crafted, it is unbalanced and incomplete.She can't annotate a large number of examples.With only a few labeled seed examples for each topic (1) and her large unlabeled fitting set (2), HIERSEED efficiently identifies the documents related to every topic via an iterative discriminative algorithm that balances their density against their spread (3).
domains and (4) we release an implementation of HIERSEED 1 for the broader research community.

Related Work
Unsupervised methods like LDA and K-Means (Blei et al., 2003;Lloyd, 1982;MacQueen et al., 1967) are flat clustering techniques that have been successfully extended to hierarchies (Griffiths et al., 2003;Isonuma et al., 2020).While both we and Chen et al. (2005) apply K-Means iteratively, they rely on hierarchical clustering to discover the number of topics at each level.None of these methods can detect a user-defined hierarchy.There is work on taxonomy construction and expansion (Hearst, 1992;Wang and Cohen, 2007;Shen et al., 2018;Lee et al., 2022) though it cannot be used for document assignment.
Supervised hierarchical classification techniques can be categorized into flat, local and global approaches (Silla and Freitas, 2011).Flat approaches (Hayete and Bienkowska, 2005;Barbedo and Lopes, 2006) ignore the hierarchy, whereas local approaches (Koller and Sahami, 1997;Shimura et al., 2018) rely on multiple local classifiers, propagating errors down the hierarchy.Global approaches (Zhou et al., 2020;Huang et al., 2021) encode the entire hierarchy and predict all labels at once.Wang et al. (2022) generate their own text embeddings while directly embedding the hierarchy information into their encoder using contrastive 1 Code can be found at https://github.com/anishsaha12/HierSeed learning for downstream classification.This leads to better performance and has become common for this task.Nevertheless, these approaches require a large number of labeled training examples for good performance, making them difficult to use in data-scarce scenarios.
Semi-supervised hierarchical methods require much less labeled data (Mao et al., 2012;Gallagher et al., 2017;Xiao et al., 2019).JoSH (Meng et al., 2020) uses a tree and text embedding model to jointly embed a taxonomy and a corpus in a spherical space, using category names to mine words relevant to each topic.Weakly-supervised classifiers like TaxoClass (Shen et al., 2021) and HClass (Meng et al., 2019) leverage provided keywords and documents for each topic to generate a set of pseudo documents for pretraining, then self-train on unlabeled data.Despite their strengths, these methods require more labeled data and expensive finetuning.Moreover, they work best if the labeled dataset represents all categories and documents.However, in some domains (e.g., Figure 1) knowledge of relevant categories is incomplete.
In contrast to this prior work, seeded clustering may be preferable as it first uses a small labeled seed set to bias the search space towards a desirable region and then leverages the representative latent structure of a larger unlabeled fitting set to improve the clustering (Basu et al., 2002).We propose one such technique, HIERSEED, which takes any taxonomy (unbalanced or incomplete) and a few labeled seeds and learns a discriminative hierarchical repre-

Methodology
We propose HIERSEED, a weakly-supervised algorithm for Seeded Hierarchical Clustering that uses document embeddings and their latent structure to represent topics in the same embedding space ( §3.1).We initialize the representation for each topic using its seed documents ( §3.2) and then update the representation in a bottom-up manner by considering both a topic's children and the density of documents around it ( §3.3).Finally, we balance the hierarchy and assign documents to each topic in a top-down manner (while also updating its representation).We repeat this iteratively until convergence (see Algorithm 1, Figure 1 (3)).

Definitions
Problem Formulation Given an unlabeled corpus D (the fitting set), a hierarchy of topics T2 of height N and a seed documents set S for each topic, the aim of Seeded Hierarchical Clustering is to assign documents to their relevant topics in T .Let d i ∈ D be unlabeled fitting documents, c i ∈ T topics, and let S i be a set of labeled seed documents for topic c i .The aim is to find the set of documents δ i ⊂ D most relevant to each c i .Here, a document may belong to multiple topics.
Note that we use C l to denote all topics at level l.We denote children of a topic c i as c (i) j ∈ ch(c i ). Background The Largest Empty Sphere (LES) (Schuster, 2008) on a set of points P, is the largest d-dimensional hypersphere containing no points from P but centered within its convex hull.In HIERSEED, for each topic c i , we calculate LES(ch(c i )), the center of the LES on c i 's subtopics (i.e., ch(c i )) (see Figure 2).LES(ch(c i )) has a particularly desirable property.Since it is as far as possible from all of its subtopics, but not too far from any particular subtopic, while also lying inside the subtopic convex hull, it helps ensure a more evenly spread surrounding document density.The main topic is adequately desensitized to any particular subtopic cluster.Recall the well-being taxonomy from Figure 1.The safety subtree is unbalanced -3 of its subtopics are semantically related (guns, assaults, robberies).Using the centroid to represent safety would therefore overly favor documents related to violence at the expense of documents related to car crashes (an enumerated subtopic) or workplace accidents (an unenumerated but relevant subtopic).As LES(ch(safety)) is informed by its subtopic spread, it is less sensitive to this imbalance.

Topic Initialization
We obtain document representations by passing each document through a word-embedding model.The representation of each topic c i is initialized as the mean of the embeddings of all seed documents corresponding to that topic.
Let the level of a topic be λ(c i )3 .We choose a pivot level ρ, in a hierarchy, such that our algorithm computes representations and performs clustering only for topics at the pivot level or below (i.e., λ(c i ) ≥ ρ).This hyperparameter may be set experimentally or through domain knowledge.
The pivot is useful because user generated hierarchies tend to become imbalanced or incomplete at lower levels.Therefore, ρ lets us choose an intermediate level such that all topics with λ(c i ) < ρ are considered to be "complete" or fully represented by their children and can be derived without seeds.For example, in the taxonomy in Figure 1, the topics above level k are well defined.However, down the hierarchy, the topics start getting sparse or unbalanced.Thus, level k serves as a good ρ.
As a result, our proposed algorithm only uses seed documents for topics with levels λ(c i ) ≥ ρ, further reducing the labeled data required.

Learning Topic Representations
Generally, topics lower in the hierarchy are specific and fine-grained with cohesive seeds.In contrast, top level topics are coarser with seeds that are often scattered.Thus, we need to obtain better representation at these top levels while ensuring representativeness of descendent topics.
To do so, we update each non-leaf topic representation in a bottom-up fashion as a function of: itself, its children and their "spread".Let C l be all topics at level l.Then, for each level l from the penultimate level to ρ, for each c i ∈ C l , update: is their centroid and LES ch(c i ) is center of the Largest Empty Sphere formed by these children ( §3.1).Since the dimension of our embedding space d is large compared to the number of child topics, computing the LES is intractable.Therefore, we propose an approximate method to estimate the center of the LES (see Appendix B).
This serves as a good updated representation of c i .Though informed by the centroid of its subtopics, it avoids favoring topics that happen to be close to each other (i.e., denser) in an unbalanced taxonomy.It weighs finding a space that is relatively empty (LES) against a space that is relatively dense (centroid) (see Figure 2).This produces topic representations robust to hierarchy imbalances.

Extending Taxonomy -Other Category
A topic c i may also be unbalanced at a particular level l if its children c (i) j ∈ ch(c i ) are unevenly distributed around it.Alternatively, it may just be incomplete due to partial enumeration.Since topic representation updates are propagated up the taxonomy, such a topic would result in a bad representation and not only degrade the performance at that level, but propagate the imbalance upward.
One reason for this imbalance could be "incompleteness" of T (i.e., the set of sub-topics is not fully enumerated).Therefore, we introduce an "Other" topic c other as a subtopic of topic c i which accounts for its missing subtopics and balances out the hierarchy.In particular, we calculate the density of the original subtopics with respect to the main topic and then define c other such that it pulls the density in the direction opposite to the centroid of the original subtopics in order to have a more even distribution of subtopics (see Figure 3).
The magnitude of c other ∥) is approximated to be the magnitude of the centroid of the subtopics.Its representation is given by Equation 2(see Appendix A for derivation).It depends on both its sibling subtopics as well as its parent topic.We extend each c i in the taxonomy with c (i) other in a bottom up manner from the leaf level and stop at ρ (as we define the topics above ρ to be complete).
In our well-being taxonomy from Figure 1, we can see why expansion via the addition of the "Other" category is desirable.Both the healthcare and safety subtrees are only partially enumerated.Automatically expanding their subtopic sets avoids having to fully and painstakingly enumerate them.

Topic Threshold and Assignment
To assign documents to topics, we initially perform distance based classification independently for each subtree at level ρ.We perform classification only at level ρ as there is a greater degree of confidence that the discovered documents actually belong to that topic since we define all topics c i with λ(c i ) < ρ as complete and balanced.
Let a root topic c r ∈ C ρ−1 be a topic at level ρ − 1.For each child topic c i is: (3) If there are no documents within the topic threshold for a topic c (r) i , we adapt to the document density by updating the topic's threshold to be at least equal to α ≥ 1 times the distance of the nearest document from c (r) i , if it is within twice the original threshold.Finally, we update the assigned sets δ (r) i by reperforming the assignment.We control the degree of overlap between the assigned sets using an additional hyperparameter (see Appendix C for details).

Cluster Assignment and Optimization
Given the corpus D, the updated taxonomy T ′ (balanced, extended), the pivot level ρ and the assigned sets, we propose an EM-style algorithm iterating between the assignment of the document sets δ i (E-Step) and recomputing the topic representations c i (M-Step).We maximize the expectation (i.e., likelihood of assigning a document to a topic assuming uniform probability) by minimizing the objective: Step The topic thresholds are used to determine the assigned set for each topic at ρ (E-Step, line 6 in Algo.1).Next, for each topic c i ∈ C l at level l with assigned set δ i , we solve a K-Means formulation (by Voronoi Iterations (Lloyd, 1982)) in a top-down manner from l = ρ to N .The iterations are performed over the set The obtained clusters correspond to the set of assigned documents δ k (E-top down, line 12).The process continues top-down for all sibling and successor topics until the leaves of T ′ .

M-Step
As the cluster centers are updated (in E-top down), we set topic representations to corresponding cluster centroids (M-top down, line 9 in Algo.1).Now, as topic representations are a function of themselves and their children, we compute bottom-up updates of topics (parents, up to level ρ) as discussed in §3.3 (M-bottom up, line 3).Finally, as each topic has only one parent in the taxonomy, we complete the taxonomy by directly deriving the topics above the pivot level ρ for each document set from their pivot level assignments.
Inference At inference time, we use our learned topic representations to assign documents to each topic.We use Eq. 3 to obtain assigned sets δ i using the learned topic threshold τ (c i ) for each topic c i at level ρ.Documents not within any pivot topic threshold are assigned to a None category.Then, each δ i is split up among its children c (i) j ∈ ch(c i ) by assigning each document to its closest topic c (i) j to obtain the sets δ (i) j .This process is repeated in a top-down manner to the leaf topics.Complexity Each topic's representation is updated using Eq. 1 which relies on its children and their LES, with a complexity of O(|δ|B 3 ) where δ is the set of documents assigned to it and B is the hierarchy's maximal branching factor (see Appendix B).Each topic is then extended using Eq. 2 with complexity O(B), and its topic threshold and assigned set are obtained with Eq. 3 with complexity O(B + D) where D is the size of the unlabeled corpus.The objective Eq. 4 identifies a topic's cluster in O(D).We do these at most n times, once for each of our n topics, until convergence.In practice we found that convergence is achieved within 4 iterations.Overall, HIERSEED scales linearly with taxonomy size n and corpus size D.
4 Experiment Details

Datasets
We use three publicly available datasets for evaluation: RCV1-V2 (Lewis et al., 2004), NYTimes (NYT) (Sandhaus, 2008) and Web-of-Science (WOS) (Kowsari et al., 2017).RCV1-V2 and NYT are news categorization corpora while WOS includes categorization of published scientific paper abstracts.All documents in WOS belong to a single leaf topic while documents in NYT and RCV1 may belong to multiple leaf/non-leaf topics.Data statistics are shown in Table 1.
We use the benchmark train/test split for RCV1 and for NYT and WOS we randomly split the data.For each dataset, the training set is also split into the seed (S) and fitting (D) sets by randomly sampling a fixed number of documents |S i | per topic c i as seeds.We only keep the labels for the much smaller seed sets and discard them for the fitting sets.

Metrics
We evaluate our algorithm using B 3 (Bagga and Baldwin, 1998) and V-Measure (Rosenberg and Hirschberg, 2007).B 3 is a cluster evaluation metric that measures precision and recall of a topic distribution.V-Measure is a conditional entropy metric which measures cluster homogeneity and completeness.For both metrics, we average across all levels, weighted equally.For a fair comparison, we report the same metrics for both our method and the baselines (instead of classification F1).

Hyperparameters and Baselines
We use a pretrained RoBERTa-base (Liu et al., 2019) model to obtain a 768-dimensional embedding for each document by taking the mean across the final hidden states of all tokens.Other hyperparameters are listed in Appendix D.
We compare HIERSEED to hierarchical classification, clustering and topic modeling baselines.As a baseline, we also compare it to HIERSEED trained without the unlabeled fitting data (Unfit-HIERSEED).That is, Unfit-HIERSEED uses just the seed set for initial topic representations followed by lines 3-5 of Algo. 1 to update them, and line 6-8 to assign documents.We use only the labeled seed set (S) for baselines requiring seeds or supervision and the unlabeled fitting set (D) for unsupervised baselines.We evaluate each model 5 times and report their averages.
For weakly-supervised and unsupervised baselines we use: hLDA (Griffiths et al., 2003) -an unsupervised non-parametric hierarchical topic model and TSNTM (Isonuma et al., 2020) -an unsupervised generative neural topic model, trained on the unlabeled fitting set; HClass (Meng et al., 2019) -a hierarchical classification model that uses keywords from the seed set for pretraining and the unlabeled fitting set for self-training; and JoSH (Meng et al., 2020) -a generative hierarchical topic mining model that uses the taxonomy for supervision, trained on the unlabeled fitting set.JoSH is the only seeded hierarchical method used for comparison.
We additionally compare to a number of supervised approaches: HDLTex   Table 3: Results of training the classification baselines with -▲ the full training set with their labels, compared to HIERSEED trained using only ▼ the labeled seed and unlabeled fitting sets, using 4 seeds per topic.Table 1 mentions the sizes of these sets for each dataset.

Main Results
Results are shown in Table 2. Our method HI-ERSEED, trained with labeled seeds and unlabeled fitting sets, outperforms all baselines on the SHC task when restricted to the same training data.Its best score is for the WOS corpus, which we hypothesize is due to its simpler taxonomy compared to NYT and RCV1.Additionally, even without fitting on the unlabeled data our method (Unfit-HIERSEED) demonstrates strong performance, outperforming most baselines.In particular, Unfit-HIERSEED (which does not use fitting data) is only outperformed by the two baselines that do use the fitting data (HClass and JoSH), and marginally by HFT(M) on RCV1.In fact, HIERSEED with fitting data outperforms these methods by a large margin across all corpora.This shows the effectiveness of using a small labeled seed set to fit to a taxonomy.
Although the Unfit-HIERSEED outperforms most baselines, there is still a large performance drop compared to HIERSEED (with fitting on unlabeled data).Since Unfit-HIERSEED does not use the unlabeled data (it stops the training after the E-Step, line 6 Algo. 1, of the first iteration) it does not estimate the LES or update the topic thresholds.Therefore, it may inadvertently kill a branch of the hierarchy and so is limited in its ability to fit to the data.The performance drop shows the importance of LES in computing better topic representations and fitting to the provided taxonomy.
Comparing baselines, we see that the unsupervised methods (hLDA and TSNTM) perform poorly compared to the (weakly-)supervised classification methods.Although unsupervised approaches are good for discovering latent hierarchies, they aren't capable of generating topics similar to a predefined structure.Additionally, most supervised classification methods still perform poorly compared to HIERSEED since these models do not make use of unlabeled data and have only a small set of seed examples for supervision.
We also examine the affect of taxonomy depth on performance.Figure 4 shows B 3 F1 at each level.Although, performance degrades at deeper levels of the hierarchy, our method is consistently better than the others at deeper levels highlighting our system's ability to learn better topic representations.
We note that although supervised classification baselines trained on the entire labeled training set outperform our method (Table 3), we achieve competitive results using a substantially smaller labeled set.The strength of our method is in the weakly-  supervised nature of the training procedure and it is therefore better suited to real-world data-scarce settings than fully supervised approaches.Thus, we see the advantages of weaklysupervised approaches, and especially HIERSEED, which can both adhere to a predefined structure (i.e., a labeled taxonomy), and make good use of the much easier to obtain unlabeled fitting set.

System Analysis
We conduct analysis on the number of seeds per topic and the document representation method.

Number of Seed Examples
We experiment with using different numbers of seeds for each topic in the taxonomy (see Table 4).We see a general increasing trend in the performance with an increasing number of seeds as the topics become more representative.However, the trend approaches saturation when going from 6 to 8 showcasing how little annotated data is required, to be effective.

Extending Taxonomy with Other Category
Here we evaluate the effectiveness of introducing an "Other" category to balance out the "incompleteness" of the hierarchy (see Table 5).To do so, we drop a few topics from the benchmark datasets and observe the changes in the evaluation metrics and compare it with the improvement expected from introduction of the "Other" topics in the modified taxonomy.We drop both leaf and pivot-level topics, in two separate experiments, and compare the performance of HIERSEED with and without extending Table 6: Performance of HIERSEED using different embeddings, trained using the labeled seed and unlabeled fitting sets, using 4 seeds per topic.
the taxonomy.The performance drops on deleting topics and more so when they are dropped from the pivot (as an entire subtree is deleted).However, the performance improves drastically on extending the taxonomy with our computed "Other" category in both the experiments, showing its ability to balance out incomplete hierarchies.

Embeddings for Document Representations
To test HIERSEED's dependence on the nature (contextual vs. static) and quality of embeddings, we switch out the document embeddings.In Table 6, we compare the performance of HIERSEED using RoBERTa (used in all other experiments), 300dimensional GloVe (Pennington et al., 2014), and fastText skip-gram (Joulin et al., 2016) while using 4 seeds per topic.The document embeddings are obtained by taking the mean of all tokens.We see that HIERSEED performs equally well for each embedding, with GloVe performing better on RCV1, and fastText on NYT.However the performance differences are small, and consistently better than the baselines, showing that HIERSEED is embedding-agnostic and is able to identify a good representation for the taxonomy regardless.

Error Analysis
An analysis of HIERSEED's hierarchical assignments highlights some important shortcomings and modes of failures.First, mistakes are more likely at the pivot level than in subsequent levels.This is intuitive since taxonomies get more specific (i.e., easier to fit to) down the hierarchy and HIERSEED assigns documents top-down starting from the pivot level.
In addition, a small set of seeds for a topic may not cover all subtopics, especially if there are many semantically diverse subtopics (e.g., 'Vaccines', 'Enzymes', and 'Cancer' are diverse subtopics of 'Molecular Biology').Furthermore, if topics are semantically similar (e.g., 'consumer finance' vs. 'government finance'), then the seed documents (and topic representations) may also be similar making it difficult to distinguish between the topics.Additionally, errors come from a lack of domain specific embeddings or informative document representations.For example, corpus specific artifacts such as jargon, equations, numeric data (e.g., in WOS) and tables and figures (e.g., in NYT) can lead to uninformative document embeddings that result in incorrect topic assignment.
Finally, our method assumes an incomplete taxonomy (i.e., always adds an Other category) and therefore cannot distinguish between None and Other below the pivot level.For example, an author biography α from NYT is assigned as "Feature (level 1 -pivot level) -Books (level 2) -Other (level 3)" instead of "Feature -Books -None" (in a taxonomy consisting of just book genres).This is because, once α is assigned to the topic "Feature", it can no longer be assigned None at the following levels.However, in general we find our assumption of incompleteness is valid.

Conclusion and Future Work
In this paper, we formalize the task of Seeded Hierarchical Clustering: fitting a large, unlabeled corpus to a user-defined taxonomy (that may be unbalanced or incomplete) using only a small number of labeled examples.We propose a novel, discriminative weakly supervised algorithm, HIERSEED, for it which outperforms both unsupervised and (weakly) supervised state-of-the-art techniques on three real-world datasets from different domains.
In the future, we aim to jointly learn and finetune task specific embeddings, develop a generative variant of HIERSEED and explore non-Euclidean representation spaces.

Limitations
While HIERSEED is non-parametric and outperforms both unsupervised and supervised baselines for the SHC task on the three real-world datasets evaluated here, it relies on a few choices and assumptions.
First, the algorithm requires selecting a pivot level below which all computations are performed.The pivot level is determined by identifying a level above which the taxonomy is well defined.For complex or incomplete taxonomies this can be hard to recognize, making the root (or a higher level) the easier choice.However, since most clustering mistakes occur at the pivot level, having a high pivot level will decrease recall at the following levels (due to error propagation) while a low pivot level will decrease precision at preceding levels as the lower levels tend to be more specific.
Another limitation is that since HIERSEED depends on external document embeddings (used to compute initial topic representations from the seeds), the clusters are sensitive to both the informativeness of the topic seeds and the richness of the embeddings.Additionally, computing representations of topics having a diverse set of subtopics is intrinsically more difficult than computing representations of topics and subtopics in a specific domain (owing to their semantic closeness).The top-down nature of topic assignment makes it crucial to start off with informative document representations to avoid error propagation.However, semantically close topics also pose a challenge as their representations become difficult to distinguish, leading to errors in topic assignment.
The form of Expectation Maximization used by HIERSEED assumes that all clusters are similarly sized and have the same variance.In practice, this may not always be the case.The use of Euclidean distance as the similarity metric, and variance as a measure of cluster scatter (as in K-Means) limits the usability in the more general Non-Euclidean cases.Additionally, HIERSEED may not be suitable for identifying clusters with non-convex shapes at each level of the hierarchy as it relies on K-Means which cannot separate nonconvex clusters.However, the overall identified hierarchy may be non-convex as it is a union of multiple convex sets.Ideally, the sub-topics P and the assigned subset of documents δ i for topic c i , would be distributed such that a distance based formulation such as K-Means would converge with centers equal to P, and the decision boundaries would construct the Voronoi diagram.We know that the documents d k ∈ δ i represent a subset of the set of infinite spatial points Φ i around c i , i.e., δ i ⊂ Φ i .Also, the set of Voronoi vertices v k ∈ V i for topic c i satisfies V i ⊂ Φ i , and bounded by δ i .
The duality of Voronoi diagrams and Delaunay triangulation states that Voronoi vertices are the circumcenters of Delaunay triangles, where the vertices of the triangles are from P. That is, for every v k ∈ V i there are three points p 1 , p 2 , p 3 ∈ P: one can iterate through all points in Φ i to find all v k that satisfy this equality.However, since Φ i is infinite, we approximate V i from δ i instead (see Figure 6).
The approximate set of Voronoi vertices V i such that, V i ⊂ δ i , and for all combination of three points p 1 , p 2 , p 3 ∈ P is given by: This approximation is good if the set δ i is dense.Additionally, the error threshold for the equality in Equation 9 may be determined based on the statistics (mean and standard deviation) of all errors.Finally, the LES ch(c i ) is the center of the sphere with the largest radius (Figure 6c).

C Topic Threshold and Assigned Set Overlap
Topic Threshold -For each c i ) is given by Equation 10(twice the distance of the furthest child), else, by Equation 11 (distance of the nearest sibling).
Eccentricity -there may be overlap between the assigned sets δ i for each topic c i .Thus, to control the degree of overlap between these sets, we introduce a parameter called eccentricity e i , for each of these topics c i , such that e i ∈ [0, 1] (see Figure 7).
For a topic c (r) i ∈ ch(c r ) (for a topic c r ∈ C ρ−1 ) and each sibling topic c The most common setting for e i is: j .This is done for all topics c (r) i ∈ ch(c r ) to obtain the final assigned sets δ i .

D Hyperparameter Settings
The topic representations ( §3.3) are computed by choosing a pivot level ρ = 2 for all three datasets.
The weights for the weighted-mean (Equation 1) are set to 1 for the centroid term centroid ch(c i ) , 5 for the main topic c i , and the weight of LES(ch(c i )) is set to 4 for WOS, RCV1, and to 1 for the NYT dataset.
The value of α is set to 1.1 while recomputing the topic thresholds in case originally discovered assigned sets are empty.
For each topic pair c i , c j ∈ T , we set the eccentricity to be: (in Equation 12).Finally, we allow the algorithm to extend the taxonomy at all levels, with the "Other" category, for all three datasets.
We also report our performance without fitting on the unlabeled fitting set, utilizing just the seed set for training.In Table 2, we set c = 4 for all three datasets, and we alleviate the randomness by repeating the seed sampling process 5 times and reporting the average metrics.

E Baselines
We compare our with multiple baselines spanning unsupervised/seed-guided hierarchical topic models, unsupervised/seed-guided text embedding models, and weakly-supervised/supervised hierarchical text classification models.
• hLDA (Griffiths et al., 2003): a non-parametric hierarchical topic model based on the nested Chinese restaurant process with collapsed Gibbs sampling, which assumes documents are generated from a word distribution of a path of topics.
Since it is unsupervised we use the training set (seed+fitting set) without any labels, and obtain the dynamic topic clusters.• TSNTM (Isonuma et al., 2020): a generative neural topic model which uses VAE inference to detect topic hierarchies in documents.Being unsupervised, we treat it as a clustering method which decides the hierarchical clusters dynamically, by fitting on the unlabeled training set.• JoSH (Meng et al., 2020): a weakly-supervised generative hierarchical topic mining model which uses a joint tree and text embedding method to simultaneously model the category tree structure and the corpus generative process in the spherical space.We use this as a text classifier, trained on the unlabeled training set using the topic taxonomy as the supervision.• WeSHClass (Meng et al., 2019): a weaklysupervised hierarchical classification model which leverages the provided keywords of each topic to generate a set of pseudo documents for pretraining and then self-trains on unlabeled data, using Word2Vec as embeddings.We use the keywords from the seed set for pretraining and the unlabeled fitting set for self-training.• HDLTex (Kowsari et al., 2017): a supervised method that combines multiple deep learning approaches in a top-down manner to produce hierarchical classification, by creating specialized architectures for each level of the hierarchy.Of the multiple variants presented, we use the RNN-RNN combination and train the model with the labeled seed set.• HiAGM (Zhou et al., 2020): an end-to-end hierarchical structure-aware global model that learns hierarchy-aware label and structure embeddings, formulated as a directed graph, which is then fused with text features to produce hierarchical text classification.We use the HiAGM-TP model and the GCN structure encoder, and train with the labeled seeds.• HiLAP-RL (Huang et al., 2021): a top-down reinforcement learning based approach to hierarchical classification, where modeled as a Markov decision process and learns a label assignment policy.We use HiLAP with bow-CNN as the policy model.• HFT (Shimura et al., 2018): a hierarchical CNN fine-tuning based approach for text classification where the model learns a classifier for the upper class labels, and uses transfer learning for the lower classes, thereby directly utilizing the parental/children dependency between adjacent levels.We train the HFT-CNN model using the recommended scoring function (MSF), on the seed set.

Figure 2 :
Figure 2: The topic c i , its children c (i) j ∈ ch(c i ), their centroid centroid(ch(c i )), the center of LES ch(c i ) and their weighted-mean WM.
Figure 3: (a) Extending the taxonomy by expanding the children of c i with "Other" c (i) other .η i is the centroid of its sub-topic unit vectors.(b, c) Topic threshold τ (c (r) i ) for topic c (r) i equal to (b) twice the distance of its furthest child when ch(c (r) i ) ̸ = ∅ and (c) the distance of the nearest sibling, where c (r) i , c (r) j , c (r) k ∈ ch(c r ), when ch(c (r) i ) = ∅.
(Kowsari et al., 2017) a hierarchical classification model trained with the labeled seed set, HiAGM (Zhou et al., 2020) -a hierarchical text classification model trained with the labeled seeds, HiLAP-RL (Huang et al., 2021) -a hierarchical classification technique trained with reinforcement learning, and HFT (Shimura et al., 2018) -a hierarchical CNN-based text classifier, trained on the seed set.Further details about the baselines can be found in Appendix E.

Figure 4 :
Figure 4: Performance (B 3 F1) at each level of the hierarchy for all three evaluation datasets and the best performing model from each category.

Figure 6 :
Figure 6: Approximating set of Voronoi vertices V i from set of sub-topic points P = ch(c i ) and data points d k ∈ δ i for a topic c i .(a) data points colored orange and P colored blue.(b) Approx Voronoi vertices v k ∈ V i labeled with black ×.V i ⊂ δ i .(c) Circumcircles (spheres) drawn with v k as center and p ∈ P on circumference.
9)Since there are O(|P| 3 ) such combinations and we iterate through δ i , the time complexity of this approximate algorithm is O(|P| 3 |δ i |) for each topic.

Figure 7 :
Figure 7: Three eccentricity settings e i for the topic C i w.r.t C j to resolve assigned set overlap.
Algorithm 1: HIERSEED Input: A corpus D; seeds S and taxonomy T of height N ; pivot level ρ.Output: Learned topic representations for all c i ∈ T ; set of relevant documents δ i ⊂ D for each c i .
1 Initialize topics c i ∈ T with mean of seed documents S i embeddings 2 while Equation 4 is minimized do 3 M-bottom-up, l from N − 1 to ρ:4 update c i ∈ C l with Eq. 1 5Extend taxonomy, ∀c i Add "other" topic to ch(c i ) (Eq. 2)6 E-at-level-ρ, for each c i ∈ C ρ :

Table 1 :
Datasets statistics.|T | is the number of topics in the taxonomy.The training and test sizes are the main data splits.The seed S (labeled) and fitting D (unlabeled) sets are subsets of the training set.

Table 4 :
HIERSEED with different numbers of seeds per topic in the taxonomy.The models are trained on their respective seeds and fitted on the fitting sets.

Table 5 :
Performance of HIERSEED with (Ext) and without (NoExt) extending taxonomy with "Other" topic.Leaf: all leaf topics dropped for one penultimate parent topic.Pivot: one topic dropped at pivot level (entire subtree deleted).