Topic Taxonomy Expansion via Hierarchy-Aware Topic Phrase Generation

Topic taxonomies display hierarchical topic structures of a text corpus and provide topical knowledge to enhance various NLP applications. To dynamically incorporate new topic information, several recent studies have tried to expand (or complete) a topic taxonomy by inserting emerging topics identified in a set of new documents. However, existing methods focus only on frequent terms in documents and the local topic-subtopic relations in a taxonomy, which leads to limited topic term coverage and fails to model the global topic hierarchy. In this work, we propose a novel framework for topic taxonomy expansion, named TopicExpan, which directly generates topic-related terms belonging to new topics. Specifically, TopicExpan leverages the hierarchical relation structure surrounding a new topic and the textual content of an input document for topic term generation. This approach encourages newly-inserted topics to further cover important but less frequent terms as well as to keep their relation consistency within the taxonomy. Experimental results on two real-world text corpora show that TopicExpan significantly outperforms other baseline methods in terms of the quality of output taxonomies.


Introduction
Topic taxonomy is a tree-structured representation of hierarchical relationship among multiple topics found in a text corpus (Zhang et al., 2018;Shang et al., 2020;Meng et al., 2020).Each topic node is defined by a set of semantically coherent terms related to a specific topic (i.e., topic term cluster), and each edge implies the "general-specific" relation between two topics (i.e., topic-subtopic).With the knowledge of hierarchical topic structures, topic taxonomies have been successfully utilized in many text mining applications, such as text summarization (Petinot et al., 2011;Bairi et al., 2015) and categorization (Meng et al., 2019;Shen et al., 2021  Recently, automated expansion (or completion) of an existing topic taxonomy has been studied (Huang et al., 2020;Lee et al., 2022), which helps people to incrementally manage the topic knowledge within fast-growing document collections.This task has two technical challenges: (1) identifying new topics by collecting topic-related terms that have novel semantics, and (2) inserting the new topics at the right position in the hierarchy.In Figure 1, for example, a new topic node painter that consists of its topic-related terms [baroque painter, realist painter, portraitist, ...] is inserted at the child position (i.e., subtopic) of the existing topic node artist, without breaking the consistency of topic relations with the neighbor nodes.
The existing methods for topic taxonomy expansion, however, suffer from two major limitations: (1) Limited term coverage -They identify new topics from a set of candidate terms, while relying on entity extraction tools (Zeng et al., 2020) or phrase mining techniques (Liu et al., 2015;Shang et al., 2018;Gu et al., 2021) to obtain the high-frequency candidate terms in a corpus.Such extraction techniques will miss a lot of topic-related terms that have low frequency, and thus lead to an incomplete set of candidate terms (Zeng et al., 2021).( 2) Inconsistent topic relation -As they insert new topics by considering only the first-order relation between two topics (i.e., a topic and its subtopic), the newly-inserted topics are likely to have inconsistent relations with other existing topics.The expansion strategy based on the first-order topic relation is inadequate to capture the holistic structure information of the existing topic taxonomy.
As a solution to both challenges, we present Top-icExpan, a new framework that expands the topic taxonomy via hierarchy-aware topic term generation.The key idea is to directly generate topicrelated terms from documents by taking the topic hierarchy into consideration.From the perspective of term coverage, this generation-based approach can identify more multi-word terms even if they have low frequency in the given corpus (Zeng et al., 2021), compared to the extraction-based approach only working on the extracted candidate terms that frequently appear in the corpus.To combat the challenge of relation inconsistency, we utilize graph neural networks (GNNs) to encode the relation structure surrounding each topic (Kipf and Welling, 2017;Shen et al., 2021) and generate topic-related terms conditioned on these relation structure encodings.This allows us to accurately capture a hierarchical structure beyond the first-order relation between two topics.
To be specific, TopicExpan consists of the training step and the expansion step.The training step is for optimizing a neural model that topicconditionally generates a term from an input document.Technically, for topic-conditional term generation, the model utilizes the relation structure of a topic node as well as the textual content of an input document.The expansion step is for discovering novel topics and inserting them into the topic taxonomy.To this end, TopicExpan places a virtual topic node underneath each existing topic node, and then it generates the topic terms conditioned on the virtual topic by utilizing the trained model.In the end, it performs clustering on the generated terms to identify multiple novel topics, which are inserted at the position of the virtual topic node.
Contributions.The main contributions of this paper can be summarized as follows: (1) We propose a novel framework for topic taxonomy expansion, which tackles the challenges in topic term coverage and topic relation consistency via hierarchyaware topic term generation.(2) We present a neural model to generate a topic-related term from an input document topic-conditionally by capturing the hierarchical relation structure surrounding each topic based on GNNs.(3) Our comprehensive evaluation on two real-world datasets demonstrates that output taxonomies of TopicExpan show better relation consistency as well as term coverage, compared to that of other baseline methods.

Related Work
Topic Taxonomy Construction.To build a topic taxonomy of a given corpus from scratch, the stateof-the-art methods have focused on finding out discriminative term clusters in a hierarchical manner (Zhang et al., 2018;Meng et al., 2020;Shang et al., 2020).Several recent studies have started to enrich and expand an existing topic taxonomy by discovering novel topics from a corpus and inserting them into the taxonomy (Huang et al., 2020;Lee et al., 2022).They leverage the initial topic taxonomy as supervision for learning the hierarchical relation among topics.To be specific, they discover new subtopics that should be inserted at the child of each topic, by using a relation classifier trained on (parent, child) topic pairs (Huang et al., 2020) or performing novel subtopic clustering (Lee et al., 2022).However, all the methods rely on candidate terms extracted from a corpus and also consider only the first-order relation between two topics, which degrades the term coverage and relation consistency of output topic taxonomies.GNN-based Taxonomy Expansion.Recently, there have been several attempts to employ GNNs for expanding a given entity taxonomy (Mao et al., 2020;Shen et al., 2020;Zeng et al., 2021).Their goal is to figure out the correct position where a new entity should be inserted, by capturing structural information of the taxonomy based on GNNs.They mainly focus on an entity taxonomy that shows the hierarchical semantic relation among fine-grained entities (or terms), requiring plenty of nodes and edges in a given taxonomy to effectively learn the inter-entity relation.In contrast, a topic taxonomy represents coarse-grained topics (or high-level concepts) that encode discriminative term meanings as well as term co-occurrences in documents (Figure 1), which allows its node to correspond to a topic class of documents.That is, it is not straightforward to apply such methods to a topic taxonomy with much fewer nodes and edges, and thus how to enrich a topic taxonomy with GNNs remains an important research question.Keyphrase Generation.The task of keyphrase prediction aims to find condensed terms that con-  cisely summarize the primary information of an input document (Liu et al., 2020).The state-ofthe-art approach to this problem is modeling it as the text generation task, which sequentially generates word tokens of a keyphrase (Meng et al., 2017;Zhou et al., 2021).They adopt neural architectures as a text encoder and decoder, such as an RNN/GRU (Meng et al., 2017;Wang et al., 2019) and a transformer (Zhou et al., 2021).Furthermore, several methods have incorporated a neural topic model into the generation process (Wang et al., 2019;Zhou et al., 2021) to fully utilize the topic information extracted in an unsupervised way.Despite their effectiveness, none of them has focused on topic-conditional generation of keyphrases from a document, as well as hierarchical modeling of topic relations.

Problem Formulation
Notations.A topic taxonomy T = (C, R) is a tree structure about topics, where each node (∈ C) represents a single conceptual topic and each edge (∈ R) implies the hierarchical relation between a topic and its subtopic.A topic node c j ∈ C is described by the set of topic-related terms, denoted by P j (i.e., term cluster for the topic c j ), where the most representative term (i.e., center term) serves as the topic name.Each document and each term1 p k = [v k1 , . . ., v kT ] in a given corpus D is the sequence of L and T word tokens in the vocabulary set v ∈ V, respectively.Here, each term is regarded as a phrase that consists of one or more word tokens, so the terms "phrase" and "term" are used interchangeably in this paper.Problem Definition.Given a text corpus D and an initial topic taxonomy T , the task of topic taxonomy expansion aims to discover novel topics by collecting the topic-related terms from D and insert them at the right position in T (Figure 1).
4 TopicExpan: Proposed Framework  Training Step.TopicExpan optimizes parameters of its neural model to maximize the total likelihood of the initial taxonomy T given the corpus D. (1) In the end, the total likelihood is factorized into the topic-conditional likelihoods of a document and a phrase, i.e., P (d i |c j ) and P (p k |d i , c j ), for all the positive triples (c j , d i , p k ) collected from T and D.
That is, each triple satisfies the condition that its phrase p k belongs to the topic c j (i.e., p k ∈ P j ) and also appears in the document d i .
To maximize Equation (1), we propose a unified model for estimating P (d i |c j ) and P (p k |d i , c j ) via The details will be presented in Section 4.4.

Encoder Architectures
For modeling the two likelihoods P (d i |c j ) and P (p k |d i , c j ), we introduce a topic encoder and a document encoder, which respectively computes the representation of a topic c j and a document d i .

Topic Encoder
There are two important challenges of designing the architecture of a topic encoder: (1) The topic encoder should be hierarchy-aware so that the representation of each topic can accurately encode the hierarchical relation with its neighbor topics, and (2) the representation of each topic needs to be discriminative so that it can encode semantics distinguishable from that of the sibling topics.Hence, we adopt graph convolutional networks (GCNs) (Kipf and Welling, 2017) to capture the semantic relation structure surrounding each topic.
We first construct a topic relation graph G by enriching the edges of the given hierarchy T to model heterogeneous relations between topics, as shown in Figure 3.The graph contains three different types of inter-topic relations: (1) downward, (2) upward, and (3) sideward.The downward and upward edges respectively capture the top-down and bottom-up relations (i.e., hierarchy-awareness).We additionally insert the sideward edges between sibling nodes that have the same parent node.Unlike the downward and upward edges, the sideward edges pass the information in a negative way to make topic representations discriminative among the sibling topics.The topic representation of c j at the m-th GCN layer is computed by where ϕ is the activation function, r(i, j) ∈ {down, up, side} represents the relation type of an edge (i, j), and α indicates either positive or negative aggregation according to its relation type; i.e., α down = α up = +1 and α side = −1.The GloVe word vectors (Pennington et al., 2014) for each topic name are used as the base node features (i.e., h (0) j ) after being averaged for all tokens in the topic name.Using a stack of M GCN layers, we finally obtain the representation of a target topic node c j (i.e., the topic node that we want to obtain its representation) by The topic encoder should also be able to obtain the representation of a virtual topic node, whose topic name is not determined yet, during the expansion step.For this reason, we mask the base node features of a target topic node regardless of whether the node is virtual or not, as depicted in Figures 3(a) and (b).In other words, with the name of a target topic masked, the topic representation encodes the relation structure of its M -hop neighbor topics.

Document Encoder
For the document encoder, we employ a pretrained language model, BERT (Devlin et al., 2019).It models the interaction among the tokens based on the self-attention mechanism, thereby obtaining each token's contextualized representation, denoted by [v i1 , . . ., v iL ].A document representation d i is obtained by mean pooling in the end.

Learning Topic Taxonomy
In the training step, TopicExpan optimizes model parameters by using positive triples as training data X = {(c j , d i , p k )|p k ∈ P j ∩ d i , ∀c j ∈ C, ∀d i ∈ D} via multi-task learning of topic-document similarity prediction and topic-conditional phrase generation (Sections 4.3.1 and 4.3.2).

Topic-Document Similarity Prediction
The first task is to learn the similarity between a topic and a document.We define the topicdocument similarity score by bilinear interaction between their representations, i.e., c ⊤ j M d i where M is the trainable interaction matrix.The topicconditional likelihood of a document in Equation (1) is optimized by using this topic-document similarity score, The loss function is defined based on InfoNCE (Oord et al., 2018), which pulls positively-related documents into the topic while pushing away negatively-related documents from the topic.
where γ is the temperature parameter.For each triple (c j , d i , p k ), we use its document d i as positive and regard documents from all the other triples in the current mini-batch as negatives.

Topic-Conditional Phrase Generation
The second task is to generate phrases from a document being conditioned on a topic.For the phrase generator, we employ the architecture of the transformer decoder (Vaswani et al., 2017).
For topic-conditional phrase generation, the context representation, Q(c j , d i ), needs to be modeled by fusing the textual content of a document d i as well as the relation structure of a topic c j .To leverage the textual features while focusing on the topicrelevant tokens, we compute topic-attentive token representations and pass them as the input context of the transformer decoder.Precisely, the topicattention score of the l-th token in the document d i , β l (c j , d i ), is defined by its similarity with the topic.
where the interaction matrix M is weight-shared with the one in Equation (3).Then, the sequential generation process of a token vt is described by (5) FFN is the feed-forward networks for mapping a state vector s t into vocabulary logits.Starting from the first token [BOP], the phrase is acquired by sequentially decoding a next token vt until the last token [EOP] is obtained; the two special tokens indicate the begin and the end of the phrase.
The loss function is defined by the negative loglikelihood, where the phrase p k = [v k1 , . . ., v kT ] in a positive triple (c j , d i , p k ) is used as the target sequence of word tokens.
To sum up, the joint optimization of Equations (3) and ( 6) updates all the model parameters in an end-to-end manner, including the similarity predictor, the phrase generator, and both encoders.

Expanding Topic Taxonomy
In the expansion step, TopicExpan expands the topic taxonomy by utilizing the trained model to generate the phrases for a virtual topic, which is assumed to be located at a valid insertion position in the hierarchy.For thorough expansion, it considers a child position of every existing topic node as the valid position.That is, for each virtual topic node c * j (referring to a new child of a topic node c j ) one at a time, it performs topic phrase generation and clustering (Sections 4.4.1 and 4.4.2) to discover multiple novel topic nodes at the position.

Novel Topic Phrase Generation
Given a virtual topic node c * j and each document d i ∈ D, the trained model computes the topicdocument similarity score and generates a topicconditional phrase p * = [v 1 , . . ., vT ] where vt ∼ P (v t |v <t , c * j , d i ).Here, the generated phrase p * is less likely to belong to the virtual topic c * j if its source document d i is less relevant to the virtual topic.Thus, we utilize the topic-document similarity score as the confidence of the generated phrase.To collect only qualified topic phrases, we filter out non-confident phrases whose normalized topicdocument similarity is smaller than a threshold, i.e., P In addition to the confidence-based filtering, we exclude phrases that do not appear in the corpus at all, since they are likely implausible phrases.This has substantially reduced the hallucination problem of a generation model.

Novel Topic Phrase Clustering
To identify multiple novel topics at the position of the virtual topic node c * j , we perform clustering on the phrases collected for the virtual topic.We acquire semantic features of each phrase by averaging the GloVe vectors (Pennington et al., 2014) of word tokens in the phrase, then run k-means clustering with the initial number of clusters k manually set.Among the clusters, we selectively identify the new topics based on their cluster size, and the center phrase of each cluster is used as the topic name.

Experimental Settings
Datasets.We use two real-world document corpora with their three-level topic taxonomy: Amazon (McAuley and Leskovec, 2013) contains product reviews collected from Amazon, and DBPedia (Lehmann et al., 2015) contains Wikipedia articles.All the documents in both datasets are tokenized by the BERT tokenizer (Devlin et al., 2019) and truncated to have maximum 512 tokens.The statistics are listed in Table 1.
Baseline Methods.We consider methods for building a topic taxonomy from scratch, hLDA (Griffiths et al., 2003) and TaxoGen (Zhang et al., 2018).We also evaluate the state-of-the-art methods for topic taxonomy expansion, CoRel (Huang et al., 2020) and TaxoCom (Lee et al., 2022).2Both of them identify and insert new topic nodes based on term embedding and clustering, with the initial topic taxonomy leveraged as supervision.
Experimental Settings.To evaluate the performance for novel topic discovery, we follow the previous convention that randomly deletes half of leaf nodes from the original taxonomy and asks each expansion method to reproduce them (Shen et al., 2020;Lee et al., 2022).Considering the deleted topics as ground-truth, we measure how completely new topics are discovered and how accurately they are inserted into the taxonomy.5.2 Quantitative Evaluation

Topic Taxonomy Expansion
First of all, we assess the quality of output topic taxonomies.Following previous topic taxonomy evaluations (Huang et al., 2020;Lee et al., 2022), we recruit 10 doctoral researchers and use their domain knowledge to examine three different aspects of a topic taxonomy.Term coherence indicates how strongly terms in a topic node are relevant to each other.Relation accuracy computes how accurately a topic node is inserted into the topic taxonomy (i.e., precision for novel topic discovery).Subtopic integrity measures the completeness of subtopics for a topic node (i.e., recall for novel topic discovery).For exhaustive evaluation, we divide the output taxonomy of each expansion method into three disjoint parts T 1 , T 2 , and T 3 so that each of them covers some first-level topics (and their subtrees) in Table 6 in Section A.5. 3In Table 2, TopicExpan achieves the highest scores for all the aspects.4For all the baseline methods, the term coherence is not good enough because they assign candidate terms into a new topic according to the topic-term relevance mostly learned from term co-occurrences.In contrast, TopicExpan effectively collects coherent terms relevant to a new topic (i.e., term coherence ≥ 0.90) by directly generating the topic-conditional terms from documents.TopicExpan also shows significantly higher relation accuracy and subtopic integrity than the other expansion methods, with the help of its GNN-based topic encoder that captures a holistic topic structure beyond the first-order topic relation.

Topic-Conditional Phrase Generation
We investigate the topic phrase prediction performance of our framework and other keyphrase extraction/generation models.We leave out 10% of the positive triples (c j , d i , p k ) from the training set X and use them as the test set.We measure perplexity (PPL) and accuracy (ACC) by comparing  each generated phrase with the target phrase at the token-level and phrase-level, respectively.In Table 3, TopicExpan achieves the best PPL and ACC scores.We observe that TopicExpan more accurately generates topic-related phrases from input documents, compared to the state-ofthe-art keyphrase generation methods which are not able to consider a specific topic as the condition for generation.In addition, ablation analyses validate that each component of our framework contributes to accurate generation of topic phrases.Particularly, the hierarchical (i.e., upward and downward) and sideward relation modeling of the topic encoder improves the quality of generated phrases.

Comparison of Topic Terms
We qualitatively compare the topic terms found by each method.In case of TopicExpan, we sort all confident topic terms by their cosine distances to the topic name (i.e., center term) using the global embedding features (Pennington et al., 2014).
Table 4 shows that the topic terms of TopicEx-  are superior to those of the baseline methods, in terms of the expressiveness as well as the topic relevance.In detail, some of the terms retrieved by CoRel and TaxoCom are either off-topic or too general (marked with a strikethrough); this indicates that their topic relevance score for each term is not good at capturing the hierarchical topic knowledge of a text corpus.On the contrary, TopicExpan generates strongly topic-related terms by capturing the relation structure of each topic.Furthermore, TopicExpan is effective to find infrequentlyappearing multi-word terms (underlined), which all the extraction-based methods fail to obtain.

Comparison of Novel Topics
Next, we examine novel topics inserted by each expansion method.To show the effectiveness of sideward relation modeling adopted by our topic encoder (Section 4.2.1),we additionally present the results of TopicExpan +sr and TopicExpan -sr , which computes topic representations with and without capturing the sideward topic relations.
In Table 5, TopicExpan +sr successfully discovers new topics that should be placed in a target position.Notably, the new topics are clearly distin-  guishable from the sibling topics (i.e., known topics given in the initial topic hierarchy), which reduces the redundancy of the output topic taxonomy.On the other hand, CoRel and TaxoCom show limited performance for novel topic discovery; some new topics are redundant (✂) while some others do not preserve the hierarchical relation with the existing topics (⊗).Some of the new topics found by TopicExpan -sr semantically overlap with the sibling topics, even though they are at the correct position in the hierarchy; this implies that our topic encoder with sideward relation modeling makes the representation of a virtual topic node discriminative with its sibling topic nodes, and it eventually helps to discover new conceptual topics of novel semantics.

Case Study of Topic Phrase Generation
To study how the generated phrases and their topicdocument similarity scores (i.e., confidences) vary depending on a topic condition, we provide examples of topic-conditional phrase generation.The input document in Figure 4 contains a review about nail care products.In case that the relation structure of a target topic implies the nail product (Fig-ure 4 Left), TopicExpan obtains the desired topicrelevant phrase "nail lacquer" along with the high topic-document similarity of 0.8547.On the other hand, given the relation structure of a target topic which is inferred as a kind of meat foods (Figure 4 Right), it generates a topic-irrelevant phrase "metallic black" from the document along with the low topic-document similarity of 0.0023.That is, TopicExpan fails to get a qualified topic phrase when the textual contents of an input document is obviously irrelevant to a target topic.In this sense, TopicExpan filters out non-confident phrases having a low topic-document similarity score to collect only the phrases relevant to each virtual topic.

Analysis of Topic-Document Similarity
Finally, we investigate the changes of generated phrases in two aspects, with respect to the topicdocument similarity scores.The first aspect is the ratio of three categories for generated phrases, which have been focused on in the literature of keyphrase generation (Meng et al., 2017;Zhou et al., 2021): (1) present phrases appearing in the input document, (2) absent phrases not appear- ing in the input document but in the corpus at least once, and (3) unseen (i.e., totally-new) phrases that are not observed in the corpus at all.The second aspect is the average semantic distance among the phrases, measured by using the semantic features.For the plots in Figure 5, the horizontal axis represents 10 bins of normalized topic-document similarity scores over all generated phrases.Interestingly, TopicExpan hardly generates absent phrases (about 0.7% for Amazon, 1.7% for DBPedia) and unseen phrases (about 0.1% for Amazon, 0.2% for DBPedia) regardless of the topic-document similarity; instead, it generates present phrases in most cases (Figure 5 Left).In other words, if the input document is not relevant to a target topic, it tends to generate an irrelevantbut-present phrase rather than a relevant-but-absent phrase, as shown in Section 5.3.3.One potential risk of TopicExpan is to generate unseen phrases that are nonsense or implausible, also known as hallucinations in neural text generation, and such unseen phrases can degrade the quality and credibility of output topic taxonomies.This result supports that we can easily exclude all unseen phrases, which account for less than 0.2% of generated phrases, to effectively address this issue.
Moreover, the negative correlation between the topic-document similarity score and the interphrase semantic distance (Figure 5 Right) provides empirical evidence that the similarity score can serve as the confidence of a generated topic phrase.There is a clear tendency toward decreasing the average semantic distance as the topic-document similarity score increases; this implies that the phrases generated from topic-relevant documents are semantically coherent to each other, and accordingly, they are likely to belong to the same topic.

Conclusion
In this paper, we study the problem of topic taxonomy expansion, pointing out that the existing approach has shown limited term coverage and inconsistent topic relation.Our TopicExpan framework introduces hierarchy-aware topic term generation, which generates a topic-related term by using both the textual content of an input document and the relation structure of a topic as the condition for generation.The quantitative and qualitative evaluation demonstrates that our framework successfully obtains much higher-quality topic taxonomy in various aspects, compared to other baseline methods.
For future work, it would be promising to incorporate an effective measure for the topic relevance of multi-word terms (i.e., phrases) into our framework.Since learning and utilizing the representation of multi-word terms remains challenging and worth exploring, it can be widely applied to many other text mining tasks.

Limitations
Despite the remarkable performance of TopicExpan on our tested corpus, there is still room to improve regarding how to better handle topics, documents, and phrases, for effective mining of topic knowledge.First, TopicExpan uses only the topic names (i.e., center terms) as the base node features in the topic relation graph, which makes our topic encoder difficult to capture the collective meaning of each topic from its set of topic-related phrases.Second, the confidence of each generated phrase considers only the topic relevance of its source document, instead of all the documents in which this phrase appears.Finally, the clustering process does not leverage the contextualized textual features computed by our BERT-based document encoder, which makes it hard to consolidate the context of the phrase within its source document.

A Supplementary Material
A.1 Pseudo-code of TopicExpan Algorithm 1 describes the detailed process of our framework, including the training step (Lines 1-9) and the expansion step (Lines 10-23).The final output is the expanded topic taxonomy (Line 24).
Algorithm 1: The process of TopicExpan.Training Step (Lines 1-9).TopicExpan first collects all positive triples (c j , d i , p k ) from an initial topic taxonomy T and a text corpus D (Line 1; Section 4.1), and constructs a topic relation graph G from the topic hierarchy (Line 2; Section 4.2.1).Then, it updates all the trainable parameters based on the gradient back-propagation (Lines 5-9) to minimize the losses for the topic-document similarity prediction task (Line 6; Section 4.3.1)and the topic-conditional phrase generation task (Line 7; Section 4.3.2).Expansion Step (Lines 10-23).Using the trained model, TopicExpan discovers new topics that need to be inserted into each valid position in the topic hierarchy (Line 11).For a virtual topic node c * j as a newly-introduced child of each topic node c j (Line 13), it constructs a topic relation graph G * from the topic hierarchy augmented with the virtual topic node (Lines 14-15).Then, it collects all pairs of a topic-document similarity score and a generated topic phrase (ŝ, p), which are obtained by using the trained model on the augmented topic relation graph and all the documents (Lines 16-20; Section 4.4.1).Next, it filters out non-confident (i.e., irrelevant) phrases according to the normalized score (Line 21), then it performs clustering to find out multiple phrase clusters, each of which is considered as a new topic node having a novel topic semantics (Line 22; Section 4.4.2).In the end, it inserts the identified new topic nodes into the target position (i.e., the child of a topic node c j ) to expand the current topic taxonomy (Line 23).

A.2 Baseline Methods
For the baselines, we employ the official author codes while following the parameter settings provided by (Lee et al., 2022).For all the methods that optimize the Euclidean or spherical embedding space (i.e., TaxoGen, CoRel, and TaxoCom), we fix the number of negative terms (for each positive term pair) to 2 during the optimization.
• hLDA5 (Griffiths et al., 2003) performs hierarchical latent Dirichlet allocation.It models a document generation process as sampling its words along the path selected from the root to a leaf.We set the smoothing parameters α = 0.1 and η = 1.0, respectively for document-topic distributions and topic-word distributions, and the concentration parameter in the Chinese restaurant process γ = 1.0.
• TaxoGen6 (Zhang et al., 2018) is the unsupervised framework for topic taxonomy construction.To identify hierarchical term clusters, it optimizes the term embedding space with SkipGram (Mikolov et al., 2013).We set the maximum taxonomy depth to 3 and the number of child nodes to 5, as done in (Zhang et al., 2018;Shang et al., 2020).
• CoRel7 (Huang et al., 2020) is the first topic taxonomy expansion method.It trains a topic relation classifier by using the initial taxonomy, then recursively transfers the relation to find out candidate terms for novel subtopics.Finally, it identifies novel topic nodes based on term embeddings induced by SkipGram (Mikolov et al., 2013).
• TaxoCom8 (Lee et al., 2022) is the state-ofthe-art method for topic taxonomy expansion.
For each node from the root to the leaf, it recursively optimizes term embedding and performs term clustering to identify both known and novel subtopics.we set β = 1.5, 2.5, 3.0 (for each level) in the novelty threshold τ nov , and fix the signficance threshold τ sig = 0.3.

A.3 Implementation Details
Model Architecture.For the topic encoder, we use two GCN layers to avoid the over-smoothing problem, and fix the dimensionality of all node representations to 300.For the document encoder, we employ the bert-base-uncased provided by huggingface (Devlin et al., 2019), as the initial checkpoint of a pretrained model.It contains 12 layers of transformer blocks with 12 attention heads, thereby obtaining 768-dimensional contextualized token representations [v i1 , . . ., v iL ] (and a final document representation d i = mean-pooling(v i1 , . . ., v iL )) for an input document d i .Consequently, the size of the interaction matrix M in our topic-document similarity predictor (Equation (3)) becomes 300 × 768.For the phrase generator, we adopt a single layer of the transformer decoder with 16 attention heads9 and train its parameters from scratch without using the checkpoint of a pretrained text decoder.We limit the maximum length of a generated phrase to 10. Figure 6 shows the phrase generator architecture.In total, our neural model contains 540K (for the topic encoder), 110M (for the document encoder), 230K (for the similarity predictor), and 30M (for the phrase generator) parameters.

Training
Step.For the optimization of model parameters, we use the Adam optimizer (Kingma and Ba, 2015) with the initial learning rate 5e-5 and the weight decay 5e-6.The batch size is set to 64, and the temperature parameter γ in Equation ( 3) is set to 0.1.The best model is chosen using the best perplexity of generated topic phrases on the validation set of positive triples (c j , d i , p k ), which is evaluated every epoch.

Expansion
Step.To filter out non-confident phrases (Section 4.4.1),we set the threshold value τ to 0.8 after applying min-max normalization on all topic-document similarity scores computed for each virtual topic node.To perform k-means clustering on the collected topic phrases (Section 4.4.2),we set the initial number of clusters k to 10, then select top-5 clusters by their cluster size (i.e., the number of phrases assigned to each cluster).The center phrase of each cluster is used as the final topic name of the new topic node.

A.4 Computing Platform
All the experiments are carried out on a Linux server machine with Intel Xeon Gold 6130 CPU @2.10GHz and 128GB RAM by using a single RTX3090 GPU.In this environment, the model training of TopicExpan takes around 2 hours and 6 hours for Amazon and DBPedia, respectively.

A.5 Quantitative Evaluation Protocol
For exhaustive evaluation on a large-scale topic taxonomy with hundreds of topic nodes, the output taxonomy of topic taxonomy expansion methods (i.e., CoRel, TaxoCom, and TopicExpan) is divided into three parts T 1 , T 2 , and T 3 so that each part covers some of the first-level topics (and their subtrees) listed in Table 6.In case of hLDA and TaxoGen, the first-level topics in their output taxonomies are not matched with the ground-truth topics (in Table 6), because they build a topic taxonomy from scratch.For this reason, in Table 2, their output taxonomies are evaluated whole without partitioning.In addition, the two metrics for novel topic discovery (i.e., relation accuracy and subtopic integrity) are designed to evaluate the topic taxonomy expansion methods, so it is infeasible to measure the aspects on the output taxonomies of hLDA and TaxoGen.Thus, we only report the metric for topic identification (i.e., term coherence) in Table 2. Term Coherence.It indicates how strongly terms in a topic node are relevant to each other.Evaluators count the number of terms that are relevant to the common topic (or topic name) among the top-5 terms found for each topic node.Relation Accuracy.It computes how accurately a topic node is inserted into a given topic hierarchy (i.e., precision for novel topic discovery).For each valid position, evaluators count the number of newly-inserted topics that are in the correct relationship with the surrounding topics.Subtopic Integrity.It measures the completeness of subtopics for each topic node (i.e., recall for novel topic discovery).Evaluators investigate how many ground-truth novel topics, which were deleted from the original taxonomy, match with one of the newly-inserted topics.

A.6 Examples of Topic Phrase Generation
We provide additional examples of topicconditional phrase generation, obtained by TopicExpan. Figure 7 illustrates a confident phrase (Left) and a non-confident phrase (Right), generated from each input document and the given relation structure of a target topic, for both datasets.As discussed in Section 5.3.3, in case that a target topic is relevant to the document (i.e., high topic-document similarity score), TopicExpan successfully generates a phrase relevant to the target topic.On the other hand, in case that a target topic is irrelevant to the document (i.e., low topic-document similarity score), TopicExpan obtains a phrase irrelevant to the target topic.

Figure 1 :
Figure 1: An example of topic taxonomy expansion.The known (i.e., existing) topics and novel topics are in single-line and double-line boxes, respectively.

Figure 2 :
Figure 2: The overall process of TopicExpan.(Left) It trains a unified model via multi-task learning of topicdocument similarity prediction and topic-conditional phrase generation.(Right) It selectively collects the phrases conditionally-generated for a virtual topic node, and then it identifies multiple novel topics from phrase clusters.

4. 1
OverviewTopicExpan consists of (1) the training step that trains a neural model for generating phrases topicconditionally from documents (Figure2Left) and (2) the expansion step that identifies novel topics for each new position in the taxonomy by using the trained model (Figure2Right).The detailed algorithm is described in Section A.1.

Figure 3 :
Figure 3: The topic encoder architecture.It computes topic representations by encoding a topic relation graph.

Figure 4 :
Figure 4: Examples of topic-conditional phrase generation, given a document and its relevant/irrelevant topic.

Figure 5 :
Figure 5: The ratio of three categories for generated phrases (Left) and the average semantic distance among generated phrases (Right).The horizontal axis shows 10 bins of normalized topic-document similarity scores.

Figure 6 :
Figure 6: The phrase generator architecture.It generates the token sequence given a topic and a document, by using topic-attentive token representations as the context.

"Figure 7 :
Figure 7: Examples of topic-conditional phrase generation, given a document and its relevant/irrelevant topic.

Table 1 :
The statistics of the datasets.

Table 2 :
Quantitative evaluation on output topic taxonomies.The average and standard deviation for the three aspects are reported.The relation accuracy and subtopic integrity are considered only for the expansion methods, whose identified new topic nodes can be clearly compared with the ground-truth ones at each valid position.

Table 3 :
Performance for topic phrase generation.

Table 4 :
Top-5 topic terms included in each topic node.The off-topic (or too general) terms are marked with a strikethrough, and the multi-word terms that are not obtainable by the extraction-based methods are underlined.

Table 5 :
Novel topics identified at each target position.The center term (i.e., topic name) of each identified topic is presented.Correct topics ( ), incorrect topics (⊗), and redundant topics (✂) are annotated.

Table 6 :
Three disjoint parts of the topic taxonomy.