Tree-Structured Topic Modeling with Nonparametric Neural Variational Inference

Topic modeling has been widely used for discovering the latent semantic structure of documents, but most existing methods learn topics with a flat structure. Although probabilistic models can generate topic hierarchies by introducing nonparametric priors like Chinese restaurant process, such methods have data scalability issues. In this study, we develop a tree-structured topic model by leveraging nonparametric neural variational inference. Particularly, the latent components of the stick-breaking process are first learned for each document, then the affiliations of latent components are modeled by the dependency matrices between network layers. Utilizing this network structure, we can efficiently extract a tree-structured topic hierarchy with reasonable structure, low redundancy, and adaptable widths. Experiments on real-world datasets validate the effectiveness of our method.


Introduction
Topic models (Blei et al., 2003;Griffiths et al., 2004) are important tools for discovering latent semantic patterns in a corpus. These models can be grouped into flat models and hierarchical models. In many domains, topics can be naturally organized into a tree, where the hierarchical relationships among topics are valuable for data analysis and exploration. Tree-structured topic model (Griffiths et al., 2004) was thus developed to learn coherent topics from text without disrupting the inherent hierarchical structure. Such a method has been proven as useful in various downstream applications, including hierarchical categorization of Web pages (Ming et al., 2010), aspects hierarchies extraction in reviews (Kim et al., 2013), and hierarchies discovery of research topics in academic repositories (Paisley et al., 2014). * The corresponding author.
Despite the practical importance and potential advantages, tree-structured topic models still face the following challenges. Firstly, the hierarchical structure of topics should be reasonable (Viegas et al., 2020). Typically, topics near the root are more general while the ones close to the leaves are more specific. Besides, child topics should be coherent with their corresponding parent topics. Secondly, low redundancy is necessary for the extracted topics, in order to prevent the distributions associated with parent topics and their children being extremely similar (Griffiths et al., 2004). Thirdly, the number of topics in each hierarchy level should be automatically determined by the model, because it is usually unknown and can not be previously set to a predefined value (Kim et al., 2012). Finally, it is difficult for probabilistic models to enhance the data scalability (Isonuma et al., 2020). Previously, several tree-structured topic models (Griffiths et al., 2004;Kim et al., 2012;Isonuma et al., 2020) have been developed. But these methods can not fully overcome the aforementioned challenges.
In this paper, we focus on grouping topics into a reasonable tree structure, based on the neural variational inference (NVI) framework (Kingma and Welling, 2014;Rezende et al., 2014) with a nonparametric prior. Owing to the excellent function fitting ability, neural network has been widely introduced into topic modeling. Nonetheless, few neural methods explicitly model the dependencies among different layers and get explainable hierarchical topics, which is largely due to the weak interpretability of neural networks. Furthermore, the inflexibility of neural networks also makes it difficult to learn an unbounded number of topics at each level. To address these limitations, we propose a novel nonparametric neural method to generate tree-structured topic hierarchies, namely nonparametric Tree-Structured Neural Topic Model (nTSNTM) 1 . By connecting the network layers with dependency matrices, the model is able to extract an explainable tree-structured hierarchy. Firstly, the topic affiliations among hierarchy levels can be determined by the dicrete vectors of the dependency matrices. Secondly, to control redundancy among topics, we allow the model to freely generate topics without duplicating their corresponding parent topics. Thirdly, we couple a stick-breaking process with NVI to equip the topic tree with self-determined widths, which can help the model determine the number of topics automatically. Finally, due to the advantages of neural networks, our model can scale to larger datasets conveniently. Experiments indicate that our model outperforms baselines on several widely adopted metrics and two new measurements developed for tree-structured topic models.
The rest of this paper is organized as follows. We describe related work in Section 2. Then, we detail the proposed nTSNTM in Section 3. Section 4 presents our experimental results and discussions. Finally, we draw conclusions in Section 5.

Related Work
In (Griffiths et al., 2004), a tree-structured topic model called hLDA was first proposed by introducing a nested Chinese restaurant process (nCRP). For hLDA, a topic tree is constructed through Gibbs sampling given a certain depth. Based on hLDA, Xu et al. (2018) proposed a knowledged-based HTM to generate topic hierarchies from multiple domains corpora, but the hierarchical relation between the ancestor topic and the offspring one may be unclear, because a document is generated by the topics along a single path of the tree. To overcome this issue, Kim et al. (2012) proposed a recursive CRP (rCRP), in which a document possesses a distribution over the entire tree. Although rCRP has shown remarkable competitiveness in hierarchical topic modeling, it suffers from the major limitation of data scalability (Isonuma et al., 2020). Several other methods focused on hierarchical text clustering. For instance, Ghahramani et al. (2010) applied nested stick-breaking processes to cluster data into a tree structure. Unfortunately, the above method only models a document by a single node of the tree. Liu et al. (2014) developed a model named HLTA for topic detection, in which words and top-ics are clustered by employing the Bridged-Islands algorithm iteratively. However, HLTA is unable to cope with polysemous words, which is quite important for topic models.
To couple nonparametric processes with NVI, Miao et al. (2017) used Gaussian distributions to generate stick-breaking fractions. Nalisnick and Smyth (2017) first described how to use stochastic gradient variational Bayes for posterior inference of the weights in stick-breaking processes. Experiments indicated that the latent representations of the above model were more discriminative than those of the Gaussian variant. Then, Ning et al. (2020) developed two nonparametric neural topic models by treating topics as trainable parameters. Unfortunately, the aforementioned methods can only learn topics with a flat structure.
For tree-structured neural topic modeling, a feasible way is to decompose the distribution over the topic tree into a path distribution and a level distribution. Following (Wang and Blei, 2009), where a tree-based stick-breaking construction of nCRP was first derived to draw topic paths, and then a level distribution was learned to sample topics along the path, Isonuma et al. (2020) proposed a tree-structured neural topic model (TSNTM) by parameterizing an unbounded ancestral and fraternal topic distribution. TSNTM applies a doublyrecurrent neural network (DRNN) to obtain topic embeddings via ancestral and fraternal edges, then generates breaking fractions by the dot product between document embeddings and topic embeddings. However, TSNTM fails to learn a reasonable topic tree for the following reasons. Firstly, the breaking fractions do not obey the Beta distributions adopted in the stick-breaking process (SBP). Secondly, the structure of DRNN in TSNTM is simplified, where the topic embeddings are generated directly by an initialized root embedding and two parameter matrices (i.e., ancestral and fraternal connections). This prevents the model from learning appropriate semantic embeddings for topics. Finally, TSNTM relies on heuristic rules to update the tree structure.
Another stream of work is to generate a document by a directed acyclic graph (DAG) structured topic hierarchy. For instance, Li and McCallum (2006) introduced the pachinko allocation model (PAM) to capture correlations between topics using a DAG. Mimno et al. (2007) proposed the hierarchical PAM by connecting the root topic to lower-level topics through multinomial distributions. Nonprobabilistic matrix factorization was also used to extract the topic structure. Liu et al. (2018) used non-negative matrix factorization (NMF) with three optimization constraints, including global independence, local independence, and information consistency, to preserve topic coherence and a reasonable structure. Viegas et al. (2020) incorporated pre-trained word embeddings into NMF to further improve topic coherence. The main limitation of NMF-based methods, however, is that a time-consuming process (e.g., measure the stability of results by running multiple random samplings) is necessary to determine the number of topics at each level. This is because nonparametric priors are intractable to be included in these models.

Tree-Structured Neural Topic Model with Nonparametric Prior
In this section, we firstly describe the stickbreaking process. Then, we introduce the modeling of tree-structured topic hierarchy. Finally, we detail the inference method of our nTSNTM.

Stick-breaking Process
For nonparametric models, stick-breaking prior is a random measure with the form G = ∞ k=1 π k δ ζ k , where δ ζ k is a discrete measure concentrated at ζ k ∼ G 0 (Ishwaran and James, 2001) 2 , i.e, a draw from the base measure. The π k s are random weights independent of G 0 (Nalisnick and Smyth, 2017). This constructive definition is known as SBP (Sethuraman, 1994), which implies that the weights π = (π k ) ∞ k=1 can be drawn according to the procedure of iteratively breaking off segments from a unit stick. Figure 1: Stick-breaking construction.
2 In topic models, ζ k represents the k th topic and G0 represents the topic space.
As shown in Figure 1, we break the unit stick and get the first component with length v 1 . If a fraction v 2 of the remaining stick is broken off, then we obtain the second component with length v 2 (1 − v 1 ) and a remaining stick with length (1 − v 1 )(1 − v 2 ). The following breaks are taken on the remaining stick by the same operation. Given a truncation level T , the length of the last component will be T −1 j=1 (1 − v j ). Formally, the length of each component is defined as: where v k ∼ Beta(α 0 , β 0 ), with α 0 and β 0 being the prior parameters. Note that the component weights π satisfy 0 ≤ π k ≤ 1 and ∞ k=1 π k = 1, thus we can interpret π as random probabilities. Particularly, when v k ∼ Beta(1, β 0 ), the joint distribution for π is the GEM distribution (Pitman, 2006) with concentration parameter β 0 , and the corresponding SBP is one of the constructions for the Dirichlet process, a popular nonparametric random process for topic modeling (Teh et al., 2005).
In our method, we take component weights π as the path distribution of a document. We assume that the words of a document come from several topic paths. Due to the sequentiality of the stickbreaking operation, paths with smaller serial numbers are more likely to be activated to represent the documents, while paths with larger serial numbers tend to be unactivated. The number of activated paths can be adjusted by SBP automatically.

Tree-Structured Topic Hierarchy
To conveniently describe our method, we here compare the sampling processes for an example document of different tree-structured topic models. As shown in Figure 2, hLDA (Griffiths et al., 2004) considers that a document is generated by topics of a single path, which violates the multi-topics assumption of topic models (i.e., a document may span several topics). Considering this issue, rCRP (Kim et al., 2012) and TSNTM (Isonuma et al., 2020) assume that a document can be generated by any topic in the tree. We follow the above assumption adopted in rCRP and TSNTM to model a tree-structured topic hierarchy, but the difference is that our model takes the sampling from the bottom up rather than from the top down as in rCRP and TSNTM. Particularly, rCRP samples topics from the root using recursive CRP. TSNTM samples paths from the root by applying a DRNN (Alvarez-Melis and Jaakkola, 2017), and it needs to update the tree structure frequently by heuristic rules. On the contrary, our model directly samples the leaf topics, and the paths toward the root are determined automatically. Specifically, we use a common stickbreaking construction to infer the distribution over leaf topics, which corresponds to the path distribution. Besides, we use dependency matrices to keep track of the affiliations among topics. Thus the tree structure can be updated through back propagation.  Figure 2: Sampling process of an example document for hLDA (Griffiths et al., 2004), rCRP (Kim et al., 2012), TSNTM (Isonuma et al., 2020), and our nTSNTM. Each node represents a topic z with its distributions over words w. The active topics and path are highlighted by boldface. Figure 3 shows the graphical representation of nTSNTM. For our model, the number of leaf topics is determined by SBP, and the numbers of non-leaf topics are adjusted through dependency matrices M between network layers. The l th item of M, i.e., M l ∈ [0, 1] K l * K l+1 , is the dependency matrix between layers l and l+1, where K l and K l+1 represent the maximum numbers of topics at level l and level l+1, respectively. In particular, M l,k,j is the probability of topic j at level l being the parent of topic k at level l+1 with j M l,k,j = 1. As mentioned in (Griffiths et al., 2004), a clear tree structure indicates that each sub-topic has a relationship with no more than one super-topic. So a softmax function with low temperature (Hinton et al., 2015) is applied to ensure that M l,k approximates a discrete one-hot vector. In this way, the topic tree can be built through the introduced M from bottom up. Furthermore, the topic hierarchy can be updated automatically according to the update of M.
After determining the topic hierarchy by M, the generative process of each word in nTSNTM can be described as follows: (2) Draw Gaussian samples: Draw a level: r d,n ∼ Multi(η d ); Draw a word: w d,n ∼ Multi(φ c d,n [r d,n ] ). (7) In the above, D is the number of documents, N d is the number of words in x d . φ c d,n [r d,n ] ∈ V −1 is the word distribution of the topic at level r d,n of path c d,n , and V is the vocabulary size. f η (·) is a neural perceptron with softmax activation to transform a Gaussian sample to a level distribution.

Parameter Inference
Since the Beta distribution does not have a differentiable non-centered parametrization that NVI requires (Kingma and Welling, 2014), we choose the Kumaraswamy distribution (Kumaraswamy, 1980) to approximate GEM(β 0 ), i.e., the conjunction of Beta(1, β 0 ) and a stick-breaking operation (Nalisnick and Smyth, 2017). For the Kumaraswamy distribution, the probability density function on the unit interval is defined as Kumaraswamy(x; a, b) = abx a−1 (1 − x a ) b−1 for x ∈ (0, 1) and a, b > 0. Samples can be drawn via the inverse transform: 1). Then the KL-divergence between the Kumaraswamy distribution and the Beta distribution can be closely approximated in the closed-form. We describe the parameter inference process of our nTSNTM as follows.
Firstly, we estimate the component weights of document x d , i.e.,π d , by the following stick-breaking operation with fractions v d : where the bag-of-words representation is used for x d . To ensure positive outputs, f α (·) and f β (·) are neural perceptrons with softplus activation. Secondly, we infer the level distributionsη d by: where f µ (·) and f σ (·) are linear transformations. In practice, we reparameterizeĝ d = µ d +ˆ * σ d with the sampleˆ ∼ N (0, I 2 ) (Rezende et al., 2014). Thirdly, we obtain the topic distributions of x d , i.e.,θ d = {θ d,1 , ...,θ d,L } by: where L denotes the depth of the topic tree, and l kθ d,l,k = 1. Then, we follow (Miao et al., 2017) to explicitly model topic-word distributions by: φ = sof tmax(u * t T ), where u ∈ R V * H and t ∈ R l K l * H are word vectors and topic vectors, and H denotes the dimension of word/topic vectors. Given topic-word distributions φ and topic distributionsθ d obtained from Eq. (13), our model reconstructs each document x d by: p(w d,n |φ,θ d ) = z d,n [p(w d,n |φ z d,n )p(z d,n |θ)] =θ d * φ, where z d,n is the topic assignment for w d,n .
Finally, the variational lower-bound of x d is: where q(π d |x d ) and q(η d |x d ) are posteriors modeled by the inference network. p(π d ) is the prior for π d , i.e., GEM(β 0 ), and p(η) is the prior for η, i.e., the standard Gaussian transformed by f η (·).
The parameter inference method for nTSNTM is presented in Algorithm 1. We use the variational lower-bound to calculate gradients and apply Adam (Kingma and Ba, 2015) to update parameters. Compute L by Eq. (14);

Experimental Setup
For tree-structured topic models, we adopt hLDA (Griffiths et al., 2004) 3 , rCRP (Kim et al., 2012), and TSNTM (Isonuma et al., 2020) as our baselines. For all these models, the max-depth of topic tree is set to 3 by following (Isonuma et al., 2020). For nonparametric or flat topic models, we adopt HDP (Teh et al., 2005), GSM & GSB (Miao et al., 2017), NB-NTM & GNB-NTM (Wu et al., 2020), and iTM-VAE & HiTM-VAE (Ning et al., 2020) as baselines. HDP is a classical nonparametric topic model that allows potentially an infinite number of topics. GSM & GSB are two NVI-based models using Gaussian priors. In particular, GSB uses Gaussian distributions to generate stick-breaking fractions. NB-NTM & GNB-NTM are two flat neural topic models based on Negative Binomial and Gamma Negative Binomial processes respectively. For iTM-VAE & HiTM-VAE, they extended the method in (Nalisnick and Smyth, 2017) to introduce nonparametric processes into the NVI framework by extracting the potential infinite topics.
We directly use the publicly available codes of hLDA 4 , rCRP 5 , TSNTM 6 , HDP 7 , NB-NTM & GNB-NTM 8 , and iTM-VAE & HiTM-VAE 9 . Besides, we implement GSM & GSB based on the original paper. For all parametric models, the number of topics is set to 50 and 200 as in (Miao et al., 2017). For nonparametric models based on SBP, the truncation level is set to 200, and the concentration parameter β 0 for the GEM distribution is chosen from [5,10,15,20,25,30] using each validation set. In particular, we sequentially choose the topics, of which the sum of probabilities in the whole corpus exceeds 95%, as the active ones. For neural baselines and our proposed model, we set the size of hidden layers to 256 and use one sample for NVI by following (Miao et al., 2017).
All the experiments are conducted on a workstation in Python/Java environment equipped with 40G memory. In the following, we do not report the results of hLDA and rCRP on Rcv1-v2 since they failed to achieve convergence in 48 hours.

Topic Hierarchy Analysis
As mentioned in (Viegas et al., 2020), a reasonable topic hierarchy means that topics near the root should be more general while the ones close to the leaves should be more specific. To this end, we adopt topic specialization (Kim et al., 2012) as an indicator for the evaluation of topical hierarchy. The specialization of a topic is the cosine distance between the word distribution of the topic and the term frequency vector of the entire corpus. A higher specialization score implies that the topic is more specialized. Figure 4 presents the average topic specialization scores of each level for different tree-structured models. The results indicate that nTSNTM and rCRP can achieve a reasonable pattern of topic specialization at different levels, i.e., the scores become higher as the level becomes deeper. We also observe that the baseline of TSNTM generates more specific topics at the second level than the third level, which indicates an unreasonable topic hierarchy. For the baseline of hLDA, there is a leap of topic specialization from level 2 to level 3, especially for 20NEWS. The reason may be that each document is generated by topics along a single path for hLDA, which renders the large specialization of the topics at level 3 since they are all restricted to one topic from level 2.  A reasonable topic hierarchy also indicates that child topics are coherent with their corresponding parent topics (Viegas et al., 2020). To measure the relations of two connected topics, we develop a new metric named cross-level NPMI (CLNPMI) to measure the relations of two connected topics by calculating the average NPMI value of every two different topic words from a parent topic and its child. In the above, NPMI was proposed by Lau et al. (2014) which evaluates the relation between 2349 two words w i and w j as follows: .
Based on NPMI, we define CLNPMI as: where W p = W p − W c and W c = W c − W p , in which, W p and W c denote the top N words of a parent topic and one of its children. To avoid degenerating into NPMI when the parent and the child topics are highly similar, CLNPMI is estimated by the distinct words between every two topics.
To evaluate the topic redundancy for a tree, we introduce a new measurement named the averaged overlap rate (OR) and adopt the widely-used topic uniqueness (TU) (Nan et al., 2019). OR measures the averaged repetition ratio of top N words between parent topics and their children, which is defined as: . TU calculates the uniqueness of all topics by TU = 1 K K k=1 TU(k), where K is the number of topics and TU(k) is defined as: In the above, cnt(n, k) is the total number of times the n th top word in topic k appears in the top N words across all topics.  Table 2: CLNPMI, TU, and OR scores of treestructured topic models, in which, higher CLNPMI and TU with a lower OR indicate better performance. The best value on each metric is highlighted by boldface.
For each of the aforementioned metrics, we calculate the average scores of 5, 10, and 15 top words. Table 2 shows the performance of different models, where each method is run for 5 times and the average values are presented. The results indicate that our model significantly outperforms the baselines in most cases, with p-values less than 0.05. For hLDA and our nTSNTM on the 20NEWS dataset, the difference is not statistically significant on the OR metric, with a p-value equal to 0.391. This validates the effectiveness of the bottom-up structure for nTSNTM, in which, non-leaf topics are activated when their offsprings are chosen. We also present the hierarchical affinity (Kim et al., 2012) for each model to measure whether the parent topic is more similar to its child topics than the descendants of other parent topics. The average cosine similarities of the parent topic's word distribution to children topics and non-children topics are shown in Figure 5. For parent topics, both rCRP and nTSNTM clearly show stronger affinities with children topics than non-children topics. But rCRP suffers from the high redundancy, which can be indicated by the high similarities (0.73 ∼ 0.82) between parent topics and sub-topics. To intuitively demonstrate the ability of our model in generating a topic tree, we present several topics extracted from 20NEWS by our nTSNTM and the existing NVI-based TSNTM in Figures 6 and 7, respectively. The results indicate that our model is able to learn a reasonable tree-structured topic hierarchy with low redundancy. While for TSNTM, we notice that there is a low degree of discrimination between topics at the second and the third levels. In addition, topics of the same group at the third level are highly repetitive, including "rec.sport.baseball" and "talk.politics.misc". For completeness, we further check topics extracted from 20NEWS by hLDA and rCRP. The results indicate that each topic at the second level is too general to represent  a topic branch and the affiliations are unclear for hLDA. Although rCRP can generate meaningful topics with appropriate affiliations between different levels, it suffers from a high topic redundancy.

Comparison on Topic Interpretability
In this part, we use the widely adopted NPMI (Miao et al., 2017;Liu et al., 2019;Wu et al., 2020;Ning et al., 2020;Isonuma et al., 2020) to evaluate topic interpretability 10 . As mentioned in (Lau et al., 2014), the NPMI is a measurement of topic coherence which closely corresponds to the ranking of topic interpretability by human annotators. Table 10 We do not estimate the perplexity for the following two reasons. First, the perplexity of sampling-based and NVIbased models is difficult to compare directly (Isonuma et al., 2020). Second, the prior of NVI-based methods has a large influence on the perplexity since the KL-divergence may vary greatly for different priors (Burkhardt and Kramer, 2019). 3 shows the NPMI of 50 and 200 topics for parametric topic models and topics induced automatically for nonparametric topic models. We run each model for 5 times and present the average results. Firstly, nTSNTM outperforms all tree-structured baselines, and the difference is statistically significant at the level of 0.05 (except for TSNTM on the Rcv1-v2 dataset). Secondly, nTSNTM shows competitive performance when compared with the best flat baselines. In particular, except for HiTM-VAE on the Reuters dataset, the results of all the other top-performing baselines are not significantly better than those of our model.

Evaluating Data Scalability
To evaluate data scalability, we randomly sample several numbers of documents (12.5k, 25k, 50k, 100k, 200k, 400k, and all)   set of Rcv1-v2 to run our model and other treestructured baselines. The sampling-based models (i.e., hLDA and rCRP) are run on an Intel Xeon Skylake 6133 CPU with 8 cores, and NVI-based models (i.e., TSNTM and nTSNTM) are tested on an Nvidia Tesla V100 GPU. Figure 8 shows the training time of these topic models. Our nTSNTM shows an advantage in data scalability when compared with baselines. Although TSNTM is also scalable to a large corpus by GPU acceleration, it applies a doubly-recurrent network which largely slows down the model speed. hLDA and rCRP spend considerable computation time on path sampling, which is much more serious when dealing with a large-scale dataset. Additionally, these two sampling-based models are serial, which means they can only utilize one core of the CPU.

Impact of the Concentration Parameter
We further validate the nonparametric property of our model. Figure 9 shows the impact of β 0 on the number of active topics. Firstly, we can see that the topic numbers of all models grow when increasing β 0 . The reason is that β 0 controls the smoothness of SBP, and that a larger value leads to a smoother degree, i.e., more topics. Secondly, compared with iTM-VAE and HiTM-VAE, the number of topics found by nTSNTM is closer to the one extracted by HDP, which demonstrates that our model is able to approximate the nonparametric property of HDP.

Conclusion
In this paper, we propose a nonparametric treestructured neural topic model named nTSNTM. Our method explicitly models the dependency of latent variables from different layers, and combines them to reconstruct the input text. By coupling SBP with dependency matrices, we can update the tree structure automatically. Extensive experiments validate the effectiveness of our nTSNTM on generating a reasonable topic tree with low topic redundancies. Furthermore, our model can be trained 2 times faster than the existing NVI-based TSNTM with approximately 800k documents. In the future, we plan to apply our method to aspect extraction.