Neural Attention-Aware Hierarchical Topic Model

Neural topic models (NTMs) apply deep neural networks to topic modelling. Despite their success, NTMs generally ignore two important aspects: (1) only document-level word count information is utilized for the training, while more fine-grained sentence-level information is ignored, and (2) external semantic knowledge regarding documents, sentences and words are not exploited for the training. To address these issues, we propose a variational autoencoder (VAE) NTM model that jointly reconstructs the sentence and document word counts using combinations of bag-of-words (BoW) topical embeddings and pre-trained semantic embeddings. The pre-trained embeddings are first transformed into a common latent topical space to align their semantics with the BoW embeddings. Our model also features hierarchical KL divergence to leverage embeddings of each document to regularize those of their sentences, paying more attention to semantically relevant sentences. Both quantitative and qualitative experiments have shown the efficacy of our model in 1) lowering the reconstruction errors at both the sentence and document levels, and 2) discovering more coherent topics from real-world datasets.


Introduction
Topic models are a family of powerful techniques that can effectively discover human-interpretable topics from unstructured corpora for text analysis purposes. Among them, Bayesian topic models, based on the latent Dirichlet allocation (LDA) (Blei et al., 2003), have been the mainstream for nearly two decades. They usually adopt count/bag-ofwords (BoW) representations for text content and model the generation of the BoW data with a probabilistic structure of latent variables. These variables follow pre-specified distributions under Bayes' theorem, and are learned through Bayesian inference * Corresponding author such as variational inference (VI) and Monte Carlo Markov chain (MCMC) sampling. Despite their success, conventional Bayesian topic models, however, are known to lack flexibility in their model structures and scalability to large volumes of data.
To address the above limitations, increasing effort has been made in leveraging deep neural networks (DNNs) for topic modelling, which leads to the so-called neural topic models (NTMs) (Zhao et al., 2021a). Most of these models follow the framework of variational auto-encoders (VAEs) (Kingma and Welling, 2014;Rezende et al., 2014) and adopt an encoder-decoder architecture, in which the encoder transforms the BoW data of each document into the corresponding documenttopical embeddings, and the decoder attempts to map these embeddings back to the same data. With a moderate increase in model complexity, NTMs have largely outperformed conventional topic models on BoW data reconstruction and topic interpretability/coherence (Miao et al., 2016;Srivastava and Sutton, 2017;Ding et al., 2018;Zhou et al., 2020;Zhao et al., 2021b).
With this being said, most NTMs only exploit the BoW information of internal documents while ignoring (1) the sentence-level BoW information of these documents, and (2) the external (semantic) information regarding the documents, sentences and words (e.g., extracted from other larger relevant corpora). These limitations have hindered the further performance improvement for NTMs. Therefore, in this paper, we propose a new NTM that address these limitations. It jointly reconstructs the BoW data of each document and their sentences with combinations of both internal BoW topical embeddings and external pre-trained semantic embeddings. To do this, we design an internal BoW data encoder and an external knowledge encoder to respectively transform the BoW data and the pretrained embeddings of the same documents and sentences into a shared latent topical space. The resulting internal and external topical embeddings are then combined to decode the BoW data.
To address the BoW data sparsity (Zhao et al., 2019) at the sentence level, which has been a problem for many topic models, our model enforces hierarchical KL divergence on pairs of sentences and corresponding documents with respect to their topical embeddings derived from both the BoW data and the external knowledge. The intent is that both types of topical embeddings for each sentence should be governed by the same types of embeddings of their parent documents. Furthermore, the hierarchical KL for a sentence is weighted by its semantic relations to its parent document, so that a document's topical information is more influential to more semantically representative sentences. Our contribution can be summarized as follows: • We propose a VAE-based neural topic model, which encodes internal BoW information and external semantic knowledge specific to word, sentence and document levels into the same latent topical space to refine topic quality. • Our model imposes attention-weighted hierarchical KL divergence on pairs of sentences and documents to smooth the learning of the topical embeddings from sparse BoW data of sentences. • We demonstrate that our model is effective in BoW data reconstruction at both the document and sentence levels. It also improves the internal and external coherence of the discovered topics.

Related Work
To our best knowledge, most state-of-the-art NTMs, like ours, are based on the VAE framework. For a detailed discussion of these NTMs, we refer readers to Zhao et al. (2021a). Here, we only discuss the two lines of research that are most relevant to ours.
NTMs with pre-trained language models Recently, pre-trained transformer-based language models such as BERT (Devlin et al., 2019) have started to draw the attention of the topic modelling community thanks to their capability of generating contextualized word embeddings with rich semantic information that is absent from the BoW data. Thus, an emerging trend focuses on incorporating these contextualized word embeddings into NTMs. Based on the popular VAE framework of Srivastava and Sutton (2017), Bianchi et al. (2021) proposed a contextualized topic model (CTM) that incorporates pre-trained document embeddings generated by sentence-transformers (Reimers and Gurevych, 2019). However, unlike ours, CTM ignores the other levels of pre-trained knowledge and BoW information. Chaudhary et al. (2020) proposed to combine an NTM with a fine-tuned BERT model by concatenating the topic distribution and the learned BERT embedding of a document as the features for document classification. Hoyle et al. (2020) proposed BAT, a NTM framework "taught" by external knowledge distilled from a pre-trained BERT model. BERT predicts probabilities for each word of a document which are then averaged to generate a pseudo BoW vector for the document. The BAT framework can be instantiated with various existing NTMs such as Scholar (Card et al., 2018) (i.e. BAT+Scholar), which imposes a logistic Normal as the variational posteriors for document embeddings in the VAE, and W-LDA (Nan et al., 2019) (i.e. BAT+W-LDA), which replaces the KL divergence with the maximum mean discrepancy (Gretton et al., 2012).
NTMs for modelling document structures Although document structures, i.e., the structured relationships between documents, paragraphs, and sentences have been modelled in conventional Bayesian topic models (e.g., in Du et al. (2012); Balikas et al. (2016a); Jiang et al. (2019)), they have not been carefully studied in NTMs to our knowledge. The closest work to our idea is Nallapati et al. (2017), which proposes an NTM that samples a topic for each sentence of an input document and then generates the word sequence of the sentence with an RNN conditioned on the sentence's topic. However, this work focuses on sentence generation instead of topic modelling.

Preliminaries
In this section, we introduce the VAE framework of neural variational topic models, based on which our model will be developed. Table 1 details the notations and symbols used throughout the paper.
Problem Formulation Consider a corpus that consists of I documents where the i th (1, ..., I) document is represented as a V -dimensional vector of word counts, w D i , also known as the bag-of-words (BoW) data. Here, V is the size of the vocabulary from the corpus, and w D iv is the number of times the v th (1, ..., V ) word occurs in the i-th document. Topic modelling assumes that there exist K topics that can be used to describe each document. Its goal is to recover these topics from the BoW data the VAE encoder for h ∈ {"B", "E"} that generate the topical embeddings of words, sentences and documents f h (·) a shared MLP of the encoder for h that  of the corpus W D . In this paper, we further formulate the topic modelling problem at the level of sentences; that is the i th document comprising a total of J i sentences. In this case, the document can be alternatively represented as a word count matrix W S i , an additional level of BoW information we aim to leverage. The j th (1, ..., J i ) row of this matrix accommodates the V -dimensional BoW vector w S ij of the j th sentence from the i th document. Neural Variational Topic Model Most traditional topic models are graphical models with exponential family model parameters. Therefore, they yield tractable inference for the posterior distributions of the model parameters. However, the inference is limited in expressiveness, thereby less capable of capturing the true underlying generative process and distributions. To solve this problem, neural variational inference is introduced into topic models, named neural variational topic models (NVTM), where more expressive posterior dis-tributions are constructed for the model parameters using neural networks. A typical NVTM is learned by maximizing the Evidence Lower Bound (ELBO) of the marginal likelihood of the BoW data W D of each document with respect to their topical embeddings Z B : where log p(W D |Z B ), q(Z B |W D ) and p(Z B ) are respectively the BoW data likelihood, the variational posterior distribution and the prior distribution of Z B . A key characteristic of NVTM is to model the first two terms as decoder and encoder networks. The encoder network q B (W D ) models the posterior distribution q(Z B |W D ) by taking in the BoW data W D to parameterize diagonal Gaussian distributions 1 over Z B . Specifically, for document i, its encoding process is formulated as: Here, f B : R V → R L is a multi-layer perceptron (MLP) for computing a shared hidden layer output from w D i , where L is the number of hidden neurons. The functions l B 1 , l B 2 : R L → R K are two linear layers for respectively predicting the posterior mean and variance vectors: that generate the document-topical embedding z B i for document i. The symbol I ∈ R K×K denotes an identity matrix used to create the diagonal Gaussian. Finally, it is common for NTMs to transform the topical embedding z B i into a probability distribution of topics for each document using Softmax.
As for the NVTM's decoder network, it reconstructs W D using the document-topic matrix Z B and the word-topic matrix Φ B as follows: where N is a vector of document lengths and Φ B , capturing the latent topics of each word in the vocabulary, can also be viewed as the weight matrix of the (output layer) of the decoder network. (1) two types of encoders, internal BoW data encoder and external knowledge encoder, that capture latent topics of documents, sentences and words from both internal and external sources, respectively; (2) an attention-aware hierarchical KL divergence that regularizes topical embeddings of sentences with those of their documents. Figure 1 shows the basic architecture of NAHTM.

Internal BoW Data Encoder
This encoder, q B (·), aims to capture the documentand sentence-level BoW data information. The encoding process of the former has been specified in eq. (2), which yields the posterior distributions q B (W D ) for the document-topic matrix Z B . As for encoding the sentence-level BoW data, it is motivated by the fact that sentences convey complete logical statements organized by topics as documents. The difference, as argued by the past research on LDA models, is that sentences are more concise with shorter text and focused topics (Balikas et al., 2016a,b;Amoualian et al., 2017). NAHTM encodes the sentence-level BoW data in the same way as it encodes the document-level data except for the final activation function. More specifically, each document is now viewed as a corpus, while each sentence is viewed as a (short) document. For document i, its BoW data W S i , over its sentences, is encoded as q B (W S i ) with the same encoding process as in eq. (2). Then, the sentencetopical embedding matrix S B i for document i is generated as: . Based on the well-grounded argument that a sentence should be bound in topics, NAHTM forces S B i to be sparse over topics by using Sparsemax (Martins and Astudillo, 2016;Lin et al., 2019) which projects real-valued embeddings into sparse probability vectors: S B i := Sparsemax(S B i ). Specifically, for the j th sentence of document i, its embedding s B ij is converted by Sparsemax as: where c lies on the (K − 1)-dimensional probability simplex. In other words, Sparsemax projects s B ij from the Euclidean space onto the probability simplex.

External Knowledge Encoder
External semantic knowledge, extracted by pretrained language models (e.g. BERT (Devlin et al., 2019)) from large general corpora, provides rich prior information regarding the contexts and semantic relatedness of instances for each entity (i.e. document, sentence and word) modelled by NTMs. The language models account for ordering patterns of the entities (i.e. sentence and word orderings), which are complementary to the orderless BoW information captured by the NTMs. Incorporating such knowledge into the NTMs can potentially help them better infer sentences' true topics in scenarios which cannot be distinguished by the BoW information. For example, a pair of sentences with the same word counts might have very different topics due to different word orderings. Meanwhile, another pair without any word overlap might still have strongly correlated topics due to a next-sentence or entailment relationship. Hence, NTMs can be guided to better infer topics of documents as well as the entire set of topics underlying the corpus.
Another advantage of external knowledge is that it can potentially alleviate the data sparsity problem under the BoW modelling, especially at the sentence level. Since sentences have much shorter text compared to documents, therefore, their BoW data is also much sparser with significantly fewer word co-occurrences within each sentence and word overlaps in between. To make the learning less affected by the sparse data, external knowledge can be leveraged to calibrate it with prior information regarding the sentences. NAHTM incorporates three levels of external knowledge in the form of the following pre-trained embeddings for words, sentences and documents.
External word embeddings X W are output by the embedding layer of the pre-trained transformer model for each word in the vocabulary. Here, X W are non-contextualized and untrainable embeddings whose dimension is predefined by the pre-trained transformer and thus, not necessarily equal to the number of topics. The reason for using the non-contextualized word embedding is that the word-level external information is expected to be "injected" correspondingly into Φ B , the factorized and non-contextualized word-topical embeddings in our topic model.
External sentence embeddings X S i for the sentences of document i are the aggregated results of the outputs from the last transformer encoder (layer) of the pre-trained model. In this case, the inputs to the pre-trained model consist of the word sequences for each sentence. The outputs are the contextualized embeddings of the words in the sequences. In this paper, we use sentencetransformers 2 , a programmable framework that provides a variety of pre-trained transformer models for computing sentence embeddings. We adopt the default aggregation strategy of sentencetransformers which is to average all the (output) contextualized word embeddings over the sentence.
External document embeddings X D can be constructed as either the unweighted or the weighted average of all the sentence embeddings for the same document. Specifically, for the latter case, x D i can be obtained as follows: where α S i = [α S i1 , ..., α S iJ i ] T is the attention (weight) vector over each sentence. Its j th element α S ij is the normalized weight of the j th sentence in representing document i, i.e. Mapping X W , X D , X S i to the topical space is done subsequently by the external knowledge encoder to align the dimension and semantic meanings of external embeddings with the topical embeddings. More specifically, each level of the external knowledge data X ∈ {X W , X D , X S i } is encoded into the corresponding posterior distri- Here, all the symbols have the same meanings as those in eq. (2) except that they are dedicated to the external knowledge encoder. Correspondingly, we denote the topical embeddings generated by this encoder for the word, sentence and documentlevel external knowledge respectively as: NAHTM combines the internal and external topical embedding matrices as follows: where γ 1 , γ 2 and γ 3 are the hyper-parameters that control the influence of external knowledge in calibrating the internal one at the different levels.

Attention-Aware KL Divergence
NAHTM makes use of the hierarchical structure of documents by setting the posterior distributions of the document topical embeddings as the priors to their sentences' topical embeddings. More specifically, for document i, its dedicated KL divergence term is Q Here, β 0 and β 1 are the hyper-parameters that control the regularization strengths of the corresponding KL terms in the ELBO. Note that all the embeddings involved in the above KL terms are unnormalized; in other words, they have not been transformed by the Softmax/Sparsemax function.
The assumption behind the above hierarchical structure of KL divergence terms is straightforward: topics of sentences should be somewhat similar to those of their documents. In this case, the degree of the topical similarity constraint enforced into the learning of NAHTM is controlled by β 1 . For example, if β 1 becomes smaller, the similarity constraint is going to be weakened accordingly.
Customising β 1 with attention weights is an alternative method we propose that allows for adaptive control on the regularization strengths of the KL divergence terms specific to individual sentence-document pairs. The intent of this method is that the semantic relevance of sentences towards the document, as revealed by the external knowledge, should also be indicative of their topical relevance in the corpus. There are two strategies for implementing this method.
In strategy 1, NAHTM integrates each unnormalized attention weight y S ij , computed from eq. (5), into the sentence-document KL terms with respect to the pre-trained sentence embedding as: where y S ij is a latent variable to be learned to control the KL terms specific to the sentence j of document i, and is constrained to be close to y S ij ; λ 0 is a hyper-parameter that controls the strength of the constraint; σ(y S ij ) is the corresponding normalized result by either the Softmax or the Sparsemax function. Eq. (7) is essentially a "soft" version of the strategy that directly uses the normalized attention weight as the controlling parameter, i.e. β 1 α S ij , for the KL terms across the J i sentences.
In strategy 2, instead of learning the attention weights y S i , they are first pre-computed based on the pre-trained sentence and document embeddings by In this case, the unnormalized attention weight y S ij for sentence j is the negative Euclidean distance between its embedding and the document embedding which is the centroid across all the sentences. The further away the embedding x S ij is from the centroid x D i , the smaller the weight. Therefore, the control over strengths of the KL constraints is now adaptive only towards the external knowledge. The rest of the KL computation follows exactly eq. (7). For a "hard" version of this strategy, we can subsequently normalize the pre-computed y S i to obtain α S i and then, directly use them as the controlling parameters (same as in strategy 1).

Training Objective
In summary, the training objective of NAHTM is: where the likelihood p Φ Comb is modelled by the decoder network as in eq. (3) with the weight matrix now being Φ Comb ; γ 4 controls the influence of the sentence-level training loss; Q D i and Q S ij are respectively the document-and sentencelevel KL terms specified in Section 4.3; Q W v consists of the regularization and KL terms respectively for the internal and external embeddings for each word v from the vocabulary; Again, λ 1 and β 2 are the controlling hyper-parameters for the respective terms.

Experimental Setup
Data We evaluate the efficacy of NAHTM using four real-world corpora from a variety of domains, Wikitext-103 (Nan et al., 2019), 20NewsGroup (Srivastava and Sutton, 2017), COVID-19 open research dataset (CORD) 3 and NIPS 4 datasets. Table 2 summarises the key statistics of these datasets. For the Wikitext-103 dataset, we only use the introduction part of each document, named the WikiIntro dataset, to examine the efficacy of our model on short text. For the CORD dataset, we randomly sampled 20,000 documents from its original corpus for our experiments, which is named CORD20K.
As for the CORD20K and NIPS datasets, we set the split ratio to be 60%-20%-20%, and use the most frequent 10,000 words (with stemming and stop-words removed) as the vocabulary. Specifically, we apply WordNet lemmatizer and English stopword list (both from the NLTK 5 toolkit) to preprocess their text. Finally, we use the first 50 sentences of each document from the two datasets for the experiments.
Furthermore, to extract the sentence embedding from the external pre-trained transformer, we have not done any pre-processing (neither stemming/lemmatization nor stopword removing) on the sentence text for all the datasets, as required by the sentence-transformers library. Doing so guarantees that the pre-trained language model can capture the full contextual information within each sentence.  Evaluation Metrics We seek to perform two types of evaluation on our model. The first one is the ability to discover a set of latent topics that are meaningful and useful to human. To achieve this, we look at topic coherence with the normalized mutual pointwise information (NPMI) (Aletras and Stevenson, 2013;Lau et al., 2014), which is positively correlated with human judgments on topic quality. Specifically, we calculate the NPMI scores on both the internal corpora from Table 2, and a large external corpus. For each calculation, we first select the top 10 words under each topic based on the (sorted) values in each column of the word embedding matrix Φ Comb . Then, we calculate and average the internal NPMI scores of each topic as in (Bouma, 2009) with a sliding window of size 10 and the training data as the reference corpus. As for the external NPMI, it is calculated using Palmetto 6 over a large external English Wikipedia dump. The second metric we use is perplexity, a popular criterion that evaluates how well topic models fit the BoW data, which is calculated over the testing data as 7 : exp(− 1 where N i is the length of document i, and p(w D i ) is the log-likelihood of the model on the document's BoW data. It is approximated as p(w D i ) ≈ 6 https://aksw.org/Projects/Palmetto.html 7 Our perplexity calculation follows that of (Miao et al., 2016(Miao et al., , 2017Card et al., 2018) but for fair comparisons, the KL divergence is not included for calculating the perplexity as different models weight KL differently.
Here, Φ Comb combines the decoder weights Φ B and the posterior mean for Φ E , while z Comb i combines the posterior means for z B i and z E i , as shown in eq. (6), at the first training step after the model has converged with respect to the validation perplexity.
Baselines We compare NAHTM with various state-of-the-art baselines which can be broadly categorised into 1) BoW-based autoencoder models without external knowledge, which include Scholar (without meta-data) (Card et al., 2018), W-LDA (Nan et al., 2019), GSM (Miao et al., 2017), ETM (Dieng et al., 2020) and RRT (Tian et al., 2020); 2) models with external knowledge learned by pretrained language models including CTM (Bianchi et al., 2021) and BAT (Hoyle et al., 2020). GSM is similar to Scholar but with a simpler encoderdecoder structure. ETM further factorizes the topicword distribution matrix into multiplication of topic and word embedding vectors. RRT proposes a new reparameterization trick for Dirichlet distributions over the document embeddings.
As for the implementations and settings of the baselines, we use their official codes and default settings obtained from their official Github repositories, except for GSM whose original code is unavailable. In this case, we re-implement GSM by referring to other credible sources 8 . For BAT, we use its enhanced versions with Scholar (i.e. BAT+Scholar) and W-LDA (i.e. BAT+W-LDA) from its implementation. To allow for fair comparisons, we tune the major parameters for all the models, including the numbers of hidden layers {1, 2, 3} and hidden neurons {100, 300, 600}, the learning rate {0.001, 0.002, 0.005, 0.01} and the batch size {8, 20, 200}. The ranges of the above hyper-parameters are the most common ones set by the baselines in their own implementations. We generally found that 1 hidden layer with 300 neurons, a learning rate of 0.002 and a batch size of 20 yields the best overall perplexity and NPMI performance across the models. For CTM and BAT that incorporate external knowledge as NAHTM does, they use the same pre-trained transformers as NAHTM, including "bert-base-uncased", "distilbert-base-uncased" and "roberta-base" from the Huggingface Transformers models 9 .

NAHTM Settings
The hyper-parameters of NAHTM control the influence of its different components and we use the validation dataset to optimize their values in terms of the validation perplexity. We find the same values generally hold, across the datasets, for γ 1 = 0.01 and γ 2 = 0.001 that control the effects of external embeddings for words and documents respectively. On the other hand, γ 3 and γ 4 were respectively tuned over the sets of candidate parameter values {0.001, 0.01, 0.1, 1} and {0.01, 0.1, 1, 5} to different optimal values for different datasets; β 0 , β 1 and β 2 were all optimized over the value set {0.001, 0.01, 0.1, 1}, which, compared with the KL annealing approach (Bowman et al., 2016), is much more efficient albeit sub-optimal; λ 0 and λ 1 were both tuned over the value set {0.01, 0.1, 1, 5, 10}. To allow for fair comparisons with the baselines, we set the number of hidden layers for both the internal BoW and the external knowledge encoders to be 1, the number of neurons to be 300, the batch size to be 20, and the learning rate to be 0.002 for all the experiments.

Results and Discussion
Following the settings from the previous section, we proceed to conduct both quantitative and qualitative experiments to evaluate NAHTM. For the quantitative experiments, we report the results of the average external and internal NPMI, and the average test perplexity over 5 runs with different random seeds for initialization. Moreover, within each run, all the models were learned twice with 50 and 200 topics each time, and their corresponding metric scores have been summarized in Table 3.
It can be observed that NAHTM, equipped with either the β 1 -customising strategy 1 or 2 from Section 4.3 (i.e. NAHTM S1 and NAHTM S2 ), and with uncustomised hierarchical KL (i.e. NAHTM HKL ), is overall more coherent, in terms of both types of NPMI, than the baseline models and NAHTM with independent KL terms for sentences and documents (i.e. NAHTM KL ). Especially when comparing with BAT+{Scholar,W-LDA} and CTM, which also have leveraged external pre-trained knowledge, NAHTM manages to achieve not only higher topic coherence but also lower perplexity on the test data in general. The only exception is on the NIPS dataset where NAHTM follows BAT+Scholar closely on the perplexity.
In addition to its efficacy on the document-  level BoW reconstruction, NAHTM is also able to achieve state-of-the-art reconstruction performance at the sentence level. We illustrate this by treating each sentence (from the beginning 50 sentences) as a short document, and applying all the NTMs models to reconstruct their BoW data. In this case, we focus on the test perplexity of the different models on the sentences, reporting the sentencelevel perplexity from NAHTM with its inference jointly performed over the document-and sentencelevel log-likelihood under the hierarchical KL constraint). Table 5 shows that at 50 topics, the best variant NAHTM S2 , with respect to the documentlevel BoW reconstruction, remains competitive on the sentence-level reconstruction task. This suggests that NAHTM, with its attention-aware hierarchical KL regularization, can effectively infer both the document-and sentence-level neural topic mod-   els it contains, and enables the latter model to be robust against the sparse sentence-level BoW data. Furthermore, we conduct an ablation study on the importance of different external knowledge components in contributing to the robustness of NAHTM when dealing with the sparse sentence data. We find from Table 6 that the external word embeddings are the most important components for enhancing the performance of NAHTM, while the external sentence embeddings are the least important. Despite this, we can still see that the performance of NAHTM is significantly influenced by all the three types of external knowledge.
Finally, to gain a more intuitive view on how well NAHTM has discovered the underlying topics, we show in Table 4 the top 10 words under each of the four example topics extracted by NAHTM from the NIPS and CORD20K datasets. These four topics are Optimization and Neural Networks from the NIPS dataset, and COVID and Quarantine from the CORD20K dataset. It can be observed that the topics discovered by NAHTM tend to be more coherent and less likely to contain common and irrelevant words which can be found from the topic  (142) 981 (119) 1,035 (182) 3,238 (165) NAHTM S2 -X D 976 (67) 711 (79) 890 (86) 2,894 (111) NAHTM S2 -X S 950 (42) 683 (60) 848 (62)

Conclusion
In this paper, we have proposed NAHTM, a VAEbased neural topic model with attention-aware hierarchical KL divergence imposed on the pairs of documents and sentences. NAHTM incorporates both the internal BoW data information and the external pre-trained knowledge for refining the topical embeddings of words, sentences and documents. Both quantitative and qualitative experiments have shown the effectiveness of NAHTM on 1) recovering the BoW data at different levels of granularity and 2) discovering coherent topics, by making use of the hierarchical KL constraints on the sentencedocument pairs and the external knowledge. As for the future work, we would like to investigate the possibility of combining NAHTM with neural language models for topic-aware language understanding and content generation.