Enhancing Neural Topic Model with Multi-Level Supervisions from Seed Words

Efforts have been made to apply topic seed words to improve the topic interpretability of topic models. However, due to the semantic diversity of natural language, supervisions from seed words could be ambiguous, making it hard to be incorporated into the current neural topic models. In this paper, we propose SeededNTM , a neural topic model enhanced with supervisions from seed words on both word and document levels. We introduce a context-dependency assumption to alleviate the ambiguities with context document information

Many works in conventional topic models incorporate seed words as guidance.Some works extend Latent Dirichlet Allocation (LDA) into seeded models (Andrzejewski and Zhu, 2009;Jagarlamudi et al., 2012;Li et al., 2016;Eshima et al., 2020), and some draw inspiration from information theory (Gallagher et al., 2017) or word embeddings (Meng et al., 2020a).While most of the conventional topic models struggle with the growing number of topics and documents, with the recent development of neural topic models (NTM), keyETM (Harandizadeh et al., 2022) is proposed to incorporate seed words into NTM to combine the advantages of NTM of scalability on large datasets.
However, keyETM only focuses on regularizing word-topic relations with seed words and fails to combine document-level topic information, which is essential as the semantics of words may vary under different context documents.As shown in Figure 1(a), under different contexts, the word 'apple' has different semantic meanings and may belong to different topics, even if it co-occurs with the seed word 'company'.This inspires us to incorporate supervisions from seed words into NTM on both word and document level and balance information from both levels for better inference of topics, thus achieving better topic interpretability.
There still remain challenges to effectively combining multi-level supervisions from seed words into the current framework of NTM.Firstly, the mean-field assumption made in current NTMs prevents the model from combining topic preferences of words and documents because they are assumed to be conditionally independent.Secondly, as shown in Figure 1(b), document level supervisions from seed words can be noisy due to the semantic ambiguity of natural languages.Previous work (Li et al., 2018) tried to tackle the problem via a neighbor consistency regularization.However, the neighbor-based method can be time-consuming, limiting the scalability on large datasets, and noisy Documents TF-IDF value of some seed words As a doctor, Will I be able to have a good life?Will i be happy as a student in med school ?… Is it because you love it?Is it for the money?… However it's not about the money.Although… How can I make some extra cash?I want to make some extra money to use to decorate our house… You can do dictation from home check with doctors, dentists, lawyers etc.If there is a local college or university near you offer to type papers for a fee… Our contributions are summarized as follows: • We propose SeededNTM, a novel neural topic model that leverages supervisions from seed words on both word and document level.
• We propose a reasonable context-dependency assumption and develop an auto-adaptation mechanism to automatically balance between word level and document level information.
• We propose an intra-sample consistency regularizer to deal with noises from document level supervisions by encouraging both perturbation and semantic consistency,.

Word Encoder
Word encoder encodes words to local word-topic preferences ϕ.For a word w n , it is first encoded to the embedding vector e n , followed by a feedforward network activated with a softmax function.
The above procedure is donoted as ϕ n = F w (w n ).

Topic Decoder
The decoder contains topic-word distribution and reconstructs documents with topic mixtures.Inspired by (Eisenstein et al., 2011), we disassemble topics in log-space into three parts, background m, regular topic η r , and seed topic η s .The background term is estimated with the overall log frequencies of words from the corpus, and both regular and seed topics act as additional deviations on m.The possibility β kv for word w v in topic k is where η r k is a V -dimensional parameter vector 275 whose elements at positions corresponding to S k 276 are fixed to zero.And η s k is defined as where κ is a hyperparameter of seeding strength.for a document d, its corresponding θ is In previous work (Rezaee and Ferraro, 2020), the 307 inferred posterior distribution q(θ, z|w) is decom-308 posed with a mean-field assumption as  We use the symmetric KL Divergence to measure the distance between two distributions, and our consistency regularizer is summarized as follows.

Training Objectives
With the new assumption in Eq.9, L rec and L kl in Eq.1 can be further derived as Detailed derivations can be found in Appendix A.
Our final training objectives is where λ 0 is KL annealing factor and gradually in-  We conduct our experiments on three datasets: 20

Evaluation of Topic Coherence
Evaluation Metrics.We use Normalized Pointwise Mutual Information (NPMI), to evaluate the coherence of learned topics.NPMI between words w i and w j is defined as: Baselines.We compare SeededNTM with the following baselines.For unsupervised topic models, we compare with LDA (Blei et al., 2003), which is a representative conventional neural topic models, and CombinedTM (Bianchi et al., 2021a), which enhances prodLDA (Srivastava and Sutton, 2017) with contextualized embeddings from BERT.
For seed-guided topic models, we compare with SeededLDA (Jagarlamudi et al., 2012), STM (Li et al., 2016) distributions for each document.Among the baselines methods, seededNTM, STM, and keyATM achieve better performances on three datasets, as they incorporate information from seed words on both levels.

540
We can find that SeededNTM achieves best results

541
on both metrics, which further demonstrates the 542 quality of our learned topics.

543
Case Study: Here we present part of topics learned by SeededNTM on Yahoo Answer dataset along with topics learned by baselines methods using the same seed words in the aforementioned experiments in Table 4.We can find that some base- Newsgroups (Lang, 1995) is a collection of newsgroup documents containing 11,000 train samples and 7,000 test samples in 20 classes.It is a common dataset that is widely used in topic modeling field.The UIUC Yahoo Answers dataset (Chang et al., 2008) contains 150,000 question-answer pairs belonging to 15 categories.It is a classification dataset and is used in topic models in (Card et al., 2018).DBPedia (Zhang et al., 2015) is extracted from Wikipedia and contains 560,000 train samples and 70,000 test samples belonging to 14 ontology classes.DBPedia is a classification dataset, and to the best of our knowledge, it is the first time that DBPedia has been used for topic modeling, but similar datasets (though much smaller) from Wikipedia have been adopted to test topic models (Nguyen and Luu, 2021).

B.2 Preprocess Procedures for Datasets
We preprocess documents in each dataset by tokenizing, filtering out stop words, words with document frequency above 70%, and words appearing in less than around 100 documents (depending on the dataset).
The final vocabulary sizes for each dataset after preprocessing vary from 2,000 to 20,000.Then we remove the documents shorter than two words.
Specifically, for the UIUC Yahoo Answer dataset, we follow the approach used in (Card et al., 2018), and drop the Cars and Transportation and Social Science classes and merge Arts and Arts and Humanities into one class, producing 15 categories, each with 10,000 documents.
As for the augmentation functions A, we use the word level augmentation method proposed in (Xie et al., 2020) by randomly replacing words with lower tf-idf scores.Around of 10% words are replaced in our experiments.

B.3 Statistics of Datasets
We summarize the statistics for the three datasets after preporcessing in Table .B.1

C.2 Baselines
We give detailed descriptions of our baselines here.• LDA (Blei et al., 2003): LDA is one of the most popular unsupervised conventional topic models that deduce posterior distribution via Gibbs sampling or variational inference.
• CombinedTM: CombinedTM enhances topic model with contextualized embeddings from pretrained language model to improve model's semantic expression ability and leads to more coherent topics.CombinedTM is an extension of prodLDA (Srivastava and Sutton, 2017)

C.7.4 Exploration on the various aspects of single concept
Due to the ambiguity of natural language, a single word or concept may relate to various topics with different meanings, especially for some common words such as 'apple', 'doctor' or 'card'.In this case, we assume that the users aim at using topic models to understand different topics in the corpus related to a single word.We start with a single word, 'card'.We set only one topic with a single seed word 'card' and leave other topics unsupervised.Then we use the topic model to generate one supervised topic about 'card' and several unsupervised topics.Iteratively, we treat the most related word in the topic 'card' as the seed word for a new topic and train another topic model under new settings.The results are shown in Table C.9. Due to space limitations, we only list the topic 'card' in round 4 and round 5. From the results, SeededNTM shows its ability to distinguish different semantic topics related to the same word, which can be used to assist users with understanding complex concepts.

D Limitations and Potential Risks of SeededNTM
Though SeededNTM achieves good performances in our experiments, there are still some limitations.
Firstly, supervisions from seed words, though flexible, are also very weak and vulnerable to noises.
Though we introduce some ways to improve the model's robustness, it is still possible that the model may crash under intentional attacks.Secondly, seed words in our model are used as pseudo supervisions.A more elegant way is to incorporate it into the generative story.As for potential risks, seeded topic models can be used to trace a specific topic, so it is possible that it's used to track someone's information from texts collected from the internet, violating personal privacy.

Figure 1 :
Figure 1: Examples from UIUC Yahoo Answers dataset.(a) Multiple semantic meanings of the word 'apple' under different contexts.(b) Seed words from three different topics bring noises to each other when estimating document topic preferences.

Figure 2 :
Figure 2: The overall structure of SeededNTM.The grey boxes indicate the training losses in SeededNTM, and the dashed boxes indicate the variables used in loss computations.
distribution θ with the pseudo dis-283 tribution θ which is estimated via the tf-idf scores 284 of seed words appearing in the document.Formally, 285 topic preferences ϕ can also be reg-292 ularized by seed words.We estimate the pseudo 293 word-topic distribution φ with co-occurrence mea-294 sured by the conditional possibility p(w|s) = both s and w.And the pseudo possibility for word w n belonging to topic k is a temperature factor to sharpen the dis-301 tribution.And we also use KL divergence to mini-302 mize the distance between φn and ϕ n , 303 L w (ϕ n , φn ) = KL( φn ∥ϕ n ) = k

310
but as we mentioned before, per-word topic pref-311 erences can be ambiguous without context docu-312 ment information.Therefore, instead of mean-field 313 assumption, we introduce a context-dependency 314 assumption by taking document topic distribution 315 θ into consideration, 316 q(θ, z|w) = q(θ|w) n q(z n |w n , θ). (9) 317 As z n is now conditioned on both w n and θ, how 318 to properly balance information from word and doc-319 ument remains unsolved.Inspired by the idea of 320 product of experts (Hinton, 2002), we propose an 321 auto-adaptation mechanism to automatically com-322 bine local word-topic preference ϕ n and the global 323 document-topic preference θ and implement the 324 combination as products of two distributions, creases to 1 during training and λ 1 , λ 2 , λ 3 are hyperparameters.The overall structure of Seeded-NTM is shown in Figure2, and the training procedure is described in Algorithm 1.
401where H(c) is the entropy of class c and H(c|w) 402 denotes the conditional entropy of c given w.For 403 each class, we choose the top L words with the highest IG scores as seed words.

)
As we are dealing with topic models with seed words, we take the top N non-seed words and predefined L seed words for each topic and measure NPMI among the N + L words.For unsupervised methods, we pick the top N + L words.By considering both seed and non-seed words, the NPMI scores can measure how well the learned topics fit the predefined aspects of interests.Also, the score implicitly reflects topic diversity, as topics with a high coherence score with seed words are more likely to be diverse as long as their seed words are distinct.We report NPMI with N = 10, L = 5 on both train and test sets.Results with different seed word numbers can be viewed in Appendix C.

531 where 1
(w ki ∈ S k ) indicates whether word w ki 532 belongs to the topic with seed word set S k .We 533 conduct human evaluation on UIUC Yahoo An-534 swer dataset, and take the top 5 words from each 535 topic for evaluation.For each metric, we invite 10 536 graduate students to independently fulfil the task, 537 and the participants in two groups do not overlap to 538 avoid information leak.More details can be viewed 539 in Appendix C. The results are shown in Table lines like Anchor CorEx, keyATM, and KeyETM, tend to put high weights on several commonly used words like 'play', 'great', 'good', while Seeded-NTM tends to pay attention to words that are more specific such as 'nintendo', a Japanese multinational video game company who releases the game 'Pokemon', and 'rowling', the author of Harry Potter, and 'ttc', meaning 'trying to conceive'.Besides presentations of topics, we conduct more other qualitative experiments under different settings to verify the generalization ability of our model.Please refer to Appendix C.3 for more information.6 Conclusions In this paper, we propose SeededNTM to improve topic interpretability together with scalability.We leverage supervisions from seed words on both word and document levels and propose a contextdependency assumption.An auto-adaptation mechanism is designed to balance word and context document information.Moreover, we propose an intra-sample consistency regularizer to deal with noisy document level supervisions.Perturbation consistency and semantic consistency are encouraged to improve the model's robustness to noises.Through quantitative and qualitative experiments on three datasets, we demonstrate that SeededNTM can derive semantically meaningful topics and outperforms state-of-the-art baselines.B More Details of Datasets B.1 Dataset Descriptions Three datasets are used in out experiments: 20 Newsgroups, UIUC Yahoo Answers, and DBPedia.20 The choices of hyperparameters for each dataset.

Figure C. 1 :
Figure C.1: Examples of the survey we used in human evaluation.
can reflect the semantic structure informa- Algorithm 1 The SeededNTM training procedure.Input: corpus D, topic number K, seed word sets S = {S 1 , S 1 , ..., S K }, initial KL annealing factor λ 0 , hyperparameters λ 1 , λ 2 , λ 3 , max iteration number T .fortfrom 1 to T do randomly sample a batch of B documents;L batch ← 0; λ 0 ← min(λ 0 + 1 T ,1.0); compute β k for each topic k by Eq.3; for each document d in the batch do compute θ with encoder F d ; compute ϕ n for each w n with encoder F w ; compute φ d = {φ 1 , ..., φ n } by Eq.10; L batch ← L batch + L tr by Eq.13 end for update model parameters with ∇L batch

Table 1 :
The NPMI and F1 scores on three datasets.Results are averaged over multiple runs with different random seeds.Standard deviations can be viewed in Appendix.

Table 2 :
Results of different variants of SeededNTM on 20 Newsgroups dataset.
(Meng et al., 2018)the necessity to incorporate 452 document level information.Anchor CorEx and 453 CatE are strong baselines on some occasions, as 454 Anchor CorEx has an information-theory-based ob-455 jective similar to NPMI, and CatE takes the order 456 of words as additional information when learning 457 embeddings.4585.4Evaluation of Classification 459 Evaluation Metrics.Except for evaluating co-460 herence of learned topics, we evaluate how well 470 Baselines.We compare SeededNTM on clas-471 sification with the aforementioned baselines ex-472 cept for the unsupervised ones.Specifically, we 473 follow CatE's original paper and use WeSTClass 474 model(Meng et al., 2018)to classify its outputs.475Results.Table1summarizes the F1 scores

Table 2
w.o.doc and SeededNTM-w.o.word indicate the importance of information on both document and word levels.Moreover, the decay on SeededNTMmean proves the effectiveness of our proposed as-

Table 3 :
Human evaluation results on word intrusion task and MACC of different models on UIUC Yahoo Dataset.

Table B
We present our choices for hyperparameters in Table.C.1.Hyperparameters are determined by grid search on the smallest dataset, 20 Newsgroups, and fine-tuned on other two large datasets.The final hyperparameters are shown in Table C.1.

Table C .
, one of the most representative neural topic models.It uses black-box neural variational inference and optimizes the model with stochastic gradient descent, increasing the model's scalability.2:Standard deviations of the results in Table 1.these datasets are collected from entirely different sources, some semantically meaningful topics can still be discovered with transferred seed words, and some lead to slightly different concepts from the originals.Moreover, the results indicates that topic-wise supervisions are flexible and bear less bias than sample-wise supervisions.Table C.8: The top words of topics learned with transferred seed words from 20 Newsgroups and DBPedia.
(Li et al., 2016)lamudi et al., 2012): SeededLDA pairs each regular topic with a topic containing only seed words and biases documents' topic preferences in Gibbs sampling if they contain seed words.•STM(Lietal., 2016): STM is a topic model-based weakly-supervised text classification method that incorporates both document and word level supervisions to improve classification accuracies.•AnchoredCorEx(Gallagheretal., 2017): Anchored CorEx is based on an information-theoretic framework and tries to derive maximally informative topics based on seed words.•CatE(Mengetal., 2020a): CatE aims at deriving topics with a single seed word for each topic.It uses a word embedding method and tries to learn a discriminative embedding space for both topics and words.•keyATM (Eshima et al., 2020): keyATM improves upon SeededLDA by allowing some seed-word-

Table C .
9: The top five words of topics learned on UIUC Yahoo Answer dataset with iteratively-given seed words.