vONTSS: vMF based semi-supervised neural topic modeling with optimal transport

Recently, Neural Topic Models (NTM), inspired by variational autoencoders, have attracted a lot of research interest; however, these methods have limited applications in the real world due to the challenge of incorporating human knowledge. This work presents a semi-supervised neural topic modeling method, vONTSS, which uses von Mises-Fisher (vMF) based variational autoencoders and optimal transport. When a few keywords per topic are provided, vONTSS in the semi-supervised setting generates potential topics and optimizes topic-keyword quality and topic classification. Experiments show that vONTSS outperforms existing semi-supervised topic modeling methods in classification accuracy and diversity. vONTSS also supports unsupervised topic modeling. Quantitative and qualitative experiments show that vONTSS in the unsupervised setting outperforms recent NTMs on multiple aspects: vONTSS discovers highly clustered and coherent topics on benchmark datasets. It is also much faster than the state-of-the-art weakly supervised text classification method while achieving similar classification performance. We further prove the equivalence of optimal transport loss and cross-entropy loss at the global minimum.


Introduction
Topic modeling methods such as (Blei et al., 2003) is an unsupervised approach for discovering latent structure in documents and achieving great performance (Blei et al., 2009).Topic modeling methods take a list of documents as input.It generates the defined number of topics.It can further produce keywords and related documents for each topic.In recent years, topic modeling methods have been widely used in many fields such as finance (Aziz et al.), healthcare (Bhattacharya et al., 2017), education (Zhao et al., 2020b), marketing (Reisenbichler, 2019) and social science (Roberts et al., 2013).With the development of Variational Autoencoder (VAE) (Kingma and Welling, 2013), Neural Topic Model (Miao et al., 2018;Dieng et al., 2020) has attracted attention as it enjoys better flexibility and scalability.However, recent research (Hoyle et al., 2021) shows that the topics generated by these methods are not aligned with human perceptions.
To incorporate users' domain knowledge into the model, semi-supervised topic modeling methods become an active area of research (Mao et al., 2012;Jagarlamudi et al., 2012;Gallagher et al., 2018) and applications (Choi et al., 2017;Cao et al., 2019;Kim et al., 2013).Semi-supervised topic modeling methods take a few keywords as input and generate topics based on these keywords.Poeple use semisupervised topic modeling methods because they want each topic include certain keywords and incorporate their domain expertise in their generated topics.Traditional semi-supervised topic modeling methods fail to utilize semantic information of the corpus, causing low classification accuracy and high variance (Chiu et al., 2022a).
To solve these problems, we propose a von Mises-Fisher(vMF) based semi-supervised neural topic modeling method using optimal transport (vONTSS).We use the encoder-decoder framework for our model.The encoder uses modified vMF priors for latent distributions.The decoder uses a word-topic similarity matrix based on spherical embeddings.We use optimal transport to extend it to a semi-supervised version.vONTSS has the following enhancements: 1. We introduce the notion of temperature and make the spread of vMF distribution (κ) learnable, which leads to strong coherence and clusterinducing properties.
2. vONT (In the rest of the paper, we use vONT to refer to the unsupervised topic model and vONTSS to semi-supervised version.)achieves the best coherence and clusterability compared to the state-of-the-art approaches on benchmark datasets.
3. We perform the human evaluation of the re- sults for intrusion and rating tasks, and vONT outperforms other techniques.4. Use of optimal transport to extend the stability of the model in the semi-supervised setting.The semi-supervised version is fast to train and achieves good alignment between keywords sets and topics.We also prove its theoretical properties.
5. In the semi-supervised scenario, we demonstrate the vONTSS achieves the best classification accuracy and lowest variance compared to other semi-supervised topic modeling methods.
6.We also show that vONTSS achieves similar performance as the state-of-the-art weakly text classification method while being much more efficient.
2 Related Methods and Challenges NTM Variational Autoencoders (VAE) (Kingma and Welling, 2013) enable efficient variational inference.NTM (Miao et al., 2015) uses Z ∈ R M as topic proportions over M topics and X ∈ R V to represent word count for the dataset with V unique words.NTM assumes that for any document, Z is generated from a document satisfying the prior distribution p(Z) and X is generated by the conditional distribution p θ (X|Z) where θ denotes a decoder.Ideally, we want to optimize the marginal likelihood p θ (X) = p(Z)p θ (X|Z)dZ.Due to the intractability of integration, NTM introduces q ϕ (Z|X), a variational approximation to the poste-rior p(Z|X).The loss function of NTM is: NTM usually utilizes a neural network with softmax to approximate p θ (X|Z) := sof tmax(W z) (Srivastava and Sutton, 2017).NTM selects Gaussian (Miao et al., 2016), Gamma (Zhang et al., 2020) and Dirichlet distribution (Burkhardt and Kramer, 2019) to approximate p(Z).The second term Kullback-Leibler (KL) divergence regularizes q ϕ (Z|X) to be close to p(Z).NTM has several problems in practice.Firstly, it does not capture the semantic relationship between words.Secondly, the generated topics are not aligned with human interpretations.(Hoyle et al., 2021).Thirdly, using Gaussian prior may risk gravitating latent space toward the center and produce tangled representations among classes of documents.This is due to the fact that gaussian density presents a concentrated mass around the origin in low dimensional settings (Dümbgen and Del Conte-Zerial, 2013) and resembles a uniform distribution in high dimensional settings.
Extending NTM to semi-supervised version is also challenging.L θ,ϕ is not always aligned with classification-related loss such as cross-entropy loss as identified by existing research (Chiu et al., 2022b).To be specific, cross-entropy makes keywords sets align with assigned topics, while reconstruction loss(−E q ϕ (Z|X) [log p θ (X|Z)]) makes latent space as representative as possible.Thus, existing semi-supervised NTM methods either are not stable (Wang et al., 2021a;Harandizadeh et al., 2022) or need certain adaptions (Gemp et al., 2019).
Embedding Topic Model (ETM) Pre-trained word embeddings such as Glove (Pennington et al., 2014a) and word2vec (Mikolov et al., 2013) have the ability to capture semantic information, which is missing from basic bag-of-word (BoW) representations.They can serve as additional information to guide topic discovery.Dieng (Dieng et al., 2020) proposes ETM to use a vocabulary embedding matrix e V ∈ R V ×D where D represents the dimension of word embeddings.The decoder ϕ learns a topic embedding matrix e T ∈ R M ×D .We denote topic to word distribution sof tmax(e T e T V ) as E However since there exists some common words that are related to many other words, these common words' embeddings may be highly correlated with few topics' embeddings.Thus, ETM does not produce diverse topics (Zhao et al., 2020a).Besides, using pre-trained embeddings cannot help the model identify domain-specific topics.For example, topics related to COVID-19 are more likely to be expressed by a few topics instead of one single topic using pre-trained Glove embeddings (Pennington et al., 2014b) since COVID-19 is not in the embeddings.
von Mises-Fisher In low dimensions, the Gaussian density presents a concentrated probability mass around the origin.This is problematic when the data is partitioned into multiple clusters.An ideal prior should be non-informative and uniform over the parameter space.Thus, the von Mises-Fisher(vMF) is used in VAE.vMF is a distribution on the (M-1)-dimensional sphere in R M , parameterized by µ ∈ R M where ||µ|| = 1 and a concentration parameter κ ∈ R ≥0 .The probability density function of the vMF distribution for z ∈ R D is defined as: where I v denotes the modified Bessel function of the first kind at order v.The KL divergence with vMF(., 0) (Davidson et al., 2018) is vMF based VAE has better clusterability of data points especially in low dimensions (Guu et al., 2018).However, vMF distribution has limited expressibility when its sample is translated into a probability vector.Due to the unit constraint, sof tmax of any sample of vMF will not result in high probability on any topic even under strong direction µ.For example, when topic dimension M equals to 10, the highest topic proportion of a certain topic is 0.23.Most of vMF-based topic modeling methods are not VAE based and very slow to train as summarized in Appendix M. From the heatmap, we observe a white hole in the middle, which denotes the unreachable probability vector from each distribution.Gaussian is mean-centered, while basic vMF tends to cluster around a small rounded triangular area due to its unity constraint.vMF with radius equals to 10 is even more expressive than Gaussian while still retaining edge weights, inducing separability among different topics.

Proposed Methods
The architecture of vONTSS is shown in Figure 3.At a high level, our encoder network ϕ transforms the BoW representation of the document X d into a latent vector generated by vmf distribution and generates a sample η d .We then apply a temperature function τ and softmax on this sample to get a probabilistic topic distribution z d .Lastly, our decoder uses a modified topic-word matrix E to reconstruct X d 's BoW representation.To extend into semi-supervised setting, we leverage optimal transport to match keywords' set with topics.The encoder network ϕ and generative model parameter θ are learned jointly during the training process.
To overcome entangled topic latent space introduced by Gaussian distribution and limited expressibility of vMF distribution, we make two improvements: 1. Introduce a temperature function τ (η i ) prior to sof tmax() to modify the radius of vMF distribution.2. Set κ to a learnable parameter to flexibly infer the confidence of particular topics during training.
Encoder Network Temperature Function To alleviate concerns regarding expressibility while inducing separability among topics, we modify the radius of vMF distribution.We use a temperature function to represent the radius.As shown in Figure 1, unmodified vMF distribution has limited expressiveness.For instance, Gaussian posteriors can express a topic probability vector of [0.98, 0.01, 0.005, 0.0003, 0.0002], while vMF can't due to the unity constraint.In practice, if we change the ra- Learnable κ To further improve the clusterability, we convert κ from a fixed value to a learnable parameter.The KL divergence of vMF distribution makes the distribution more concentrated while not influencing the direction of latent distribution.This makes the result more clustered.For Gaussian distribution, KL divergence penalizes the polarization of latent distribution (Appendix K).This makes the Gaussian distribution less clustered.To illustrate this, we randomly sampled encoded documents' latent distributions from AgNews Dataset (Zhang et al., 2016) after training with both latent distributions, as shown in Figure 2.For the Gaussian distribution, we see that documents belonging to different topics are entangled around the center, causing the inseparability of topics during both the training and inference stage.vMF distribution, on the hand, repels four document classes into different quadrants, presents more structures when compared to Gaussian distribution, and creates better separable clusters.Detailed ablation study can be found in Appendix O Decoder Network Our decoder follows ETM's construction and uses the embedding e V and e T to generate a topic-word matrix E. One distinction between our decoder and ETM's decoder is that we generate the word embeddings by training a spherical embedding on the dataset.Spherical We also keep word embeddings fixed during the topic modeling training process for two reasons.Firstly, keeping word embeddings fixed can alleviate sparsity issues (Zhao et al., 2018).Additionally, vMF based VAE tends to be less expressive in high dimensions due to limited variance freedom (Davidson et al., 2018).Keeping the embedding fixed can make topics more separable in higher-dimension settings and improve topic diversity.
Loss Function for vONTSS In semi-supervised settings, the user specifies sets of keywords S associated with topics T .Let (s, t) represent a keyword set and a topic pair, where each keyword x ∈ s is labeled by topic t.Instead of training a separate neural network for the semi-supervised extension of NTM, we use the topic-word matrix (decoder θ) to represent the probability of a word x given topic t.
M1 + M2 is a semi-supervised model used in VAE.We adapt the M1 + M2 model framework (Kingma et al., 2014).Under the assumption that p θ (x, t, z) = p θ (x|z)p θ (t|x)p(z), our loss function can be approximated as For topic i and word j, we let q θ (x j |t i ) = E i,j where E is the topic-word matrix.H[q ϕ (X|T )] is entropy of q θ (X|T ).We can consider it as a regularization term.
Optimizing the current model is hard because we have 3 objectives to minimize(cross-entropy, KL Divergence, and reconstruction loss) and they are not aligned with each other.To validate our point, we find out that if we make radius parameters learnable, the classification metric performs worse even if it decreases the reconstruction loss(Appendix D).If we apply cross-entropy at the beginning, topic embeddings get stuck into the center of selected keywords' embeddings, which makes the model overfitting.If we first train an unsupervised vONT, we need to find a way to match keywords and trained topics.If we match them based on their cosine similarity, different keywords may match to the same topics.This makes performance unstable.To deal with these challenges, we decide to use a two-stage training process and do not specify labeled keywords to topics at the beginning.vONTSS first optimizes L θ,ϕ (X) − αH[q ϕ (X|T )] till convergence, then jointly optimizes L(X, T ) for few epochs.This makes our method easier to optimize, less time-consuming, and suitable for interactive topic modeling (Hu et al., 2014).To optimize L ce after stage 1, we need to pair topics and keyword sets.Existing methods such as Gumbel softmax prior (Jang et al., 2016) often lead to instability, while naive matching by q ϕ (x|t) may give us redundant topics.
Optimal Transport for vONTSS Optimal Transport (OT) distances (Chen et al., 2019;Torres et al., 2021) have been widely used for comparing the distribution of probabilities.Specifically, let U (r, c) be the set of positive m × n matrices for which the rows sum to r and the sum of the column to c: For each position t, s in the matrix, it comes with a cost C t,s .Our goal is to solve d C (r, c) = min P ∈U (r,c) t,s P t,s C t,s .To make distribution homogeneous (Cuturi, 2013), we let OT has achieved good robustness and semantic invariance in NLP related tasks (Chen et al., 2019).
Optimal transport has been used in topic modeling to replace KL divergence (Zhao et al., 2020a;Huynh et al., 2020;Wang et al., 2022) or create topic embeddings (Xu et al., 2018) as discussed in Appendix M. It has not been used for extending topic modeling to semi-supervised cases.
To better match topic and keywords set, we approximate L ce using optimal transport.We choose sinkhorn distance since it has an entropy term, which makes our trained topics more coherent and stable.Our goal is to design the loss function that is aligned with derived cross-entropy loss at the global minimum.To be specific, the raw dimension of our cost matrix is equal to the dimension of topics and the column dimension of the cost matrix equals to the dimension of keywords group.
We denote each entry in the M matrix in optimal transport as, where t is the topic and x is the word in a keywords group s.The model uses sinkhorn distance and restricts the sum of each column and row of P to 1.We give the model an entropy penalty term to make sure each topic is only related to one group of keywords.Thus, where λ controls the entropy penalty.The first term is similar to L ce approximation, and the second term makes the result homogeneous.To lower the second term, each keyword should be highly correlated to one topic while not/negatively correlated with others.This further separates the topics and improves the topic diversity.We further show that L OT = L ce when L(X, T ) is minimized.
Lemma 3.1 When L(X, T) reaches the global minimal.For any (s, t), (s ′ , t ′ ) ∈ (S, T ): .41 ± 0.01 0.03 ± 0.01 0.61 ± 0.05 0.05 ± 0.01 0.55 ± 0.04 0.07 ± 0.03 ETM 0.41 ± 0.04 0.02 ± 0.002 0.35 ± 0.02 -0.04 ± 0.01 0.51 ± 0.02 0.06 ± 0.01 vNVDM 0.44 ± 0.02 0.028 ± 0.008 0.74 ± 0.02 0.08 ± 0.007 0.52 ± 0.01 0.03 ± 0.01 ProdLDA 0.32 ± 0.04 -0.22 ± 0.04 0.59 ± 0.06 0.01 ± 0.003 0.35 ± 0.02 -0.18 ± 0.03 NSTM 0.37 ± 0.02 -0.04 ± 0.02 0.61 ± 0.01 -0.08 ± 0.007 0.38 ± 0.01 0.06 ± 0.04 vONT 0.49 ± 0.02 0.054 ± 0.02 0.70 ± 0.03 0.10 ± 0.03 0.69 ± 0.03 0.16 ± 0.02 datasets have ground truth labels.Average document length varies from 5.4 to 155.We preprocess all the datasets by cleaning and tokenizing texts.We remove stop words, words that appear more than 15 percent of all documents and words that appear less than 20 time.For semi-supervised experiments, we use the same labels in DBLP and Ag-News.We sample 4 similar classes from 20News to see how our method performs in datasets with similar labels.For unsupervised settings, we keep the number of topics equal to the number of classes plus one.I keep the unit of the length to 10 for all experiments.For semi-supervised settings, we set the number of topics equal to the number of classes in semi-supervised cases, and we provide 3 keywords for each class.We use 20% as the training set to get our keywords with the top tfidf score for each class.We use 80% data as the test set.Additional details and provided keywords on the dataset are available in Appendix H Settings In our experiment setting, we do not utilize any external information beyond the dataset itself.The embedding is trained on the test set.We do not compare methods that rely on transfer learning or language models such as (Bianchi et al., 2021;Yu et al., 2021;Wang et al., 2021b) because of reasons mentioned in appendix Q.The hyperparameter setting used for all baseline models and vONT is similar to (Burkhardt and Kramer, 2019).We use a fully-connected neural network with two hidden layers of [256,64] unit and ReLU as the activation function followed by a dropout layer (rate = 0.5).We use Adam (Kingma and Ba, 2017) as the optimizer with learning rate 0.002 and use batch size 256.We use (Smith and Topin, 2018) as scheduler and use learning rate 0.01 for maximally iterations equal to 50.We use spherical embeddings (Meng et al., 2019) trained on the dataset for NVTM, ETM, GSM and NSTM.For vONT, we set the radius of vMF distribution equal to 10.We fix α = δ = 1 in L(X, T ) .We keep λ = 0.01 in L OT .Our code is written in PyTorch and all the models are trained on AWS using ml.p2.8xlarge (NVIDIA K80). 1

Unsupervised vONT experiments
Evaluation Metrics We measure the topic coherence and clusterability of the model.Most of unsupervised topic coherence metrics are inconsistent with human judgment, based on a recent study (Hoyle et al., 2021).Thus, we have done a qualitative study where we ask crowdsource to perform rating and intrusion task on 4 models trained on AgNews.In rating task (Aletras and Stevenson, 2013;Newman et al., 2010;Mimno et al., 2011), raters see a topic and then give the topic a quality score on a three-point scale.The rating score is between 1 and 3. A rating score close to 3 means that users can see a topic from provided words.Chang (Chang et al., 2009) devise the intrusion task, where each topic is represented as its top words plus one intruder word which has a low probability belonging to that topic.Topic coherence is then judged by how well human annotators detect the intruder word.The intrusion score is between 0 and 1.An intrusion score close to 1 means that users can easily identify the intruder word.We use mechanical turk and sagemaker groundtruth to do the labeling work.To measure clusterability, we assign 1 Details on codebases used for baselines and fine-tuning are provided in Appendix E every document the topic with the highest probability as the clustering label and compute Top-Purity and Normalized Mutual Information(Top-NMI) as metrics (Nguyen et al., 2018) to evaluate alignment.Both of them range from 0 to 1.A higher score reflects better clustering performance.We further apply the KMeans algorithm to topic proportions z and use the clustered documents to report purity(Km-Purity) and NMI Km-NMI (Zhao et al., 2020a).We varied the number of topics from 10 to 50.We set the number of clusters to be the number of topics for KMeans algorithm.Models with higher clusterability are more likely to perform well in semi-supervised extension.Furthermore, we run all these metrics 10 times.We report mean and standard deviation.Detailed metric implementations are in Appendix G.We also analyze topic diversity in P and unsupervised topic coherence in F.
Baseline Methods We compare with the stateof-the-art NTM methods that do not rely on a large neural networks to train.These methods include: GSM (Miao et al., 2018), an NTM replaces the Dirichlet-Multinomial parameterization in LDA with Gaussian Softmax; ProdLDA (Srivastava and Sutton, 2017), an NTM model which keeps the Dirichlet Multinomial parameterization with a Laplace approximation; ETM (Dieng et al., 2020), an NTM model which incorporates word embedding to model topics; vNVDM (Xu and Durrett, 2018), a vMF based NTM as mentioned in section 2. NSTM (Zhao et al., 2020a), optimal transport based NTM, as mentioned in section 3.All baselines are implemented carefully with the guidance of their official code. 2 For qualitative study, we choose ProdLDA, ETM and LDA as a comparison to align with previous study (Hoyle et al., 2021).
Results i) In Table 2, vONT performs significantly better than other methods in all datasets for cluster quality metrics.This means vMF distribution induces good clusterability.ii) vONT has 2 Some methods we tested had lower TC scores compared to other benchmarks.This may be because we have less complicated layers, small epochs to train, and we keep fewer words.The ranking of these metrics is mostly in alignment with the paper that has a benchmark.We exclude methods that need to rely on large neural networks and a lot of finetune such as (Duan et al., 2021a,b).We also exclude methods similar to existing methods such as (Wang et al., 2022).We exclude methods that do not perform well in previous papers' experiments (Duan et al., 2021a) such as (Burkhardt and Kramer, 2019).We also exclude methods that are relevant but work on different use cases, such as short text.(Wu et al., 2020) the lowest variance in clusterability-related metrics.(iii) In Appendix F, vONT outperforms other models in TC metrics C v and NPMI.This means that our model is coherent.We believe the introduction of the temperature function helps our method perform better than the existed method in coherence.iv) In Appendix P, vONT performs well on diversity and has the lowest variance.
Human Evaluation To evaluate human interpretability, we use intrusion test and ratings test.Details of the experiment are provided in Appendix J.We select AgNews as our dataset, we generate 10 topics each from 4 models.In the word intrusion task, we sample five of the ten topic words plus one intruder randomly sampled from the dataset; for the rating task, we present the top ten words in order.Figure 4 summarizes the results.
vONT performs significantly better than ProdLDA, ETM, and LDA qualitatively.In intrusion test, vONT has the highest score 0.4.The second-best method is LDA, which has score 0.29.The two sample test between the two methods has the p-value equal to 0.014.In rating test, vONT has the highest score 2.51 while ProdLDA has the second-highest score 2.42.The two sample test between the two methods has a p-value equal to 0.036.Based on this study, we conclude that humans find it easier to interpret topics produced by vONT.

Semi-Supervised vONTSS experiments
Evaluation Metric diversity aims to measure how diverse the discovered topic is.diversity is defined as the percentage of unique words in the top 25 words from all topics.(Dienget al., 2020) diversity close to 0 means redundant and TD close to 1 means varied topics.We measure the classification accuracy of the model.Thus, we measure accuracy.Similar to other semi-supervised paper (Meng et al., 2018a), we also measure micro f1 score, since this metric gives more information in semi-supervised cases with unbalanced data.We do not include any coherence metric since we already have ground truth.
Baseline methods CatE (Meng et al., 2020) retrieves category representative terms according to both embedding similarity and distributional specificity.It uses WeSTClass (Meng et al., 2018b) for all other steps in weakly-supervised classification.If we do not consider methods with transfer learning or external knowledge, it achieves the best clas-sification performance.GuidedLDA (Jagarlamudi et al., 2012): incorporates keywords by combining the topics as a mixture of a seed topic and associating each group of keywords with a multinomial distribution over the regular topics.Correlation Explanation CorEx (Gallagher et al., 2018) is an information theoretic approach to learning latent topics over documents by searching for topics that are "maximally informative" about a set of documents.We fine-tune on the training set and choose the best anchor strength parameters for our reporting.We also created semi-supervised ETM by using gaussian distribution and adding the same optimal transport loss as vONTSS.We call it gONTSS.We also train all objectives instead of using two-stage training and call it vONTSS with all loss.Instead of applying optimal transport, we apply cross entropy directly after stage 1 and match topics by keywords set with the highest similarity.We call this method vONTSS with CE.To get Best Unsupervised method, we train the unsupervised models (ETM, vNVDM, vONT, ProdLDA) and consider all potential matching between topics and seed words.We report the method with the highest accuracy for each dataset across all different matching.Guided BERTopic We evaluate the guided version of BERTopic (Grootendorst, 2022) method.They create seeded embeddings to find the most similar document.It then takes seed words and assigns them a multiplier larger than 1 to increase the IDF value. 3esults Table 3 shows that i) vONTSS outperforms all other semi-supervised topic modeling methods in classification accuracy and micro F1 score, especially for large datasets with lengthy texts such as AgNews.ii) vONTSS has a lower standard deviation compared to other models.This advantage makes our model more stable and practical in real-world applications.iii) To compare methods with/without optimal transport, methods with optimal transport vONTSS achieve much better accuracy, diversity, and lower variance compared to vONTSS with CE and vONTSS with all loss.This means optimal transport does increase the classification accuracy, stability, and diversity of generated topics.iv) In benchmark datasets, vONTSS is comparable to CatE in quality metrics.As can be seen in Table 5 in the appendix, vONTSS is 15 times faster than CatE.v) Unsupervised methods cannot produce comparable results even if we use the best topic seed word matching.This shows that semi-supervised topic modeling methods are necessary.vi) Guided Bertopic does not produce good results.It is also not very stable.In Guided Bertopic, the assigned multiplier is increased across all topics, which makes their probability less representative.vi) If we change vONTSS to gONTSS,

Conclusions
In this paper, we propose a new semi-supervised neural topic modeling method vONTSS, which leverages vMF, the temperature function, optimal transport, and VAEs.Its unsupervised version exceeds state-of-the-art in topic coherence through both unsupervised and human evaluations while inducing high clusterability among topics.We show that optimal transport loss is equivalent to crossentropy loss under the optimal condition and induces one-to-one mapping between keywords sets and topics.vONTSS achieves competitive classification performance, maintains top topic diversity, trains fast, and possesses the least variance among diverse datasets.Xian-Ling Mao, Zhaoyan Ming, Tat-Seng Chua, Si Li, Hongfei Yan, and Xiaoming Li. 2012.Sshlda: a semisupervised hierarchical topic model.In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pages 800-809.

Appendix A Additional Experimental Results
Figure 5 shows the variation of cluster purity as the number of topics changes.This expands the information provided in Figure 2.
Figure 6 provides box plots for the metrics in Table 3.
B Proof of Lemma 3.1 Lemma B.1 When L(X, T) reaches the global minimum.For any (s, t), (s ′ , t ′ ) ∈ (S, T ): If the reverse is true, then, we can just switch position of topic t and t ′ in the topic-word matrix and also switch the position on latent space z using temperature function.This will not change reconstruction process, since for every input, get the same reconstruction.Thus, reconstruction loss does not change.Assume this new neural network structure has loss L ′ (X, T ) and cross entropy loss is The last step is based on (9).This contradicts that L(X, T ) is global minimal.Thus, lemma holds.

C Proof of Theorem 3.2
Theorem C.1 When L(X, T) reaches the global minimal, vONT is the left most.We mark its performance in skyblue.
Step 1 show that p t,s = 1 when (t, s) ∈ (T, S) and equal to 0 in all other cases.∃p t,s = γ < 1 when (t, s) ∈ (T, S).Without loss of generality, we assume p t,s ′ = 1 − γ, p t ′ ,s ′ = γ and p t ′ ,s = 1 − γ.Consider related term in L OT , for the first term: Lemma 3.1 and Equation (7) For the second term in L OT , −p t,s log p t,s = 0 when p t,s = 1 or 0. Otherwise, it is larger than 0. This means that p t,s = p t ′ ,s ′ = 1 achieve smaller L OT compare to current settings.This contradicts the definition of L OT which is the min in the space.Thus, p t,s = 1 when (t, s) ∈ (T, S).Since the raw sum and column sum equal to |T |.This means p t,s = 0 when (t, s) / ∈ (T, S) 10) and ( 11), we have

D Effect of learn-able distribution temperature
In this study, we make it a learnable parameter and implement it in two ways.The first way is setting temperature variable as one parameter that can be learned (1-p model).All topics share the same parameter.The second way is setting the temperature variable as a vector with dimension equal to the number of topics (n-p model).This means each topic has its own temperature.The initialization value for both the vectors is 10.
After training, the 1-p model has value 4.99 and n-p model has values [-0.45,4.88,5.91,3.47,4.19](values are rounded to 2 decimals).The accuracy for 1-p model is 78.9 and n-p model is 80.5.This means that vONTSS cannot further improve with learnable temperature.This means that our loss function is not fully aligned with accuracy metric.This is due to the fact that we optimize reconstruction loss as well as KL divergence during the training procedure.This makes our objective less aligned with cross entropy loss.For p dimensional latent space, vMF is parameterized by p+1 variables while Gaussian is parameterized by 2*p variables assuming conditional independence or up to p(p+1)/2 + p variables assuming interdependence.In the extreme setting when labelled documents are less than O(p 2 ), our encoder and decoder may overfit, learning identity mapping.

E Code
In the topic modelling space, a softmax transformation σ is applied to η to extract a probabilistic mixture of topics.In the independent Gaussian posterior case, we view affinity and confidence of the document to topic 1 is encoded in the first entry of µ and, σ 2 respectively.Ideally, we would want the encoder to offer variability in the sampling process to regularize, defined as difference in topic probability with initial training epochs; however, we will show through an example 7, that Gaussian may learn identity mapping by predicting variance to be near 0.
In the figure below, we define misaligned document as those documents such argmax(ς)!= argmax(η).This can be viewed as a measure of regularization.In the Gaussian case, our encoder network learns identity mapping within the first epoch.Out of 120000 documents, only 200 or so documents were able to explore different spaces.vMF allows 1/6th of documents to vary and stabilizes after KL divergence kicks in.In trained latent spaces representation, we clearly see vMF learning more nuanced and structured data when comparing to Gaussian as you can see in 8

J Human Evaluation
We use the ratings and word intrusion tasks as human evaluations of topic quality.We recruit crowdworkers using Amazon Mechanical Turk inside Amazon Sagemaker.We pay workers 0.024 per ratings task and 0.048 per intrusion tasks.We select enough crowdworkers per task so that p value for two sample t test between the best method and the second-best method is less than 0.05, resulting in a minimal of 18 crowd workers per topic for both tasks.Overall, we ask crowdsources to perform 1641 tasks and create 223 We select AgNews as our dataset, we generate 10 topics each from 4 models.In the word intrusion task, we sample five of the ten topic words plus one intruder randomly sampled from the dataset; for the ratings task, we present the top ten words in order.
We also document the confidence per task generated by Amazon Mechanical Turk tool and average time per task for each task as can be seen below.For time spent, crowdsources spend 100 115 seconds per intrusion task and 70 80 seconds per rating task.Crowdsources spent 102.7 seconds on intrusion task generated by vONT which is lower than all other tasks.This means that it is easier for users to find intrusion word for topics generated by vONT.The confidence per rating task is between 0.88 to 0.94, where vONT has highest confidence 0.938 while LDA has lowest confidence 0.886.The confidence per intrusion task is between 0.74 to 0.86, where vONT has highest confidence 0.858 while ETM has lowest confidence 0.747.This means the crowdsources are in general more confident in their answer to questions that is generated by vONT.

K Theoretical Analysis of vMF clusterability
In this section, we present theoretical intuition behind cluster inducing property of vMF distribution comparing to the normal distribution.
In the normal VAE set up, the encoder network learns mean parameter µ i and variance parameter σ i for each document i.During the training process, we sample one data point, η i from the learned distribution and pass into the softmax function to represent a probability distribution of topics.To introduce high clusterability, we need sampled η to have the ability to induce high confidence assignment to a topic under some form of regularization.In other words, with p number of topics, model can increase argmax(sof tmax(η)) ∈ (1/p, 1) without additional penalty.
We prove that under normal distribution and in the two dimensional case, it is impossible to increase argmax(sof tmax(η)) without increase KL divergence loss with respect to the prior N (0, I).The KL divergence with p = 2 is KL If we denote p 1 and p 2 to be expected distribution of topics, then p 1 = e µ 1 e µ 1 +e µ 2 and p 2 = e µ 2 e µ 1 +e µ 2 .Without loss of generality, we assume that the document i is more aligned with the first topic, the model will learn and output µ 1 > µ 2 .To minimize KL defined above, µ 1 and µ 2 will be centered be around 0 with µ 1 = −µ 2 ; however, in order to increase propensity of argmax(sof tmax(η)) or p 1 , µ 1 and µ 2 have to increase and decrease respectively, forcing the KL divergence penalty to increase.distribution to model corpus µ, word embeddings and entity embeddings.To compare, we use modified vMF to generate topic distributions over documents and adapt spherical word embeddings instead of modeling it using vMF.Our method scales well, optimizes fast and offers highly stable performance.The choice of spherical word embeddings also alleviates the sparsity issue among words.vNVDM (Xu and Durrett, 2018) is the only other method that combines vMF with variational autoencoders.(Xu and Durrett, 2018) proposes using vMF(.,0) in place of Gaussian as p(Z), avoiding entanglement in the center.They also approximate the posterior q ϕ (Z|X) = vM F (Z; µ, κ) where κ is fixed to avoid posterior collapse.The above approach does not work well for two reasons.Firstly, fixing κ causes KL divergence to be constant, which reduces the regularization effect and increases the variance of the encoder.Another concern with vMF distribution is its limited expressability when its sample is translated into a probability vector.Due to the unit constraint, sof tmax of any sample of vMF will not result in high probability on any topic even under strong direction µ.For example, when topic dimension M equals to 10, the highest topic proportion of a certain topic is 0.23.We also have a different decoder.
NSTM (Zhao et al., 2020a) uses optimal transport to replace KL divergence.Row and column represent topics and words.Instead, our method represents row and column as topics and keywords with M matrix also defined differently.(Xu et al., 2018) uses optimal transport for topic embeddings, but with wasserstein distances as metric and jointly learns word embeddings.Instead, our algorithm keeps word embedding fixed during the training process to maintain stability.

N Ablation Study on Radius
Ablation study for radius parameter on AG-News where we set topics equal to 10: as we sweep temperature from 1 to 20, nmi increases and diversity decreases.Radius=10 has the best average rank over coherence based metrics in this temperature range.It has good diversity while has good coherence based metric.Temperature = 10 also has the best pruity score which make it useful for semi-supervised learning

O Ablation Study on κ
Ablation study for Kappa on AG-News: we check kappa = 10, 50, 100, 500, 1000.Kappa=100 has highest purity and nmi, kappa = 50 has highest NPMI and C v .Kappa = 500 has highest diversity.Our version of kappa has highest diversity, purity and NPMI compare to all fixed kappa.

P Diversity Evaluation on vONT
vONTSS has high diversity by design.As you can see in the table, vONT achieves the best diversity on R8 and AgNews.vONT is the second best on 20News dataset.It also has the lowest standard deviation compare to other methods.

Q Why not use language modeling based methods?
Most language modeling methods are time-consuming to train and need a lot of transfer learning.They also need finetune in most of our use cases.Without fine-tuning, (Bianchi et al., 2021) makes it harder to be used in domain-specific datasets.We have tried (Yu et al., 2021;Wang et al., 2021b) to compare, but both takes too much time to run.On AG-News, (Yu et al., 2021) takes 108 minutes to run, while (Wang et al., 2021b) takes more than 2.5 hours.It also occurs in other models in footnote 2. vONTSS takes 8 minutes to run and 50 seconds to fine-tune.We also tried some methods which only leverage embeddings of language modeling such as On AgNews and we set topics equal to 20, For (Wang et al., 2020), diversity 0.71, C v 0.396, NPMI:-0.1089.For (Bianchi et al., 2021), diversity 1, C v 0.435, NPMI:-0.1073.Except diversity in (Bianchi et al., 2021), all other metric perform worse than vONT.
For semi-spervised cases, we take keywords as input.It is really different from other weakly supervised learning formulations, and how to incorporate keywords into a language model is not straight forward.We have tried few methods, but it does take a lot of time to run and change their code is not easy since their effectiveness do rely on the specific version of language model.Thus, we exclude language modeling methods in our paper.Also, in our use case, each topic model is designed for a specific user or use case.It will be very hard to be interactive or store the model on user's side when the number of parameters is too large for every single model.

R Limitations and Risks
vMF distribution has a unit constraint.This limits the variability of latent space, which in turn reduces the gains as the number of topics increase.We can try other distributions with richer variability, such as Bivariate von Mises distribution and Kent distribution.
Also, in weakly supervised cases, vONTSS may not perform as well as those methods that leverage pretraining language models in classification.In the future, we can combine the structure of this model with existed language modeling to further improve its classification performance.
Lastly, in semi-supervised cases version, our formulation of vONTSS requires each topic to have at least one keyword.This limits its practical usage to some extent.To solve it, we can first preselect topics before doing the topics and keywords mapping, or we can modify the optimal transport loss using Gumbel distributions.

Figure 1
Figure 1: 2-D PCA projection of empirical CDF sof tmax(η) where from left to right η ∼ N (0, I), η ∼ vM F (., 0) and η ∼ 10 * vM F (., 0)) respectively.From the heatmap, we observe a white hole in the middle, which denotes the unreachable probability vector from each distribution.Gaussian is mean-centered, while basic vMF tends to cluster around a small rounded triangular area due to its unity constraint.vMF with radius equals to 10 is even more expressive than Gaussian while still retaining edge weights, inducing separability among different topics.

Figure 2
Figure 2: 2-D TSNE projection of randomly sampled η from latent spaces under different posteriordistributions.From left to right are Gaussian, vMF with fixed k, vONT Each color represents a different topic.All encoders are trained on AgNews dataset with the same network structure.

Figure 3 :
Figure 3: Architecture of the model.Purple represents the part of the network that can be trained.L recon represents reconstruction loss.L KL represents KL divergence.L OT represents optimal transport loss

Figure 4 :
Figure 4: Comparison of intrusion and rating task performance on AgNews.

Figure 5 :
Figure 5: Each column represents a metric and each row represents a dataset.The error bar represents the standard deviation that is created by running the same model for 10 times with different random seeds.

Figure 6 :
Figure 6: Each row represents a metric and each column represent a dataset.The boxplot is created by running the same model for 10 times with different random seeds.Mean and variance values are presented in the boxplot.vONT is the left most.We mark its performance in skyblue.

Figure
Figure 7: Sample Exploration

Figure 9 :
Figure 9: User interface of intrusion task

Figure 10 :
Figure 10: User interface of rating task

Table 1 :
Description of the notations used in this work

Table 2 :
Clusterability metrics for vONT.The number of topics is 20.The best and second-best scores of each dataset are highlighted in boldface and with an underline, respectively.Figure5in the appendix shows the variation of the metrics as a function of a number of topics.It is hard to get Km-Purity for ProdLDA.Since it does not perform well for Top-purity, we do not think it will perform well on Km-purity.Thus, we ignore that result.

Table 4 :
Coherence metrics for vONT.Number of topics is 20.Figure6in the appendix shows details of the result.

Table 6 :
Evaluate the influence of radius on coherence and clusterability related metric in Dataset AgNews.Temperature is from 1 to 20.The best scores of is highlighted in boldface.The number of topis is 10

Table 7 :
kappa diversity Top-Purity Top-Nmi Km-Purity Km-Nmi NPMI C v Evaluate the influence of learnable on coherence and clusterability related metric in Dataset AgNews.The best scores is highlighted in boldface.The number of topic is 20.

Table 8 :
Evaluate the diversity of vONT compare to other methods in all 3 datasets.The number of topic is 20.