A Joint Learning Approach for Semi-supervised Neural Topic Modeling

Topic models are some of the most popular ways to represent textual data in an interpret-able manner. Recently, advances in deep generative models, specifically auto-encoding variational Bayes (AEVB), have led to the introduction of unsupervised neural topic models, which leverage deep generative models as opposed to traditional statistics-based topic models. We extend upon these neural topic models by introducing the Label-Indexed Neural Topic Model (LI-NTM), which is, to the extent of our knowledge, the first effective upstream semi-supervised neural topic model. We find that LI-NTM outperforms existing neural topic models in document reconstruction benchmarks, with the most notable results in low labeled data regimes and for data-sets with informative labels; furthermore, our jointly learned classifier outperforms baseline classifiers in ablation studies.


Introduction
Topic models are one of the most widely used and studied text modeling techniques, both because of their intuitive generative process and interpretable results (Blei, 2012). Though topic models are mostly used on textual data (Rosen-Zvi et al., 2012;Yan et al., 2013), use cases have since expanded to areas such as genomics modeling (Liu et al., 2016) and molecular modeling (Schneider et al., 2017).
Recently, neural topic models, which leverage deep generative models have been used successfully for learning these probabilistic models. A lot of this success is due to the development of variational autoencoders  which allow for inference of intractable distributions over latent variables through a back-propagation over an inference network. Furthermore, recent research shows promising results for Neural Topic Models compared to traditional * Equal contribution topic models due to the added expressivity from neural representations; specifically, we see significant improvements in low data regimes (Srivastava and Sutton, 2017;Iwata, 2021).
Joint learning of topics and other tasks have been researched in the past, specifically through supervised topic models (Blei and McAuliffe, 2010;Huh and Fienberg, 2012;Cao et al., 2015;Wang and Yang, 2020). These works are centered around the idea of a prediction task using a topic model as a dimensionality reduction tool. Fundamentally, they follow a downstream task setting (Figure 1), where the label is assumed to be generated from the latent variable (topics). On the other hand, an upstream setting would be when the input (document) is generated from a combination of the latent variable (topics) and label, which has the benefit of better directly modeling how the label affects the document, resulting in topic with additional information being injected from the label information. Upstream variants of supervised topic models are much less common, with, to the extent of our knowledge, no neural architectures to this date. (Ramage et al., 2009;Lacoste-Julien et al., 2008).
Our model, the Label-Indexed Neural Topic Model (LI-NTM) stands uniquely with respect to all existing topic models. We combine the benefits of an upstream generative processes (Figure 1), label-indexed topics, and a topic model capable of semi-supervised learning and neural topic modeling to jointly learn a topic model and label classifier. Our main contributions are: 1. The introduction of the first upstream semisupervised neural topic model.

2.
A label-indexed topic model that allows more cohesive and diverse topics by allowing the label of a document to supervise the learned topics in a semi-supervised manner.
3. A joint training framework that allows for users to tune the trade-off between document classifier and topic quality which results in a classifier that outperforms same classifier trained in an isolated setting for certain hyperparameters.
2 Related Work

Neural Topic Models
Most past work in neural topic models focused on designing inference networks with better model specification in the unsupervised setting. One line of recent research attempts to improve topic model performance by modifying the inference network through changes to the topic priors or regularization over the latent space (Miao et al., 2016;Srivastava and Sutton, 2017;Nan et al., 2019). Another line of research looks towards incorporating the expressivity of word embeddings to topic models (Dieng et al., 2019a,b).
In contrast to existing work on neural topic models, our approach does not mainly focus on model specification; rather, we create a broader architecture into which neural topic models of all specifications can be trained in an upstream, semisupervised setting. We believe that our architecture will enable existing neural topic models to be used in a wider range of real-word scenarios where we leverage labeled data alongside unlabeled data and use the knowledge present in document labels to further supervise topic models. Moreover, by directly tying our topic distributions to the labels through label-indexing, we create topics that are specific to labels, making these topics more interpretable as users are directly able to glean what types of documents each of the topics are summarizing.

Downstream Supervised Topic Models
Most supervised topic models follow the downstream supervised framework introduced in s-LDA (Blei and McAuliffe, 2010). This framework assumes a two-stage setting in which a topic model is trained and then a predictive model for the document labels is trained independently of the topic model. Neural topic models following this framework have also been developed, with the predictive model being a discriminative layer attached to the learned topics, essentially treating topic modeling as a dimensionality reduction tool (Wang and Yang, 2020;Cao et al., 2015;Huh and Fienberg, 2012).
In contrast to existing work, LI-NTM is an upstream generative model ( lowing a prediction-constrained framework. The upstream setting allows us to implicitly train our classifier and topic model in a one-stage setting that is end-to-end. This has the benefit of allowing us to tune the trade-off between our classifier and topic model performance in a predictionconstrained framework, which has been shown to achieve better empirical results when latent variable models are used as a dimensionality reduction tool (Hughes et al., 2018;Sharma et al., 2021). Furthermore, the upstream setting allows us to introduce the document label classifier as a latent variable, enabling our model to work in semisupervised settings.

Background
LI-NTM extends upon two core ideas: Latent Dirichlet Allocation (LDA) and deep generative models. For the rest of the paper, we assume a setting where we have a document corpus of D documents, a vocabulary with V unique words, and each document having a label from the L possible labels. Furthermore let us represent w dn as the n-th word in the d-th document.

Latent Dirichlet Allocation (LDA)
LDA is a probabilistic generative model for topic modeling (Blei et al., 2003;Blei and McAuliffe, 2010). Through the process of estimation and inference, LDA learns K topics β 1:K . The generative process of LDA posits that each document is a mixture of topics with the topics being global to the entire corpus. For each document, the generative process is listed below: α θ z β y w N M Figure 2: Generative Process for LI-NTM: The label y indexes into our label-topic-word matrix β, which is "upstream" of the observed words in the document w.
2. For each word w in document: z n and the parameters η, σ 2 are estimated during inference. α θ is a hyperparameter that serves as a prior for topic mixture proportions. In addition we also have hyperparameter α β that we use to place a dirichlet prior on our topics, β k ∼ Dirichlet(α β ).

Deep Generative Models
Deep Generative Models serve as the bridge between probabilistic models and neural networks. Specifically, deep generative models treat the parameters of distributions within probabilistic models as outputs of neural networks. Deep generative models fundamentally work because of the re-parameterization trick that allows for backpropogation through Monte-Carlo samples of distributions from the location-scale family. Specifically, for any distribution g(·) from the location-scale family, we have that thus allowing differentiation with respect to µ, σ 2 .
The Variational Auto-encoder is the simplest deep generative model (Kingma and Welling, 2014) and it's generative process is as follows: where µ θ (z), Σ θ (z) are both parameterized by neural networks with variational parameters θ. Inference on a variational autoencoder is done through Figure 3: Architecture for LI-NTM in the un-labeled setting. y is used instead of obtaining a probability distribution π from the classifier in the labeled setting. q(·|x) are distributions parameterized by neural networks. Note that we can optimize the classifier, encoder, and decoder in one backwards pass.
approximating the true posterior p(z|x) which is often intractable with an approximation q φ (z|x) that is parametrized by a neural network. The M2 model is the semi-supervised extension of the variational auto-encoder where the input is modeled as being generated by both a continuous latent variable z and the class label y as a latent variable . It follows the generative process below: where π is parameterizing the distribution on y and µ θ (y, z), Σ θ (y, z) are both parameterized by neural networks. We then approximate the true posterior p(y, z|x) using by saying where q φ (y|x) is a classifier that's used in the unlabeled case and q φ (z|y, x) is a neural network that takes in the true labels if available and the outputted labels from q φ (y|x) if unavailable.
Notationally, let us denote the bag of words representation of a document as x ∈ R V and the onehot encoded document label as y ∈ R L . Furthermore, we denote our latent topic proportions as θ d ∈ R K and our topics are represented using a three dimensional matrix β ∈ R L×K×V .
Under the LI-NTM, the generative process (also depicted in Figure 2) of the d-th document x d is the following: Step 1, we draw from the Logistic-Normal LN (·) to approximate the Dirichlet Distribution while remaining in the location-scale family necessary for re-parameterization (Blei et al., 2003). This is done obtained through: Note that since we sample from the Logistic-Normal, we do not require the Dirichlet prior hyperparameter α.
Step 2 is unique for LI-NTM , in the unlabeled case, we sample a label y d from π, which is the output of our classifier. In the labeled scenario, we skip step 2 and simply pass in the document label for our y d .
Step 3 is typical of traditional LDA, but one key difference is that in step 3b we also index by the β by y d instead of just z dn . This step is motivated by how the M2 model extended variational autoencoders to a semi-supervised setting .
A key contribution of our model is the idea of label-indexing. We introduce the supervision of the document labels by having different topics for different labels. Specifically, we have L × K different topics and we denote the k-th topic for label l as the V dimensional vector, β l,k . Under this setting, we can envision LI-NTM as running a separate LDA for each label once we index our corpus by document labels.
Label-indexing allows us to effectively train our model in a semi-supervised setting. In the unlabeled data setting, our jointly-learned classifier, q φ (y|x), outputs a distribution over the labels, π. By computing the dot-product between π and our Algorithm 1 Topic Modeling with LI-NTM Initialize model and variational parameters Compute the ELBO and its gradient (backprop.) Update model parameters β Update variational parameters (φ µ , φ Σ , ν) end for topic matrix β, this allows us to partially index into each label's topic proportional to the classifier's confidence and update the topics based on the unlabeled examples we are currently training on.

Inference and Estimation
Given a corpus of normalized bag-of-word representation of documents x 1 , x 2 , · · · , x d we aim to fit LI-NTM using variational inference in order to approximate intractable posteriors in maximum likelihood estimation (Jordan et al., 1999). Furthermore, we amortize the loss to allow for joint learning of the classifier and the topic model.

Variational Inference
We begin first by looking at a family of variational distributions q φ (δ d |x d ) in modeling the untransformed topic proportions and q ν (y d |x d ) in modeling the classifier. More specifically, q φ (δ d |x d ) is a Gaussian whose mean and variance are parameterized by neural networks with parameter φ and q ν (y d |x d ) is a distribution over the labels parameterized by a MLP with parameter ν (Kingma and . We use this family of variational distributions alongside our classifier to lower-bound the marginal likelihood. The evidence lower bound (ELBO) is a function of model and variational parameters and provides a lower bound for the complete data log-likelihood. We derive two ELBObased loss functions: one for the labeled case and one for the unlabeled case and we compute a linear interpolation of the two for our overall loss function. (1) where Equation 1 serves as our unlabeled loss and Equation 2 serves as our labeled loss. H(·, ·) is the cross-entropy function. τ and ρ are hyperparameters on the KL and cross-entropy terms in the loss respectively. These hyper-parameters are well motivated. τ is seen to be a hyper-parameter that tempers our posterior distribution over weights, which has been well-studied and shown to increase robustness to model mis-specification (Mandt et al., 2016;Wenzel et al., 2020). Lower values τ would result in posterior distributions with higher probability densities around the modes of the posterior. Furthermore, the ρ hyperparameter in our unlabeled loss is the core hyperparameter that makes our model fit the prediction-constrained framework, essentially allowing us to trade-off the between classifier and topic modeling performance (Hughes et al., 2018). Increasing values of ρ corresponds to emphasizing classifier performance over topic modeling performance.
We treat our overall loss as a combination of our labeled and unlabeled loss with λ ∈ (0, 1) being a hyper-parameter weighing the labeled and unlabeled loss. λ allows us weigh how heavily we want our unlabeled data to influence our models. Example cases where we may want high values of λ are when we have poor classifier performance or a disproportionate amount of unlabeled data compared to label data, causing the unlabeled loss to completely outweigh the labeled loss.
We optimize our loss with respect to both the model and variational parameters and leverage the reparameterization trick to perform stochastic optimization (Kingma and Welling, 2014). The training procedure is shown in Algorithm 1 and a visualization of a forward pass is given in Figure 3. This loss function allows us to jointly learn our classification and topic modeling elements and we hypothesize that the implicit regularization from joint learning will increase performance for both elements as seen in previous research studies (Zweig and Weinshall, 2013).

Experimental Setup
We perform an empirical evaluation of LI-NTM with two corpora: a synthetic dataset and AG News.

Baselines
We compare our topic model to the Embedded Topic Model (ETM), which is the current state of the art neural topic model that leverages word embeddings alongside variational autoencoders for unsupervised topic modeling (Dieng et al., 2019a). Further details about ETM are shown in the appendix (subsection A.2). Furthermore, our baseline for our jointly trained classifier is a classifier with the same architecture outside of our jointly trained setting.

Synthetic Dataset
We constructed our synthetic data to evaluate LI-NTM in ideal and worst-case settings.
• Ideal Setting: An ideal setting for LI-NTM consists of a corpus with similar word distributions for documents with the same label and very dissimilar word distributions for documents with different labels • Worst Case Setting worst-case setting for LI-NTM consists of a corpus where the label has little to no correlation with the distribution of words in a document.
Since the labels are a fundamental aspect of LI-NTM we wanted to investigate how robust LI-NTM is in a real-word setting, specifically looking at how robust it was to certain types of mis-labeled data points. By jointly training our classifier with our topic model, we hope that by properly trading off topic quality and classification quality, our model will be more robust to mis-labeled data since we are able to manually tune how much we want to depend on the data labels.
We use the same distributions to generate the documents for both the ideal and worst-case data. In particular, we consider a vocabulary with V = 20 words, and a task with L = 2 labels. Documents are generated from one of two distributions, D 1 and D 2 . D 1 generates documents which have many occurrences of the first 10 words in the vocabulary (and very few occurrences of the last 10 words), while D 2 does the opposite, generating documents which have many occurrences of the last 10 words in the vocabulary (and very few occurrences of the first 10 words). The distributions D 1 and D 2 have parameters which are generated randomly for each trial, although the shape of the distributions is largely the same from trial to trial.
In the ideal case, the label corresponds directly to the distribution from which the document was generated. For the worst-case data, the label is 0 if the number of words in the document is an even number, and 1 otherwise, ensuring there is little to no correlation between label and word distributions in a document. Note that in our synthetic data experiments, all of the data is labeled. The effectiveness of LI-NTM in semi-supervised domains is evaluated in our AG News experiments.

AG News Dataset
The AG News dataset is a collection of news articles collected from more than 2,000 news sources by ComeToMyHead, an academic news search engine. This dataset includes 118,000 training samples and 7,600 test samples. Each sample is a short text with a single four-class label (one of world, business, sports and science/technology).

Evaluation Metrics
To evaluate our models, we used accuracy as a metric to gauge the quality of the classifier and perplexity to gauge the quality of the model as a whole. We opted to use perplexity as it is a measure for how well the model generalizes to unseen test data.

Synthetic Data Experimental Results
We used our synthetic dataset to examine the performance of LI-NTM relative to ETM in a setting where the label strongly partitions our dataset into subsets that have distinct topics to investigate the effect and robustness of label indexing.
LI-NTM was trained on the fully labeled version of the both the ideal and worse case label synthetic Figure 4: Topic-word probability distribution visualization for LI-NTM on ideal case synthetic dataset with one topic per label. We observe that we learn topics that are strongly label partitioned. dataset and ETM was trained on the same dataset with the label excluded, as ETM is a unsupervised method. We varied the number of topics in both LI-NTM and ETM to explore realistic settings K = 2, 8 and the extreme setting K = 20.

Effect of Number of Topics
Takeaway: More topics lead to better performance, especially when the label is uninformative.
First, we note that as we increase the number of topics, the performance of LI-NTM on ideal case labels, LI-NTM on worst case labels, and ETM improves as shown in Table 1. This is expected as having more topics gives the model the capacity to learn more diverse topic-word distributions which leads to an improved reconstruction. However, we note that LI-NTM trained on the worst-case labels benefits most from the increase in the number of topics.

Informative Labels
Takeaway: Label Indexing is highly effective when labels partition the dataset well.
Next, we note that LI-NTM trained on the ideal case label synthetic dataset outperforms ETM with respect to perplexity (see Table 1). This result can be attributed to the fact that LI-NTM leverages label indexing to learn the label-topic-word distribution. Since the ideal case label version of the dataset was constructed such that the label strongly partitions the dataset into two groups (each of which has a very distinct topic-word distribution), and since we had perfect 50.4 ± 0.2 93.7 ± 6.2 Table 2: Accuracies of classifier LI-NTM (V2) on ideal case and worst case labels. LI-NTM (V2) is trained only on worst-case labels but evaluated on both worst case and ideal case label test sets. Note that even though α = 0 and the training set is only worst case labels, the reconstruction loss distantly supervises the classifier to learn the true ideal case labels.
classifier accuracy (the ideal case label dataset was constructed such that the classification problem was trivial), LI-NTM is able to use the output from the classifier to index into the topic-word distribution with 100% accuracy. If we denote the topic-word distribution corresponding to label 0 by β 0 and the topic-word distribution corresponding to label 1 by β 1 , we note that LI-NTM is able to leverage β 0 to specialize in generating the words for the documents corresponding to label 0 while using β 1 to specialize in generating the words for the documents corresponding to label 1 (see Figure 4). Overall, this result suggests that LI-NTM performs well in settings when the dataset exhibits strong label partitioning.

Uninformative Labels
Takeaway: With proper hyperparameters, LI-NTM is able to achieve good topic model performance even when we have uninformative labels.
We now move to examining the results produced by LI-NTM trained on the worst-case labels. In this data setting, we investigated the robustness of the LI-NTM architecture. Specifically, we looked at a worst-case dataset, where we have labels that are uninformative and are thus not good at partitioning the dataset into subsets that have distinct topics.
In the worst-case setting, we define the following two instances of the LI-NTM model. For LI-NTM (V1), we did see decreases in performance; namely, that V1 has a worse perplexity than both ETM and ideal case LI-NTM. This aligns with our expectation that having a label with very low correlation to the topic-word distributions in the document results in poor performance in LI-NTM. This can be attributed to the failure of LI-NTM to adequately label-index in cases where this occurs.
However, for LI-NTM (V2) we found that we were actually able to achieve lower perplexity than ETM when the model was told to produce more than 2 topics, even with uninformative labels. To understand why this was happening, we analyzed the accuracy of the original classifier in LI-NTM (V2) on both the worst-case labels (which it was trained on) and the ideal-case labels (which it was not trained on). We report our results in Table 2.
The key takeaway is that we observed a much higher accuracy on the ideal labels compared to the worst-case labels. This suggests that when ρ = 0 the classifier implicitly learns the ideal labels that   are necessary to learn a good reconstruction of the data, even when the provided labels are heavily uninformative or misspecified. This shows the benefit of label-indexing and of jointly learning our topic model and classifier in a semi-supervised fashion. Even in cases with uninformative data points, by setting ρ = 0, the joint learning setting of our classifier and topic model pushes the classifier, through the need for successful document reconstruction, to generate a probability distribution over labels that is close to the true, ideal-case labels despite only being given uninformative or mis-labeled data.

AG News Experimental Results
We used the AG News dataset to evaluate the performance of LI-NTM in the semi-supervised setting. Specifically, we aimed to analyze the extent to which unlabeled data can improve the performance of both the classifier and topic model in the LI-NTM architecture. Ideally, in the unlabeled case, the distant supervision provided to the classifier from the reconstruction loss would align with the task of predicting correct labels. We ran four experiments on ETM and LI-NTM in which the amount of unlabeled data was grad-ually increased, while the amount of labeled data was kept fixed. In each of the experiments, 5% of the dataset was considered labeled, while 5%, 15%, 55%, and 95% of the whole dataset was considered unlabeled in each of the four experiments respectively.

Semi-Supervised Learning: Topic Model Performance
Takeaway: Combining label-indexing with semi-supervised learning increases topic model performance.
In Table 3 we observe that perplexity decreases as the model sees more unlabeled data. We also note that LI-NTM has a lower perplexity than ETM in higher data settings, supporting the hypothesis that guiding the reconstruction of a document exclusively via label-specific topics makes reconstruction an easier task. In the lowest data regime (5% labeled, 5% unlabeled), LI-NTM performs worse than ETM. This suggests that while in high-data settings, LI-NTM is able to effectively leverage L = 4 sets of topics, in low-data settings there are not enough documents to learn sufficient structure.

Semi-Supervised Learning: Classifier Performance
Takeaway: Topic modeling supervises the classifier, resulting in better classification performance.
Jointly learning the classifier and topic model also seem to benefit the classifier; Table 3 shows classification performance increases linearly with the amount of unlabeled data. The accuracy increase suggest the task of reconstructing the bag of words is helpful in news article classification.
Select topics learned from LI-NTM on the AG News Dataset are presented in Table 4 and the distributions are visualized in the appendix Figure A1.

Conclusion
In this paper, we introduced the LI-NTM, which, to the extent of our knowledge, is the first upstream neural topic model with applications to a semisupervised data setting. Our results show that when applied to both a synthetic dataset and AG News, LI-NTM outperforms ETM with respect to perplexity. Furthermore, we found that the classifier in LI-NTM was able to outperform a baseline that doesn't leverage any unlabeled data. Even more promising is the fact that the classifier in LI-NTM continued to experience gains in accuracy when increasing the proportion of unlabeled data. While we aim to iterate upon our results, our current findings indicate that LI-NTM is comparable with current state-of-the-art models while being applicable in a wider range of real-world settings.
In future work, we hope to further experiment with the idea of label-indexing. While in LI-NTM every topic is label-specific, real datasets have some common words and topics that are labelagnostic. Future work could augment the existing LI-NTM framework with additional label-agnostic global topics which prevent identical topics from being learned across multiple labels. We are also interested in extending our semi-supervised, upstream paradigm to a semi-parametric setting in which the number of topics we learn is not a predefined hyperparameter but rather something that is learned.