Adaptive Mixed Component LDA for Low Resource Topic Modeling

Probabilistic topic models in low data resource scenarios are faced with less reliable estimates due to sparsity of discrete word co-occurrence counts, and do not have the luxury of retraining word or topic embeddings using neural methods. In this challenging resource constrained setting, we explore mixture models which interpolate between the discrete and continuous topic-word distributions that utilise pre-trained embeddings to improve topic coherence. We introduce an automatic trade-off between the discrete and continuous representations via an adaptive mixture coefficient, which places greater weight on the discrete representation when the corpus statistics are more reliable. The adaptive mixture coefficient takes into account global corpus statistics, and the uncertainty in each topic’s continuous distributions. Our approach outperforms the fully discrete, fully continuous, and static mixture model on topic coherence in low resource settings. We additionally demonstrate the generalisability of our method by extending it to handle multilingual document collections.


Introduction
In topic modeling, the goal is to learn key themes in a corpus for exploratory document analysis (Boyd-Graber et al., 2017). Latent Dirichlet Allocation (LDA; Blei et al. (2003)) has been the bedrock for topic modeling and remains a hard to beat baseline for the general scenario which models with only words and documents.
We examine topic modeling in a low resource data setting (Hao et al., 2018), which has seen little attention but is commonly encountered in the digital humanities where document collections are potentially small (Jockers and Mimno, 2013;Schöch, 2017;Navarro-Colorado, 2018). 1 In such scenarios, word co-occurence statistics are less reliable due to sparsity of discrete counts.
With the rise of neural word embeddings (Mikolov et al., 2013), the defacto approach to improving over discrete models has been to utilise continuous representations (regardless of whether the setting is low resource). Early work by Liu et al. (2015) introduced topic dependent word embeddings, while others subsequently use embeddings to influence the discrete topic-word distribution (Zhao et al., 2017;Dieng et al., 2019). However, the low resource scenario constrains us to existing pre-trained embeddings, as the number of train documents is limited to several thousands and thus prohibitively small to train neural models (Srivastava and Sutton, 2017;Zhu et al., 2018;Liu et al., 2019;Hu et al., 2020;Zhu et al., 2020).
We therefore consider approaches that do not require further tuning of embeddings, and operate within the well established LDA probabilistic inference framework in the continuous space. There have been multiple attempts to replace discrete words with pre-trained embeddings (replacing the multinomial topic-word distribution with a continuous topic-word distribution), doing away with discrete words completely (Das et al., 2015;Batmanghelich et al., 2016;Xun et al., 2017). Given the dominance of pre-trained word embeddings in modern NLP, would continuous representations outperform discrete representations even in low resource settings? Surprisingly, we find that discrete LDA outperforms its fully continuous counterpart on topic coherence measures which correlate with human judgement (Lau et al., 2014).
How then can we utilise pre-trained continuous representations for learning better topics? A natural direction is hybrid models based on statistical counts and pre-trained neural representations (Neubig and Dyer, 2016). Early work by Nguyen et al. (2015) used a mixture of discrete and continuous topic-word distributions with static mixture coefficients. However we find that this does not improve over discrete LDA, which motivates a more nuanced treatment of the mixture coefficient.
In this work, we introduce an adaptive mixture coefficient specific to each word and each topic, which is updated at every step of Collapsed Gibbs Sampling (Griffiths and Steyvers, 2004). The intuition is as follows, topic anchor words (Lund et al., 2017) which have stronger signal from corpus statistics should rely more on the discrete distribution, while infrequent words should rely more on their embeddings (pre-trained on a large external corpus). Crucially, we do not assume any prior knowledge of the corpus used to train the word embeddings, and our parameterisation depends on the uncertainty of the continuous topic distributions at the current state of the Markov Chain during Gibbs Sampling. Our contributions are as follows: 1. By using adaptive mixed representations for the observed word with a data-dependent parameterisation, we provide an automatic tradeoff between continuous and discrete representations during inference. Our method requires no additional tuning and relies purely on corpus statistics and statistics gathered from the current state of the Markov Chain.
2. We illustrate the extensibility of our approach to LDA variants with a combined topic model; Cross-lingual Adaptive LDA, and showed that adaptive mixing can balance between both discrete and continuous representations for better topic coherence on both monolingual and multilingual datasets.
2 Background 2.1 Unsupervised Learning with LDA Discrete LDA (Blei et al., 2003) describes a generative probabilistic model of a corpus with latent topics. Formally we can define a corpus with D documents and K topics, where each document has a multinomial distribution over topics, Θ = {θ 1 , · · · , θ D }, and each topic has a multinomial distribution over words, Φ = {φ 1 , · · · , φ K }. Θ and Φ are the set of document-topic and topicword distributions respectively. LDA relies on discrete counts and co-occurrence statistics, and therefore has poorer estimates in low resource scenarios due to data sparsity.
Gaussian LDA (Das et al., 2015) proposes a variant of LDA which operates on the continuous vector space rather than on discrete words. Each word is represented by an M -dimensional vector v ∈ R M and is drawn from a multivariate Gaussian for that topic. That is, for K topics, there are K Gaussian distributions. While there have been extensions to more complex continuous distributions such as von Mises-Fisher (Batmanghelich et al., 2016;Li et al., 2016b)), we opted to work with a simpler distribution to demonstrate the approach, which can subsequently be extended in future work.
Polylingual LDA (Mimno et al., 2009) studies LDA across more than two languages using parallel corpora. The model assumes that the documenttopic distribution θ d , is shared across languages, and that each language specific topic has a multinomial topic-word distribution, Φ 1 , Φ 2 due to the discrete nature of words. Mimno et al. (2009);Ni et al. (2009) showed that Polylingual topic models can infer topic structure in multilingual corpora.
Latent Feature Topic Models A natural extension to discrete only or continuous only representations, is to model a word as being sampled with some probability from its discrete or continuous component. Nguyen et al. (2015) introduced an interpolation between the continuous and discrete representations, but convert the continuous representations back into discrete probability over word types by learning latent feature weights.

Discrete-Continuous Mixture LDA
We first establish an incremental extension to the Latent Feature Topic model using mixture of discrete categorical and continuous Gaussian distributions. We adopt a mixture model where each word has some probability of either coming from its categorical (discrete) or Gaussian (continuous) distribution. The generative process for this model with K topics is as follows: where W −1 is the Inverse Wishart distribution, Ψ is the normalised Precision matrix, ν 0 is degrees of freedom, µ 0 is the prior mean for each Gaussian topic, and π is a mixture coefficient.

Gibbs Sampling for Posterior Inference
Given a corpus, our goal is to infer the posterior distribution over Θ and Φ and latent topic assignments z, given the observations x. We perform inference with collapsed Gibbs sampling (Griffiths and Steyvers, 2004) which can be derived by analytically integrating out Θ and Φ.
The key step in Gibbs sampling 2 samples a new topic z d,i assignment for each word, w d,i at index i in document d based on the conditional distribution where the previous assignment is ignored (denoted with \): η is the corresponding parameters of a Dirichlet prior for the document-topic distribution θ, and ϕ are parameters associated with the topic-word distribution. This is either λ for the Dirichlet prior for multinomial φ, or µ 0 , Σ 0 , ν 0 , κ for the Gaussian. In our proposed model (section 5), ϕ consists of both Dirichlet and Gaussian parameters. 3 The first term on the right in Equation 1 expresses the probability of the i th word in document d under topic k, while the second term expresses the probability of topic k in document d (Griffiths and Steyvers, 2004). Gaussian LDA modifies the first term to use continuous representations instead of discrete, while cross-lingual models focus on the second term which reflects document level sharing.
We focus on the first term to incorporate adaptive mixed representations in section 5.
Mixture Models Let f 1 be a discrete probability mass function with parameters ϕ 1 and f 2 be a continuous density function with parameters ϕ 2 . The density can be expressed as a convex combination: Then, the second term in Equation 3 can be expressed as the density of v d,i under the multivariate t distribution 4 parameterised by mean µ k and covariance κ k +1 κ k Σ k , with ν k degrees of freedom. κ is a prior confidence on µ k and Σ k (Murphy, 2012). ϕ = {λ, ν 0 , µ 0 , Σ 0 , κ}, including parameters of both the Dirichlet and Gaussian priors, with the subscript 0 indicating parameters of the conjugate prior. N indicates counts; for the first term in the RHS of Equation 3, N k,w d,i are the counts of that particular word type (for the token w d,i ) assigned to topic k, and N k,w is the number of counts of word type w assigned to topic k, with V being the vocabulary. 5 There are several ways to interpret the mixture coefficient π which interpolates between the discrete and continuous representations. Both the discrete and Gaussian LDA can be viewed as special cases of a two component mixture model, where the mixture coefficient π is either 1 or 0 respectively. π can also be viewed as a tunable hyperparameter that emphasises either representation depending on the availability of discrete word units or quality of embeddings (Nguyen et al., 2015).

Perspectives on π as a Static Random Variable Informed by Observations
From a Bayesian perspective, the mixture coefficient, π ∈ [0, 1], can be modelled as a random variable following a Beta distribution. This provides a distribution over component proportions (discrete or continuous) with useful conjugate properties. By Bayes Rule, posterior inference of π is proportional to prior times likelihood: p(π|o) ∝ p(π)p(o|π). k,w d,i are counts of the word type for the i th word in document d, w d,i in topic k. j indexes the word type for the token w d,i or v d,i , and t ν k is the probability density function of the multivariate t distribution parameterised by ν k degrees of freedom, mean µ k and covariance Σ k . κ k = κ + |V | w N k,w , where κ represents the belief on the prior of the multivariate Gaussian. For the cross-lingual model, ν and κ k sum counts in 1 and 2 . "\" denotes counts when excluding that variable. λ ∈ R |V | and η ∈ R K are hyperparameters of the Dirichlet prior distribution on topic-word and document-topic distribution respectively.

LDA types Topic-Word
Here the observations o correspond to the discrete and continuous representations.
It can be shown due to conjugacy of the betabinomial distribution that when the prior p(π) is Beta(α 0 , β 0 ), the posterior p(π|o) is also a Beta distribution, where α and β are counts of words that have a discrete and continuous representation available, and α 0 and β 0 are set to 1 in the absence of any information.
π ∼ Beta(α 0 + α , β 0 + β ) Note that with modern word embeddings such as FastText, 6 and Byte Pair Encoding methods, both discrete and continuous representations are mostly always observed together and |V | = α ≈ β , when |V | is large, E[π] = 0.5 with V ar[π] ≈ 0. Unfortunately, this view is overly "naive" as the continuous representations are not true observations, but learned representations which should not constitute full observation counts. We refer to this setting as "Static Mixing (SMIX)" in section 8, where we directly adopt π = 0.5. 7 5 Adaptive Mixture Coefficient π k,j We recommend a more pragmatic view for balancing between (noisy) learned word embeddings and discrete counts by modeling the mixture coefficient as a topic k and word type indexed by j, π k,j 6 FastText can generate a representation for previously unseen vocabulary words based on their character Ngrams. 7 For a vocabulary size of just 1000, V ar[π] = 0.00026. specific random variable. At inference time, we sample π k,j ∈ [0, 1] from a Beta distribution that is specific to each word type and each topic for the α parameter, and topic specific for the β parameter to compute Equation 3.
The parameter α k,j represents the concentration on the discrete representation, while β k represents the concentration on the continuous representation. As we do not assume any knowledge of the external corpus used to train the continuous representations, the β parameter is agnostic to the word type. On each Gibbs sampling update, α k,j is updated by discrete counts for the categorical distribution, and β k is updated based on the uncertainty in the t distribution as measured by the trace of the covariance matrix Σ k .

Adaptive α k,j Based on Counts
We specify corpus specific 'α' priors, α 0 j for each word type indexed by j in the vocabulary as the number of word counts N k,j normalised by K, the number of topics, and the relative proportion of number of documents D to number of unique vocabulary words |V |.
Intuitively, we expect that if a word has a higher frequency in the corpus, its statistics based on discrete counts are more reliable. However, if |V | >> D, count statistics become less reliable. The α k,j parameter at each step where N k,j is the number of times word type at vocabulary index j was assigned to topic k is which takes a similar form to the regular closedform conjugate posterior update in Equation 4 for discrete counts.

Adaptive β k Based on Topic Uncertainty
Recall that while counts are appropriate for the discrete case, continuous representations are learned from an external corpus and should not constitute full observation counts. Hence there is no closed form update for the membership of the continuous representations (Koller and Friedman, 2009). Instead we let β k be a random variable which reflects our current confidence in the multivariate t distribution indexed by topic k.
We approximate the uncertainty of the k topic distribution, as measured by the sum of eigenvalues of the square root of the topic covariance matrix Σ k , equivalently written as tr( √ Σ k ). We formulate β k as depending on the constant terms M K , M is the number of dimensions of the multivariate Gaussian, and (non-constant) Σ k which is updated at every step of Gibbs Sampling: The intuition for the inverse relationship between tr( √ Σ k ) and β k is as follows. If the topic has high variance, then β k should be smaller as we have less confidence in its density function. The square root is a computational convenience for working with the Cholesky decomposition L T k L k = Σ k , where the last step assumes most of the variance is contained along the diagonals 8 of L k . In the following equation, we simplify the notation of L k to L.
We elaborate on the the interpretation of B k in Appendix C. 8 We verified this assumption by inspecting L k , and found that the off-diagonals tended to be smaller by a factor of 3. Figure 1: Cross-Lingual Adaptive LDA, with shared continuous parameters µ k , Σ k across languages and adaptive π k,j for every word type j and topic k. The word type j corresponds to the i th token of document d. w d,i indicates a token when it is not being used as a subscript.

Computational Complexity
We consider the computational cost for every Gibbs Sampling step. The main source of computational complexity comes from inverting Σ k which takes O(M 3 ) when computing the probability density of v d,i in row 2 of Table 1.
Since the covariance matrix Σ k is symmetric and positive semi-definite, we can utilise the Cholesky decomposition where Σ k can be decomposed as a product of upper and lower triangular matrices, Σ k = L T k L k . Although this takes O(M 3 ), we pay this cost only once during initialisation. L k is maintained by performing rank-1 updates and downdates (Seeger, 2004) at every step of Collapsed Gibbs Sampling.
As shown in Das et al. (2015), calculating the probability density takes O(M 2 ) instead of O(M 3 ). Our proposed prior for β k sum the diagonals of L k which takes O(M ) with little to no constant time overhead.
Therefore each Gibbs Sampling step takes O(KM 2 ) where K is the number of topics whose p(v d,i |z d,i = k, ϕ, x) we need to compute. This is parallelisable to O(M 2 ) as each term can be computed independently.

Cross-lingual Adaptive LDA
The following section describes the extension of our work from the monolingual to the cross-lingual setting. To test the robustness of our proposed model and extensibility to other models, we study the topic coherence in multilingual settings where the quality of word embeddings is thought to be worse than monolingual embeddings. We introduce a new topic model for continuous multilingual representations building on our adaptive sampling scheme, Cross-lingual Adaptive LDA in (Figure 1). Following Mimno et al. (2009), we assume that the document-topic distribution θ d is shared across paired language documents, and follow a bag-of-words assumption, i.e., they need not be sentence or word-aligned. We additionally assume that multilingual word embeddings v 1 , v 2 have been mapped to the same embedding space, by adopting shared Gaussian mean µ k and covariance Σ k across languages. This reduces the number of parameters and importantly ensures a continuous mapping across languages. Although this does not necessarily affect topic-coherence when measured within in each language, this would results in very poor cross-lingual document-topic representations. We checked this assumption by inspecting the learned topics without parameter sharing and found that topic indexes were mismatched across languages. Topic 5 in English may be about sports but Topic 5 in French may be about medicine.

Adaptive Mixing for Cross-lingual LDA
For the cross-lingual setting, our parameterisation of Equation 5 takes into account language ∈ L for word type j and topic k: Similar to the low resource monolingual setting, our approach relies on existing pre-trained multilingual word embeddings. Note that each language may have different vocabulary size.

Experimental Setup
We conduct experiments on a standard monolingual dataset and multilingual wikipedia dataset, reflecting a resource constrained setting by reducing the number of train documents. Our experiments 9 investigate the following: • Does an adaptive mixture coefficient perform better than the fully continuous, fully discrete, and static mixture coefficient? • How do the various mixture models perform across different number of training documents?
We compare the following models in Table 2, -indicates the cross-lingual case in Table 3 and SMIX is as described in subsection 4.2:

Datasets
We use the 20 newsgroup dataset (20NG) which is a common text analysis dataset containing around 18000 documents and 20 categories. 10 We perform stratified shuffled sampling, using 7000 docs as holdout test and varying the number of training documents from 1000 to 8000. For each model and each training size, we present the results averaged across 5 random splits of the dataset.
Since the goal is to model the present corpora, our main results are evaluated on a held-out test set based on the same corpora. We additionally evaluate on a held-out test set following (Röder et al., 2015). GAUSS performs better in this setting, and we discuss possible reasons in Appendix F.
Wikipedia paired document corpus. For the multilingual scenario, we utilised a Wikipedia dataset (Sasaki et al., 2018) that was automatically constructed by inter-language link to the most relevant foreign language document. For the multilingual setting 1000 test pairs were standardized across all languages, and training data consisted of 8000 randomly selected document pairs for each language. We performed shuffled samping on the training data for 5 random splits of 1000 and 7000 training document pairs.
Note that low resource topic modeling is not equivalent to low resource languages. A language can be considered high resource but the collection of documents that we are modeling could be small. 11   Preprocessing Standard text preprocessing steps were applied. Stopwords, digits, punctuations, words that appeared less than 5 times and the top 10 most frequent words were removed for efficiency. Wikipedia articles were restricted to the first 200 words and document titles were removed.
Model Settings All experiments (both 20NG and the multilingual experiments) were conducted with pre-trained multilingual word embeddings from the MUSE library (Conneau et al., 2017). We trained for up to 100 iterations and checked for convergence by inspecting mixing of the posterior topic-word distributions.
Hyperparameters We initialised the prior mean µ 0 and covariance Σ 0 to the empirical mean and sample covariance respectively based on random assignment of words to topics. Following Das et al.
, we initialise κ to 1, ν 0 to the embedding size M of 300. Parameters of the Dirichlet prior η and λ are set to 1 and 0.01 respectively, and K = 20. The same hyperparameter settings were used in the multilingual setting. All parameters of our proposed approach are based on corpus statistics, and existing parameters such as number of topics, and embedding size.

Topic Coherence Measure
Topic models are often evaluated based on the likelihood of held-out documents. However the likelihood of words from the discrete probability mass function and continuous probability density function is not directly comparable. Instead, we compute the coherence score S k of topic k using the normalised point-wise mutual information (NPMI; Bouma (2009)) which has been found to correlate with human judgement of topic quality (Lau et al., 2014). We also evaluate on the 'C v ' metric, which is closely related (see Appendix F) from Röder et al. (2015).
NPMI ranges from [−1, 1], where −1 indicates no co-occurences and 1 indicates 100% cooccurences. 12 The score of each topic S k is computed from word pair combinations of the top T words returned by that topic.
We extract word co-occurrence statistics of the held-out documents to compute p(w i ) and p(w i , w j ), and set to 1e −12 to avoid logarithm of 0. NPMI averaged across all topics are reported as 1 K k S k in Table 2 and 3. Note that the standard metric in Equation 12 will encounter division by 0 for the case where p(w i )·p(w j ) = 0, which is a case which frequently occurs in our low resource setting. We elaborate on this in Appendix D.

Results and Discussion
Finding 1: Adaptive Mixing performs best in resource constrained settings. We see that in Table 2, the adaptive mixture coefficient performs better under more resource scarce settings, and  after a certain point, is nearly equivalent to the Discrete LDA. These results are in the direction that we expect, the discrete model performs increasingly well with larger corpus sizes. Gaussian LDA (GAUS) performs poorly with increasing number of training documents. The authors report better performance using Pointwise Mutual Information (PMI) which assigns high scores to rare words such as human names such as "scott, graham, walker.." 13 which are not representative of themes. In this work, we evaluate using normalized PMI (Bouma, 2009) which corrects for this. This is somewhat suprising given the dominance of neural methods in modern NLP, and motivates our analysis (see Observation 1 and Observation 2) in the next section.
Interestingly, even with a less optimal continuous distribution, the adaptive method is able to balance between both representations with low number of training documents, and has a 'jump-start' using embeddings. We note that ADAP performs slightly less convincingly in the multilingual setting in terms of achieving statistical significance (not poorer in absolute terms), which could be due to poorer quality of multilingual embeddings.
Finding 2: Static mixture coefficient of π = 0.5 performs poorly, and while this could potentially be tuned for better performance, our adaptive method requires no tuning at all. This is discussed further in subsection 8.6. 13 See Table 1 Table 4: Top topic words on 20NG, bolded words are common across both topics. ADAP (Adaptive π) is able to construct topics with little training data (1000 docs), and correctly assigns human names to their ground truth topic.
One might expect that SMIX should not be worse than DISC or GAUSS, since it has access to both discrete and continuous distributions. However, the results suggest that equally weighting both the continuous and discrete topic representations, causes the model to not be able to learn effectively if they are in conflict, for e.g, continuous topic prefers topic 15 and the discrete topic prefers topic 3, and if weighted in equal proportions, this hinders the updates in Gibbs Sampling.

Analysis
Observation 1: GAUS produces narrow topics which are oddly narrow based on names (Table 4), American Cities, directions (North, South, East, West) etc. This phenomena is present in both the monolingual and multilingual models. While these groups of words may be semantically close, they are not representative 'themes' in a corpus.
This may be attributed to pre-training via skipgram loss to predict neighbouring words (Mikolov et al., 2013). Words which are used in similar contexts have similar embeddings, and the more unique the context is, the narrower the word clusters. To verify this, we compared word clusters from the Gaussian Mixture Model (with same K) (Bishop, 2006), which uses no corpus information. We observe a high word overlap with the topics from GAUS (see Appendix E), indicating that the continuous representations dominate the corpus co-occurence statistics.
Observation 2: GAUS has a rich-get-richer phenomena. Figure 2 shows the size of topics produced by different models on the 20NG. With the exception of very narrow clusters of words, most words collapse onto a single topic for GAUS (Figure 2b). 14 If many words have been assigned to one topic, that topic covariance Σ k becomes much larger than the others, leading to subsequent v i then having a higher relative density under that topic during Gibbs Sampling.
Our proposed adaptive π (Figure 2d) counteracts this effect better than the static π (Figure 2c). If Σ k is large, to balance the effect of words having a higher density under topic k, the algorithm samples a larger π, thereby placing less weight on the continuous representation.
Observation 3: ADAP is flexible and produces reasonable topics. Discrete LDA does not perform well with low training data due to sparsity of word co-occurences. Table 4 shows that ADAP does not suffer from this and can make up for the lack of training data to produce a topic about 'government' and 'law'. Next, we observe that while GAUS clusters all human names together based on their embedding space, ADAP is not overly reliant on embeddings and can correctly assign 'Paul' and 'Mary' to its ground truth topic of christianity. Additional topics and NPMI coherence scores are available in Appendix G.

Stability of Mixture Coefficient
As our experiments were conducted with a fixed number of topics, we study the expectation of α, β, π under a varying number of topics (K from 20 to 200). Figure 3: Stability of adaptive mixture coefficient π i,k with increasing number of topics in 20NG using 7000 documents. All α, β, π are expected values across all vocabulary words and all topics. We observe that the expected values vary smoothly with increasing K.
We approximate the expectation by the arithmetic average: Note that α and β take on different values for each word and topic during Gibbs Sampling. We observe that while E[π] is close to 0.5 for K = 20 for ADAP, it significantly outperforms SMIX (π = 0.5) in Table 2. This lends confidence to the interpretation that the adaptive mixture coefficient π k,j contributes to the better performance, as opposed to simply having a better static π.

Conclusion
Low resource scenarios present an interesting challenge to topic modeling due to sparsity of counts and a lack of data to train neural models. Our work proposes an automatic trade-off between externally trained continuous representations and traditional co-occurrence count-based statistics that is specific to each word and topic. The method accounts for variations in number of topics and embedding dimensions, and requires no additional tuning beyond existing methods.
Importantly, it requires no additional retraining of word embeddings or learning of topic embeddings, allowing us to rely solely on pre-trained representations and existing corpus statistics. We showed the efficacy and extensibility of our approach on a monolingual and a multilingual dataset, while introducing a new Cross-lingual Adaptive LDA topic model in the process. In future work, we aim to study the different scenarios of low resource (e.g., when there are a lot of infrequent words such as named entities) and their interaction with different embedding methods.  Update α znew,j , β znew (Eq : 6, 8)

26
Update µ znew , L znew The full inference algorithm is given in Algo-rithm 1. For details on the parameterisation of the multivariate t, update of µ k and computation of Ψ k , we refer readers to Murphy (2012). For update and downdates of L k , we refer readers to Seeger (2004) and Das et al. (2015).

B Accounting for Uncertainty in the Multivariate t Distribution
We present a small modification when calculating the density of the word vector v d,i for each topic (row 2 of Table 1). At each step of Gibbs Sampling, the model samples a topic based on the relative likelihood of a v d,i drawn from a t-distribution of topic k. 15 We observe that in Equation (1), the second term is dominated by the first term, where x d,i is a word vector representation, v d,i . In high dimensions, p(v 1 d,i |z 1 d,i = k, ϕ, z \d,i , x) becomes highly skewed towards a certain topic, such that the influence of the document structure becomes negligent. This motivates a correction in the first term, as the embeddings are pre-trained rather than a true signal. We correct the degrees of freedom ν k to better account for uncertainty in the embedding representations.

B.1 Rescaling the Degrees of Freedom ν k
As given by Murphy (2012) where N k is the number of words assigned to topic k and M is the embedding dimensions. Upon initialisation, under random assignment of words to topics, E[N k ] = |Ṽ | K , where |Ṽ | are all the (nonunique) words in the corpus. Since for a typical corpus |Ṽ | is very large and |Ṽ | K >> M , the degrees of freedom ν k are very large resulting in an approximate normal distribution which is over-confident in its posterior predictions. This effectively dominates the priors for Σ 0 , ν 0 or µ 0 . Hence, we rescale ν k toν k from 1 to 30 16 to account for inherent uncertainty over v i belonging to any particular topic.
The effect of rescaling ν k results in a heavier-tail distribution which results in higher density for v which are further from µ k . This encourages better mixing during Gibbs Sampling.
Comparison with the fully Bayesian treatment. We found this heuristic to be numerically and em-pirically more stable than a fully Bayesian treatment which encodes a higher variance in the t distribution by having a larger prior on the covariance matrix Σ 0 .
First, re-estimating the covariance matrix at every step of Gibbs Sampling is numerically unstable with a large Σ 0 . Next, rescaling ν k guarantees that we maintain a heavy-tailed t distribution at every iteration of Gibbs Sampling resulting in better mixing of the Markov Chain. By adopting the rescaling heuristic, we can directly set the prior covariance Σ 0 to its sample covariance, removing one adhoc parameter choice. Since both setting a large prior Σ 0 and scaling ν k are modeling decisions, we adopt the approach that is numerically and empirically more stable.

C Interpretation of β k
Note that β k can be interpreted as a random variable drawn from a Gamma distribution, with shape parameter 1 K , and rate parameter tr( Then, Equation 8 is the point estimate of β k obtained from the expectation of the Gamma distribution, where β k ∈ (0, ∞) can be interpreted as real-valued 'counts' for observing the continuous representation. The rate parameter is scaled by 1 M to make the numerator robust to dimension size. Since Σ k is positive semi-definite, and square root is a monotonically increasing function, as M increases, the trace of Σ k increases ( M i σ i , σ i ≥ 0, ∀i) and β k decreases.
D NPMI when p(w i ) · p(w j ) = 0 In our implementation of NPMI, we do not consider the pair if either p(w i ) is p(w j ) is 0, as this simply indicates a "mismatch" between training and test corpus. However if they are non-zero, and p(w i ) · p(w j ) = 0, then the model has predicted a poor word pair that never co-occurs despite them individually appearing in the test corpus, and the score for N P M I(w i , w j ) = −1. This differs from many online implementations of NPMI which will simply set N P M I(w i , w j ) = 0 if p(w i ) · p(w j ) = 0, and 'does not penalise very poor word pairs of this nature.   We believe the main reason for GAU SS to score highly on this measure is most likely due to the scoring of word pairs as described in Appendix D. This is supported by the observation that with some very rare words, the effect of in the NPMI score in C v is large, resulting in higher scores than expected. This is described in https://github.com/dice-group/ Palmetto/issues/12.

G Topics for 20NG
Adaptive LDA (ADAP) Topic:  power  war  post  book  religion  key  cars  battery  armenians  posting  lost  rights  government  engine  light  turkish  list  study  gun  chip  miles  design  armenian  article  msg  government public  speed  idea  muslims  send  pain  news  clipper  driving  bit  population source  york  support  encryption  ford  quality  jewish  mail  school  control  phone  oil  single  answer  questions kids  article  security  heavy  type  history  address  disease  post  police  clean  model  killed  hope  drug  freedom  private  rear  noise  source  posted  books  guns  algorithm  white  systems  muslim  lines  cancer  action  data  left  normal  children  faq  cheers  society  search  heard  ground  genocide  based  double  subject  des  fun  control  human  version  original  land  law  looked  boot  shuttle  product  april  americans  nsa  air  fit  real  subject  effects  weapons  secure  tires  signal  cut  write  studies  questions  david  road  heat  turkey  note  usa  court  message  weight  fine  western  response  patients  congress