Are Neural Topic Models Broken?

Recently, the relationship between automated and human evaluation of topic models has been called into question. Method developers have staked the efficacy of new topic model variants on automated measures, and their failure to approximate human preferences places these models on uncertain ground. Moreover, existing evaluation paradigms are often divorced from real-world use. Motivated by content analysis as a dominant real-world use case for topic modeling, we analyze two related aspects of topic models that affect their effectiveness and trustworthiness in practice for that purpose: the stability of their estimates and the extent to which the model's discovered categories align with human-determined categories in the data. We find that neural topic models fare worse in both respects compared to an established classical method. We take a step toward addressing both issues in tandem by demonstrating that a straightforward ensembling method can reliably outperform the members of the ensemble.


Introduction
Topic models provide an unsupervised way both to discover implicit categories in text corpora, and to estimate the extent to which any given category applies to a specific text item.As such, topic modeling can be viewed as an automated variety of content analysis for text: those two capabilities directly correspond to the practice of developing an emergent coding system via examination of a text collection, and then coding the text units in the collection (Stemler, 2000;Smith, 2000).This form of content analysis is a dominant use case for topic models and therefore it is our focus here.
Explicitly identifying topic models as a tool for content analysis allows us to characterize what makes topic models good: we can measure the extent to which a model achieves the goals of content analysis.This careful consideration of the criteria for "good" topic models is essential because recent results have challenged the validity of the prevailing model evaluation paradigm (Hoyle et al., 2021;Doogan and Buntine, 2021;Doogan, 2022;Harrando et al., 2021).In particular, Hoyle et al. identified a validation gap in the automatic evaluation of topic coherence: metrics like the widely-used normalized pointwise mutual information (NPMI) were never validated using human experimentation for the newer neural models once they emerged, and the authors demonstrated that such metrics exaggerate differences between models relative to human judgments.Given that the majority of claimed advances in topic modeling are predicated on these metrics (per Hoyle et al.'s meta-analysis), it would appear that much of the topic model development literature now rests on uncertain ground.Doogan and Buntine also challenged the validity of current automated evaluation measures, and highlighted the disconnect between these measures and use of topic models in real-world settings.
In this paper, we begin with the needs of content analysis, and we use those needs to argue for specific choices of how to measure topic model performance.We then report on comprehensive experimentation using two English-language datasets, four neural topic models that are representative of the current state of the art, and classical LDA with Gibbs sampling as implemented in MALLET (McCallum, 2002).The results indicate that MAL-LET is a more reliable choice than the more recent neural models from a content analysis perspective.Taking a step toward addressing these issues, we use a straightforward ensemble method that combines the output of models across runs, which reliably yields better results than the usual practice of running a single model.
To summarize our argument and contributions: • Automated comparison of topic models should be grounded in a use case, and content analysis is a dominant use case for topic models ( §2.2). • Stability and reliability are necessaryalthough not sufficient-criteria to ensure the value of a content analysis ( §2.3).• Stability and reliability can be directly measured from model outputs, unlike automated coherence, which prior work has shown is an unreliable proxy for human judgment ( §2.4).• On these metrics, we show that LDA with Gibbs sampling (as implemented in Mallet) is significantly more stable and reliable than newer neural models ( §4).• We present a straightforward ensembling method to mitigate the stability problem ( §6).
We release all code and data. 1 2 What makes a topic model "good"?
In considering how to characterize a topic model that works well, we focus on text content analysis as a dominant use case for topic modeling.

Traditional content analysis
Although content analysis is an extremely broad concept (Krippendorff, 2018), a very widely used paradigm across many disciplines is a manual process of inductive discovery of codesets via emergent coding (Stemler, 2000), which "allows categories to emerge from the material without the influence of preconceptions" (Smith, 2000).Weber (1990) describes a "data-reduction process by which the many words of texts are classified into much fewer content categories," and this invocation of data reduction in a manual setting provides a sense of why topic modeling, a dimensionalityreduction technique, can be a good fit when considering ways to automate the process. 2  Typically the inductive process involves multiple researchers independently reading samples of the text units being analyzed, and proposing categories (usually called "codes") that they see as present and relevant; they then reconcile their independent proposed categories to produce a candidate codeset 1 github.com/ahoho/topics 2 This process of inductive category discovery contrasts with the use of pre-existing categories, e.g.those coming from relevant theory, and from the use of "manifest" or directly observable characteristics of text.Discussions in the literature often distinguish "quantitative" from "qualitative" content analysis, with the inductive process we describe being associated with the latter category.This terminological distinction may be overly sharp, however; see Schreier (2012) for useful discussion of relationships and differences among quantitative content analysis, qualitative content analysis, and other forms of qualitative research.
with associated definitions and coding guidelines.The candidate codeset is then used by two or more people to independently code (i.e.manually label) a sample of the data, and inter-coder reliability is measured using a chance-corrected agreement measure like Krippendorff's α (cf.Artstein and Poesio, 2008).If an acceptable level of reliability has not yet been achieved, the codeset and coding guidelines are revisited and revised, and another iteration of independent coding and reliability measurement takes place.Once reasonable reliability has been achieved, the final set of categories is considered to reflect true structure underlying the text collection.Sometimes the texts in the collection are then manually coded using those categories in order to support quantitative analysis -possibly with further inter-coder reliability measurement for quality control -although sometimes the set of categories itself is the intended result, not item-level coding.

Topic modeling for content analysis
The models of interest in this paper are exemplified by Latent Dirichlet Allocation (LDA, Blei et al., 2003), within which each of N documents d is represented as an admixture θ d of K topics, and each topic is itself represented as a distribution β k over the vocabulary V .Topics can thus be viewed in two complementary ways, as ranking either the words in the vocabulary or the documents in the collection.These views can be interpreted as corresponding closely to two central elements of a traditional text content analysis.First, the rows in the topic-word distributions matrix B ∈ R K×|V | constitute an inductively determined set of categories analogous to a human-determined codeset; for example, the presence of a topic with top (most probable) words artist, museum, exhibition might correspond to a human analyst identifying the code ART.Second, the columns of the document-topic matrix Θ ∈ R N ×K constitute a soft coding of documents using the categories in B. 3 To help illustrate the first step, Table 1 shows the top words from inferred topic-word distributions β k for two model types over multiple runs.
Reviewing the use of topic models.Bearing these correspondences in mind, we reviewed the literature to confirm our subjective impression that text content analysis is indeed the dominant use  1: Sets of WEATHER topics for two model types for different runs with different hyperparameters on a Wikipedia dataset, represented in conventional fashion using the most probable ten words per topic.The table visually illustrates MALLET's dramatically greater stability: the top words from the base topic appear in corresponding topics across the the full range of the other nine runs (overlap with base topic in orange), while for D-VAE, a neural topic model, consistency with the base topic begins to show a significant drop-off even with the nearest topic (overlap in blue).See §4 and §5 for discussion.case for topic modeling. 4Using Semantic Scholar (semanticscholar.org),we collected research studies outside the field of computer science published in 2019-2022 that cite Blei et al. (2003), and selected 50 at random.We excluded studies that cite Blei et al. but do not actually use any topic model, as well as studies that do not involve language data.We retain those that employ topic model variants, such as STM (Roberts et al., 2013).Using Semantic Scholar's reported field of study, disciplines represented include medicine, sociology, business, political science, psychology, economics, and history.We find that 94% of the papers use a topic model for inductive discovery of categories for human consumption, 68% of which go on to actually assign human-readable code labels to topics; and 64% of papers use document-topic probabilities as a form of coding for individual text units.We interpret these results as strongly suggesting that, outside topic model development, the primary use of topic models is an automated form of text content analysis as characterized in §2.1.

Criteria for good content analysis
Having established text content analysis as a central topic modeling use case, we consider criteria for "good" analysis motivated by that use case.These then inform the selection of topic model evaluation metrics in §2.4,helping to ensure a correspondence between the way topic models are evaluated and the reasons people are using them.
One key issue in content analysis is stability or intra-coder reliability: if the same coder were to look at the same data again (say, separated by a long interval to achieve some degree of independence), would they produce the same results?When an individual coder cannot produce stable output, this calls into question the quality of the results they have produced any one of those times.
A second central concern in content analysis is reproducibility or inter-coder reliability: do two or more independent coders looking at the same data agree with each other?In the absence of externally provided coding to compare against, what establishes trust in categories or coding is consensus, what Weber (1990) refers to as "the consistency of shared understandings" between coders.
A third concern that is often discussed is validity: do categories or measurements actually correspond to whatever they are intended to measure (Rubio, 2005)?As Weber (1990) notes, in content analysis this often goes only as far as face validity, i.e. a subjective perception that a measure (or category) appears to be valid.In contrast, Shapiro and Markoff (1997) argue that content analysis "is only valid and meaningful to the extent that the results are related to other measures".
Research in content analysis typically focuses on these three issues -stability, reproducibility, and validity -as necessary considerations when considering whether a content analysis should be used as the basis for inferences about a dataset.Validity, however, is challenging to assess outside the context of specific research questions (see Grimmer and Stewart, 2013, for an example in political science).We therefore focus on stability and reproducibility as the basis for developing metrics to assess topic models for the automated content analysis use case.
Note that the criteria we emphasize-stability and reproducibility-are necessary to ensure the value of a content analysis, but not sufficient: topic coherence is a complementary and crucial concern (Newman et al., 2010) that requires further investigation, since prior work has shown automated coherence measurements are an unreliable proxy for human judgment (Hoyle et al., 2021).

Operationalizing the criteria
Because they are generative models, the development community initially evaluated topic models using held-out perplexity, i.e. their ability to predict unseen text.However, focusing on the goal of producing categories that humans can understand, Chang et al. (2009) established that perplexity actually correlated negatively with human determinations of coherence as estimated using behavioral measures.Lau et al. (2014) went on to introduce NPMI as an automated coherence metric positively correlated with human preferences.Since then, NPMI has been the most prevalent way to establish that a new topic modeling method is better than the old ones, including the new generation of neural topic models.However, Hoyle et al. (2021) recently identified a validity gap for NPMI: its correspondence to human judgments was never validated for neural topic models, and although recent neural topic models can attain relatively high NPMI, human annotators fail to meaningfully distinguish them from a classical LDA baseline.
That result suggests taking a fresh, well motivated look at topic model evaluation.Any model evaluation should be grounded in consideration of the model's intended purpose, which leads us to suggest grounding formal evaluation metrics in the content analysis use case. 52.4.1 Stability §2.3 notes stability as an important criterion in content analysis.Whether codes are being produced by a human coder or a topic model, if there is meaningful latent structure in the text collection, one 5 It should be noted that we are focusing on only the most central part of the content analysis use case.Smith (2000) situates codeset discovery and coding within a broader process that begins with identifying the research problem, selecting appropriate materials, etc., and ends with actually using the codeset and coding to generate research findings.Bayard de Volo et al. ( 2020) situate topic model creation within a corresponding end-to-end workflow; see also Boyd-Graber et al. (2014) for practical discussion of topic modeling including discussion of other use cases.
would expect either humans or models to consistently uncover that structure.
To ground our evaluation in our use case, we measure the stability of models across hyperparameter settings (for a fixed topic number K).In the absence of an unsupervised metric to optimize or reliable "default" values, a practitioner is forced to explore different hyperparameter settings.All else equal, a topic model that is less sensitive to changes in hyperparameter settings is preferable to one that is more sensitive (we also evaluate results for fixed hyperparameters with different random seeds, see Appendix A.1).
Translating these ideas into a formal measurement, we follow Greene et al. (2014) in operationalizing model stability by measuring the total distances between the topic-word estimates for each run, extending their method to measure stability of both the sorted rows of the topic-word estimates B or the sorted columns of the document-topic estimates Θ; the smaller these distances, the more stable the estimates.
Without loss of generality, we focus on the topic-word distributions to operationalize stability as total topic distance.We collect a set of estimates from m model runs on the same dataset.For each pair of m 2 runs, we compute the pairwise distance d between all K topics in each run.We use the Rank-Biased Overlap distance (RBO, Webber et al., 2010), which is used to measure the distance between two rankings giving more importance to similarity of the top-ranked items, i.e., the measure is top-weighted, making it ideal for measuring the distance between topics (Mantyla et al., 2018). 6ithin a pair of runs B (i) , B (j) , the goal is to find a permutation of rows π(•) to minimize This problem is an instance of bipartite matching distance minimization, which we solve with the modified Jonker-Volgenant algorithm of Crouse (2016).If the set of m 2 total distances T D (i.e., the minimized costs) for one model are significantly less than a second model, the first model is more stable.
Prior topic modeling work has identified stability as a crucial concern for robust application to social sciences (Koltcov et al., 2014;Ballester and Penner, 2022), for better incorporation of topic models in downstream automated NLP tasks (Miller and McCoy, 2017), as a criterion for tuning LDA parameters (Greene et al., 2014), and has offered ways to improve it for LDA estimates (Agrawal et al., 2018;Mantyla et al., 2018).Chuang et al. (2015) introduced an interactive tool to help humans assess a topic model's stability.However, in a meta-analysis of 35 papers proposing new "stateof-the-art" neural topic models over the past three years (2019-2022), we find that none of them compared the models on stability.7 2.4.2Inter-coder reliability §2.3 notes that reproducibility or inter-coder reliability is also a central consideration in content analysis.Going beyond intra-coder consistency, if a set of codes cannot be applied consistently by multiple coders, this also calls into question whether it is doing a good job capturing meaningful content categories.
We treat a topic model ⟨B, Θ⟩ as a coder, and approach inter-coder reliability from the perspective of reproducing categories from other coders who are human, instantiated as a set of humanassigned "ground truth" labels for the documents in the collection.Since what we care about here are the categories discovered by a topic model, not actual labels, we measure the extent to which categories induced by the model align with that ground truth.Intuitively, for example, if documents that are assigned to a topic by the model all have the same ground-truth label, the topic is a good fit for human categorization of the data (and this can be determined just using documents assigned to the topic, without any generation or evaluation of labels).Conversely, if documents all assigned to the same topic in the model have a wide variety of ground truth labels, this mismatch suggests that the topic is missing something important relative to the underlying category structure in the collection.
By taking the most probable topic for a document ld = argmax k ′ θ d,k ′ as its assigned topic or "code", we can apply standard metrics of cluster quality. 8We borrow exposition of cluster quality metrics from Poursabzi-Sangdeh et al. (2016), with all metrics using the predicted clustering from a model, L = {ℓ d : d = 1, . . ., n}, and a given set of gold labels L * .
Adjusted Rand Index.The Rand Index compares all pairs of the two labelings over documents, counting the proportion of pairs that have the same (TP) or different (TN) assignments (Rand, 1971).

RI =
T P + T N T P + F P + T N + F N The adjusted rand index further corrects for chance (Steinley, 2004).
Normalized Mutual Information (NMI) measures the mutual information between two clusterings, and is invariant to cluster permutations (Strehl and Ghosh, 2002).Here, I is the mutual information and H are the entropies for each clustering.
Purity takes all documents contained in a single predicted cluster and measures the number of associated gold labels that appear in it -it is roughly akin to precision (Zhao and Karypis, 2002).A small number of gold labels present in a predicted cluster means that there is high alignment between the discovered "concept" and the true one.
Purity is not symmetrical, so we define inverse purity as P −1 = P (L * , L), and P 1 as their harmonic mean (analogous to F 1 ).
Prior topic modeling work has looked at how well topics discovered by a model align with reference codes (Chuang et al., 2013;Korenčić et al., 2021).However, in the same meta-analysis discussed above, only six of the 35 neural topic modeling development papers compared models on a version of alignment.This suggests that even though stability and alignment have been identified as important and useful criteria in topic modeling literature in prior work -especially when using and examining LDA and its variants -they have seen precious little uptake.We hope that our strong use-case motivations and experimental results will change this.
training data and calculating a held-out F1 score-i.e., to train a classifier-but this process does not correspond to any common real-world use of topic models.

Experiments
Having argued that topic models should be subject to evaluations designed with real-world uses in mind, and having motivated specific ways to operationalize evaluative measurements based on criteria that matter for text content analysis, we evaluate nominally "state-of-the-art" topic models to understand how well they perform relative to those criteria.

Datasets
We use two standard English datasets of varying characteristics: 14,000 "good" articles from Wikipedia (Wiki, Merity et al., 2017) and 32,000 bill summaries from the 110-114th U.S. congresses (Bills). 9The documents in both datasets have hierarchical labels, which serve as ground truth when evaluating the quality of the document-topic posteriors ( §2.4).The Wiki dataset has 45 labeled highlevel and 279 low-level labels; the Bills dataset has 21 high-level and 114 low-level labels.We process each with the standardized setup of Hoyle et al. (2021), setting the vocabulary size to either 5,000 or 15,000 terms, limiting by term-frequency (Blei and Lafferty, 2006).
Prior evidence suggests that neural topic models may produce topics with narrower scope than classical models (e.g., agnes_martin, sol_lewitt, minimalism rather than art, painting, museum, cf.Hoyle et al., 2021).We therefore generate heldout sets for both datasets to facilitate exploration of this phenomenon.Namely, we ensure that both the training and held-out sets contain documents from all high-level categories, but partition the lowlevel categories into seen and unseen labels.For example, Wikipedia articles about television are present in both subsets, but those about 30 Rock episodes are exclusively in the training set whereas Simpsons episodes are unseen.Although not an emphasis of the present work, our high-level conclusions remain the same for the held-out data (i.e., MALLET is better-aligned, Appendix A.1); we leave further analysis to future efforts.

Models and experimental contexts
Classical topic models use Gibbs sampling or variational inference to infer the posteriors over the latent variables; more recent neural topic models use 9 "Featured" Wikipedia articles have an incompatible labeling scheme and are therefore excluded.Raw bill data was extracted from https://www.govtrack.us/data/us/.contemporary techniques that involve neural networks, such as variational auto-encoders (Kingma and Welling, 2014).
We evaluate one classical topic model and four neural topic models.Each model is evaluated in one of 16 experimental contexts: a tuple of dataset (Bills, Wiki), vocabulary size (5k, 15k), and number of topics (25,50,100,200).
In light of the finding that automated coherence cannot meaningfully reproduce human judgments (Hoyle et al., 2021), there is no unsupervised metric that we can optimize to avoid the problem of instability, while optimizing for K remains an open research problem.Therefore we vary K and, for all contexts, we train the models ten times using a different set of randomly-selected hyperparameters, where value ranges are based on prior literature ( §A.3).
Although "optimal" hyperparameters will often change depending on context, we also report results with fixed hyperparameters and varying seeds in Appendix A.1.
MALLET.Given its prevalence among practitioners and strong qualitative human ratings in prior work (Hoyle et al., 2021), as a classical model we use LDA estimated with Gibbs sampling (Griffiths, 2002), implemented in MALLET (McCallum, 2002).While LDA is a common baseline in the topic model development literature, it is often estimated with variational methods, which anecdotally produce lower-quality topics (Goldberg, 2020).10SCHOLAR.A popular neural alternative to the structural topic model (Roberts et al., 2014), flexibly incorporating supervised signals and external covariates into the model (Card et al., 2018).

Results
Recall that measuring stability is motivated by intracoder reliability in content analysis: producing the same result every time increases confidence that the analysis reflects actual latent structure in the data.MALLET is significantly more stable than other models across the vast majority of contexts, often by a large margin (Table 2).Most striking are the topic-word distributions B: none of the neural models even approach its consistent level of stability.CTM sometimes achieves comparable stability for Θ; this may be due to its use of pretrained document embeddings, which are transformed in order to parameterize the estimate.
Recall also that alignment is motivated by intercoder reliability in content analysis: is the model, in the role of analyst, agreeing with human-derived categorization for the data?MALLET shows strong consistency in providing the numerically best alignment with human categorization across datasets (Table 3).11Among neural models, D-VAE and SCHOLAR sometimes achieve statistically indistinguishable performance, but they do not approach MALLET's consistency across datasets, number of topics, and metrics.Now, we argued in §2.4.1 that practitioners do not have access to optimal hyperparameters for a given model, because what is optimal will depend on the dataset, number of topics, preprocessing, and other experimental decisions.The above results show that model estimates can be very sensitive to different hyperparameter settings and they clearly favor MALLET on our metrics.However, in many real-world scenarios, a practitioner may simply rely on some "default" settings.We there-fore also evaluate models for fixed hyperparameters using reasonable default values. 12o generate the defaults, for each dataset and model we find the hyperparameter settings that yield the best alignment performance across experimental contexts (vocabulary size, number of topics, alignment metric, and label hierarchy).Specifically, within each context, we first rescale the alignment metric values over the 10 runs for that model to avoid differences in metric values; we then select the hyperparameters which have the largest average values across all contexts, for a given dataset.Finally, to approximate a common use case and to avoid overfitting to the dataset, we use the hyperparameters obtained from one dataset to train models on the other dataset (e.g., we select defaults based on the Bill alignment metrics and set those for new models run on Wiki; defaults in Appendix A.3).
Results are in Appendix A.1.Unsurprisingly, fixing "good default" hyperparameters for the neural models improves their stability and alignment.In particular, D-VAE has competitive alignment metrics in the |V | = 5k case, although it is hampered by its relatively poor stability.MALLET's stability is marginally affected: while it is no longer as consistently dominant, it remains more stable and better-aligned in the majority of contexts.

A close reading of model stability
Table 1 illustrates corresponding versions of a topic from different runs of D-VAE and MALLET.For a given context (here, K = 50, |V | = 15, 000), we collect the topic-word estimates B across the 10 runs for each of the two models, each run using a different set of randomly-selected hyperparameters.One weather-related topic across runs was chosen manually as the "base" run, and then the corresponding topics in the other nine runs for the same model were ranked by their RBO distance to that topic.The nearest, median, and most-distant topics in that ranking, shown in the table, therefore capture the range of variation across different hyperparameter settings.
It is immediately clear that even the nearest topic for D-VAE has fewer words in common with the base topic, compared to MALLET.And as distance increases, the top words for MALLET stay consistent, whereas those for D-VAE change dramatically, even if they relate to the same weather concept.Note that in this example, consistent with anecdotal reports from other practitioners and our own experience, the neural model tends toward less frequent or more specific words.The idea that neural models may be capturing topics that are in some sense narrower, with instability leading to different such topics in each run, leads directly to the idea that a cross-run ensemble might be expected perform better than the individual runs-which is important in the absence of a reliable automated method for optimizing hyperparameters.

Ensembling estimates
We have highlighted lack of stability as a serious problem for neural topic models, but neural models can also have desirable properties.How can we increase the odds of obtaining a good neural topic model in the face of extreme variation?The distance metrics we use to measure instability offer one solution: clustering to aggregate similar estimates over runs to form an ensemble.We adopt an approach similar to prior work (Miller and Mc-Coy, 2017;Mantyla et al., 2018), going further by accounting for the document-word estimates Θ and by evaluating ensembles' alignment against human categorization.Specifically, we concatenate run estimates over the m runs B = B (i) m i and Θ = Θ ⊤(i) m i , where each row in the concatenated matrix is a topic.We then compute pairwise distances between topics, D( B) and D( Θ), and cluster based on a linear interpolation of the two distances, λD( B) + (1 − λ)D( Θ), where λ is a hyperparameter.The estimate of each topic k from each run i, ⟨θ k ⟩, is assigned to a cluster, and to infer new document-topic or topic-word scores for the ensemble, we take the element-wise mean over the estimates assigned to each cluster.13To evaluate this method, we compare the alignment score of the ensemble ( §2.4.2) combining the m = 10 runs, versus the alignment score of each individual member run.We do so across each of the 400 contexts (model, dataset, K, high versus low label granularity, and metric).Figure 1 illustrates a summary of results for the purity alignment metric on the Wikipedia dataset.Across the full range of our experimentation, the ensemble improves on the median member in 97% of all the contexts, and it is always better than the worst member (full results in Appendix A.5).

Conclusions
A tool can be considered broken when it doesn't work well for its intended use.In this paper we have focused on a widespread use case for topic models, their application in text content analysis; we have carefully motivated criteria for measuring the extent to which a topic model is serving those needs; and we have demonstrated through comprehensive and replicable experimentation that, when measured on those criteria, recent and representative neural topic models fail to improve on the classical implementation of LDA in MALLET.In particular, MALLET is much more stable, reducing concerns from the content analysis perspective that different runs could yield very different codesets.Equally important, across the vast majority of contexts, its discovered categories are reliable as measured via alignment with ground-truth human categories.For people seeking to use topic modeling in content analysis, therefore, MALLET may still be the best available tool for the job.
That said, there are still good reasons to investigate neural topic models.Foremost among these is the fact that they can benefit from pretraining on vast, general samples of language (e.g.Hoyle et al., 2020;Bianchi et al., 2021a;Feng et al., 2022).Neural realizations of topic models can also be integrated smoothly for joint modeling within larger neural architectures (e.g.Lau et al., 2017;Wang et al., 2019Wang et al., , 2020)), and hold the promise of being more straightforward to use multilingually (e.g.Wu et al., 2020;Bianchi et al., 2021b;Mueller and Dredze, 2021) or multimodally (e.g.Zheng et al., 2015). 14We therefore introduced one possible way to address the shortcomings we identified using a straightforward ensemble technique.
Perhaps the most important take-away we would suggest is that development of new topic modelsindeed, of all NLP models-should be done with use cases firmly in mind.Some models are enabling technologies, without a direct user-facing purpose, and others are intended to produce results directly for human consumption.But whatever the goal, the driving question for methodological development and evaluation should not be how to demonstrate an improvement in "state of the art", it should be why the model is being created in the first place and what measurements will demonstrate improved performance for that intended purpose.

Limitations
Our studies used only English datasets, while topic modeling has been used to characterize texts in many languages.While theoretically we see no reason why our results and findings should not generalize beyond the English language, empirical generalizability across languages remains to be determined.
Our method for measuring alignment of modelinduced categories with human-determined categories relies on ground-truth human labels, potentially limiting its broader applicability.In addition, the categories in the Wikipedia data were not, to our knowledge, produced via a traditional human content analysis process.We are currently designing a follow-up study in which human subject matter experts perform traditional content analysis from scratch on the same dataset used for topic modeling, in order to provide a head-to-head comparison between automated and traditional methods and to establish human upper bounds on inter-coder reliability.
Our literature review of topic modeling use cases was not a formal systematic review (Moher et al., 2009).It relied on Semantic Scholar's content and its discipline categorization, and potentially excluded papers in computer science that were about the use of topic models rather than method development.It seems clear that text content analysis a dominant use case for topic modeling, if not the dominant use case.In the social sciences, we also note frequent use of the Structural Topic Model (Roberts et al., 2014)

A Appendix
A.1 Additional Results Fixed Hyperparameters.In tables 4 and 5, we report the equivalents of tables 2 and 3 when holding hyperparameters fixed, rather than letting them vary.We identify the hyperparameters for each model that achieve the highest average alignment metrics on across experimental contexts for one dataset, then use those hyperparameters to estimate models on the other dataset (hyperparameter values in appendix A.3).In this way, we follow a common paradigm in practical application of machine learning models: hyperparameters are determined based on an initial experimental context, then used in another.Broadly, MALLET is more stable and better-aligned than its neural counterparts in this setup, although the difference is not as stark as when hyperparameters are allowed to vary.
Held-out data.In table 6, we report the alignment metrics for unseen category labels.To form the held-out data, we keep all high-level categories consistent between the training and held-out sets, but partition the low-level categories such that some are never seen during training (e.g., although documents from the high-level architecture category will be included in both splits, documents on bridges are only seen in training while those on lighthouses are held-out).Here too, MALLET generally has the highest alignment metrics over experimental contexts.

A.2 Details of LDA applications meta-analysis
Summary statistics of our meta-analysis of studies using LDA outside computer science are shown in Table 7.The major results were discussed in Section 2.2.We find that about half of the papers did not specify the exact LDA implementation they used in their study, which raises larger reproducibility concerns for scientific research.Note that one paper can be assigned multiple subject or fields of study by Semantic Scholar.All the papers used for the meta-analysis are shown in tables 9 and 10.

A.3 Hyperparameters
Hyperparameters are included in the supplementary materials as <model name>.ymlfiles.The full range of hyperparameters can also be found in Table 11.

SCHOLAR+KD.
Hoyle et al. (2020) apply knowledge distillation (KD) to improve on SCHOLAR using a BERT-based autoencoder.Gao et al. (2021) show that domain experts prefer the outputs of an adapted SCHOLAR+KD over other models(MALLET, ETM, Dieng et al., 2020).Dirichlet-VAE.The Dirichlet-VAE (D-VAE,Burkhardt and Kramer, 2019) is a variant of LDA that (a) uses a VAE to approximate the posterior over the latent document-topic distribution, and (b) follows PRODLDA by using unnormalized estimates of the topic-word values β, as opposed to a proper distribution.Annotators rate D-VAE's topics similarly to MALLET(Hoyle et al., 2021).Contextualized Topic Model.Typically, VAEbased neural topic models encode the bag-of-words representation of a document with a neural network to parameterize that document's distribution over topics.The popular model introduced inBianchi et al. (2021a) extends this representation with a contextualized document embedding from a large pretrained language model.

Figure 1 :
Figure1: Ensembling performance on alignment (purity metric, § 2.4.2) for the Wiki dataset.Each box represents a context: the columns identify model type and the label granularity used in evaluation (e.g.top left is MALLET with High-level categories), and the rows correspond to different values K for the number of topics.Dots are alignment scores for individual runs; the horizontal line is the alignment score for that ensemble of runs using our method.Shading indicates when the ensemble method has beaten the score of the best individual run (darkest), the median (middle), or has outperformed the worst individual run (lightest).The ensemble is typically in the top quartile of the component runs.Ensembles virtually always outdo the median, and frequently outperform the best individual run.
winds depression mph september damage cyclone system nearest storm tropical hurricane winds depression mph september damage cyclone system median tropical storm hurricane depression winds september cyclone mph system august farthest tropical storm depression hurricane cyclone system season winds september mph D-VAE base tropical mph storm hurricane winds cyclone extratropical utc rainfall nearest tropical cyclone hurricane storm winds landfall depression dissipated convection extratropical median convection landfall shear nhc utc tropical mbar northwestward cyclone extratropical farthest dvorak southwestward depressions dissipation intensifying conventionally southeastward

Table 2 :
Stability for topic-word B and document-topic Θ estimates, over 10 runs.Smallest per-column values are bolded and are sig.smaller than unbolded values (2-sided t-test, p < 0.05); underlined values have p > 0.05.

Table 3 :
Average alignment metrics across 10 runs, measured against gold labels at the lowest hierarchy level, |V | = 15, 000.Largest values in each column are bolded, which are significantly greater than unbolded values in a two-sided t-test (p < 0.05); underlined values have p > 0.05.
which, like SCHOLAR, can incorporate metadata into model estimation-we leave an evaluation of this use case to future work.Erik L. Peterson and Crystal L. Hall.2020."what is dead may not die": Locating marginalized concepts among ordinary biologists.Journal of the history of biology.

Table 11 :
Hyperparameter settings for MALLET, D-VAE, CTM, SCHLR+KD and SCHOLAR.* : Best setting for Bills, †: best setting for Wiki; based on the best average alignment metrics across experimental contexts.