DynaMiTE: Discovering Explosive Topic Evolutions with User Guidance

,


Introduction
Dynamic topic models (DTMs) seek to capture the evolution of topics in time-stamped documents (Blei and Lafferty, 2006).These models can be applied to many downstream tasks, including studying breakthroughs in scientific research (Uban et al., 2021), discovering global issues in parliamentary debates (Müller-Hansen et al., 2021;Guldi, 2019), and tracking evolving news stories (Li et al., 2020;Vaca et al., 2014;Yoon et al., 2023b).As information and language continuously evolve, DTMs are  (Churchill and Singh, 2022) for topics natural language processing (NLP) and neural networks (NNs) on Arxiv machine learning papers, compared to our output.
important tools for communicating these changes to users (Vosecky et al., 2013;Dieng et al., 2019).Existing DTMs are either fully supervised or fully unsupervised, both of which have their own limitations.To uncover topic evolutions in document collections, supervised DTMs (Park et al., 2015;Jiang, 2015) require each document to have a topic label.However, obtaining such topic labels requires annotating the document collection, which can be expensive and time-consuming.Hence, unsupervised DTMs (Blei and Lafferty, 2006;Wei et al., 2007;Zhang and Lauw, 2022;Grootendorst, 2022) are a more practical and popular approach, as they can be applied to unlabeled document collections.Despite their widespread usage, we observe two drawbacks of unsupervised DTMs that limit their effectiveness in downstream applications.
First, unsupervised DTMs fail to consider their users' needs, such as specific topics or categories of interest. 2Hence, the discovered topics may not be completely interpretable or relevant to the user (Chang et al., 2009).For example in Table 1 (red), the unsupervised DTM retrieves generic terms like "learn" and "results" which are not distinctly related to the desired topic of NNs.These terms also overlap with NLP, another topic of the user's interests.As shown in Table 1 (blue), it would be more informative to return specific models ("tnn") and techniques ("ntk") discussed primarily in the context of NNs.These category indicative terms promote a deeper understanding of the topics of interest, increase the likelihood that the retrieved outputs satisfy a user's needs, and enhance downstream tasks such as content discovery and corpus summarization (Wang et al., 2009;Boyd-Graber et al., 2017;Yoon et al., 2023a).
Second, unsupervised DTMs fail to distinguish between terms that are generic and terms that are distinct to each time step.For example in Table 1 (red), the unsupervised DTM retrieves "languages" for NLP at each time step, which is redundant and does not capture the field's evolution from 2013 to 2021 (Sun et al., 2022).As shown in Table 1 (blue), a user would be more informed by terms that uniquely characterize NLP in each year, such as "stance detection" in 2017 and "mbert" in 2021.Such time indicative terms provide clearer insights into how a topic has changed and they can aid users in downstream tasks, such as associating concepts with specific time steps ( §5.4) and identifying key shifts in successive years ( §6.4).
To address the above shortcomings, we introduce a new task, discriminative dynamic topic discovery, which aims to create informative topic evolutions suited to a user's needs.We minimally represent a user's interests as a set of provided category names or seeds, i.e., terms present in the input corpus.A discriminative dynamic topic discovery framework must produce evolving topics for each seed that are distinctly relevant to the category and time step.
For this task, we develop DynaMiTE, an iterative framework to Dynamically Mine Topics with Category Seeds.Avoiding the pitfalls of existing DTMs, DynaMiTE combines three scores to ensure that candidate terms are (1) semantically similar to a user's interests, (2) popular in documents indicative of the user-specified category, and (3) indicative of the corresponding time step.We briefly describe these scores as follows: (1) Semantic Similarity Score: Combining the strengths of category-guided and temporal embed-ding spaces, we propose a discriminative dynamic word embedding model to compare the semantics of candidate terms and user-provided seeds ( §4.1).
(2) Category Indicative Score: We assume that high-quality candidate terms related to a userprovided category name are likely to be found in documents that discuss the category name.Thus, we calculate a term's distinct popularity in a set of retrieved category indicative documents ( §4.2).
(3) Time Indicative Score: To discover candidate terms that uniquely capture time steps, we introduce a time indicative score based on topic burstiness.We seek candidate terms whose popularity rapidly explodes and defuses ( §4.3).
DynaMiTE ensembles these three scores after every training iteration to mine a single term for each time step and each category ( §4.4).These terms are used to refine the discriminative dynamic word embeddings and category indicative document retrieval, resulting in informative topic evolutions.We present DynaMiTE as a fast, simple, and effective tool for aiding trend and evolution exploration.
Our contributions can be summarized as follows: • We propose a new task, discriminative dynamic topic discovery, which aims to produce informative topic evolutions relevant to a set of user-provided seeds.
• We develop DynaMiTE, which iteratively learns from discriminative dynamic embeddings, document retrieval, and topic burstiness to discover high-quality topic evolutions suited to a user's needs.
• We design a new human evaluation experiment to evaluate discriminative dynamic topic discovery.We find that users prefer Dyna-MiTE due to its retrieval of category and time indicative terms.
• Through experiments on three diverse datasets, we observe that DynaMiTE outperforms state-of-the-art DTMs in terms of topic quality and speed.

Related Work
We outline two variations on topic mining which incorporate time and user guidance, respectively.

Dynamic Topic Modeling
Many popular unsupervised DTMs (Blei and Lafferty, 2006;Churchill and Singh, 2022)   LDA (Blei et al., 2003), where each document in a corpus is drawn from a generative process.
A drawback common to all aforementioned approaches is the inability to incorporate user guidance.We address this limitation by enabling users to specify seeds for each topic evolution.Further, there does exist a small family of supervised DTMs (Park et al., 2015;Jiang, 2015), but these models can only be used on labeled document collections.Hence, if the user specifies seeds that are not included in the document labels or the document collection is unlabeled, supervised DTMs cannot be directly applied to our setting.

User-guided Topic Discovery
Varying forms of guidance have been integrated into non-dynamic topic models.SeededLDA (Jagarlamudi et al., 2012) generates topics with usergiven "seed topics".Later methods allow users to specify whether pairs of words should be generated by the same topics (Andrzejewski and Zhu, 2009) and anchor specific words to topics (Gallagher et al., 2017).Recently, user queries have been used to guide topic models (Fang et al., 2021).
More relevant to our task are models that iteratively expand upon a set of user-provided seeds.GTM (Churchill et al., 2022) uses Generalized Polya Urn sampling (Mimno et al., 2011) to learn topics based on user-given seeds.Embeddingbased approaches such as CatE (Meng et al., 2020) learn discriminative embeddings for user-provided categories.Recent seed-guided topic mining works (Zhang et al., 2022a,b) use language model representations and topical sentences to improve CatE.
These works assume a non-dynamic corpus and thus cannot discover topic evolutions from temporal corpora, which is the main focus of this paper.

Problem Definition
We define discriminative dynamic topic discovery as follows: Given a corpus of time-stamped document collections D = {D 1 , D 2 , ..., D T } and a set of user-provided seeds C = {c 1 , c 2 , ..., c n }, discriminative dynamic topic discovery aims to retrieve topic evolutions {S tj } T t=1 for each category c j .The topic S tj contains a list of terms {w 1 , w 2 , ..., w m } that are discriminatively relevant to time t and category c j .The time steps T = {1, ..., T } are any ordinal measure of time and can vary depending on the granularity required.

Methodology
To solve discriminative dynamic topic mining, we propose DynaMiTE, which iteratively populates each topic S tj .Each topic S tj initially contains just the category name c j , and after every training iteration of DynaMiTE, we expand each S tj with a single term w.For a term w to be added to S tj , we require three conditions to be satisfied: (1) w must be semantically similar to S tj ; (2) w must be prevalent in documents which discuss S tj ; (3) w must be a time indicative word of time t.
We achieve these three goals by calculating three respective scores for candidate terms, namely semantic similarity scores with discriminative dynamic word embeddings ( §4.1), category indicative scores from retrieved category indicative documents ( §4.2), and time indicative scores based on topic burstiness ( §4.3).Combining these scores ( §4.4), we can iteratively mine terms and use this information to further enrich our framework, illustrated in Figure 1 and detailed in Algorithm 1.

Semantic Similarity Score
Static word embeddings (Mikolov et al., 2013;Pennington et al., 2014) are one option to compute the semantic similarity between candidate terms and user-provided categories.However, static embeddings do not consider the category and time dimensions, thus losing the ability to model category distinctive information (Meng et al., 2020) and capture evolving semantics (Bamler and Mandt, 2017).Hence, we combine the category and time dimensions into a single discriminative dynamic word embedding model based on Yao et al. (2018).
Given a temporal corpus D, we seek to model the semantics of every word w ∈ D at every time step t.To do so, we wish to find a word embedding matrix U (t) ∈ R V ×d for each time t, where V is the vocabulary size and d is the word embedding dimension.We assume that U (t) is affected by local contexts, temporal contexts, and user guidance.Local Contexts: To learn accurate word semantics for topic discovery, it is essential to go beyond the bag-of-words assumption of LDA (Meng et al., 2020).Thus, we follow skip-gram (Mikolov et al., 2013) and assume that the semantics of surrounding words w j in a local context window of size h (i.e., ) are influenced by the semantics of the center word w i .To learn semantics from local contexts for matrix U (t), we leverage the fact that skip-gram word embeddings can be obtained by factoring the V × V pointwise mutual information (PMI) matrix of D t (Levy and Goldberg, 2014) We choose PNPMI over PMI because it is bounded between 0 and 1, allowing us to easily modify the similarity of specific word embeddings when we later add user guidance.Specifically, manually setting PNPMI(x, y) = 0 (or 1) implies that x and y have independent (or complete) co-occurrences in local context windows of size h, in turn causing x and y to have dissimilar (or similar) embeddings.
Temporal Contexts: As words change meaning over time, so should their embedding space representations (Bamler and Mandt, 2017).Hence, we follow the assumption that semantics drift slightly between successive time steps and control the distance between neighboring embeddings: With temporally aligned embeddings, DynaMiTE can address issues of data sparsity by borrowing semantics from neighboring time steps.This process also allows us to identify significant shifts in category semantics between successive time steps, which we explore in our experiments section ( §6.4).
User Guidance: Separating categories in the embedding space will enforce a stronger understanding of category names, as categories will become clusters surrounded by category distinct terms (Meng et al., 2020).For example, representing the categories NLP and NNs as separated clusters in the embedding space will cause overlapping, generic terms like "results" to fall between these clusters.Thus, overlapping terms will no longer be semantically similar to either category.To form these clusters at each time t, we adjust the embedding space so words in the same topic have similar embeddings and words in different topics have dissimilar embeddings.As discussed in §4.1, we can do this by forming a category discriminative matrix Z(t) ∈ R V ×V to modify specific PNPMI values: By minimizing the distance between U (t)U (t) T and Z(t), we form category distinct clusters which become more refined as every topic S tj grows: Discriminative Dynamic Word Embeddings: By combining the loss terms of local contexts (Eq.3), temporal contexts (Eq.4), and user guidance (Eq.6), we can jointly capture a category discriminative and temporal embedding space for D: We also add a loss term γ T t=1 ∥U (t)∥ 2 F to encourage low-rank data fidelity.α, τ, κ, γ are hyperparameters.We efficiently minimize λ with Block Coordinate Descent (Tseng, 2001) in Appendix A.
We calculate the semantic similarity score between candidate term w and topic S tj by computing the cosine similarity of their embeddings.We obtain u tw , the embedding of w, directly from the matrix U (t).To obtain u ts , the embedding of topic S tj , we average the embeddings of the terms that have been assigned to the topic, i.e., w ′ ∈ S tj :

Category Indicative Score
Skip-gram embeddings treat local contexts equally, regardless of whether the context is indicative of the category.However, a topic evolution that is distinctly relevant to its respective category should prioritize terms discussed in category indicative contexts.For example, "Chernobyl," a high-quality term for the category of disaster, is more likely to be discussed when the focus of the discourse is on disasters.To achieve this outcome, we follow previous works (Tao et al., 2016;Zhang et al., 2022b) and leverage the current topic evolution output to iteratively retrieve and quantify a candidate term's distinct popularity in category indicative contexts.We assume that the category indicative contexts of time step t and category c j can be represented as a set of documents Θ tj ⊆ D t .To obtain Θ tj , we search D t and select documents which contain any of the terms in S tj .Thus, Θ tj is updated iteratively as S tj grows.We calculate the relevance of candidate term w to Θ tj through popularity (how often does term w appear in Θ tj ) and distinctiveness (how unique is term w to Θ tj compared to other category indicative documents).Popularity deprioritizes hyper-specific terms, such as models uniquely introduced in an abstract, while distinctiveness deprioritizes generic terms.For popularity, we choose the logarithm of term frequency (TF) and for distinctiveness, we choose the softmax of BM-25 (Robertson et al., 1995) n i=1 e BM-25(w,Θ ti ) . (10) We also experimented with TF-IDF (Ramos, 2003) and Dense Passage Retrieval (Karpukhin et al., 2020) instead of BM-25, but selected BM-25 due to its balance of efficiency and performance.Combining popularity and distinctiveness, we can form a category indicative score for candidate term w: where 0 ≤ β ≤ 1 is a hyperparameter.

Time Indicative Score
Previous works have demonstrated that topic evolutions can uniquely capture time steps when they contain a strong temporal ordering of burst topics (Kleinberg, 2002;Leskovec et al., 2009).For example, "ELMo" is a high-quality term that uniquely captures NLP in 2018, since it abruptly spiked in popularity when it was released that year.Thus, to improve the informativeness of our retrieved terms at each time t, we focus on terms that explode in popularity at t but are not popular before and after t. return {Stj|t ∈ T , cj ∈ C} which w appeared.We combine these metrics to calculate a time indicative score as follows: where I is the indicator function.

The Iterative DynaMiTE Framework
We The term with the lowest mean rank that does not exist in any topics at time t is added to each topic S tj .To obtain N unique terms for each topic S tj , we repeat the process of semantic modeling, document retrieval, and term ranking for N iterations.

Experimental Setup
We present a detailed setup in Appendix B.

Datasets
We conduct experiments on three datasets from different domains. (

Quantitative Metrics
We evaluate all models quantitatively using normalized pointwise mutual information (NPMI), a standard measure of topic coherence (Lau et al., 2014).We calculate the NPMI of 5 terms in each time t with respect to D t and report their mean as a percentage (mean of 25 runs).

Disaster
we design an interface using PrairieLearn (West et al., 2015) and invite three graduate students with knowledge of the three domains to annotate.We encourage them to use Google or any other resources to aid them.We provide a detailed human evaluation setup and screenshots in Appendix B.6.
(1) Term Accuracy: Term accuracy measures whether users are satisfied by the discovered topics of DTMs.We evaluate term accuracy by asking annotators if each term in the topic evolution uniquely "belongs" to its category and does not "belong" to other categories.We define "belongs" as any nonsynonym relation (to avoid low-quality terms such as "tragedy" for disaster) between the term and the category.For reference, we provide annotators with relations from ConceptNet (Speer et al., 2017).We average the labeling of annotators and report the final results as mean accuracy (MACC).We find high inter-annotator agreement for MACC, with Fleiss' kappa (Fleiss, 1971) scores of 88, 86, 84 for Arxiv, UN, and, Newspop, respectively.
(2) Temporal Quality: NPMI and MACC do not evaluate if topic evolutions capture interpretable trends.Thus, motivated by the definitions of interpretability for non-dynamic topic models proposed by Doogan and Buntine (2021), we propose that an interpretable topic evolution is one that can be ordered chronologically.To evaluate this property, we remove the label that indicates which time step each set of terms belongs to, as well as terms that reveal the time step of the set.We shuffle these sets and ask annotators to order them chronologically.
We use Spearman's rank correlation coefficient (Rank) (Zar, 2005) to measure how similar the annotator's order is to the true order of the topic evolution and ask annotators to rate their confidence (Conf) on a scale from 1 to 5 using Mean Opinion Score (Streijl et al., 2016), where 5 indicates total confidence.We report Rank and Conf averaged over seeds and annotators.To our knowledge, this is the first work with human experiments to evaluate the temporal quality of topic evolutions.

Performance Comparison
Quantitative Results: In Table 2, we find that DynaMiTE produces high-quality topic evolutions, almost always achieving superior quantitative results.The only exception is NPMI on the Newspop dataset, where CatE and BERTopic obtain higher scores than DynaMiTE.The Newspop dataset contains short headlines, where category names do not co-occur frequently with the high-quality terms mined by DynaMiTE, reducing NPMI.We contend that DynaMiTE still mines more informative terms, as demonstrated by the human evaluation metrics in Table 2. Overall, our strong quantitative results suggest that DynaMiTE (1) directly addresses a user's search needs (MACC, NPMI) and (2) captures interpretable trends (Rank, Conf), making it a preferred choice for exploring temporal corpora.Qualitative Results: In Table 3, we observe two desirable properties of the topic evolutions produced by DynaMiTE: (1) While other models retrieve generic terms weakly related to disaster and leader (e.g."demise" and "coordinator"), Dyna-MiTE mines terms which are distinctly and directly related to each category name.We believe that the use of category discriminative embeddings and category indicative document retrieval helps DynaMiTE avoid this pitfall and achieve higher MACC scores.(2) While other models contain similar sets of terms over time, DynaMiTE uses topic burstiness to find terms that uniquely capture each time step.This explains why annotators performed the best and were most confident when ordering the shuffled outputs of DynaMiTE.For example, a quick Google search will show that Hurricane Hugo occurred in 1989, Iraq invaded Kuwait in 1990, and Hurricane Luis was recorded in 1995 (Wikipedia contributors, 2023a,b).We show all qualitative results of our model in Appendix C.1.

Ablation Study
We perform an ablation study (Table 4) to observe how users perceive the outputs of DynaMiTE when its different components are removed.To directly measure user preferences, we use MACC.We observe the following: (1) DynaMiTE outperforms all ablations in most cases, implying that all components of the model complement each other.(2) It is interesting to note that removing the time indicative score causes on average, a 46.7% drop in MACC.This observation suggests a strong association be-

Newspop
Figure 2: Runtime comparison (in seconds) for 5-term topic evolution retrieval on Arxiv and Newspop over ten runs.The right plot has a logarithmic y-axis scale.
We omit DNLDA due to its poor performance (e.g. an average runtime of 5,117 seconds on Newspop).
tween a term's distinct popularity within a temporal window and its perceived relevance to a category name.
(3) After the time indicative score, removing the semantic similarity score leads to the next largest drop in MACC, being on average, 29.9%.Combining this observation with (2), we can infer that users prefer the full version of DynaMiTE due to its retrieval of terms both directly relevant to their interests and unique to each time step.

Runtime Comparison
DTMs are most often applied to rapidly changing domains, such as news and research, and thus benefit from running in real time.Further, efficient NLP frameworks greatly improve user experience (Telner, 2021).Hence, we study the runtime of DynaMiTE in Figure 2. We find that due to the combination of matrix factorization and Block Coordinate Descent to learn the embedding space, DynaMiTE achieves the fastest runtime on Arxiv and Newspop (UN follows the same trend).In addition, DynaMiTE operates entirely on CPUs, while BERTopic and Dynamic Bernoulli Embeddings re- quire GPUs, making DynaMiTE a highly practical and resource-efficient solution for users.

Category Shift Analysis
We employ a discriminative dynamic embedding space with smoothness constraints over successive time steps to capture semantic shifts (Eq.4).To study this property, we analyze the largest semantic shifts of our user-provided category names.First, we find the adjacent time steps t and t − 1 where the embeddings of the category name are the most dissimilar.To pinpoint one contributor to this large semantic shift, we identify the term whose embedding distance to the category name changed the most between t and t − 1 using cosine similarity.
For the category of natural language processing on Arxiv, the largest semantic shift occurred between 2021 and 2022, with the main cause being "GPT-3."Our findings align with recent studies (Bommasani et al., 2021;Sun et al., 2022;Goyal et al., 2022) which suggest that GPT-3 has led to a paradigm shift in NLP, in turn changing the semantics of the category NLP.This phenomenon is visualized in Figure 3.We present more category shift experiments in the Appendix (Table 9).

Conclusion
We propose the new task of discriminative dynamic topic discovery and develop DynaMiTE to solve the task.Through experiments on three diverse datasets, including the design of a new human evaluation experiment, we demonstrate that Dyna-MiTE produces high-quality topic evolutions and outperforms state-of-the-art DTMs.Ablation studies show that DynaMiTE effectively addresses a user's needs by retrieving category and time indicative terms.Through runtime analyses, we find that DynaMiTE is a computationally efficient and practical tool.Finally, we probe the discrimina-tive dynamic embedding space of DynaMiTE to identify key shifts in computer science, politics and news.

Limitations
Time Granularity: The granularity of time we test DynaMiTE on ranges from spans of four years to months.After testing multiple ways to bucket our temporal corpora, we observed that the granularity of time only affected DynaMiTE when there were insufficient documents in each time step.Specifically, we found that there must be at least 100 documents per time step to expect reasonably good results.Runtime: One drawback of DynaMiTE is that its runtime depends on the number of terms required at each time step.However, this can be avoided by mining more than one term during each iteration of the framework.We also observed that DynaMiTE, along with all other dynamic topic mining baselines, had a slower performance on datasets with longer text documents.Risks: DynaMiTE is intended to be used as a tool to discover topic evolutions in temporal corpora suited to a user's interests, represented as category seeds.We only experimented with DynaMiTE in domains with trustworthy information.If Dyna-MiTE was used in document collections that contain misinformation, it could have the potential to mine inaccurate terms.

A Discriminative Dynamic Word Embeddings Optimization
In this section, we detail the exact optimization process for Eq. 7, which follows similar steps as Yao et al. (2018).We first add an extra parameter designating the embedding matrix to the loss terms for local contexts, temporal contexts, and user preferences (e.g.λ local (t) becomes λ local (t, U ), where U is the embedding matrix we seek to populate).Minimizing Eq. 7 jointly for every U (t) would require a large amount of memory to store all arrays.Hence, the first step is to decompose the objectives by time step, and instead solve the following equation for each λ(t) using alternating minimization: Minimizing each of these equations with gradient descent is computationally expensive.Instead, we introduce a second embedding matrix W to minimize the more relaxed problem below: Eq. 17 contains mirrored loss terms for both embedding matrices U and W .The final term ensures that U and W have identical embeddings, which can be accomplished by setting ρ to a very large value (in our case, we choose 100).By formulating the equation in this way, which breaks the symmetry of factoring Y (t), Yao et al. (2018) find that minimizing λ(t), for both U (t) and W (t), is the solution of a ridge regression problem.For optimizing U (t) (and equivalently, W (t)), taking the derivative of Eq. 17 leaves us with an equation in the form U (t)A = B, where A and B are defined as follows (we omit the 1 2 scalar): Solving U (t)A = B for every t can be accomplished efficiently by using Block Coordinate Descent (Tseng, 2001).

B Experimental Setup B.1 Dataset Description
We provide thorough summary statistics of the Arxiv, UN, and Newspop datasets in Table 5.All datasets (Arxiv, UN, Newspop) were obtained from publicly available sources.The original Arxiv dataset contains research papers from all scientific fields, so we select a subset of these papers by finding those which are categorized solely by "machine learning," "computer vision," or "natural language processing".The original UN dataset contains very long documents (around 4000 words), so we treat each paragraph as a document instead.The documents from the Newspop dataset were not modified.
On the UN dataset, the speaker name was present, but these speakers are public figures part of the United Nations General Assembly, and their speeches have been released to the public.Given the informative nature of each dataset, we did not find any other personal data or offensive content.
To check this, we analyzed a random sample of 50 documents from each dataset.Apart from what was mentioned in the paper, we also modify the datasets by filtering noisy symbols with Regex3 and converting all characters to ASCII with Unidecode. 4To our knowledge, all datasets are entirely in English.We did not split any of the datasets into training, testing, or validation sets, since we did not perform any tasks which require inference and validation.
After this pre-processing, we perform phrasechunking with AutoPhrase (Shang et al., 2018) on all datasets, treating each phrase as a single embedding, and remove phrases that appear in less than 1 5000 documents.After these two steps, the vocab sizes for Arxiv, UN, and Newspop are 16073, 26184, and 8199, respectively.Models are trained on the pre-processed datasets to retrieve 5-term topic evolutions.

B.2 Model Inputs
For the Arxiv dataset, the inputs to each model were the pre-processed corpus and user-provided seeds (1) natural language processing, (2) vision, and (3) neural network.For the UN dataset, the inputs to each model were the pre-processed corpus and user-provided seeds (1) disaster and (2) leader.For the Newspop dataset, the inputs to each model were the pre-processed corpus and user-provided seeds (1) technology, microsoft, and (2) politics, president barack obama.We include microsoft and president barack obama as additional seeds because the documents discussing technology and politics in the Newspop dataset mostly surround these two topics.

B.3 Training Setup
We release the Python code implementation of Dy-naMiTE.DynaMiTE is initialized with word2vec for faster convergence and trained with α = 100,γ = κ = τ = 50.We set β = 0.2, 0.05, 0.4 and BIDF window size r = 5,7,5 for Arxiv, UN, and Newspop, respectively.The only hyperparameter tuned was β, which was done by qualitatively assessing topic evolutions produced with different β values on a subset of the corpus.
In practice, we train DynaMiTE by combining Eq. 3 and Eq. 6 into a single loss term and treat each Θ tj as one document.Both of these steps result in equivalent performance and help Dyna-MiTE run more efficiently.DynaMiTE considers local context window sizes of 7 for Arxiv and UN, and the entire text for Newspop (as headlines are short).The embedding size of DynaMiTE is set to 50.When retrieving topic evolutions for qualitative experiments, we also add a condition that any added term must not have a cosine similarity above 0.9 with any of the terms currently in the topic evolution to avoid redundancy, which is calculated through our discriminative dynamic word embeddings.As mentioned in the paper, DynaMiTE is trained entirely on CPUs and is limited to using only 10 CPUs.

B.4 Baseline Implementations
We implement DNLDA using the official Python Georgetown DataLab Topic Modeling package 5 uploaded by the authors of the paper.We set most of the parameters to be the default values of the model.The only parameter we change is the number of 5 https://github.com/GU-DataLab/gdtmtopic evolutions outputted by the model, which we set to 200 to ensure that topic evolutions existed for each of our specified seeds.DNLDA was trained entirely on CPUs.To select topic evolutions, we manually search through the outputs, prioritizing those which contain any of our user-provided seeds.
We implement BERTopic using the official Python bertopic package6 uploaded by the authors of the paper.We set all of the parameters to be the default value of the model.BERTopic was trained using multiple GPUs.We follow the same process as DNLDA to retrieve topic evolutions.
We implement Bernoulli using the Pytorch implementation. 7We choose this one over the official implementation because it is computationally efficient.When testing both versions, we found no noticeable difference in performance, and thus elected for the Pytorch implementation.We set all parameters to be the default value of the model, with the exception of the word embedding size, which is set to 50.The Bernoulli model was trained using multiple GPUs.To select topic evolutions, we first find the embeddings of the user-provided seeds (averaging them if there are multiple seeds for a single topic evolution).Then, we find each seed's nearest neighbors for each time step using cosine similarity and retrieve these as the outputs for the topic evolution.
We implement DW2V using the official Python code8 uploaded by the authors of the paper.We set all of the parameters to be the default value of the model and warm up DW2V with global word2vec embeddings.DW2V considers the same local window sizes as DynaMiTE to calculate PMI.The word embedding size is set to 50.DW2V was trained entirely on CPUs.We follow the same process as Dynamic Bernoulli Embeddings to retrieve topic evolutions.
We implement CatE using the official C code9 uploaded by the authors of the paper.We set all of the parameters to be the default value of the model.CatE is a user-guided topic mining framework, so we did not have to retrieve terms through our own implementation.To make CatE dynamic, we run it recursively on each time-stamped document collection with the same parameters.

B.5 Quantitative Metrics
As stated in the paper, we report NPMI averaged over 25 runs.The standard error of these runs for Arxiv, UN, and Newspop were 0.0437, 0.0395, and 0.0188 respectively.We found that the outputs of DynaMiTE were consistent on most occasions.To obtain the topic evolutions for human evaluation (term accuracy and temporal ordering), we only consider a single run chosen at random.
We also report the detailed formulas for NPMI, MACC, and Rank, as well as the statistical tests we used to determine significance below: NPMI or normalized pointwise mutual information is a standard measure of topic coherence.To calculate the NPMI for a topic evolution, we first calculate the normalized pointwise mutual information for each pair of terms at each time t, defined as follows: − log P (w j , w k ) P (w j , w k ) is the probability that w j and w k cooccur in a document, while P (w j ) is the probability that w j occurs in any document.We then calculate our NPMI metric as the sum of all NPMI(t) divided by the total number of time steps in T .i.e.: We calculate the statistical significance of the NPMI values produced by each baseline with an approximate randomization test, using the list of NPMI values over 25 runs as the distribution.
MACC or mean accuracy measures term accuracy, defined as the proportion of retrieved terms that "belong" to the category name.To adapt MACC for dynamic topic mining, we flatten all terms retrieved by the dynamic topic mining frameworks and do not consider the temporal aspect.The exact formula for a single annotator is as follows: I is the indicator function which denotes whether w j belongs to category c i , according to the annotator.We report our final results as these MACC scores averaged over all annotators.
To conduct a pairwise t-test for significance, we construct a list M for each model which contains the MACC scores for every dataset, seed, and annotator.We have 7 total seeds and 3 annotators, so M has a length of 21 for each baseline.As our sample size is small, we conduct Wilcoxon signed-rank tests using each list M .
Rank or Spearman's rank correlation coefficient is a value ranging between -1 and 1 to compare an annotator's ordering x i and the ground truth ordering y i for category i, where 1 is a perfect match and -1 is where the annotator's ordering is the ground truth order in reverse.We represent y i as the list {t|0 < t ≤ |T |}, while the x i will be some permutation of the ground truth order.Using x i and y i , Spearman's rank correlation coefficient is calculated as: where x i (t) denotes the t-th element of list x i .We report our final results as these Spearman's rank correlation coefficients averaged over all annotators.
Since our orderings contain a maximum of 12 elements, we cannot conduct the usual significance test for Spearman's rank correlation, as it requires at least 500 samples.Thus, we use a permutation test to compute the statistical significance,10 and mark models which obtain a significant human ordering (that is, a human ordering significantly close to the true ordering) for all seeds and annotators.
Conf measures the annotator's confidence during ranking, which is a discrete value from 1 to 5, based on Mean Opinion Score.The exact criteria for Conf can be viewed in Figure 5.We report the confidence values averaged over all annotators and seeds.For determining if Conf values were significant, we follow the same approach as MACC described above.

B.6 Human Experiments
We provide details on the term accuracy (Figure 4) and temporal quality (Figure 5) human evaluation experiments below: Term Accuracy: First, we compile the topic evolutions of all baselines and ablation models of DynaMiTE (including our full version).We flatten the terms contained within each topic evolution and upload them to the tool.To avoid any positional biases, the order of terms is randomly shuffled for each annotator.Using a checkbox for each term, annotators are instructed to select terms that they believe belong to the category name, where "belong" is defined as a non-synonym relationship between the category and term.To effectively complete the task, annotators are provided with all category names considered in the experiment, the relevant time steps, the dataset (or context) of the experiment, resources and examples for types of non-synonym relations, and a sample Google search query for ascertaining whether a term and category are related.
Temporal Quality: For each topic evolution, we remove the label that indicates which time step each set of terms belongs to.We present annotators with these terms in a randomized order, where each annotator sees a different randomized order.Annotators are instructed to order these sets of terms chronologically by using a drag-and-drop functionality integrated into the PrairieLearn interface.To effectively complete the task, annotators are provided with the dataset (or context) of the experiment, the relevant time steps, and a sample Google search query for ascertaining whether a set of terms precedes or succeeds another set of terms.After annotators have completed ordering the terms they are asked to rate their confidence on a scale of 1 to 5 based on Mean Opinion Score (Streijl et al., 2016), using a multiple choice question.
Both tools displayed in the Figures were created using the PrairieLearn (West et al., 2015) interface, which is traditionally used in classroom settings.Annotators can submit their results at any time by pressing "Save and Grade".By pressing "Save," annotators can save their current results and choose to come back to the experiment at a later time.We find that PrairieLearn's easy-to-use interface and integration of Python make it an ideal tool for setting up human evaluation experiments.We received no complaints from our annotators indicating that PrairieLearn was a difficult tool to navigate.We hope to work with the creators of PrairieLearn to make it publicly available for all types of human evaluations.

C.1 Topic Evolutions
We display the full 5-term topic evolution outputs produced by DynaMiTE on the Arxiv (Table 6), UN (Table 7), and Newspop (Table 8) datasets.

C.2 Category Shift Analysis
We display all category shift analyses on the seeds and datasets from our experiments in Table 9.

Figure 1 :
Figure 1: Overview of DynaMiTE.Given a temporal collection of documents and user-provided seeds, DynaMiTE first calculates semantic similarity scores with discriminative dynamic word embeddings, category indicative scores with document retrieval, and time indicative scores based on topic burstiness.Ensembling these scores, DynaMiTE iteratively mines topic evolutions and uses this information to further enrich its outputs.

Figure 4 :
Figure 4: Screenshot from the human evaluation experiment for Term Accuracy (MACC).

Figure 5 :
Figure 5: Screenshot from the human evaluation experiment for Temporal Ordering (Rank and Conf).

Table 1 :
Evolution from unsupervised DTM DNLDA build upon

Table 2 :
Topic coherence (NPMI), term accuracy (MACC), and temporal quality (Rank and Conf) comparison.Models with metrics marked with * significantly outperform all non-marked baselines (p < 0.05 approximate randomization test Thus, we conduct two human experiments to qualitatively evaluate topic evolutions.For both experiments,

Table 5 :
Detailed description of the Arxiv, UN, and Newspop datasets used in our experiments.

Table 6 :
Full DynaMiTE topic evolution output on the Arxiv dataset.

Table 7 :
Full DynaMiTE topic evolution output on the UN dataset.