A Query-Driven Topic Model

Topic modeling is an unsupervised method for revealing the hidden semantic structure of a corpus. It has been increasingly widely adopted as a tool in the social sciences, including political science, digital humanities and sociological research in general. One desirable property of topic models is to allow users to find topics describing a specific aspect of the corpus. A possible solution is to incorporate domain-specific knowledge into topic modeling, but this requires a specification from domain experts. We propose a novel query-driven topic model that allows users to specify a simple query in words or phrases and return query-related topics, thus avoiding tedious work from domain experts. Our proposed approach is particularly attractive when the user-specified query has a low occurrence in a text corpus, making it difficult for traditional topic models built on word cooccurrence patterns to identify relevant topics. Experimental results demonstrate the effectiveness of our model in comparison with both classical topic models and neural topic models.


Introduction
Topic modeling aims to infer topics from a collection of documents, where a topic is a salient pattern of the collection and is represented by a distribution over words. The availability in large volume of new sources of unstructured data, such as social media, has presented a challenge to conventional qualitative research methods in the social sciences and humanities and encouraged the exploration of topic modeling as a potential solution (Melville et al., 2019;Hu et al., 2019;Yao and Wang, 2020). In these studies, topic modeling has been applied to questions centered on interpretation and meaning. By analyzing words distribution of topics learnt, researchers can apply inductive reasoning on spe-cific topics and perform a more in-depth study of related documents, allowing them to identify underlying topical trends and conduct a more thorough analysis of the data.
One limitation of conventional topic modeling approaches in these studies is that they can only learn topics from the whole corpus. However, in some cases, researchers may be interested in topics describing specific concepts or aspects of the corpus. To identify these topics, researchers have to analyze words distribution for all topics, thereby making it very time consuming. Moreover, it could also happen that the target topics may have a very small presence in the data to be detected directly by a topic model. For instance, given a set of posts about health, researchers may wish specifically to analyze the impact of food on health. If the words related to food have a relatively low frequency of occurrence in the posts, then conventional topic models such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) may not find any food-related topics at all. This is caused by the phenomenon of higher order co-occurrence in conventional topic models (Heinrich, 2009), which prevents infrequent words being sampled under the correct topic. While an information retrieval method could be used to find relevant documents, identifying key subtopics discussed in these documents will still be a daunting process.
To handle this limitation, weakly-supervised approaches Nikolenko et al., 2017;Chen et al., 2013;Andrzejewski et al., 2011;Yang et al., 2015) have been proposed as a solution and different types of domain-specific, prior knowledge, such as word correlation (Yang et al., 2015), document and word labels have been introduced. By adding these to the unsupervised topic model, a set of topics describing the domain knowledge can be generated. However, this still requires experts to define the domain knowledge, which may not always be feasible. In addition, the Step 1: user uses a query to define the concept of interest.
Step 2: a query expansion technique is used to expand the input query to a set of concept words.
Step 3: the concept words are utilized to generate a single topic.
Step 4: the single topic is expanded to a set of subtopics. The retrieved concept-topic and subtopic results allow the user to do inductive reasoning and have a more in-depth study of related documents. In Step 3 and Step 4, we present the top weighted words of the topic and their corresponding weights.
aforementioned approaches can only generate one topic relevant to the target concept. It is desirable to distinguish between different contexts about the same concept: for instance, for the concept 'Middle East', there might be subtopics relating to Middle East conflicts and Middle East resorts, respectively. In our work, we propose a novel approach that automatically generates all subtopics relevant to the target concept.
In our query-driven topic model, a query phrase is used to define the concept of interest. As illustrated in Figure 1, a query expansion technique is first employed to expand the input query to a set of concept words, which are then utilized to first generate a single topic about the concept, and subsequently further expanded to a set of subtopics automatically. In summary, our contributions are four fold: (1) We propose a novel approach which allows users without expertise knowledge to use a short query rather than predefined keywords to detect topics of their interests; (2) Our model is novel in its ability to identify rare topics in text, which would not be possible using existing topic modeling approaches; (3) Our model is built on the Hierarchical Dirichlet Process (HDP) and can therefore automatically infer all subtopics describing the target concept without having to determine the optimal number of topics beforehand; (4) We evaluate our approach on three datasets and achieve superior performance compared to both traditional hierarchical topic models and neural topic models, both quantitatively and qualitatively. 1

Related Work
Earlier work has attempted to solve the problem of identifying specific topics by using prior knowledge.  expressed domain knowledge with two primitives on word pairs called Must-Links and Cannot-Links, encoding them using a Dirichlet Forest prior. Topic-in-set knowledge  defines 'z-labels' as prior knowledge and a similar idea was introduced by Nikolenko et al. (2017). First-Order Logic has been proposed as a way to incorporate richer forms of prior knowledge (Andrzejewski et al., 2011). Yang et al. (2015) proposed an efficient method for incorporating domain knowledge and demonstrated significant speed improvement with large datasets. El-Assady et al. (2019) presented a framework that allows users to incorporate the semantics of their domain knowledge in topic models interactively. Gemp et al. (2019) incorporated informative priors in an neural topic model for the purpose of semisupervised topic modeling. All these approaches require experts to provide domain-specific, prior knowledge, which is problematic for two reasons: different corpora in the same domain may contain different information; and it may be costly to specify all prior knowledge. We take advantage of a query expansion technique and propose an automatic concept words extractor to help user extract prior knowledge.
Our work is also similar to the Hierarchical Topic Model (HTM) (Blei et al., 2004). HTM is a non-parametric topic model that generates topics in a hierarchical structure. In our work, we also propose to generate subtopics from a parent topic. A key difference is that we propose a novel solution to incorporate domain-specific prior knowledge, making it possible to generate desirable topics. This is not the case with HTM. Although attempts were made to introduce prior knowledge in HTM, Perotte et al. (2011) focused on out-of-sample label prediction which is not the focus of our work while Xu et al. (2018) still required experts to define word pairs which is problematic as mentioned earlier.

Proposed Framework
In outline, our model expands an input query to a set of concept words using a concept words extractor. These concept words are then fed into a two phases framework based on a variant of a Hierarchical Dirichlet Process (HDP) to model all topics relevant to the concept.

Concept Words Extractor
Given an input query q, we retrieve a list of documents d according to the query likelihood score (Ceri et al., 2013), where n is the number of tokens in the query and p(q i |d) is the probability of query term q i in document d. We define two extraction rules "AND" and "OR" to constrain whether query terms should appear in the same document or not. We then extract concept words from the retrieved documents. We adopt three approaches for our purpose.
Frequency based extraction (FRE) The first one simply extracts words with high frequency in the retrieved documents as our concept words: where n is the number of retrieved documents and T F (w|d i ) is the term frequency of word w in document d i .

KL-Divergence based extraction (KLD)
The second one is inspired by the query expansion technique (Carpineto et al., 2001). By intuition, words relevant to the input query have a high probability in the retrieved sub-corpus but a low probability in the whole corpus. The score can be defined as: where P R (w) is the probability of word w in the retrieved sub-corpus and P C (w) is the probability of word w in the whole corpus. We extract words with high scores as our concept words.

Relevance model with word embedding (REL)
This approach extracts concept words from a wordembedding enhanced relevance model (Diaz et al., 2016). The probability assigned to word w by the relevance model (Lavrenko and Croft, 2017) is: where R is the retrieved documents set, p(w|d) is the probability of word w in document d and p(d|q) is d's query likelihood from equation (1). We integrate this model with word embeddings: where λ is a hyperparameter and sim(w, q) is the normalized similarity between word w and the input query q. For each term in the vocabulary list, we calculate its similarity with the input query. We then take the top k most similar terms and normalize their similarity values. If w is among the top k similar terms, sim(w, q) would get the normalized similarity value. Otherwise, sim(w, q) = 0.

Query-Driven Topic Model
We propose a two-phase framework based on HDP, which is a nonparametric Bayesian model that can automatically infer the number of topics in a corpus (Teh et al., 2005). It assumes a restaurant (i.e., a document) has a set of tables and serves dishes (i.e., topics) from a global menu. A single dish is only served at a single table for all customers (i.e., words) who sit at that table.
In the first phase, the model infers one topic for each concept, along with other irrelevant topics.
We define this topic as the "parent topic" in later sections. We denote this parent topic of a concept corresponding to the input query q asz q . We incorporate prior knowledge into HDP by fixing the topic index for concept words in all documents. For words from concept words W q corresponding to the input query q, the topic index z are known and remain fixed asz q , and the probability for sampling an existing table t for a word w ji at document j and position i in the Gibbs sampling process is: where k jt is the topic assignment of table t at document j and f −w ji k jt (w ji ) is the probability of w ji assigned to topic k jt after removing the current word and 1 1 (w ji , k jt ) is an indicator function, which takes on value 0 if w ji ∈ W q and k jt =z q and 1 otherwise. n −ji jt denotes the number of words in document j at table t except the current word. The probability for sampling a new table t new is: Here, m k denotes the number of tables of topic k and m · denotes the total number of tables. γ and α are the hyperparamenters of the model. f −w ji k new (w ji ) = 1 |V | is the prior density of w ji where |V | is the vocabulary size of the dataset. If the sampled table is a new table, we sample an existing topic k jt new from: and probability for sampling a new topic k new is: In the second phase, the model expands the parent topic of each concept produced in the first phase to a set of subtopics. Let Wz q be the words assigned to the parent topicz q in the first phase, the probability for sampling an existing table t for a word w ji in the Gibbs sampling process is: where 1 2 (w ji , k jt ) is an indicator function that takes on value 1 if w ji ∈Wz q and k jt =z q and 0 otherwise. Probability for sampling a new table t new is: |Wz q | is the prior density of Wz q where |Wz q | is the vocabulary size of Wz q . For a new table, probability for sampling an existing topic k is: and probability for sampling a new topic k new subordinate to the parent topicz q is: The model automatically decides the number of subtopics and we treat the subtopics produced as the final topics relevant to the target concept.

Incorporating Generalized Pólya Urn scheme
To make topics more interpretable, we incorporate word-embeddings by the Generalized Pólya Urn scheme (Li et al., 2016). Pólya Urn scheme is introduced for colored balls and urns. In the Generalized Pólya Urn scheme, when we draw a ball of a particular color, two balls of the same color are put back along with a certain number of balls of the similar colors. In topic modeling context, a topic can be viewed as an urn while a word can be viewed as a ball in a certain color and its semantically related words can be viewed as balls of similar colors. Every time we sample a word w under a parent topicz q , we increase the probability of sampling w underz q , as well as its semantically related concept words. Given pre-trained word embeddings, we calculate the cosine similarity between word w i and concept word w q ∈ W q . We then construct a word semantic relatedness matrix M (Li et al., 2016), consisting of all word pairs whose cosine similarity is greater than a predefined threshold. We then construct a promotion matrix A whose elements are efined as: where u∈(0, 1) is a predefined promotion weight. When we sample a word w under topicz q , we also promote all its semantically related concepts words based on the amount of promotion in A.
Word filtering Inspired by , we propose a word filtering strategy. Word filtering can be used to prevent words that have weak ties with the sampled topic being promoted. For a word w at i th Gibbs sampling iteration, its semantic cohesion to topic k is: where p i (k, m) is the probability of m th representative word in topic k at i th iteration and M is the number of representative words predefined. The representative words of topic k =z q at i th Gibbs sampling iteration are defined by the words ranked by the topic-word probability in the descending order. cos w i , RW i (k, m) is the cosine similarity between word w i and the m th representative word of topic k at i th iteration. The representative words ofz q are simply its concept words. For the semantic cohesion of word w with different topics CV [·, w], we map CV [·, w] into an arithmetic progressionCV [·, w] ranging from 0 to 1.0 . We use the following equation to decide if the GPU is applied to w: where S j,w indicates whether GPU is applied to word w given document j andCV max [k, w] is the maximal semantic cohesion among all topics.
We present the details of the Gibbs sampling process of the first phase of our model in Algorithm 1. We omit the details of the second phase of our model since it is similar to the first phase. The details of the functions Initialize(·) and UpdateCounter(·) can be found in Appendix C.

Experiments
We conducted our experiments by two steps. In the fist step, we evaluated the quality of the parent topics from the first phase of our model. In the second step, we evaluated the quality of the subtopics from the second phase of our model.

Setup
Datasets We conducted our experiments on three datasets: 20Newsgroup 2 contains around 18k newsgroup posts on 20 topics; TagMyNews 3 contains around 32k short English news from 7 categories; SearchSnippets (Xu et al., 2017) contains 12k short web search snippets from 8 categories. We also compared our model with BERT (Devlin et al., 2018), a well-known neural language model, to test its document retrieval ability. To evaluate the quality of our subtopics towards the target concepts, we compared our model with HTM (Blei et al., 2004), which is also a nonparametric Bayesian model that can generate subtopics from higher level topics.
Parameterization We set α = 0.5, β = 0.1 for DF-LDA and α = 0.1, β = 0.01 for SCLDA as suggested by the original papers. We set α = 1/K, β = 1/K for LDA and ISLDA, where K is the number of topics pre-set for the models, and found it outperforms the original settings. We set α = 0.1, γ = 0.1 and η = 0.01 for HTM as suggested by the original paper and set the topic hierarchy depth to 3, to make it easier to compare with our model since topics from the second level of HTM can be considered as the parent topics and those from the third level as the subtopics of the parent topics. We set α = 1.0, β = 0.5 and γ = 1.5 as in the original HDP paper for our query-driven model, and set the threshold for the cosine similarity used for the Generalized Pólya Urn scheme to 0.5 and the promotion weight u to 0.3. The number of representative words M for the word filtering strategy was set to 10. λ for the REL query expansion technique was set to 0.5 and k was set to 100. In our experiments, we treated each category as a concept and determined the number of topics for each baseline model based on the number of categories in the datasets. For example, if a dataset had 16 categories, we set the number of topics to 17, using an extra one representing irrelevant information.
For all baseline models, we asked an expert to provide prior knowledge. Each category in a dataset was associated with 10 keywords provided by the expert. For DF-LDA, we converted keywords to must-links. Since LDA, DF-LDA, AVITM and HTM cannot reveal the relationship between a concept and the generated topics directly, we need a further step to find the relationship between them. We calculated the average pairwise cosine similarity of the keywords and the top-10 word embeddings of each topic, and chose the topic with the highest similarity as the target topic of the concept. For HTM, we use the topics from the second level of its generated topic hierarchy. For our model, we used query phrases to represent the main concept of each category. Query phrases were interpreted directly from category names, e.g., we used "computer graphics" to represent the category "comp.graphics" in the 20Newsgroup dataset. We removed categories that do not have meaningful names due to the difficulty of defining the query phrases for these categories, e.g., "talk.politics.misc" in the 20Newsgroup dataset. We then selected the top 10 concept words of each query based on the scores from the concept words extractor. We list expert-defined keywords and query phrase for each category in Appendix A. All models were trained until convergence. For BERT, we simply used the query phrases to retrieve relevant documents. We ran each model five times and present their average performance.

Parent Topic Evaluation
We evaluated the quality of parent topics of our model in terms of document classification, topic coherence and document retrieval performance. For document classification, a logistic regression classifier with default parameter settings was used. We used the topic distribution of each document as the input and conducted five-fold cross-validation. The topic distribution of a document represents the probability of each topic in a document. The quality of the topics can be assessed by the accuracy of text classification using the topic-level representation. A better classification accuracy means better latent semantic representations of the topics, indicating the learnt topics are more discriminative and representative. For topic coherence measure, we followed Roder et al. (2015) and used the best performing topic coherence measure C V based on the external corpus (Wikipedia). We focused on the top 10 words of our parent topics and used the Palmetto library algorithm (Röder et al., 2015). Higher coherence indicates better topic interpretability. For document retrieval, we adopted the metric "precision@K" (P@K), which corresponds to the number of relevant results among the top K documents. We retrieved documents of each topic based on the probability of the topic in the documents p(z|d). If a topic can describe the target concept well, then the top retrieved documents should be relevant to the concept. In our experiments, we know the ground-truth number of documents belonging to each category, therefore we set K for each concept to the actual number of documents from the corresponding category. We only considered the parent topics and reported the average results. A higher score indicates the model retrieves more concept-relevant documents, which is important when a researcher wants to do a more in-depth study of related documents. Table 1 shows the performance of our models using three different query expansion techniques as well as only using expert-defined key words as prior knowledge. It can be observed that our models using the FRE and KLD query expansion techniques outperform all baselines except AVITM on almost all measures, though our model using the FRE query expansion technique has slightly worse document retrieval performance on the TagMyNews dataset. Although the model using REL is not as competitive as the models using FRE and KLD on the 20newsgroup dataset, it has the highest coherence scores on all the datasets, despite not using any expert-defined keywords as prior knowledge. This shows that combining word embeddings in query expansion can help produce more coherent prior knowledge. Although AVITM has better coherence score than our model on the 20newsgroup and TagMyNews datasets, its poor document retrieval performance indicates it is unable to find documents relevant to the target concept. Comparing our query-driven model with or without using expert-defined keywords, it achieves better coherence scores on the 20Newsgroup dataset without expert-defined keywords, though it performs slightly worse on the other two datasets.
As for document classification and document retrieval, our KLD-based model has better performance than expert-defined keywords on 20newsgroup and TagMyNews, but does not work well on SearchSnippets. This may be because our concept words extractor does not work well on short texts. The concept words extracted from TagMyNews and SearchSnippets are not as competitive as expert defined keywords. In addition, different interpretations of the same concept word may also compromise the performance. Interestingly, we also observed that BERT does not work well on these datasets. Possibly this is because we are using short query phrases to represent the concepts and BERT only works well for long queries. A short query may not give enough information about the concept that's why we adopted query expansion and topic modeling approaches. Table 2 shows example concept-specific topics extracted from the TagMyNews dataset. It can be seen that extracted topics are closely related to their respective concept phrase. Topic extraction results on the other two datasets are shown in Appendix B.  Ablation study: We also studied the effectiveness of two major components in the proposed model: 1) GPU to incorporate word embeddings;  2) word filtering to remove unimportant words. The last two rows in Table 1 show the performance of our model using KLD query expansion technique without GPU and word filtering components. These show that GPU has a big impact for coherence and can help improve other measures in some extents, while removing word filtering reduces performance on all measures.

Subtopics Evaluation
We used our model with the KLD-query expansion technique in this evaluation. We dropped subtopics that have prevalence of less than 0.5% in the corpus as these subtopics usually are not of interest.
We evaluated the quality of subtopics in terms of topic diversity and topic cohesion 4 . Topic diversity measures how much a subtopic overlaps with each other. We define it to be the percentage of the unique words in the top 25 words of all subtopics subordinate to the same parent topic (Dieng et al., 2020). Higher diversity indicates more varied topics, while lower diversity indicates more redundant topics. Topic cohesion measures the relevance between the subtopics and the parent topic. We define it to be the cosine similarity between the parent topic embedding and the subtopic embedding. We can get the topic embedding as the weighted summing of the embeddings of its top 10 associated word. We combine these two metrics and define the overall quality of a subtopic as their product.
We report the results in Table 3. It shows that our model outperforms HTM by topic cohesion measures on all datasets, though with lower topic diversity scores. The high cohesion score indicates that the subtopics of our model is highly relevant to the target concept. By taking both measures into a account, our model achieves relatively better performance. It is expected since we incorporate domain prior knowledge into our model.

Qualitative Evaluation
We present the qualitative evaluation results in this section. We show a set of topics produced by our model from the 20newsgroup dataset in Table 4. The input concept phrases are shown in the left side of the table. For the concept phrase "atheism", which means the absence of belief in the existence of deities, we can see that our parent topic is highly relevant to it. The topic words like "question", "belief ", "god" and "lack" clearly indicate that the topic is related to the arguments about God. By looking at the subtopics, we can easily see the first subtopic is about the atheism morality, the second subtopic is about the arguments between atheism and theism, and the third subtopic is about the scientific explanation on atheism. For the concept phrase "for sale", which means selling an item in a cheaper price, our parent topic includes many relevant words, such as "sell", "sale", "sold" and "price". The inclusion of "box" is less easy to explain, but could be related to product packaging. As expected, our subtopic reveals a sub-aspect about the concept that can not be identified directly from the parent topic: the email subscription for the latest news. This is reasonable, since merchants usually use email to provide customers with information about the latest products. The words like "interested", "mail", "send", "information" and "original" provide more information about the concept.
The last row of Table 4 presents the topics of the low occurrence query "business", appeared only 294 times in the corpus, which is extremely low compared with the majority of other words in the corpus. LDA and HTM are unable to generate relevant topics due to the aforementioned "higher order co-occurrence" issue, but our model can produce reasonable topics. The topic words "encryption", "key", "phone", "company" and "business" in the parent topic shows that the topic is related to data encryption for business. By looking at the subtopics of the topic, we can get a rough idea that the first subtopic is about a Japanese phone company, since the top weighted words include Table 4: Parent topic and the subtopics of the concepts "atheism", "for sale" and "business" for the 20newsgroup dataset.
"japanese", "company", "phone", "technology". This makes sense since phone companies usually have a strong requirement for encryption. The second and third subtopics are about encryption algorithms, since the top weighted words include "chip", "nsa", "key" and "algorithm". We further verified that our interpretation is correct by looking at the top weighted documents of the topics. This confirms that our model has potential for use in real world applications.

Conclusions and Future Work
We presented a novel, query-driven topic model to help identify topics of interest in large datasets. Instead of asking experts to define keywords for these topics, we implemented a concept words extractor to automatically extract concept words and used the GPU model, incorporating word-filtering, to improve interpretability and performance. To distinguish between different contexts for the same concept, we further introduced a subtopic modeling procedure. The procedure can automatically infer all subtopics without having to determine the optimal number of subtopics beforehand. Experimental results on three benchmark datasets demonstrate the model's promise. In the future, we plan to evaluate our model's performance using real-world, qualitative analysis use cases.