Seed-Guided Topic Discovery with Out-of-Vocabulary Seeds

Discovering latent topics from text corpora has been studied for decades. Many existing topic models adopt a fully unsupervised setting, and their discovered topics may not cater to users’ particular interests due to their inability of leveraging user guidance. Although there exist seed-guided topic discovery approaches that leverage user-provided seeds to discover topic-representative terms, they are less concerned with two factors: (1) the existence of out-of-vocabulary seeds and (2) the power of pre-trained language models (PLMs). In this paper, we generalize the task of seed-guided topic discovery to allow out-of-vocabulary seeds. We propose a novel framework, named SeeTopic, wherein the general knowledge of PLMs and the local semantics learned from the input corpus can mutually benefit each other. Experiments on three real datasets from different domains demonstrate the effectiveness of SeeTopic in terms of topic coherence, accuracy, and diversity.


Introduction
Automatically discovering informative and coherent topics from massive text corpora is central to text analysis through helping users efficiently digest a large collection of documents (Griffiths and Steyvers, 2004) and advancing downstream applications such as summarization (Wang et al., 2009(Wang et al., , 2022)), classification (Chen et al., 2015;Meng et al., 2020b), and generation (Liu et al., 2021).
Unsupervised topic models have been the mainstream approach to topic discovery since the proposal of pLSA (Hofmann, 1999) and LDA (Blei et al., 2003).Despite their encouraging performance in finding informative latent topics, these topics may not reflect user preferences well, mainly due to their unsupervised nature.For example, given a collection of product reviews, a user may be specifically interested in product categories Table 1: Three datasets (Cohan et al., 2020;McAuley and Leskovec, 2013;Zhang et al., 2017) from different domains and their topic categories (i.e., seeds).Red: Seeds never seen in the corpus (i.e., out-of-vocabulary).In all three datasets, a large proportion of seeds are outof-vocabulary.(e.g., "books", "electronics"), but unsupervised topic models may generate topics containing different sentiments (e.g., "good", "bad").To consider users' interests and needs, seed-guided topic discovery approaches (Jagarlamudi et al., 2012;Gallagher et al., 2017;Meng et al., 2020a) have been proposed to find representative terms for each category based on user-provided seeds or category names.2However, there are still two less concerned factors in these approaches.
The Existence of Out-of-Vocabulary Seeds.Previous studies (Jagarlamudi et al., 2012;Gallagher et al., 2017;Meng et al., 2020a) assume that all user-provided seeds must be in-vocabulary (i.e., appear at least once in the input corpus), so that they can utilize the occurrence statistics or Skip-Gram embedding methods (Mikolov et al., 2013) to model seed semantics.However, user-interested categories can have specific or composite descriptions, which may never appear in the corpus.Table 1 shows three datasets from different domains: sci-entific papers, product reviews, and social media posts.In each dataset, documents can belong to one or more categories, and we list the category names provided by the dataset collectors.These seeds should reflect their particular interests.In all three datasets, we have a large proportion of seeds (45% in SciDocs, 60% in Amazon, and 78% in Twitter) that never appear in the corpus.Some category names are too specific (e.g., "chronic respiratory diseases", "nightlife spot") to be exactly matched, others are the composition of multiple entities (e.g., "hepatitis a/b/c/e", "neoplasms (cancer)", "clothing, shoes and jewelry"). 3  The Power of Pre-trained Language Models.Techniques used in previous studies are mainly based on LDA variants (Jagarlamudi et al., 2012) or context-free embeddings (Meng et al., 2020a).Recently, pre-trained language models (PLMs) such as BERT (Devlin et al., 2019) have achieved significant improvement in a wide range of text mining tasks.In topic discovery, the generic representation power of PLMs learned from web-scale corpora (e.g., Wikipedia or PubMed) may complement the information a model can obtain from the input corpus.Moreover, out-of-vocabulary seeds usually have meaningful in-vocabulary components (e.g., "night" and "life" in "nightlife spot", "health" and "care" in "health and personal care").The optimized tokenization strategy of PLMs (Sennrich et al., 2016;Wu et al., 2016) can help segment the seeds into such meaningful components (e.g., "nightlife" → "night" and "##life"), and the contextualization power of PLMs can help infer the correct meaning of each component (e.g., "##life" and "care") in the category name.Therefore, PLMs are much needed in handling out-of-vocabulary seeds and effectively learning their semantics.
Contributions.Being aware of these two factors, in this paper, we study seed-guided topic discovery in the presence of out-of-vocabulary seeds.Our proposed SEETOPIC framework consists of two modules: (1) The general representation module 3 One possible idea to deal with composite seeds is to split them into multiple seeds.However, there are many possible ways to express the conjunctions (e.g., "/ ", "()", "," and "and" in Table 1), which may require manual tuning.Besides, simple chunking rules will induce splits that break the semantics of the original composition (e.g., "professional and other places" may be split into "professional" and "other places").Moreover, even after the split, some seeds are still out-ofvocabulary.Therefore, we propose to use PLMs to tackle out-of-vocabulary seeds in a unified way.In experiments, we will show that our model is able to tackle composite seeds.For example, given the seed "hepatitis a/b/c/e", we can find terms relevant to "hepatitis b" and "hepatitis c" (see Table 4).uses a PLM to derive the representation of each term (including out-of-vocabulary seeds) based on the general linguistic knowledge acquired through pre-training.(2) The seed-guided local representation module learns in-vocabulary term embeddings specific to the input corpus and the given seeds.In order to optimize the learned representations for topic coherence, which is commonly reflected by pointwise mutual information (PMI) (Newman et al., 2010), our objective implicitly maximizes the PMI between each word and its context, the documents it appears, as well as the category it belongs to.The learning of the two modules is connected through an iterative ensemble ranking process, in which the general knowledge of PLMs and the term representations specifically learned from the target corpus conditioned on the seeds can complement each other.
To summarize, this study makes three contributions.( 1) Task: we propose to study seedguided topic discovery in the presence of out-ofvocabulary seeds.(2) Framework: we design a unified framework that jointly models general knowledge through PLMs and local corpus statistics through embedding learning.(3) Experiment: extensive experiments on three datasets demonstrate the effectiveness of SEETOPIC in terms of topic coherence, accuracy, and diversity.

Problem Definition
As shown in Table 1, we assume a seed can be either a single word or a phrase.Given a corpus D, we use V D to denote the set of terms appearing in D. In accordance with the assumption of category names, each term can also be a single word or a phrase.In practice, given a raw corpus, one can use existing phrase chunking tools (Manning et al., 2014;Shang et al., 2018) to detect phrases in it.After phrase chunking, if a category name is still not in V D , we define it as out-of-vocabulary.Problem Definition.Given a corpus D = {d 1 , ..., d |D| } and a set of category names C = {c 1 , ..., c |C| } where some category names are outof-vocabulary, the task is to find a set of invocabulary terms S i = {w 1 , ..., w S } ⊆ V D for each category c i such that each term in S i is semantically close to c i and far from other categories c j (∀j = i).

The SEETOPIC Framework
In this section, we first introduce how we model general and local text semantics using a PLM mod-ule and a seed-guided embedding learning module, respectively.Then, we present the iterative ensemble ranking process and our overall framework.

Modeling General Text Semantics using a PLM
PLMs such as BERT (Devlin et al., 2019) aim to learn generic language representations from webscale corpora (e.g., Wikipedia or PubMed) that can be applied to a wide variety of text-related applications.To transfer such general knowledge to our topic discovery task, we employ a PLM to encode each category name and each in-vocabulary term to a vector.To be specific, given a term w ∈ C ∪ V D , we input the sequence "[CLS] w [SEP]" into the PLM.Here, w can be a phrase containing multiple words, and each word can be out of the PLM's vocabulary.To deal with this, most PLMs use a pre-trained tokenizer (Sennrich et al., 2016;Wu et al., 2016) to segment each unseen word into frequent subwords.Then, the contextualization power of PLMs will help infer the correct meaning of each word/subword, so as to provide a more precise representation of the whole category.
After LM encoding, following (Sia et al., 2020;Thompson and Mimno, 2020;Li et al., 2020), we take the output of all tokens from the last layer and average them to get the term embedding e w .In this way, even if a seed c i is out-of-vocabulary, we can still obtain its representation e c i .

Modeling Local Text Semantics in the Input Corpus
The motivation of topic discovery is to discover latent topic structures from the input corpus.Therefore, purely relying on general knowledge in the PLM is insufficient because topic discovery results should adapt to the input corpus D. Now, we introduce how we learn another set of embeddings Previous studies on embedding learning assume that the semantic of a term is similar to its local context (Mikolov et al., 2013), the document it appears (Tang et al., 2015;Xun et al., 2017a), and the category it belongs to (Meng et al., 2020a).Inspired by these studies, we propose the following embedding learning objective. where , (z can be wj, d, or ci). (2) In this objective, u w i (and v w j ), v d , v c i are the embedding vectors of terms, documents, and categories, respectively.
where h is the context window size.
Note that the last term in Eq. ( 1) encourages the similarity between each category c i and its representative terms S i .Here, we adopt an iterative process to gradually update category-representative terms.Initially, S i consists of just a few invocabulary terms similar to c i according to the PLM.At each iteration, the size of S i will increase to contain more category-discriminative terms (the selection criterion of these terms will be introduced in the next section), and we need to encourage their proximity with c i in the next iteration.
Directly optimizing the full softmax in Eq. ( 2) is costly.Therefore, we adopt the negative sampling strategy (Mikolov et al., 2013) for efficient approximation.
Interpreting the Objective.In topic modeling studies, pointwise mutual information (PMI) (Newman et al., 2010) is a standard evaluation metric for topic coherence (Lau et al., 2014;Röder et al., 2015).Levy and Goldberg (2014) prove that the Skip-Gram embedding model is implicitly factorizing the PMI matrix.Following their proof, we can show that maximizing Eq. ( 1) is implicitly doing the following factorization: where the columns of X ww , X wd , and X wc are PMI matrices.
Here, # D (w i , w j ) denotes the number of cooccurrences of w i and w j in a context window in D; # D (w) denotes the number of occurrences of w in D; λ D is the total number of terms in D; # d (w) denotes the number of times w occurs in d; λ d is the total number of terms in d; b is the number of negative samples.(For the derivation of Eq. ( 3), please refer to Appendix A.) To summarize, the learned local representations u w are implicitly optimized for topic coherence, where term co-occurrences are measured in context, document, and category levels.

Ensemble Ranking
We have obtained two sets of term embeddings that model text semantics from different angles: {e w |w ∈ C ∪ V D } carries the PLM's knowledge, while {u w |w ∈ V D } models the input corpus as well as user-provided seeds.We now propose an ensemble ranking method to leverage information from both sides to grab more discriminative terms for each category.
Given a category c i and its current term set S i , we first calculate the scores of each term w ∈ V D . (5) The subscript "G" here means "general", while "L" means "local".Then, we sort all terms by these two scores, respectively.Each term w will hence get two rank positions rank G (w) and rank L (w).We propose the following ensemble score based on the reciprocal rank: Here, 0 < ρ ≤ 1 is a constant.In practice, instead of ranking all terms in the vocabulary, we only check the top-M results in the two ranking lists.If a term w is not among the top-M according to score G (w) (resp., score L (w)), we set rank G (w) = +∞ (resp., rank L (w) = +∞).In fact, when ρ = 1, Eq. ( 6) becomes the arithmetic mean of the two reciprocal ranks w) .This is essentially the mean reciprocal rank (MRR) commonly used in ensemble ranking, where a high position in one ranking list can largely compensate a low position in the other.In contrast, when ρ → 0, Eq. ( 6) becomes the geometric mean of the two reciprocal ranks (see Appendix B), where two ranking lists both have the "veto power" (i.e., a term needs to be ranked as top-M in both ranking lists to obtain a non-zero ensemble score).In experiment, we set ρ = 0.1 and show it outperforms MRR (i.e., ρ = 1) in our topic discovery task.
After computing the ensemble score score(w|S i ) for each w, we update S i .To guarantee that each S i is category-discriminative, we do not allow any term to belong to more than one category.Therefore, we gradually expand each S i by turns.At the beginning, we reset S 1 = ... = S |C| = ∅.When it is S i 's turn, we add one term S i according to the following criterion: score(w|Si)}.(7)

Overall Framework
We summarize the entire SEETOPIC framework in Algorithm 1.To deal with out-of-vocabulary category names, we first utilize a PLM to find their nearest in-vocabulary terms as the initial categorydiscriminative term set S i (Lines 1-7).After initialization, Note that for an in-vocabulary category name c i ∈ V D , itself will be added to the initial S i as the top-1 similar in-vocabulary term.After getting the initial S i , we update it by T iterations (Lines 8-16).At each iteration, according to the up-to-date S 1 , S 2 , ..., S |C| , we relearn embeddings u w , v w , v d , and v c i using Eq.(1) (Line 10).The two set of embeddings, {e w |w ∈ C ∪ V D } (computed at Line 1) and {u w |w ∈ V D } (updated at Line 10), are then leveraged to perform ensemble ranking (Lines 11-12).Based on the ensemble score score(w|S i ), we update S i using Eq. ( 7) (Lines 13-16).After the t-th iteration, Amazon review belongs to one or more product categories.We use the subset sampled by Zhang et al. (2020Zhang et al. ( , 2022)), which contains 10 categories and 100K reviews.(3) Twitter (Zhang et al., 2017) 6 is a crawl of geo-tagged tweets in New York City from August 2014 to November 2014.The dataset collectors link these tweets with Foursquare's POI database and assign them to 9 POI categories.We take these category names as input seeds.
Seeds used in the three datasets are shown in Table 1.Dataset statistics are summarized in Table 2.For all three datasets, we use AutoPhrase (Shang et al., 2018) 7 to perform phrase chunking in the corpus, and we remove words and phrases occurring less than 3 times.
Previous studies (Jagarlamudi et al., 2012;Meng et al., 2020a) have tried some other datasets (e.g., RCV1, 20 Newsgroups, NYT, and Yelp).However, the category names they use in these datasets are all picked from in-vocabulary terms.Therefore, we do not consider these datasets for evaluation in our task settings.Following (Sia et al., 2020), we adopt a 60-40 train-test split for all three datasets.The training set is used as the input corpus D, and the testing set is used for calculating topic coherence metrics (see evaluation metrics for details).
Compared Methods.We compare our SEETOPIC framework with the following methods, including seed-guided topic modeling methods, seedguided embedding learning methods, and PLMs.
(1) SeededLDA (Jagarlamudi et al., 2012) 8 is a seed-guided topic modeling method.It improves LDA by biasing topics to produce input seeds and by biasing documents to select topics relevant to the seeds they contain.(2) Anchored CorEx (Gallagher et al., 2017) 9 is a seed-guided topic modeling method.It incorporates user-provided seeds by balancing between compressing the input corpus and preserving seed-related information.(3) Labeled ETM (Dieng et al., 2020)10 is an embedding-based topic modeling method.It incorporates distributed representation of each term.Following (Meng et al., 2020a), we retrieve representative terms according to their embedding similarity with the category name.(4) CatE (Meng et al., 2020a) 11 is a seed-guided embedding learning method for discriminative topic discovery.It takes category names as input and jointly learns term embedding and specificity from the input corpus.Category-discriminative terms are then selected based on both embedding similarity with the category and specificity.( 5) BERT (Devlin et al., 2019) 12 is a PLM.Following Lines 1-7 in Algorithm 1, we use BERT to encode each input category name and each term to a vector, and then perform similarity search to directly find all repre- Since BioBERT is specifically trained for biomedical text mining tasks, we report its performance on the SciDocs dataset only.( 7) SEETOPIC-NoIter is a variant of our SEETOPIC framework.In Algorithm 1, after initialization (Lines 1-7), it executes Lines 9-16 only once (i.e., T = 1) to find all representative terms.
Here, all seed-guided topic modeling and embedding baselines (i.e., SeededLDA, Anchored CorEx, CatE, and Labeled ETM) can only take in-vocabulary seeds as input.For a fair comparison, we run Lines 1-7 in Algorithm 1 to get the initial representative in-vocabulary terms for each category, and input these terms as seeds into the baselines.In other words, all compared methods use BERT/BioBERT to initialize their term sets.
Evaluation Metrics.We evaluate topic discovery results from three different angles: topic coherence, term accuracy, and topic diversity.
(1) NPMI (Lau et al., 2014) is a standard metric in topic modeling to measure topic coherence.Within each topic, it calculates the normalized pointwise mutual information for each pair of terms in S i .
where P (w j , w k ) is the probability that w j and w k co-occur in a document; P (w j ) is the marginal probability of w j . 14 (2) LCP (Mimno et al., 2011) is another standard metric to measure topic coherence.It calculates the pairwise log conditional probability of top-ranked 13 https://huggingface.co/dmis-lab/biobert-v1.1 14 When calculating Eqs. ( 8) and ( 9), to avoid log 0, we use P (wj, w k ) + and P (w) + to replace P (wj, w k ) and P (w), respectively, where = 1/|D|.terms.
Note that PMI (Newman et al., 2010) is also a standard metric for topic coherence.We do observe that SEETOPIC outperforms baselines in terms of PMI in most cases.However, since our local embedding step is implicitly optimizing a PMI-like objective, we no longer use it as our evaluation metric.
(3) MACC (Meng et al., 2020a) measures term accuracy.It is defined as the proportion of retrieved terms that actually belong to the corresponding category according to the category name.
where 1(w j ∈ c i ) is the indicator function of whether w j is relevant to category c i .MACC requires human evaluation, so we invite five annotators to perform independent annotation.The reported MACC score is the average MACC of the five annotators.A high inter-annotator agreement is observed, with Fleiss' kappa (Fleiss, 1971) being 0.856, 0.844, and 0.771 on SciDocs, Amazon, and Twitter, respectively.
(4) Diversity (Dieng et al., 2020) measures the mutual exclusivity of discovered topics.It is the percentage of unique terms in all topics, which corresponds to our task requirement that each retrieved term is discriminatively close to one category and far from the others.
Experiment Settings.We use BioBERT as the PLM on SciDocs, and BERT-base-uncased as the PLM on Amazon and Twitter.The embedding dimension of u w is 768 (the same as e w ); the number of negative samples b = 5.In ensemble ranking, the length of the general/local ranking list M = 100; the hyperparameter ρ in Eq. ( 6) is set as 0.1; the number of iterations T = 4; after each iteration, we increase the size of S i by N = 3.We use the top-10 ranked terms in each topic for final evaluation (i.e., |S i | = 10 in Eqs. ( 8)-( 11)).

Performance Comparison
Table 3 shows the performance of all methods.We run each experiment 3 times with the average score reported.To show statistical significance, we conduct a two-tailed unpaired t-test to compare SEE-TOPIC and each baseline.(The performance of BERT and BioBERT is deterministic according to our usage.When comparing SEETOPIC with them, we conduct a two-tailed Z-test instead.)The significance level is also marked in Table 3.
We have the following observations from Table 3. (1) Our SEETOPIC model performs consistently well.In fact, it achieves the highest score in 8 columns and the second highest in the remaining 4 columns.(2) Classical seed-guided topic modeling baselines (i.e., SeededLDA and Anchored CorEx) perform not well in respect of NPMI (topic coherence) and MACC (term accuracy).Embeddingbased topic discovery approaches (i.e., Labeled ETM and CatE) make some progress, but they still significantly underperform the PLM-empowered SEETOPIC model on SciDocs and Amazon.(3) SEETOPIC consistently performs better than SEE-TOPIC-NoIter on all three datasets, indicating the positive contribution of the proposed iterative process.(4) SEETOPIC guarantees the mutual exclusivity of S 1 , ..., S |C| .In comparison, SeededLDA, Labeled ETM, and BERT cannot guarantee such mutual exclusivity.In-vocabulary vs. Out-of-vocabulary.Figure 1 compares the MACC scores of different seedguided topic discovery methods on in-vocabulary categories and out-of-vocabulary categories.We find that the performance improvement of SEE-TOPIC upon baselines on out-of-vocabulary categories is larger than that on in-vocabulary ones.For example, on Amazon, SEETOPIC underperforms CatE on in-vocabulary categories but outperforms CatE on out-of-vocabulary ones; on Twitter, the gap between SEETOPIC and baselines be- comes much more evident on out-of-vocabulary categories.Note that all baselines in Figure 1 do not utilize the power of PLMs, so this observation validates our claim that PLMs are helpful in tackling out-of-vocabulary seeds.
Figure 2 shows the change of model performance measured by NPMI and LCP.
Table 4: Top-5 representative terms retrieved by different algorithms for three out-of-vocabulary categories from SciDocs, Amazon, and Twitter.: at least 3 of the 5 annotators judge the term as relevant to the seed.: at most 2 of the 5 annotators judge the term as relevant to the seed., in most cases, the performance of SEETOPIC deteriorates as ρ increases from 0.1 to 0.9.Thus, setting ρ = 0.1 always leads to competitive NPMI and LCP scores on the three datasets.Although ρ = 1 is better than ρ = 0.9, its performance is still suboptimal in comparison with ρ = 0.1.This finding indicates that replacing the mean reciprocal rank (i.e., ρ = 1) with our proposed Eq. ( 6) is reasonable.According to Figures 2(c) and 2(d), SEETOPIC usually performs better when there are more iterations.On SciDocs and Twitter, the scores start to converge after T = 4. Besides, more iterations will result in longer running time.Overall, we believe setting T = 4 strikes a good balance.

Case Study
Finally, we show the terms retrieved by different methods as a case study.From each of the three datasets, we select an out-of-vocabulary category and show its topic discovery results in Table 4.We mark a retrieved term as correct () if at least 3 of the 5 annotators judge the term as relevant to the seed.Otherwise, we mark the term as incorrect ().
For the category "hepatitis a/b/c/e" from Sci-Docs, SeededLDA and Anchored CorEx can only find very general medical terms, which are relevant to all seeds in SciDocs and thus inaccurate; Labeled ETM and CatE find terms about "alanine aminotransferase", whose elevation suggest not only hepatitis but also other diseases like diabetes and heart failure, thus not discriminative either; BioBERT and SEETOPIC, with the power of a PLM, can accurately pick terms relevant to "hepatitis b" and "hepatitis c".For the category "sports and outdoors" from Amazon, SeededLDA and Anchored CorEx again find very general terms, most of which are not category-discriminative; Labeled ETM and CatE are able to pick more specific terms such as "cars and tracks", but they still make mistakes; BERT, as a PLM, can accurately find terms that have lexical overlap with the category name (e.g., "outdoorsmen", "sporting events"), meanwhile such terms are less diverse; SEETOPIC-NoIter starts to discover more concrete terms than BERT (e.g., "indoor soccer", "bike riding") by leveraging local text semantics; the full SEETOPIC model, with an iterative updating process, can find more specific and informative terms (e.g., "canoeing", "picnics", and "rafting").For the category "travel and trans-port" from Twitter, both BERT and CatE make mistakes by including the term "natural history"; SEETOPIC-NoIter, without an iterative update process, also includes this error; the full SEETOPIC model finally excludes this error and achieves the highest accuracy in the retrieved top-5 terms among all compared methods.

Related Work
Seed-Guided Topic Discovery.Seed-guided topic models aim to leverage user-provided seeds to discover underlying topics according to users' interests.Early studies take LDA (Blei et al., 2003) as the backbone and incorporate seeds into model learning.For example, Andrzejewski et al. (2009) consider must-link and cannot-link constraints among seeds as priors.SeededLDA (Jagarlamudi et al., 2012) encourages topics to contain more seeds and encourages documents to select topics relevant to the seeds they contain.Anchored CorEx (Gallagher et al., 2017) extracts maximally informative topics by jointly compressing the corpus and preserving seed relevant information.Recent studies start to utilize embedding techniques to learn better word semantics.For example, CatE (Meng et al., 2020a) explicitly encourages distinction among retrieved topics via category-name guided embedding learning.However, all these models require the provided seeds to be in-vocabulary, mainly because they focus on the input corpus only and are not equipped with general knowledge of PLMs.
Embedding-Based Topic Discovery.A number of studies extend LDA to involve word embedding.The common strategy is to adapt distributions in LDA to generate real-valued data (e.g., Gaussian LDA (Das et al., 2015), LFTM (Nguyen et al., 2015), Spherical HDP (Batmanghelich et al., 2016), and CGTM (Xun et al., 2017b)).Some other studies think out of the LDA backbone.For example, TWE (Liu et al., 2015) uses topic structures to jointly learn topic embeddings and improve word embeddings.CLM (Xun et al., 2017a) collaboratively improves topic modeling and word embedding by coordinating global and local contexts.ETM (Dieng et al., 2020) models word-topic correlations via word embeddings to improve the expressiveness of topic models.More recently, Sia et al. (2020) show that directly clustering word embeddings (e.g., word2vec or BERT) also generates good topics; Thompson and Mimno (2020) further find that BERT and GPT-2 discover high-quality topics, but RoBERTa does not.These models are unsupervised and hard to be applied to seed-guided settings.In contrast, our SEETOPIC framework joint leverages PLMs, word embeddings, and seed information.

Conclusions and Future Work
In this paper, we study seed-guided topic discovery in the presence of out-of-vocabulary seeds.To understand and make use of in-vocabulary components in each seed, we utilize the tokenization and contextualization power of PLMs.We propose a seed-guided embedding learning framework inspired by the goal of maximizing PMI in topic modeling, and an iterative ensemble ranking process to jointly leverage general knowledge of the PLM and local signals learned from the input corpus.Experimental results show that SEETOPIC outperforms seed-guided topic discovery baselines and PLMs in terms of topic coherence, term accuracy, and topic diversity.A parameter study and a case study further validate some design choices in SEETOPIC.
In the future, it would be interesting to extend SEETOPIC to seed-guided hierarchical topic discovery, where parent and child information in the input category hierarchy may help infer the meaning of out-of-vocabulary nodes.

A The Embedding Learning Objective
In Section 3.2, we propose the following embedding learning objective: J = d∈D w i ∈d w j ∈C(w i ,h) exp(u T w i vw j ) w ∈V D exp(u T w i v w ) Now we prove that maximizing J is implicitly performing the factorization in Eq. ( 3).Levy and Goldberg (2014) have proved that maximizing J context is implicitly doing the following factorization.
We follow their approach to consider the other two terms J document and J category in Eq. ( 12).Using the negative sampling strategy to rewrite J document , we get where σ(•) is the sigmoid function.Following (Levy and Goldberg, 2014;Qiu et al., 2018), we assume the negative sampling distribution ∝ λ d . 15 Then, the objective becomes ( For a specific term-document pair (w, d), we consider its effect in the objective: (17) 15 In practice, the negative sampling distribution ∝ λ 3/4 d , but related studies (Levy and Goldberg, 2014;Qiu et al., 2018) usually assume a linear relationship in their derivation.
|S i | = (t + 1)N (∀1 ≤ i ≤ |C|).Complexity Analysis.The time complexity of using the PLM is O((|C| + |V D |)α PLM ), where α PLM is the complexity of encoding one term via the PLM.The total complexity of local embedding is O(T λ D (h+|C|)b) because in each iteration 1 ≤ t ≤ T , every w ∈ D interacts with every other term in the context window of size h, its belonging document, and each category c i ∈ C, and each update involves b negative samples.The total complexity of ensemble ranking is O(T |V D ||C||S i |) as in each iteration 1 ≤ t ≤ T , we compute scores between each w ∈ V D and each w ∈ S i (∀c i ∈ C). 4 Experiments 4.1 Experimental Setup Datasets.We conduct experiments on three public datasets from different domains: (1) SciDocs (Cohan et al., 2020) 4 is a large collection of scientific papers supporting diverse evaluation tasks.For the MeSH classification task (Coletti and Bleich, 2001), about 23K medical papers are collected, each of which is assigned to one of the 11 common disease categories derived from the MeSH vocabulary.We use the title and abstract of each paper as documents and the 11 category names as seeds.(2) Amazon (McAuley and Leskovec, 2013) 5 contains product reviews from May 1996 to July 2014.Each

Figure 1 :
Figure 1: MACC of seed-guided topic discovery methods on in-vocabulary categories and out-of-vocabulary categories.
w∈V D d∈D # d (w) log σ(u T w v d )+bE d log σ(−u T w v d ) ,

Table 3 :
NPMI, LCP, MACC, and Diversity of compared algorithms on three datasets.NPMI and LCP measure topic coherence; MACC measures term accuracy; Diversity (abbreviated to Div.) measures topic diversity.Bold: the highest score.Underline: the second highest score.* : significantly worse than SEETOPIC (p-value < 0.05).