Lifelong Learning of Topics and Domain-Specific Word Embeddings

Lifelong topic models mainly focus on in-domain text streams in which each chunk only contains documents from a single domain. To overcome data diversity of the in-domain corpus, most of the existing methods exploit the information from limited sources in a separate and heuristic manner. In this study, we develop a lifelong collaborative model (LCM) based on non-negative matrix factorization to accurately learn topics and domain-speciﬁc word embeddings. LCM particularly investigates: (1) developing a knowledge graph based on the semantic relationships among words in the lifelong learning process, so as to accumulate global context information discovered by topic models and local context information reﬂected by context word embeddings from previous domains, and (2) developing a subword graph based on byte pair encoding and pairwise word relationships to exploit subword information of words in the current in-domain corpus. To the best of our knowledge, we are the ﬁrst to collaboratively learn topics and word embed-dings via lifelong learning. Experiments on real-world in-domain text streams validate the effectiveness of our method.


Introduction
Lifelong learning (Silver, 2011;Mitchell et al., 2015), which accumulates and maintains the past knowledge to help future learning in an endless manner, has attracted considerable attention in topic modeling (Chen et al., 2020b;Gupta et al., 2020). Most lifelong topic models (Chen and Liu, 2014b;Chen, 2015;Wang et al., 2016) focus on the corpus that only contains text from a single domain, dubbed the in-domain corpus (Xu et al., 2018). This is because in-domain corpora are widespread in real-world applications, such as breaking news and tweets related to a specific topic (domain). The * The corresponding author. key to the success of a lifelong topic model within in-domain corpora is based on a precondition that prior topical information of previous domains can be fully exploited to guide meaningful learning in the new coming domain (Chen and Liu, 2014b). However, because the in-domain corpus is typically of limited size (Xu et al., 2018), it is insufficient for the existing methods to train coherent topics.
To alleviate the lack of global context in a corpus, one simple solution for topic models is to incorporate general-purpose pre-trained word embeddings (Das et al., 2015;Xun et al., 2016Xun et al., , 2017bDieng et al., 2020). Although the general-purpose embeddings can provide some useful information for words within the in-domain corpus, their embedding representations may not be ideal for the target domain and in some cases they may even conflict with the meanings of the words in the task domain because words often have multiple senses or meanings (Xu et al., 2018). Another solution trains topics and word embeddings jointly in the one-shot learning scenario (Xun et al., 2017a;Dieng et al., 2020). Such a unified method prevents relying on the external embedding corpus that is not always closely aligned with the domain task, because the model can learn domain-specific word embeddings by itself. Unfortunately, the aforementioned models are conducted on collected documents without the guidance of any prior knowledge. Besides, they all treat words as atomic units, which may not perform well on the in-domain corpus with relatively few words.
In light of these considerations, we aim to generate coherent topics and domain-specific word embeddings jointly by a lifelong process. On the one hand, domain-specific word embeddings tend to offer more accurate complementary information to lifelong topic modeling than pre-trained embeddings. On the other hand, we alleviate the lack of global and local context information within in-domain corpora by exploiting subwords (Pinter et al., 2017). Both the topical and subword information are leveraged in our knowledge-based learner to generate better domain-specific word embeddings. To achieve this, we propose a lifelong collaborative model (LCM) 1 by coordinating global context, local context, and subword information. First, our LCM maintains a knowledge graph based on word relationships to accumulate the past knowledge learned from previous domains, which exploits from both the global word-document matrix and the local word co-occurrence matrix. Second, we develop a subword graph from the current in-domain corpus to capture extra information of words. We use non-negative matrix factorization (NMF) as our framework, which is an effective method of mining latent text semantics with great flexibility in transforming prior knowledge into regulations (Lee and Seung, 1999;Chen et al., 2015Chen et al., , 2020b and it gives sparseness to matrices with interpretability (Hoyer, 2004). The main contributions of this study can be summarized as follows: • We propose a lifelong learning method to jointly generate topics and word embeddings over in-domain text streams. To the best of our knowledge, we are the first to collaboratively learn topics and domain-specific word embeddings through a lifelong process. • We incorporate local context information and subword information into lifelong topic modeling, which can alleviate the lack of global context information when the target dataset is relatively small. • In lifelong word embedding learning, we leverage the topical and subword information to help generate better domain-specific embeddings for down-stream learning tasks.

Related Work
Topic modeling (Deerwester et al., 1990;Hofmann, 1999;Blei et al., 2003) and word embedding learning (Mikolov et al., 2013a,b) are two of the most important tasks in natural language processing. The former task aims to discover the latent semantic structure of documents based on the global context, while the latter one follows the distributional hypothesis that words occurring in similar local contexts tend to have similar syntactic and semantic properties (Harris, 1954). The traditional topic and word embedding learning models are based on isolated learning, i.e., a one-shot task learning, thus they lack ability to continually learn from incrementally available data.
Lifelong Topic Modeling. A lifelong topic model (Chen and Liu, 2014b;Chen, 2015;Wang et al., 2016), as a typical example of lifelong machine learning, is gaining more and more research interests than traditional one-shot deal that conducts a topic model on collected documents just for once (Chen et al., 2020b). Lifelong topic models inherit three key characteristics in lifelong machine learning, i.e., continuous tasks, knowledge accumulation and maintenance, and a knowledge-based learner that can leverage the past knowledge to help future learning in a never-ending manner. Furthermore, lifelong topic modeling is mainly applied to in-domain corpora where each chunk only contains text from a single domain. Based on NMF, Chen et al. (2020b) proposed a lifelong topic model named NMF-LTM. However, the above method only considers the most important 10 words under every topic while ignores other non-top words, i.e., most words in the vocabulary. This problem will be more serious if the vocabulary size is large. Besides, NMF-LTM may perform poorly for other downstream tasks, because it can only capture the information of limited words in sentences. Finally, NMF-LTM only mines word relationships from the perspective of global (topical) information, which is inadequate within in-domain corpora. Considering the limited global context in the new coming corpus, Gupta et al. (2020) incorporated general-purpose pre-trained word embeddings as complementary to topics into the knowledge base for lifelong learning. Unfortunately, the above method required that the dimension of word embeddings being equal to the number of topics and each dimension of word embeddings corresponding to a topic. This violates the complementary but different points of view, i.e., the global viewpoint and the local viewpoint, for topic models and word embedding models (Xun et al., 2017a). This model optimizes a topic-word matrix, in which each row represents the word distribution of a topic and each column represents the embedding of a word. Each dimension of the word embedding learned by this model implied the possibility of the word occurring in the corresponding topic. However, word embeddings contain many other features that cannot be captured by global (context) information, e.g., the syntactic feature.
Word Embedding Learning. Lifelong learning has also been adopted to train domain-specific word embeddings, which fills the gap between generalpurpose embeddings trained on large-scale corpora and the topic (domain) of the down-stream task. For example, Xu et al. (2018) first developed a meta-learner to expand the new in-domain corpus by measuring the content similarity of past domains and the new domain. Then, they generated word embeddings for the new domain using the combined data. However, this method only considered the local context information from past domains, which is inadequate to capture the polysemous nature of words. As an illustration, apple is one of polysemous words that is topically contextualized by several domains, i.e., product line, operating system, and fruit (Gupta et al., 2020).

Lifelong Collaborative Model
In this section, we detail the proposed LCM for jointly learning topics and domain-specific word embeddings in a lifelong process. The topical information and local context from previous domains, and a subword graph constructed from the current in-domain corpus are exploited in LCM to guide future tasks.

Problem Formalization
Given a stream of document chunks DOC t T t=1 accumulated in an endless manner (T = +∞), we aim to jointly generate topics and domain-specific word embeddings when each chunk only contains text from a single domain. At any time point, our LCM deals with the current document chunk, e.g., DOC t , by leveraging the past knowledge learned from the previous document chunks, i.e., DOC 1 , DOC 2 , . . . , DOC t−1 . Table 1 lists the notations used in this paper. We use bold uppercase letters such as D t to represent matrices, regular uppercase letters such as M to represent scalar constants, and regular lowercase letters such as λ v to represent scalar variables.

Notation
Description Topic-document matrix at the current moment X t ∈ R M ×M Word co-occurrence matrix at the current moment B t ∈ R M ×E Word embedding matrix at the current moment C t ∈ R M ×E Context word embedding matrix at the current moment M The number of words N The number of documents K The number of topics E The dimension of word embeddings  Figure 1 illustrates the architecture of our LCM, which processes in-domain text streams through a knowledge-based learner. Formally, the objective function of LCM is defined as follows:

Objective Function
It is noteworthy that we constrain the nonnegativity of B t and C t to learn sparse interpretable word embeddings (Murphy et al., 2012;Luo et al., 2015), so as to capture the polysemous nature of words (Panigrahi et al., 2019). With non-negativity constraints, words are represented by limited dimensions (Murphy et al., 2012). All words that have positive values under specific dimensions may share a common characteristic, which enhances the interpretability of word embeddings and helps capture the polysemous nature.
The first term of our objective function aims to factorize the global word-document matrix D t into the word-topic matrix U t and the topic-document matrix V t , and the interpretability of U t and V t is ensured by their non-negativity. For the local context information, Levy and Goldberg (2014) have proved that the Skip-Gram model with negative sampling (SGNS) is implicitly factorizing a positive pointwise mutual information word cooccurrence matrix shifted by a constant offset. Accordingly, we use the shifted positive pointwise mutual information matrix as our word co-occurrence matrix X t and decompose it into the word embedding matrix B t and the context word embedding matrix C t , as presented in the second term. Given a hyperparameter λ v , the sparsity constraint on V t is introduced as the third term Υ(V t ) = λ v V t 1 . This ensures that each document covers limited topics (Chen et al., 2020b). The sparsity of topics encourages interpretable topics (Card et al., 2018), which corresponds with the tuition that a document usually focuses on several salient topics instead of covering a wide variety of topics (Lin et al., 2019). Although NMF has given sparseness to V t , a more direct control over such properties of the representation is still needed (Hoyer, 2004). The rest terms Ψ(U t ), Φ(C t ), and Ω(B t ) are the constraints on matrices U t , C t , and B t , which will be described in sections 3.2.3-3.2.5, respectively.

Knowledge Graph (KG)
LCM uses relationships between words as the representations of our KG to maintain knowledge of past domains to help with the current in-domain task. KG accumulates the knowledge of past domains from two sources of information, i.e., the global context information mined by topic models and the local context information reflected by context word embeddings. As shown in Figure 1, the output of LCM contains the word-topic matrix U t , the topic-document matrix V t , the word embedding matrix B t , and the context word embedding matrix C t . KG fuels the global context and local context information with the help of U t and C t , as follows.
For the global context information, we use the inner product to measure similarities between topic distributions of words in U t . For each word w i in the current vocabulary of D t , we find topT words w j (j = 1, 2, ..., T ), whose topic distributions of all topics are most similar to w i . Each w j and w i are seen as word pairs that reflect the relationship from the global context information, i.e., the topical information. After finding topT related words of each word, all the word pairs are accumulated and de-duplicated. Following (Chen et al., 2020b), we set the weight of each word pair (w i , w j ) to 1.
Regarding the local context information, we use the inner product to measure similarities between context word embeddings of words in C t . For each word w i in the current vocabulary of D t , we find topT words whose context word embeddings are most similar to this word. All word pairs represent the relationship from the perspective of local context information and their weights are set to η after de-duplication, where η adjusts the weight relationship between global context information and local context information.
Then, we accumulate the word pairs from these two sources to fuel the information of global context and local context. It is worth noting that if a word pair (w i , w j ) appears simultaneously in the two kinds of word pairs, its weight is recorded as 1 + η. De-duplication is not required here, because the relationship between two words related in both global context and local context is closer than that of two words only related in one kind of source information. We use J t to denote all the related word pairs in the current in-domain corpus, and J t is defined as: where (W k ) ij represents the weight of the word pair (w i , w j ) in KG. Finally, KG is updated as follows:

Subword Graph (SG)
To incorporate subword information, LCM uses a subword graph (SG) to store relationships between words in D t from the perspective of subword information. The motivation of introducing SG is to capture more information from the structures of words themselves as the complement for global and local contexts (Bojanowski et al., 2017;Pinter et al., 2017). Many in-domain text streams contain a high proportion of rare words with low word frequencies, e.g., proper nouns in specific domains, which cannot be adequately reflected by context due to the low frequencies. Note that SG only mines the subword information of the current indomain corpus since subwords are only related to domain-independent structures of words. A typical word in English is composed by three kinds of subword units, i.e., the word root, the prefix, and the suffix 2 . Word roots and prefixes determine the meaning of words, while suffixes determine the syntactic-related part of speech. We adopt byte pair encoding (BPE) (Sennrich et al., 2016), which can implicitly match these morpheme boundaries, to conduct subword segmentation. We also compare this segmentation method with character n-gram features (Bojanowski et al., 2017) in experiments. For every word pair, the number of shared subword units between them are recorded as the weight. In the current in-domain corpus, SG t is defined as: where (W s ) ij represents the weight of the word pair (w i , w j ) in SG t .

Constraint on U t
Before introducing Ψ(U t ), we first construct the word-word relationship matrix K t−1 ∈ R M ×M from KG t−1 to represent the closeness of relationships between words. In KG t−1 , we select all of the word pairs in which both of the two words occurred in the current vocabulary of D t . Only these words contribute to the current in-domain task on D t , and all diagonal elements of K t−1 are 1. R k , which represents the threshold ratio for KG, is used to select the "close" relationships between words. For two words w i and w j , if the corresponding pair (w i , w j ) occurred in the selected word pairs in D t mentioned above, the value of k ij will be determined by the threshold ratio R k . If the weight of (w i , w j ), i.e., (W k ) ij , is greater than or equal to the max weight of all the word pairs in D t multiplied by R k , then k ij = If it is less than the max weight multiplied by R k or w i and w j are not connected in KG t−1 , then k ij = 0. In the above, R k helps to select word pairs with relatively large weights. Although wrong connections between some word pairs are kept in our KG, the weights of them cannot be large enough, because the max weight of KG becomes larger and larger over time. These pairs will not be chosen to participate in constraints of matrices. In summary, K t−1 is calculated as follows: The word-word relationship regularization based on KG for Ψ(U t ) holds that the topic distributions of words that are closely related in KG would be more similar than those have no connection in KG. We use the Graph Laplacian (Dai et al., 2020) as the first part of Ψ(U t ) to depict that under each topic in U t , the closer two words are connected in KG, the closer their probabilities are. The second part of Ψ(U t ) is the diversity regularization to reduce the overlapping of topics, i.e., to improve the topic uniqueness (Nan et al., 2019). Accordingly, Ψ(U t ) is defined as follows: In the above, λ u1 and λ u2 are hyperparameters. H t−1 = diag(K t−1 · 1) − K t−1 represents the Graph Laplacian of K t−1 , where 1 represents a column vector in which all of the elements are 1, and diag(K t−1 · 1) represents the matrix with the vector K t−1 · 1 as diagonal elements. I K is an identity matrix of order K × K.
3.2.4 Constraint on C t KG, which fuels the information of global context and local context from previous domains, is constructed with the help of U t and C t . It also contributes to both the two matrices on their constraints. For Φ(C t ), we use K t−1 to introduce the word-word relationship regularization. It depicts that context embeddings of words that are closely related in KG would be more similar than those have less connection in KG. Specifically, under each dimension in C t , the closer two words are connected in KG, the closer their representations are. Φ(C t ) is calculated as follows: where λ c is a hyperparameter.

Constraint on B t
The first part of Ω(B t ) is similar to the word-word relationship regularization for Ψ(U t ) and Φ(C t ).
It holds that embeddings of words that are closely related in SG would be more similar than those have less connection in SG. We also use the Graph Laplacian to depict that under each dimension in B t , the closer two words are connected in SG, the closer their representations are. A word-word relationship matrix S t ∈ R M ×M is constructed from SG t , as follows: where (W s ) ij represents the weight of pair (w i , w j ) in SG t , and R s denotes the threshold ratio of SG. R s helps exclude and ignore wrong connections in SG. In addition, the cooperation of KG and SG can further reduce the influence of unimportant edges. For example, a connection that is only selected by R k may not be as important as the connection that is selected by both of R k and R s simultaneously. The second part is a sparsity constraint on B t , which depicts that each word only has features of a limited number, because we aim to learn sparse representations so that the generated domain-specific word embeddings are more interpretable.

Alternately Iterative Algorithm
We develop an alternately iterative algorithm to achieve a good compromise between ease of implementation and speed. Take B t as an example, we first calculate the derivative of the objective function L on B t as follows: Based on the derivative of L on B t , the updating rule for B t is given below: where J(B t ) = λ b1 diag(S t · 1)B t + λ b2 2 · 1 · 1 T . Note that U t , V t , C t , and B t always satisfy the non-negativity because they are updated in this multiplication form. Due to the limited space, we provide the updating rules for matrices U t , V t , and C t , the parameter inference process, the theoretical proof of the algorithmic convergence, and the time complexity analysis in Appendices A-D.

Model Scalability
Our model has a good scalability due to the "divideand-conquer" strategy. First, we partition the large corpus into several small document chunks that belong to different domains, and we only decompose matrices of one chunk at any time. Second, we use sparse matrices to store KG and SG, thus they are scalable and can be processed fast. Third, when facing a large single domain corpus in text streams, we can partition it into small sub-domain corpora and process one corpus at each time.

Dataset
We evaluate our LCM on the real-world Amazon Review dataset 3 (McAuley et al., 2015;He and McAuley, 2016) from 28 departments (i.e., the firstlevel category). Following (Xu et al., 2018), we consider all the reviews under each second-level category as a domain. Each domain has several third-level categories, which will be used for all down-stream tasks and model evaluation. We randomly select 9 in-domain corpora to carry out experiments. Table 2 summarizes the characteristics of the selected 9 corpora, i.e., domain names, numbers of reviews and labels, the average text lengths, and vocabulary sizes. The selected corpora are preprocessed by eliminating stop-words and words with frequency (in the total reviews from 9 domains) lower than 15. Also, reviews with less than 20 words are removed.
Note that we choose the above dataset instead of other datasets for lifelong topic modeling (Gupta et al., 2020) since we focus on in-domain corpora with rare overlaps, in which, documents share few common information. For completeness, we also shuffle these 7 training domains randomly and show the results under different permutations in Appendix E.

Experimental Setting
We simulate an endless lifelong learning process using the 9 corpora. Word-word relationships in the knowledge graph are gradually accumulated on the first 7 domains. With the help of this knowledge graph, we conduct experiments on the last 2 domains and compare the model performance on both lifelong topic modeling and domain-specific word embedding learning. We randomly select 5% reviews from "Cult Movies (CM)" for validation. To build the knowledge graph for the validation set, 5% reviews from the first 7 corpora are sampled and reviews from these 8 small-scale in-domain corpora also construct a stream of document chunks. The idea of grid search (Fayed and Atiya, 2019) is used to select the best parameters for each metric. We provide our hyperparameters search space for grid search in Appendix F. The remaining 95% reviews of the first 7 domains construct the training set. All reviews from "Science Education (SE)" and the remaining 95% documents from CM are used as the testing set.

Baselines
For lifelong topic modeling, we compare our method with the following baselines: LDA-LTM 4 (Chen and Liu, 2014b), NMF-LTM (Chen et al., 2020b) other baselines in two ways for evaluation, i.e., only on the new in-domain corpus, and on the total document set by fusing the new corpus and all corpora from the previous domains. We implement NMF-LTM and L-DEM by Python according to the original papers. For the sake of fairness, the parameters of all baselines 9 are selected on the validation set in the same experimental setting as LCM.

Evaluation Metrics
Suggested by (Lau et al., 2014;Chen and Liu, 2014a,b;Wang et al., 2016;Isonuma et al., 2020), we use the normalized pointwise mutual information (NPMI) (Aletras and Stevenson, 2013) score, which closely matches human judgments, to measure the coherence of representative words of topics generated by lifelong topic models. Following (Chen et al., 2020b), top 20 words of each topic are used for calculation. Considering that it is important to discover discriminative topics, we also adopt the topic uniqueness (TU) score (Nan et al., 2019) to measure the diversity of topics. In addition, the sparsity score of the document-topic distribution (TS-U) and the topic-word distribution (TS-V) proposed by Lin et al. (2019) is further used to measure the topic sparsity quantitatively. Particularly, we use 1e-20 as the threshold to count the number of zero values in document-topic and topicword distributions. Only values that are smaller than 1e-20 can be set to zero. Although several studies (Chang et al., 2009;Newman et al., 2010) stated that the perplexity is unable to reflect the real semantic coherence of topics and even negatively correlated with human judgements, we show this metric of each model for completeness.
As domain-specific dictionaries are relatively small and meanwhile may contain some uncommon words, it is inappropriate to evaluate domainspecific word embeddings in traditional ways, e.g., calculating the word similarity. Following (Xu et al., 2018), we build a down-stream text classification task to evaluate domain-specific word embeddings generated by different models. The two testing sets, i.e., SE and CM, are used for text classification with their third-level categories as classification labels. For each review, we use the average embedding of all of the words as its feature vector to train a SVM classifier (Bayot and Gonçalves, 2016;Qin and Wang, 2009 Table 3: Performance comparison of lifelong topic models. For all metrics, "↓" after the metric indicates smaller is better while "↑" indicates larger is better. The best performance on each measure is highlighted by boldface. accuracy to evaluate the effectiveness of word embeddings on the down-stream text classification task as in (Xu et al., 2018). Table 3 shows the performance of different models on the lifelong topic discovery task, from which we can observe that LCM performs the best or the second best on each measure. For the baselines, NMF-LTM achieves the best perplexity while almost the poorest TU. As mentioned in (Burkhardt and Kramer, 2019), there is a tradeoff between perplexity and TU in some cases, which means that models generating a lot of redundant topics may have a meaningless low perplexity. The reason of obtaining a low TU for NMF-LTM may be that it enforces documents within the same class would have more similar topic distributions, which is unsuitable to handle in-domain text streams since all documents in the in-domain corpus come from one class. This also influences its sparsity. For example, a document only has non-zero values for 10 topics while another document from the same class has non-zero values for another 10 topics. To get similar, they may both become non-zero for 20 topics. LNTM faces the same problem because it does not constrain the diversity among topics explicitly. The coherence score of LNTM is also unsatisfactory. A possible reason is that it treats word embeddings as topic distributions of words, which deteriorates the local semantic information captured by context word embeddings. LNTM entirely neglected the sparsity of document-topic and topic-word distributions, thus its TS-U and TS-V are zero.

Lifelong Topic Modeling
LDA-LTM has a main difference with the other models, i.e., it does not construct the in-domain text stream based on time series, but fuses all of the previous domains together and accumulates the knowledge from this large corpus to help with the current in-domain corpus. Even though it is difficult to compare LDA-LTM with other lifelong topic models fairly, our LCM performs better than LDA-LTM in most cases, because we can exploit the information from local context and subwords.
The qualitative analysis of topics generated by these models is provided in Appendix G. Figure 2 reports the accuracy of text classification using word embeddings with different dimensions. Note that sparse interpretable word embeddings always perform better when the embedding dimension is relatively large (Murphy et al., 2012). LCM performs the best in each testing set under each dimension, although some baselines (i.e., FastText, SPINE, and Word2Sense) are trained on the total 9 corpora and access more information without considering time series. L-DEM cannot learn high-quality embeddings because it needs a large amount of in-domain corpora to train its metalearner, which is not accessible in most applications. Compared to FastText that incorporates subword information, our word embeddings perform better on classification in all cases. One possible reason is that the global context provides LCM with extra information. Two sparse word embedding models, i.e., SPINE and Word2Sense, overemphasize the sparsity while ignore the quality for downstream tasks. LCM balances well between sparsity and quality with the help of global context and subwords. We evaluate sparsity and interpretability of word embeddings in Appendix H.

Word Embedding Learning
For completeness, we also replace the SVM classifier with a neural network classifier (Chen et al., 2020a) consisting of 3 fully-connected layers. The results are shown in Appendix I.

Ablation Experiments
Take SE as an example, we report results of ablation experiments on NPMI, TU, and Accuracy in Table 4. "LCM-KG G ", "LCM-KG L ", and "LCM-SG" represent LCM without the participation of global context in KG, local context in KG, and subwords in SG, respectively. Deleting each part leads to performance degradation, which validates the  effectiveness of global context, local context, and subwords. Compared to the subword information, KG contributes more to topic discovery and the local context information plays the most important role in word embedding learning. We replace BPE of SG with character n-gram features (Bojanowski et al., 2017) in "LCM-SG BPE ", which indicates the effectiveness of BPE on capturing subwords. Figure 2: Classification performance comparison of SVM classifiers with word embeddings generated by different methods. Models marked with "total" are conducted on the combination of 7 training sets and the current testing set. W2S represents Word2Sense.

Analysis on Catastrophic Forgetting
Catastrophic Forgetting (Robins, 1995;Kirkpatrick et al., 2017), which is a big challenge for lifelong topic models, will not be a serious problem for LCM. The learning process of LCM only accumulates knowledge in KG, and the model is trained on each in-domain corpus independently by following (Chen et al., 2020b). To further investigate the ability of LCM in avoiding catastrophic forgetting, we use the final updated KG after CM to "go back" to help train the model on the 7 training set one by one. As LDA-LTM does not construct the in-domain text stream based on time series, we only take NMF-LTM and LNTM for comparison. In terms of NPMI and TU, Figure 3 shows that LCM has the best ability to alleviate catastrophic forgetting. For all domains, the latest KG will not have a significant negative impact on LCM (i.e., the catastrophic forgetting is limited), and sometimes it even helps with the new task. For example, both NMF-LTM and LCM achieve better NPMI scores with the latest KG in "SIM Cards & Prepaid Minutes". One possible reason is that later domains provide valuable information through KG.

Conclusions
In this work, we propose a lifelong collaborative model (LCM) for learning topics and domainspecific word embeddings. LCM deals with the new in-domain corpus by coordinating global and local context information from previous domains, and subword information from the current corpus. A knowledge graph based on word-word relationships is leveraged during the learning process. Experiments on real-world in-domain text streams demonstrated the superior performances of LCM.
In the future, we plan to incorporate contextualized word representations into topic models (Bianchi et al., 2020(Bianchi et al., , 2021 for alleviating collapsing of word senses and learning more coherent topics.

Appendices
A Updating Rules for U t , V t , and C t First, we transform the objective function L as: In order to show G is an auxiliary function, we have to show G(x, z) ≥ F (x): To prove the inequation, we first verify how the inequality holds on the first term: Similarly, we can get: Since λ b1 , λ b2 , and each element in S t are nonnegative, we have the above inequation. This establishes that G is an auxiliary function for F .
Proof. To show that Algorithm 1 converges (i.e., Theorem C.1), we need to show that update rule for B t follows Eq. (12). ∂G(x,z) ∂x is listed as follows: Solving ∂G(x,z) ∂x = 0 for x, we get the update rule as mentioned in Eq. (11). Since G is the auxiliary function for F , the value of F is non-increasing. We can prove the convergence of update rules for U t , V t , and C t similarly. Thus, Algorithm 1 is guaranteed to converge to a local minimum.

D Time Complexity Analysis
In this section, we analyze the time complexity of Algorithm 1. For updating matrices B t , U t , V t , and C t , the time complexity of one iteration is 3KN ), respectively. Thus, the time complexity of each iteration is O(4M N K + (6E + 3)M 2 + (4E 2 + 5E)M +2M 2 K+3M K 2 +N K 2 +2M K+3KN ) for our method, which spends an extra time cost of O((6E + 2)M 2 + (4E 2 + 5E)M − (2K + 1)N 2 ) to learn word embeddings when compared with the previous NMF-based lifelong topic model, i.e., NMF-LTM (Chen et al., 2020b). Although the time complexity is proportional to M , we can alleviate the scalability issue simply. For example, if M (i.e., the vocabulary) of a single domain is too large, we can partition this domain into several small subdomain corpora. At each time, we only process matrices of one sub-domain, and M of each small sub-domain will not be too large.
As an illustration, it costs about 30 seconds per iteration for training LCM based on a workstation equipped with Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50 GHz, 8 cores and 128G memory. To achieve convergence, LCM costs about 1 hour to update B t , U t , V t , and C t for all domains in order. NMF-LTM costs about half an hour accordingly.

E Different Permutations of Training Domains
We shuffle training domains randomly for 5 times, and show the results under these different permutations in Table 5, which indicates that LCM is robust to domain permutations.    of baselines. We list the best hyperparameters on the validation set for different metrics in Table 7. We provide the search space of hyperparameters for grid search in all of our baselines and their corresponding best hyperparameters in our codes. For hyperparameters, we first varied search spaces for sensitive analysis and observed that LCM was robust to most hyperparameters. Thus, we used final search spaces in Table 6. For completeness, we also show the variances of results under different hyperparameter values in Table 8. Take some hyperparameters in CM as examples, we vary each parameter when others are fixed, and compute the variances of NPMI and Accuracy, which indicates that LCM is robust to most hyperparameters.

G Qualitative Analysis of Topics
Following (Chen et al., 2020b), we map the topics learned by LCM with ones by LDA-LTM (Chen and Liu, 2014b), LNTM (Gupta et al., 2020), and NMF-LTM (Chen et al., 2020b), respectively. Particularly, we represent each topic by its top 20  words, and compute the cosine similarity between every two topics. Take SE as an example, we randomly show 3 topics, as listed in Table 9. Irrelevant words are marked by italics. LDA-LTM, NMF-LTM, and LCM learn topics well in most cases. However, NMF-LTM captures some high-frequency words (i.e., amazon, anatomy, anatomical) in the corpus, which are not related to the topic "Design and production of handicrafts". LDA-LTM also assigns an irrelevant word "print" to the topic "Machinery and industrial manufacturing technology", and an irrelevant word "spring" to the topic "Body structure and anatomy".
The ability of LNTM in generating cohesive topics is poor. It is noteworthy that the topics generated by LNTM seem not related to other models, because we use cosine similarity to map topics. If the cosine similarities are the same (for LNTM, it is 0 sometimes), the topics with smaller IDs will be chosen. For the sake of fairness, we also show a relatively coherent topic generated by LNTM separately in Table 10. The result of LNTM is still worse than other models, which contains 7 irrelevant words in the top 10 word list.

H Evaluating Interpretability
To evaluate the interpretability of our domainspecific word embeddings, we follow (Murphy et al., 2012) to show top 5 words for 5 randomly chosen dimensions in word embeddings generated from CM. Although there exists noisy words, the dimensions are generally semantically coherent and interpretable. We also choose one polysemous word "cell" in SE to measure the ability of our method in capturing the polysemous nature of words. We select the two highest values of the word vector learned by our LCM for "cell", and find top 5 words in these two dimensions. From Table 11, we can observe that the two dimensions focus on cells in biology and cell phones, which reflect the two different meanings of "cell".     Table 12 shows the classification accuracy and the sparsity (i.e., the proportion of zeros) of word embeddings generated by different methods, where the dimension is 1500. Compared with other models, our LCM generates domain-specific word em-   To compare different word embedding learning models comprehensively, we replace the SVM classifier with a neural network classifier consisting of 3 fully-connected layers. As shown in Figure 4, LCM also performs the best in each testing set under each dimension.