Modeling Text using the Continuous Space Topic Model with Pre-Trained Word Embeddings

In this study, we propose a model that extends the continuous space topic model (CSTM), which flexibly controls word probability in a document, using pre-trained word embeddings. To develop the proposed model, we pre-train word embeddings, which capture the semantics of words and plug them into the CSTM. Intrinsic experimental results show that the proposed model exhibits a superior performance over the CSTM in terms of perplexity and convergence speed. Furthermore, extrinsic experimental results show that the proposed model is useful for a document classification task when compared with the baseline model. We qualitatively show that the latent coordinates obtained by training the proposed model are better than those of the baseline model.


Introduction
Topic models are statistical models that automatically extract latent topics in documents from a text corpus. Topic models have been used in various applications within and outside of natural language processing. Such applications include information retrieval (Wei and Croft, 2006), collaborative filtering (Marlin, 2003), author identification (Rosen-Zvi et al., 2012), and opinion extraction (Lin et al., 2011).
The latent Dirichlet allocation (LDA) (Blei et al., 2003), which is a representative method for topic modeling, assumes that each document has a latent topic. It uses an unobservable random variable called the latent topic to formulate the factors that produce a set of words that are statistically likely to co-occur. Unlike the LDA, the continuous space topic model (CSTM) (Mochihashi et al., 2013) models documents without using intermediate variables, such as latent topics. Specifically, the CSTM is formulated by introducing latent coordinates of words and considering a function that follows a Gaussian process in the same space to represent the importance of a word in a document. In the LDA, the probability distribution of words is fixed, and the probability of words is controlled by the topic distribution. Therefore, it is not possible to change the probability distribution of words according to each document and thus the text cannot be modeled in a fine-grained way. By contrast, the CSTM controls the probability of words based on the latent coordinates of the words and the function representing the meaning of the document. Hence, it is possible for the CSTM to dynamically change the word distribution according to the document. Additionally, the CSTM outperforms conventional topic models, such as the LDA, in terms of perplexity.
As mentioned above, the CSTM models documents using word embeddings; however, the structure of the model is such that the word embeddings (latent coordinates) are free parameters. Therefore, the estimation of the model is time-consuming because of the large number of parameters. In addition, the only information used for the estimation of the word embeddings is the frequency of words, which makes it difficult to capture the semantics of words.
In this study, we propose a new method in which the latent coordinates of words, which are one of the free parameters of the CSTM, are learned in advance using word2vec (Mikolov et al., 2013), and the learned distributed representation of the words are introduced into the CSTM. As in the Gaussian LDA (Das et al., 2015), when we use the word embeddings that capture the semantics of words and provide them as prior information to the model, we can expect improved performance and faster convergence. In the experiments, we use English and Japanese corpora to compare the proposed method with the baseline CSTM in terms of perplexity and convergence speed. We also perform a document classification task to evaluate the quality of the document representations that are learned by our model. In the discussion, we use the trained model to investigate the importance of words in documents and evaluate the trained model qualitatively. Additionally, we visualize the latent coordinates of words and documents in the same space.
The main contributions of this study are as follows: • We propose a CSTM-based model that can estimate parameters faster and obtain useful document representation using pre-trained word embeddings.
• Intrinsic experiments using English and Japanese corpora show that the proposed model exhibits a superior performance over the baseline model in terms of perplexity and convergence speed.
• Extrinsic experimental results show that document embeddings obtained by the proposed model are useful for document classification.
2 Related Work

Word Embeddings and Topic Models
There are several studies that aimed to improve the performance of topic models by using a distributed representation of words. Das et al. (2015) proposed the Gaussian LDA (G-LDA), which uses a multivariate Gaussian distribution in the same space of word embeddings to estimate topics in the embedding space. Compared with the LDA, it has high coherence (Chang et al., 2009) because it introduces prior knowledge of semantics of words by using pre-trained word embeddings. Recently, Dieng et al. (2020) proposed the embedded topic model (ETM). The ETM models each word with a categorical distribution whose natural parameter is the inner product between the embedding of word and an embedding of its assigned topic. It outperformed traditional topic models including the LDA. However, both topic models use latent topics to model the documents. The G-LDA defines latent topics as multivariate Gaussian distribution, and the ETM uses topic embeddings for formulating the word probability. Therefore, those topic models hardly control word probability directly depending on a document. In Section 2.2, we introduce the CSTM, which can directly control word probability in a document.

Continuous Space Topic Model
In the CSTM, the probability of a word is modeled through the Polya distribution, which is a compound distribution of the Dirichlet and multinomial distributions, to account for the burstiness of language (Doyle and Elkan, 2009). We denote y = (y 1 , y 2 , . . . , y V ) as the frequency of each word in the document, w. The Polya distribution is defined as follows: (1) where α represents the concentration parameter of the Polya distribution. We assume that each word, To increase the probability of semantically related words in each document, we generate a function that follows a Gaussian process with a mean of zero in the same latent space: where K represents the kernel matrix, and in this case, it is an inner product kernel: . A Gaussian process (Rasmussen and Williams, 2006) is a stochastic process that generates a random regression function, where the closer k(w i , w j ) is, the closer the corresponding outputs, f (w i ), f (w j ), will be. Intuitively, f represents "what we want to say in this document." The concentration parameter, α v , of the Polya distribution is then modeled to be larger according to its function value: where α 0 ∼ Ga(a 0 , b 0 ) is a free parameter, and Ga(a 0 , b 0 ) indicates the gamma distribution. Additionally, G 0 (w v ) ∼ PY(β, γ) represents the "default" probability of word w v , and PY(β, γ) denotes the Pitman-Yor process. In practice, the maximum likelihood estimator, is the frequency of the word w v in all documents). Based on this, the generation process of the CSTM that generates N documents is as follows: 4. For n = 1 . . . N , • Draw f n ∼ GP(0, K).
3 Proposed Method

Word Embeddings
Word2vec (Mikolov et al., 2013) is a probabilistic model for learning distributed representations that capture the semantics of words based on the distributional hypothesis (Harris, 1954). The continuous bag-of-words (CBOW) model, which is one of the learning methods of word2vec, obtains word embeddings by maximizing the predicted probability of the target word, w t : where C wt = {w t±i |1 ≤ i ≤ δ} represents the set of nearby context words, δ is the context window width, andη(C wt ) := |C wt | −1 w∈Cw t η(w) denotes the average vector of all context word vectors.
We use the CBOW model to learn word embeddings. In this study, we used a relatively large context window of δ = 10 to learn the topical information (Bansal et al., 2014). In general, it has been shown that the quality of word embeddings improves by centering (Hara et al., 2015;Mu and Viswanath, 2018). Accordingly, acquired distributed representations of the word, η(w 1 ), η(w 2 ), . . . , η(w V ), are centered and normalized as follows: where S is a normalization constant, and defined as follows: In addition, τ is a hyperparameter that controls the variance of word embeddings, and in this study, we simply set τ = d −1/2 .

Modeling Text with Pre-trained Word Embeddings
Next, as in Mochihashi et al. (2013), we define the function that follows the Gaussian process, whose mean is zero and kernel function is k(w i , w j ) = ψ(w i ) T ψ(w j ), in the latent space consisting of the word distributed representations obtained using Eq.
However, because f is, in principle, infinite in dimension and difficult to estimate directly, we introduce an auxiliary variable representing the latent coordinates of the document in the word latent space, similar to the discrete infinite logistic normal distribution (Paisley et al., 2011), which introduces latent coordinates to correlate between topics in the LDA framework: We summarize the latent coordinates of the words as Ψ = (ψ(w 1 ), ψ(w 2 ), · · · ψ(w V )) T , and we can obtain the distribution of f = Ψu by marginalizing u as follows: f follows the same Gaussian process as expressed in Eq. (7). Therefore, in the proposed method, we define the Gaussian process representing the meaning of the document using the document vector, u, which is in the same latent space as the word vector: Next, we define α v as in Eq. (3): and model the probability of a word using the Polya distribution in Eq. (1).

Bayesian Markov Chain Monte Carlo (MCMC) Estimation
By combining N documents as D = (y 1 , y 2 , . . . , y N ), we can obtain the joint distribution of α 0 and α as follows: However, because α changes only through the document vector, u, in Eq. (10), in the proposed model, the joint distribution of the estimated parameters, α 0 and u = (u 1 , u 2 , . . . , u N ), is denoted as follows: p(α 0 , u|D) ∝ n p(y n |α 0 , G 0 , ψ, u n )p(α 0 )p(u n ). (13) For model estimation, we use the random walk Metropolis-Hastings (MH) algorithm to avoid the problem of local optima, as demonstrated by Mochihashi et al. (2013). 1 We show the MCMC algorithm of proposed model in Figure 1. The estimating parameters are α 0 , and the document vector u in Eq. (11). The candidates for each parameter are generated using the following proposal distribution:  We also adopt candidates according to the acceptance probability of the following likelihood ratio: In this study, we set σ α 0 = 0.2 and σ u = 0.01, which are the random walk widths that control efficiency of training, based on the results of preliminary experiments.

Corpora
In the experiments, we used the Neural Information Processing Systems (NIPS) 2 , which is an English corpus, Corpus of Spontaneous Japanese (CSJ) and Mainichi Newspaper (10,000 randomly selected articles from 2013), which are Japanese corpora. For Japanese, we preprocessed texts using MeCab 3 with IPADic. In all the corpora, words with a frequency of less than five were excluded from the training data. The statistics for each corpus are listed in Table 1.

Intrinsic Evaluation
To evaluate the performance of topic models, we computed the perplexity of the proposed model, the CSTM and the ETM. Similar to the work of Wallach et al. (2009), we randomly selected 80% of the words in each document as training data and calculated the perplexity on the remaining 20% of the words. For the evaluation in the proposed model and the CSTM, we varied the latent dimension size by 10, 20, 50, and 100 and reported the best score on test data. For the evaluation in the ETM, we set the local learning rate to 0.002 and the weight decay parameter to 1.2 × 10 −6 , and then selected the model which reported the best validation score by varying the number of topics by 10, 20, 50, and 100.
Perplexity The perplexity of the proposed model, the CSTM and the ETM computed for each corpus is shown in Table 2. The proposed method outperforms the CSTM and the ETM in terms of perplexity for all three corpora. Compared to the CSTM, the proposed method naturally has higher performance because it has the topical information from pre-trained word embeddings. The ETM cannot directly control the word probability in a document because it uses topic embeddings for formulating the word probability, so the proposed model, which can control the word probability flexibly, performs better in terms of predictive power. Figure 2 shows the perplexity convergence of the proposed model and the CSTM. The proposed model only takes less than ten iteration to converge, though the CSTM takes fifty to hundred iteration. The proposed model also outperforms the CSTM in terms of convergence speed on all corpora because it has topical information as prior knowledge from the pre-trained word embeddings.

Extrinsic Evaluation
To evaluate the quality of representations of the documents that are learned by our model, we perform a document classification task. We evaluate the performance of the proposed model by comparing it with the performances of CSTM and word2vec.
Settings In this experiment, we use the oneversus-one support vector machine implemented in scikit-learn 4 . The data was split between training, 90% and testing, 10%. For the tuning parameter C, which is one of the parameters controlling the extent of penalty, and γ, which is the parameter of RBF kernel, we execute grid search by a 10-fold cross validation on the training data and select the best models in terms of accuracy. For other parameters, we use the default values set by scikit-learn. We define the features as follows: For the CSTM, we use the document vectors. For word2vec, we use the mean vector of word vectors in the document. For the proposed model, we use the document vector (denoted "Ours") and the concatenation of the mean vector of word vectors and document vector (denoted "Ours w/ word2vec"). Also, we apply the paired t-test to compare the performance between the proposed models and the baseline models. A confidence interval of 95% was considered to identify a significant difference between two compared models.
Results Table 3 shows the classification accuracy on the CSJ corpus using each feature. For document classification using only document vector obtained from the proposed model, we can see that it significantly (p < 0.05) outperformed the CSTM but is slightly inferior to word2vec. However, when we use the document vector obtained from the proposed model and the average vector of word vectors obtained from word2vec, the accuracy is better than that of word2vec, although the difference is not statistically significant. We will analyze the classification results in detail in Section 5.3.

Visualizing Word and Document Embeddings
In the proposed model and the CSTM, word vectors and document vectors are located in the same space, so we can observe the relationships between a word and a document at the same time by visualizing embedding space. We execute the PCA on vectors of words with high frequency and all documents to reduce dimensionality. The reduced word and document vectors obtained by the proposed model are shown in Figure 3 and 4, and we additionally show the visualization of full embedding space, including those documents, in Figure 5 in Appendix. In these figures, two representative documents are shown-a neuroscience article titled "The Role of Activity in Synaptic Competition at the Neuromuscular Junction," and a computer science article titled "Bayesian Model Comparison by Monte Carlo Chaining." Figure  3 enlarges reduced embedding space around the neuroscience article that shows words such as "signal," "neurons," and "Cortex." Figure 4 enlarges reduced embedding space around the computer science article that shows words such as "Bayesian," "iterations," "optimized," and "parameters." From these figures, we can see that words related to topics of the article are correctly located. Therefore, we can see that the proposed model can locate document vectors appropriately in the word embedding space, which enhances the performance of the model.

Analyzing the Importance of Words in a Document
In the proposed model and the CSTM, the document vectors are defined in the same space as the word vectors. Therefore, based on the inner product of the document vector and the word vector, we can quantitatively measure the importance of words in a document, such as words that are likely to appear in a document and words that are not. For the calculation, we used the document and word vectors of all words in the training vocabulary, including words that do not actually appear in the document. For example, for the proposed model and the CSTM, we used the neuroscience article in the NIPS corpus to compute the ranking of topicrelated and topic-unrelated words in the document. Tables 4 and 5 show the results of the proposed model and the CSTM, respectively. We show the words that actually appear in the document in bold. Although both the results of the CSTM and the proposed model contain the words appearing in the document, we can see that the proposed model comparatively captures the topic of the document and gives high score to topic-related words. The topic-related words obtained using the CSTM accounted for a few words that were related to the topic of the document, whereas those obtained by using the proposed model accounted for a significant number of words that were related to the topic of the document, such as "axon," "synapses," and "nervous." This means that the probability of such words in the document will be reflected to a greater extent. Moreover, we observed that words among the topic-unrelated words obtained by applying the proposed model were not related to the topic of the document. Such words include "Euclidean," "gradient," and "regression." We believe that this is because, unlike the CSTM, the proposed model has prior knowledge of the topical information of words, thereby facilitating the estimation of document vectors that capture a set of topically similar words. Table 6 shows the classification accuracy for eight category labels using each feature. The proposed model outperforms the CSTM substantially in all categories. For example, the classification of "Speech Processing," the CSTM misclassified some of the doc- 0.5024 multilayer uments as "Linguistics," "Psychology," and "Artificial Intelligence," while the proposed model classified almost all of the documents as "Speech Processing" except for some of the documents labeled "Linguistics." We find that the CSTM misclassified one of the documents in "Speech Processing," which discusses statistical methods in detail, as "Psychology," while the proposed model classified it correctly. The CSTM models word co-occurrence on a document-by-document basis as in Eq. 3, though multiple topics might exist in a document. Therefore, the document vectors obtained by the CSTM do not have the information of the semantic difference between psychology and statistics. In contrast, the proposed model models word cooccurrence based on the local context of the neighborhood, where topics are considered to be somewhat consistent. Therefore, the proposed model can distinguish the word set that tends to appear in the genre of psychology from the genre of statistics in the embedding space. Hence, because the document vectors are estimated in the space where word vectors have the information of the semantic difference between psychology and statistics, the proposed model can distinguish those documents.

Conclusion and Future Work
In this study, we introduced the learned distributed representation of words into the CSTM to provide prior knowledge on the semantics of words.
In the experiments, we showed that the proposed model outperformed the baseline method in terms of perplexity and convergence speed. Also, we showed that the proposed model is useful for a document classification task compared with the baseline model. Additionally, we showed that the document vectors obtained by training the model are superior through visualization of the embedding space and analysis of importance of words in a document.
In the future, we would like to investigate better ways of estimating the model, including optimization by applying the Hamiltonian MCMC algorithm, which was not used in this study. Furthermore, we would like to use contextualized word embeddings obtained by ELMo (Peters et al., 2018) or BERT (Devlin et al., 2019) in the proposed model.