Extracting Topics with Simultaneous Word Co-occurrence and Semantic Correlation Graphs: Neural Topic Modeling for Short Texts

Short text nowadays has become a more fashionable form of text data, e.g., Twitter posts, news titles, and product reviews. Extracting semantic topics from short texts plays a signiﬁcant role in a wide spectrum of NLP applications, and neural topic modeling is now a major tool to achieve it. Motivated by learning more coherent and semantic topics, in this paper we develop a novel neural topic model named Dual Word Graph Topic Model (DWG TM ), which extracts topics from simultaneous word co-occurrence and semantic correlation graphs. To be speciﬁc, we learn word features from the global word co-occurrence graph, so as to ingest rich word co-occurrence information; we then generate text features with word features, and feed them into an encoder network to get topic proportions per-text; ﬁnally, we reconstruct texts and word co-occurrence graph with topical distri-butions and word features, respectively. Besides, to capture semantics of words, we also apply word features to reconstruct a word semantic correlation graph computed by pre-trained word embeddings. Upon those ideas, we formulate DWG TM in an auto-encoding paradigm and efﬁciently train it with the spirit of neural variational inference. Empirical results validate that DWG TM can generate more semantically coherent topics than baseline topic models.


Introduction
The topic modeling family targets at learning latent topic representations from text document collections (Blei, 2012). During the past decades, it has been extensively applied in many tasks of natural language processing, e.g., sentiment analysis (Lin and He, 2009), summarization (Ma et al., 2012) and classification (Zeng et al., 2018), to name just a few. Conventional topic models such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) are often inferred by approximate inference methods, e.g., mean-field variational inference (Jordan et al., 1999) and collapsed Gibbs sampling (Griffiths and Steyvers, 2004), which require model-specific derivations. The recent inference method with neural networks, such as Variational Auto-Encoder (VAE) (Kingma and Welling, 2014), works in a black-box manner, providing a more generic and flexible solution to topic models beyond traditional approximate inference methods. Broadly speaking, the models inferred with neural networks are referred to as neural topic models, and they have been recently drawn much more attention from the natural language processing community (Zhu et al., 2018;Burkhardt and Kramer, 2019;Dieng et al., 2020;. Unfortunately, whether for conventional or neural topic models, they tend to perform poorly on short text, a more fashionable and significant form of text data, e.g., Twitter posts, news titles, and product reviews. The main reason is that short texts lack document-level word co-occurrences, known as the sparsity problem, which hinders models to capture coherent word patterns. Many conventional topic models have been developed to handle short texts. For example, given very few words per-text, Dirichlet Multinomial Mixture (DMM) (Nigam et al., 2000;Yin and Wang, 2014) constrains that each text covers a signal topic. Biterm Topic Model (BTM) (Yan et al., 2013;Cheng et al., 2014) directly learns topics from corpus-level word co-occurrence patterns. Recently, there are also few attempts of neural topic models aiming to address the sparsity problem of short texts. GraphBTM (Zhu et al., 2018) extracts topics from word graphs of randomly drawn mini-corpus. Negative sampling and Quantization Topic Model (NQTM)  applies a topic distribution quantization method to pursue peakier topic proportions of texts. As reported in , those neu-ral topic models can empirically induce coherent topics from short texts.
Motivated by learning more coherent and semantic topics, in this paper we develop a novel neural topic model for short texts, namely Dual Word Graph Topic Model (DWGTM). As the name suggests, in DWGTM we apply two word graphs, including the word co-occurrence graph constructed by aggregating word co-occurrence patterns of each text to alleviate the sparsity problem, and the word semantic correlation graph generated by using the pre-trained word embeddings to capture the semantic information of words. Specifically, we formulate DWGTM in an auto-encoding paradigm with four main components: (1) We encode the word co-occurrence graph as word features by applying a Graph Convolutional Network (GCN) module (Kipf and Welling, 2016a). (2) For each text, we construct its feature with corresponding word features, and encode it as topic proportion.
(3) Reconstruct texts with topical distributions. (4) Reconstruct the two word graphs with word features. With the word semantic correlation graph, DWGTM can output topics that are associated with the semantic information of words. Besides, we propose a novel topic quality metric to measure the semantic coherence of learned topics, namely Topical Semantics Coherence (TSC). We conduct extensive experiments to evaluate DWGTM, and empirical results indicate that DWGTM can learn more semantically coherent topics than existing baseline models.
In a nutshell, the major contributions of this paper are listed below: • We propose a novel neural topic model DWGTM for short texts, extracting topics from simultaneous word co-occurrence and semantic correlation graphs.
• We propose a novel topic quality metric called TSC, which measures the semantic coherence of learned topics.
• On three benchmark datasets of short texts, DWGTM empirically outputs more semantically coherent topics than strong baseline models.

Related Work
In this section, we briefly review related works on conventional topic modeling of short texts and neural topic modeling.

Topic Modeling for Short Texts
Short texts lack the document-level word cooccurrence information, making conventional topic models such as LDA (Blei et al., 2003) much less effective. To resolve this issue of short text, existing models mainly adopt the methodology of word co-occurrence enrichment (Yan et al., 2013;Yin and Wang, 2014;Quan et al., 2015;Zuo et al., 2016a,b;Li et al., 2016Shi et al., 2018;Li et al., 2019aLi et al., ,b, 2020a. First, one straightforward way is to generate long pseudo-texts by adaptively aggregating short texts and then learn topics from them by applying LDA. Several representatives (Quan et al., 2015;Zuo et al., 2016a; jointly estimate long pseudo-texts and topics, however, they are often time consuming as well as sensitive to the number of long pseudo-texts. Second, another mainstream is to extract more word co-occurrences at the corpus level. The BTM (Yan et al., 2013;Cheng et al., 2014) directly induces topics from all word co-occurrence patterns of the corpus. Semantics-assisted Non-negative Matrix Factorization (SeaNMF) (Shi et al., 2018) regards each word type as a pseudo-text consisting of the words that co-occur with it in the same short text, and learns topics with those auxiliary word type pseudo-texts. Additionally, other attempts (Li et al., 2016(Li et al., , 2019a upgrade existing models, e.g., DMM and BTM, by further leveraging auxiliary knowledge or techniques such as word semantic correlations measured by pre-trained word embeddings (Mikolov et al., 2013;Pennington et al., 2014). In contrast to aforementioned models, our DWGTM is built on the framework of neural variational inference with GCN (Kipf and Welling, 2016a), enabling to effectively extract topics with word cooccurrence patterns.

Neural Topic Modeling
Along the new research line of integrating VAE (Kingma and Welling, 2014), a number of neural topic models have been proposed. Generally, the basic idea of neural topic modeling is to apply neural networks as topic encoders to induce topic representations of texts, and reconstruct texts with topical distributions. Benefiting from the effectiveness and flexibility of neural networks in unsupervised representation learning, neural topic models can induce more significant topics from texts. Nowadays, the representatives include Neural Variational Document Model (NVDM) (Miao et al., 2016), Product of expert LDA (ProdLDA) (Srivastava and Sutton, 2017), and Embedded Topic Model (ETM) (Dieng et al., 2020), etc. Besides these "naive" neural variants of LDA, many other models have been investigated by applying (1) various neural modules to the topic encoder, e.g., recurrent module (Rezaee and Ferraro, 2020), attention mechanism (Li et al., 2020b), and graphical connection (Zhu et al., 2018;Yang et al., 2020), and (2) new learning paradigms, e.g., adversarial training (Wang et al., 2019), reinforcement learning (Gui et al., 2019), and lifelong learning (Gupta et al., 2020). However, despite their effectiveness on normal long texts, those models suffer from the sparsity problem of short texts (Zeng et al., 2018). To our knowledge, there are only a few neural topic models for addressing the sparsity problem of short texts (Zeng et al., 2018;Zhu et al., 2018;. Inspired by BTM (Yan et al., 2013), the GraphBTM method (Zhu et al., 2018) directly learns topics from the aggregated word co-occurrence patterns of randomly generated mini-corpus. The NQTM method  is based on the assumption that the peakier topic proportions of texts are more appropriate for modeling short texts as demonstrated in DMM (Yin and Wang, 2014), To achieve this, it applies a topic distribution quantization method, and meanwhile it adopts a negative sampling step to avoid repetitive topics. Orthogonal to those models, our DWGTM further employs the pre-trained word embeddings to capture the semantic information of words, so as to output more semantically coherent topics.

The Proposed DWGTM Model
In this section, we introduce the proposed Dual Word Graph Topic Model (DWGTM). For convenience, the important notations used in this paper are summarized in Table 1.

Overview of DWGTM
The topic modeling family such as LDA (Blei et al., 2003) refers to the probabilistic model that describes the generative process of documents. Basically, it posits totally k topics φ 1:k , each of which is a multinomial distribution over the vocabulary, and each document is represented by a topic proportion θ. Given a corpus D consisting of n documents x 1:n , the main goal of topic modeling is to estimate topics φ 1:k and topic proportions θ 1:n from D. However, it is commonly intractable to To effectively handle short texts, we propose a novel neural topic model named DWGTM by not only leveraging the corpus-level word cooccurrence information to address the sparsity problem (Yan et al., 2013), but also capturing word semantic correlations measured by pre-trained word embeddings (Mikolov et al., 2013;Pennington et al., 2014). Specifically, as depicted in Fig.1, DWGTM consists of the four components in the auto-encoding manner. (1) WCG-Encoder: We construct a global word co-occurrence graph G c , and then encode G c as word features z w 1:v , where v denotes the vocabulary size.
(2) TP-Encoder: We construct latent text features z t 1:n by using z w 1:v , and then encode z t 1:n as topic proportions θ 1:n .
(3) Text-Decoder: We reconstruct the texts x 1:n with θ 1:n and topics φ 1:k . (4) DualWG-Decoder: We reconstruct G c with z w 1:v . Meanwhile, to further capture semantic information of words, we construct a word semantic correlation graph G s by using pretrained word embeddings, and reconstruct G s with also z w 1:v . In the following part, we introduce each component of DWGTM in more details.

WCG-Encoder
Given a corpus D, we first construct a word cooccurrence graph G c = (V, E c ), where V and E c denote the sets of word nodes and word co-occurrence edges, respectively. That is, the graph can be represented by the co-occurrence adjacency matrix A c ∈ R v×v , where each element A c ij denotes the count of words w i and w j co-occurring in the same text. The WCG-encoder targets at encoding A c as word features Z w = [z w 1 , · · · , z w v ] , so that more frequently co-occurring words tend to have more similar word features. This is achieved by applying a GCN module parameterized by W c : Following (Kipf and Welling, 2016a), each layer of the GCN module is formulated below: where l c is the number of layers; are the learnable parameters; Z w (0) is initialized by the identity matrix I v with the shape of v; ψ(·) denotes the Tanh activation function;Ã c = D − 1 2 (A c + I v )D − 1 2 is the the symmetrically normalized adjacency matrix; and D denotes the degree matrix of A c + I v .

TP-Encoder
Naturally, the resulting word features Z w learned from G c are rich in global word co-occurrence information. Accordingly, we can use Z w to generate latent text features z t 1:n , enabling to alleviate the sparsity problem of short texts. For each short text, the latent text feature can be easily obtained by aggregating its corresponding word features, formulated below: where x d and |x d | denote the word frequency vector of the d th document and its total number of word tokens, respectively. The TP-Encoder aims at encoding z t 1:n as topic proportions θ 1:n . Inspired by (Miao et al., 2016;Dieng et al., 2020), we apply the VAE-like paradigm with logistic-normal prior distribution. Specifically, suppose that for each short text the topic proportion is drawn from a logistic-normal prior as follows: where δ can be regarded as the unnormalized topic proportion; and N (µ 0 , Σ 0 ) denotes a Gaussian prior probability. We apply a fully-connected module, a.k.a., variational inference network (Dieng et al., 2020), which ingests each latent text feature z t d and outputs the mean µ d and covariance Σ d of the unnormalized topic proportion δ d , formulated below: where l t denotes the number of layers; and ρ(·) denotes the Tanh activation function. We then compute the topic proportion θ d by leveraging the reparameterization trick (Kingma and Welling, 2014): where denotes element-wise product; and is a sample drawn from the Gaussian N (0, I k ). Due to the space limitation, we omit background descriptions of this VAE-like paradigm and reparameterization, and refer the readers to more details in (Kingma and Welling, 2014;Mnih and Gregor, 2014;Rezende et al., 2014;Miao et al., 2016).
Remark. Strictly speaking, the reparameterization trick (i.e., Eq.8) is really meant for forming the Monte Carlo approximation of the variational objective (Kingma and Welling, 2014), and it should even be described in the decoding process. We kindly emphasize that we introduce Eq.8 as a step of encoding θ for the sake of a more intuitive expression for the TP-Encoder.

Text-Decoder
In this component, we reconstruct the texts x 1:n with topic proportions θ 1:n and topics φ 1;k . Following the spirit of VAE derivation (Kingma and Welling, 2014), the reconstruction loss of x 1:n consists of a log marginal likelihood term and a KLdivergence regularier as follows: We adhere to the generative assumption of LDAlike models, therefore the marginal likelihood term of texts can be formulated below: where θ i:n are computed by Eq.8. Second, the KL-divergence regularizer admits a closed-form expression as follows: where Tr(·) denotes the trace of a matrix.

DualWG-Decoder
As its name suggests, the aim of DualWG-Decoder is two-fold: applying the word features z w 1:v to reconstruct the word co-occurrence graph G c and also an auxiliary word semantic correlation graph G s .
Reconstruction of G c . Following (Kipf and Welling, 2016b), we apply an inner product decoder with word features. Accordingly, the reconstruction loss is formulated as follows: where σ(·) denotes the Sigmoid function.
Reconstruction of G s . Besides extracting topics by applying the word co-occurrence statistics, We expect to take the semantic information of words into consideration, so as to generate more semantically coherent topics (Li et al., 2016(Li et al., , 2019a. To achieve this, we construct a word semantic correlation graph G s = (V, E s ), where E s denotes the set of word semantic correlation edges. Let A s ∈ R v×v be the corresponding adjacency matrix, where each element A s ij reflects the cosine similarity between pre-trained GloVe embeddings 1 of words w i and w j . To be specific, it is formulated as follows: where γ ij = cos(g i , g j ) denotes the cosine similarity; the notation g specifies the pre-trained GloVe word embedding; and γ * is a word semantic correlation threshold. We reconstruct G s by encouraging the resulting word features to capture word semantic correlations. Accordingly, the reconstruction loss of G s can be formulated below: where · 2 denotes the 2 norm.

Full Objective of DWGTM
We now outline the full objective of DWGTM. Except the reconstruction losses of x 1:n , G c , and G s , we also incorporate the following entropy regularization term to encourage peakier topic proportions: Finally, we can reach the full objective with respect to the learnable parameters {W c , W t , φ} as follows: where λ 1 , λ 2 , and λ 3 are the scale parameters.
In the experiments, we select three benchmark short text datasets: Trec, 2 Google- News, 3 and YahooAnswer. 4 For all datasets, we remove digits and words with term frequencies less than 20. Stop words and non-english words are filtered out by NLTK. 5 For clarity, the statistics of those datasets are listed in Table 2.
Baseline Topic Models. We select 8 existing baselines, including 4 conventional topic models and 4 neural topic models. Following their original papers, the important implementation details of all baselines are described below.
• LDA 6 (Blei et al., 2003): The model is inferred by variational inference, and the Dirichlet priors for topic proportions and topic distributions are set to 0.1 and 0.01, respectively.
• GraphBTM 12 (Zhu et al., 2018): The model applies a 3-layer GCN encoder with 100 hidden neurons and samples 3 documents as a mini-corpus.
• NQTM 13 : The model applies a 3-layer MLP encoder with 100 hidden neurons and the word sample size for negative sampling is set as 20.
For DWGTM, we apply a 2-layer GCN WCG-Encoder and a 3-layer MLP TP-Encoder, where the hidden neurons of both encoders are set as 100-300-400-300-k. To avoid posterior collapsing, we adopt 0.4 dropout, batch normalization, and a shallower 1-layer Text-Decoder. The threshold γ * is set to 0.6 for Trec, and 0.8 for Google-News and YahooAnswer. Scale parameters are set as λ 1 = 0.1, λ 2 = 0.1, λ 3 = 1. The number of epochs is 900 and mini-batch size is 200. To construct G s , we employ the 300-dimensional GloVe 14 embeddings trained on Wikipedia2014 and Giga-word5. For fair comparisons, the baselines requiring word embeddings use the same GloVe embeddings.
Evaluation Metrics. To evaluate the topic quality, we adopt two metrics: Topic Coherence (TC) and Topical Semantics Coherence (TSC).
First, TC is the most popular topic quality metric that measures the co-occurrence statistics between top-m words of topics. Here, we compute the TC scores with the public TC project of Palmetto, 15 where, especially, the setting of C V is applied. Second, we propose a novel metric named TSC to measure the semantic coherence of topics. Analogy to TC, we suppose that higher similarities between top-m words of topics imply better semantic coherence for topics. Accordingly, TSC can be defined as follows: where Ω t is the top-m words of the t th topic; and e w i and e w j denote the pre-trained word embeddings of w i and w j , respectively.   Specially, we describe several details of metrics.
(1) For both metrics, higher scores indicate better performance.
(2) We fix m to 10 in all evaluations.
(3) For fair comparisons, we employ the pre-trained word2vec 16 embeddings to compute TSC, instead of the GloVe embeddings that have been used in some of comparing models.

Topic Quality Results
We independently run each comparing model 5 times, then report the average scores of TC and TSC in Tables 3 and 4. In terms of TC, it can be clearly seen that our DWGTM can achieve higher scores than baseline models in most cases. First, DWGTM outperforms the neural competitors GraphBTM and NQTM, which also focus on handling short texts. Second, the TC scores of DWGTM are higher than those conventional topic models in all settings, where the results demonstrate the GCN WCG-Encoder can better capture the word co-occurrence information from the corpora. In terms of TSC, our DWGTM gets competitive scores, and ranks the first averagely. Comparing with neural topic models, DWGTM can achieve higher scores in most cases, where more importantly it beats the most art NQTM. Surprisingly, conventional short text topic models, e.g., DMM, 16 https://wikipedia2vec.github.io/wikipe dia2vec/pretrained/ BTM, and GPUDMM, can achieve competitive TSC scores with DWGTM, and even perform better than NQTM. The possible reason is that those shallow models capture similar semantic information to word2vec, i.e., the word embeddings used to compute TSC scores in the experiments. Specially, we kindly indicate that a potential problem of DWGTM is the TSC degradation with more topics compared to conventional topic models. We will further investigate this problem.

Topic Visualization
For qualitative evaluations, we show the top-10 words of two selected topics about politics and credit across YahooAnswer. As presented in Table 5, we can observe that DWGTM can effectively learn informative word patterns from corpora, where the top topical words are exactly associated with politics and credit, being consistent in the judgment of human-beings to some extent. In contrast to baseline models, the topics learned by DWGTM seem more coherent, where some of baselines often generate several less informative words, e.g., for the topic of credit, {"best", "bad", "long"} in LDA, {"old", "weight", "stomach"} in NVDM, and {"salt", "water", "ice"} in NQTM for the topic of credit. Besides, we also observe that the top-10 words of GPUDMM and DWGTM contain semantically related words. This implies

Ablative Study
We conduct an ablative study to evaluate whether the two reconstruction losses of the DualWG-Decoder and also the entropy regularization of θ (i.e., Eq.15) have positive effects on topic extraction. To achieve this, we examine three simplified versions that independently remove the loss of G c (λ 1 = 0), the loss of G s (λ 2 = 0), and the entropy regularization term (λ 3 = 0). We show the topic quality results of different versions of DWGTM on Trec when k = 20. As shown in Table 6, we can observe that the full DWGTM method outperforms all three simplified versions, indicating that all three components have positive effects on topic extraction. Specifically, DWGTM w/o the losses of G c and G s (i.e., λ 1 = 0 and λ 2 = 0) lead to TSC deficiency over 0.01, indicating that the two reconstruction processes in the DualWG-Decoder can help capturing the semantic information of words. Besides, the gain of DWGTM over the version without the entropy regularization (i.e., λ 3 = 0) shows more significant validity. This coincides with the fact that the entropy regularization tends to compute peakier topic proportions, which are beneficial for extracting topics from short texts with extremely limited words.
Specially, we have evaluated different values of {λ 1 , λ 2 , λ 3 } and also the threshold γ * from the The results show that {λ 1 , λ 2 } and λ 3 perform better with smaller and larger values, respectively; and γ * performs relatively stable with different values. Due to the space limitation, we omit the detailed results and will show them in the next version.

Conclusion
In this paper, we develop a novel neural topic model for short texts, called DWGTM. The proposed DWGTM model extracts topics by simultaneously applying the word co-occurrence graph and word semantic correlation graph. Specifically, it consists of four main components: (1) Encode the word cooccurrence graph as word features.
(2) Generate text features with word features, and encode them as topic proportions.
(3) Reconstruct the texts with topical distributions. (4) Reconstruct both graphs with word features. We also propose a novel metric to evaluate the semantic coherence of topics, called TSC. Empirically, the effectiveness of DWGTM was validated on three benchmark datasets of short texts. We show that the topics learned by DWGTM can simultaneously capture meaningful patterns and semantic correlations of words.