Graph Relational Topic Model with Higher-order Graph Attention Auto-encoders

Learning low-dimensional representations of networked documents is a crucial task for documents linked in network structures. Relational Topic Models (RTMs) have shown their strengths in modeling both document contents and relations to discover the latent topic semantic representations. However, higher-order correlation structure information among documents is largely ignored in these methods. Therefore, we propose a novel graph relational topic model (GRTM) for document network, to fully explore and mix neighborhood information of documents on each order, based on the Higher-order Graph Attention Network (HGAT) with the log-normal prior in the graph attention. The proposed method can address the aforementioned issue via the information propagation among document-document based on the HGAT probabilistic encoder, to learn efﬁcient networked document representations in the latent topic space, which can fully reﬂect document contents, along with document connections. Experiments on several real-world document network datasets show that, through fully exploring information in documents and document networks, our model achieves better performance on unsupervised representation learning and outperforms existing competitive methods in various down-stream tasks.

Learning low-dimensional representations of networked documents is a crucial task for documents linked in network structures. Relational Topic Models (RTMs) have shown their strengths in modeling both document contents and relations to discover the latent topic semantic representations. However, higher-order correlation structure information among documents is largely ignored in these methods. Therefore, we propose a novel graph relational topic model (GRTM) for document network, to fully explore and mix neighborhood information of documents on each order, based on the Higher-order Graph Attention Network (HGAT) with the log-normal prior in the graph attention. The proposed method can address the aforementioned issue via the information propagation among document-document based on the HGAT probabilistic encoder, to learn efficient networked document representations in the latent topic space, which can fully reflect document contents, along with document connections. Experiments on several real-world document network datasets show that, through fully exploring information in documents and document networks, our model achieves better performance on unsupervised representation learning and outperforms existing competitive methods in various downstream tasks.

Introduction
Document networks, such as hyperlink networks of Web pages, citation networks of academic documents, and user profiles in social networks, have long been an intensively studied research subject due to their wide applications. Finding lowdimensional representations of networked documents to preserve document contents and connections among documents simultaneously is a crucial research task. Inspired by the wide application of topic models such as latent Dirichlet allo-cation (LDA) (Blei et al., 2003) on discovering the latent semantic structure of unconnected documents, a series of Relational Topic Models (RTMs) are proposed to explore the latent topic semantic structure of documents and links among them, based on probabilistic graphical models (Nallapati et al., 2008;Chang and Blei, 2009;Le and Lauw, 2014;Chen et al., 2014;Yang et al., 2016), deep generative models (Acharya et al., 2015;Wang et al., 2017;Bai et al., 2018), auto-encoders (AEs) (Zhang and Lauw, 2020) and graph auto-encoders (GAEs) (Wang et al., 2020a).
However, most RTMs consider only the pairwise correlation or the first-order neighbor correlation (Zhang and Lauw, 2020) among documents. Although the recently proposed deep relational topic model, GPFA (Wang et al., 2020a) based on graph neural networks (GNNs) can consider low-order indirect neighborhood information via stacked graph neural network (GNN) (Kipf and Welling, 2016b) layers, it still suffers from exploiting the deep interactions (higher-order) between indirectly connected documents due to the oversmoothing problem . Such the higher-order correlation structure has been proved to be effective on various tasks (Abu-El-Haija et al., 2019) such as link prediction and recommendation (Zhang and McAuley, 2020).
To address the aforementioned issue, we propose the graph relational topic model (GRTM) for modeling the latent topic structure of document contents and links, based on the higher-order graph attention auto-encoders (HGTAEs), aiming to fully explore and fuse each order proximity (including the low-order and higher-order) of document network. Specifically, we propose to extract the higherorder document proximity network (HDPN) from the adjacent matrix of the document network via the calculation of the shortest path. The higherorder graph attention network (HGAT) is presented to efficiently model the neighborhood information propagation on HDPN via introducing the lognormal prior into the graph attention. We finally propose our GRTM with the higher-order graph attention auto-encoders (HGTAEs) based on HGAT and HDPN. The main contributions of our paper are as follows: 1. We propose a novel unsupervised deep relational topic model GRTM to fully explore multiple information: the higher-order document relations and latent topic semantic among document contents and networks.
2. We propose a novel graph attention network HGAT to efficiently explore each order correlations among networked documents.
3. Experimental results on document network datasets show that our model outperforms existing competitive methods on unsupervised representation learning, through fully exploring multi-granularity information in document networks.

Related Work
In this section, we briefly review existing Relational Topic Models (RTMs), Graph Auto-Encoders (GAEs), and Graph Topic Models. RTMs generally extended LDA based topic models to further model the links between documents in networks. Chang and Blei (2009) first proposed to introduce additional binary conditional variables in the generation to model the document links. (Chen et al., 2014(Chen et al., ) (2014 proposed discriminative relational topic models (DRTMs) to learn discriminative latent representations of document networks. Le and Lauw (2014) proposed PLANE which can jointly extract topics and visualization coordinates. To apply the neural network based inference approach to RTMs, Bai et al. (2018) utilized Stacked Variational Auto-Encoder(SVAE) to derive more representative documents in topic distributions. However, these models only consider pair-wise document correlations, fail to model the full structural information (low-order and higherorder) embedded in the document network.
To model the block correlation structure of the document network, Yang et al. (2016) incorporated weighted stochastic block model into relational topic models. Most recently, Zhang and Lauw (2020) proposed AdjEnc to reconstruct both documents and their neighborhoods in the network. However, it can only capture the first-order correlation structure with the adjacent-encoder. Wang et al. (2020a) proposed the deep relational topic model GPFA based GNNs to explore hierarchical relationships of interconnected documents. However, still, it can only capture the low-order hierarchical relationships of interconnected documents due to the well-known smoothing problem of GNNs, while long-range relations among documents are also critical for learning latent representations in document networks. To address this issue, we calculate the higher-order proximity network that allows considering the long-range topological information among documents, rather than merely pairwise or few-order relations.
Recently, GAEs has attracted a lot of attention, which incorporates GNNs into auto-encoder to unsupervised graph embedding learning, motivated by the successful applications of GNNs in modeling graph topological structure. The earliest attempt VGAE (Kipf and Welling, 2016a) extended variational auto-encoder (VAE) onto graph structure data for learning network embedding. Inspired by the advantage of GNNs, some works have explored VGAE for topic modeling, including the deep relational topic model GPFA mentioned before, and GraphBTM (Zhu et al., 2018) which improved the biterm topic model (Yan et al., 2013) with word co-occurrence graph encoded by GCNs. Except studies based on VGAE, there are also works combining topic models with graph neural networks in a different manner, such as the graph attention topic network (GATON) (Yang et al., 2020) proposed for unconnected documents, the dynamic hierarchical topic graph model DHTG (Wang et al., 2020b) used for unconnected document classification, the topic variational graph autoencoder (TVGAE) (Xie et al., 2021b) for document classification and the graph topic neural network (GTNN) (Xie et al., 2021a) proposed for representation learning of both connected and unconnected documents. Different from them, we target connected documents. Moreover, to fully explore the deep topological structure of document networks, we propose the novel higher-order graph attention network, and then introduce it into the relational topic modelling based on variational graph autoencoders.

Method
In this section, we present our graph relational topic model (GRTM) for the document network. We first introduce the construction of the higher-order proximity networks: HDPN from document adjacency matrices, then we present the novel graph attention network HGAT to fuse the information of HDPN. We end this section by introducing the variational graph auto-encoder structure for building GRTM.

Higher-order Proximity Network
Formally, we define a given document network as G = (D, A, X). D = {d 1 , ..., d n } is the set of document nodes with n documents and a vocabulary V with m words. Relations between documents are represented as a 0-1 adjacency matrix A 2 R n⇥n , and X 2 R n⇥m is the documen-word index matrix, in which X ij represents the weight (e.g. TF-IDF) of word j in document i. For the given document sets D, Based on the given adjacency matrices A of document network, the key problem is to discover and preserve arbitrary-order neighborhood relations beyond first-order or few-order (including other higher-order). Intuitively, two nodes have a proximity correlation if and only if we can find at least one path between them . Thus, we can calculate the order of proximity correlation between two nodes according to the length of the shortest path between them based on the adjacency matrix, and directly preserve arbitrary-order information in the same matrix. Denoting the adjacency matrices of HDPN asÂ 2 R n⇥n , the link of proximity correlation between two documents (d i , d j ) is defined as: According to the above definition,Â can be calculated during the data pre-processing step in advance. The length of the shortest path of two nodes is calculated using classical search algorithms such as Dijkstra's algorithm or Bellman-Ford algorithm on the machine learning framework 1 . Compared with existing methods that calculate the higher-order proximity with the power of adjacency matrix or steps in a probabilistic transition process (Abu-El-Haija et al., 2019; , our calculation is more suitable for explicitly calculating the length of the shortest path. Because the k-power of adjacency matrix has proximity information overlap on other power matrices before it, while the calculation of k-walk may lead to nodes return to their neighbors less than k-order rather than reach to their k order neighbor (Zhang and Xu, 2020).

Higher-order Graph Attention Network
In this section, we focus on how to better fuse information of neighbors at different orders on an HDPN to efficiently learn node representations. Intuitively, for a given node representation, the contributions of its neighbors vary according to their distances. However, directly utilizing GNNs such as graph convolutional networks (GCNs) and graph attention networks (GATs) on the HDPN will treat neighbors of nodes at different orders equally. Therefore, we present the higher-order graph attention network (HGAT) to solve the problem via introducing the log-normal prior into the graph attention.
Instead of utilizing uniform prior in GATs, we exploit the log-normal distribution to model the importance decaying of neighbors of the current node on different orders. For simplicity sake, we use the log-normal distribution with zero-mean value, and calculate the attention coefficient between two nodes i, j: where is the probability density function of the log-normal distribution, is the variance of the log-normal prior, p ij is the length of the shortest path between nodes i, j in a HPN, h l i , h l j are representations of node i, j in the l-th layer, ⇢ is the activation function, is the parameter to control the influence of the log-normal prior, N (i) is neighbors of node i in a HPN, W l , b l are the weight matrix and bias of the l-th layer. The attention mechanism based on the log-normal prior allow nodes to select their neighbors at arbitraryorder with different importance. Calculated attention coefficients are further exploited to propagate information of neighbors of each node at arbitrary- order: Although there are other distributions, such as the normal distribution used in the Gaussian transformer (Guo et al., 2019), the log-normal is more suitable for model the importance calculation in our case. This is because the density function (p) of the log-normal prior with zero-mean value has always a real value larger than 0 rather than that of the normal distribution with the negative value or Poisson distribution with an integer value. Moreover, (d) always decreases monotonically after exp 2 < 1, while the path length in our HDPN is always greater or equal to 1. Therefore, the log-normal prior can naturally model the decay of importance weight by increasing the path length between two nodes, as shown in Figure 1. When conducting the HGAT on the vanilla adjacency matrix (only with first-order proximity), the HGAT will be degenerated into the GAT method due to there are only 1 or 1-length of the shortest path in the vanilla adjacency matrix.

Graph Relational Topic Modelling
To fully explore the long-range document relations to model the latent topic semantic among document contents and networks, we present the GRTM model with the higher-order graph attention autoencoders (HGTAEs). Let's assume K is the topic number, ✓ is the document topic proportion, and is the topics, namely the topic word proportion. Firstly, we present the generative process of GRTM as in Algorithm 1. Similar to previous RTMs, we assume the document topic proportion is generated from the Dirichlet prior. However, the Dirichlet prior makes it difficult to make the neural variational inference for GRTM, due to the challenge of reparameterizing the Dirichlet prior. Thus, to simplify the inference process, we approximate the Dirichlet distribution with its Laplace approximation: the logistic normal distribution follow- ing many previous works (Srivastava and Sutton, 2017).
To incorporate the higher-order relations among documents, we generate the document topic proportion with logistic normal distribution parameterized by HGAT probabilistic encoder. Specifically, for each document d, we draw the mean and covariance of a multinomial distribution variable and then transform it with the softmax function: where HGAT is the message passing process as in Equation 3, ✏ d is the noise variable. For HGAT encoders of mean and covariance, the input feature h 0 d is set to the normalized document-word index feature X d following previous methods (Kipf and Welling, 2016a). The message passing based on HGAT makes latent topic proportions of each document influenced by its neighbors at different orders with different importance.
In the decoding process, the word is generated from the multinomial distribution based on the topic proportion of the document it belongs to and its topic proportion: The links between two documents are modeled as Bernoulli binary variables, which are conditionally generated based on the latent topic proportions of these documents: Following the auto-encoding variational Bayes inference method (Kingma and Welling, 2014), we can yield the evidence lower bound (ELBO) to the marginal log-likelihood according to the above generative process: where ⇥ is the parameter set of the whole process, q(✓|w,Â) is the approximate Dirichlet variational posterior as parameterized in Equation 3, p(✓|↵) is assumed the true Dirichlet posterior and ↵ is the prior parameter. We still approximate it with its Laplacian approximation: the softmax variable on the multivariate normal with mean and covariance matrix as follows: We seek to minimize the KL divergence between the variational posterior and the true posterior in the first term. The second and third terms aim to reconstruct the document contents and links. Based on the gradient variational Bayes (SGVB) estimator (Kingma and Welling, 2014), we can further yield the detailed formulation of each term: whereX = (✓ ) is the reconstructed document contents,Ã = sigmoid(f y (✓)) is the reconstructed document links. Based on these, we can optimize the ELBO with stochastic gradient descent to infer the whole model end to end.

Experiments
We conduct experiments in several real-world document network datasets. The statistics are reported in Table 1. Four datasets are subsets extracted from Cora: Data Structure (DS), Hardware and Architecture (HA), Machine Learning (ML), and Programming Language (PL) as in (Zhang and Lauw, 2020), in which Cora is the scientific article citation dataset collected from scholar websites. To evaluate the unsupervised representation learning capability of our method, we infer the latent topic portions of documents ✓ with our model, and then use it for three types of downstream tasks, namely document classification, document clustering, link prediction. We compare our method against baselines from the following three categories: We follow the settings for all baselines as in (Zhang and Lauw, 2020) and also compare methods in both transductive and inductive learning settings. For inductive learning, we randomly select a subset of 70% documents as the training set, a subset of 10% of documents as the validation set, and use the remaining 20% documents as the testing set. For transductive learning, all documents are involved in the training process. All experimental results are averaged over the results of 10 independent runs. Following (Zhang and Lauw, 2020), we set the topic number K as 64, the layer number L of message passing in HGAT as 1. The hidden size of weight matrices in HGAT is equal to the topic number of 64. For the log-normal prior in HGAT, we set the the parameter = p ⇡, the variance = 1 p 2 . We use the tanh activation function in HGAT. We use the Dirichlet distribution with parameter ↵ = 1 K for the logistic normal approximation. The learning rate on all datasets is 0.065, the maximum training epochs with Adam is 40000, the early stop epoch is 500. The parameter setting in all baseline models is the same as in (Zhang and Lauw, 2020).
We infer document topic representations with trained GRTM model for both train and test documents, which are then used in three downstream tasks to evaluate the effectiveness: 1) Document classification: we adopt K-Nearest Neighbors to predict each document's label based on the Euclidean distance of generated representations. We use the classification accuracy as the metric. 2) Document clustering: We also compare our method with baselines in clustering documents via K-means, to investigate whether our method can generate similar representations for documents in the same category. In this case, the ground truth labels are only utilized to calculate normalized mutual information (NMI) in evaluation. 3) Link Prediction: The generated representations are used to predict the links between documents in this experiment. We use Mean Average Precision (MAP) as the evaluation metric following the previous method (Zhang and Lauw, 2020). To better understand the semantic information our method captured in generated representations, we also conduct experiments to present a detailed analysis of our generated topics in inductive learning: 1) Topic Coherence: As in previous work (Zhang and Lauw, 2020), we adopt PMI -P MI(w i , w j ) = log p(w i ,w j ) p(w i )p(w j ) to evaluate topic coherence. We calculate the average pairwise PMI of the top 10 words in each topic. Better topics should produce higher PMI.
2) Visualization: We apply t-SNE to project text representations generated by different models into a 2-dimensional space. Table 2 and 3, our method achieves the best performance in all tasks on the four datasets on both inductive and transductive settings. Compared with auto-encoder based methods (AE, DAE, CAE, VAE, KSAE, and KATE), which only consider document contents without document networks. ADE, relational topic model methods (ProdLDA, RTM, PLANE, NRTM), and our method achieve better performance due to considering the links among documents, which benefits the downstream tasks of document classification/clustering and link prediction. Compared with relational topic model methods (ProdLDA, RTM, PLANE, NRTM), we found that the graph embedding method VGAE and the adjacent auto-encoder method ADE perform better than other baselines, which demonstrates the advantage of using high-order proximity information. But they are still inferior to our proposed GRTM, which proves the benefits of fully exploring information of various orders in document networks. A similar performance among these methods can also be observed in the results of topic coherence in Table 4.

As shown in
There is no mean standard deviation evaluation by the previous methods (Zhang and Lauw, 2020;Chen and Zaki, 2017), so we only report the results of our method in Table 5 to illustrate the statistical effectiveness of our model. These results are obtained in both transductive and inductive settings through repeating each 10 times.

Effect of Topic Number
To investigate the sensitivity of our method to topic numbers, we present the classification accuracy of our model on different topic numbers in the inductive setting. As shown in Figure 2, the test accuracy on four datasets generally improves with the increase of the number of topics and reaches the peak when the topic number is around 64. From these curves, we can find that the performance of our model is not too sensitive to the topic number, and also the topic number does not seem to be so related to the ground truth number of classes of datasets. In figure 3, we further show transductive test classification accuracy of different models under different topic numbers. We can see that our model consistently outperforms all baselines under different topic numbers on four datasets.

Effect of Log-normal Prior
We vary the variance of the log-normal prior to explore the impact of the log-normal prior in HGAT.
We present results of our model under the value of 2 { 1 2 , 1 p 2 , 1}, and the results are shown in Table  7. The larger value of makes the slower decay of the importance on higher-order proximity information otherwise the faster decay. From the table, we can see that our model generally achieves the best performance at = 1 p 2 . We suspect that too much noisy higher-order information is introduced when is set too large, while insufficient higher-order information can be used when too small. Hence it yields poor performance in both cases.

Different layer numbers
Although one layer message passing process of HGAT is able to capture arbitrary-order proximity information among document networks, we can still report the results of our model under the different layers of HGAT in Table 6. We can see that our model with one layer HGAT achieves the best performance under both settings. When the layer number of HGAT is set to 0, the HGAT encoder of our model degenerates to the feed-forward neural network. With two-layer HGAT, document representations may be disturbed by the information of their noisy neighbors.

Ablation Study
We also perform an ablation study on our method to verify the effectiveness of each module in the inductive setting. We compare our model with its variants by removing one of the components HDPN and HGAT respectively, as shown in Table 8. From which we can see that each component makes a certain contribution to the overall performance. In the case of removing the HDPN (W/HDPN) module, our model directly takes the document adjacency matrix as input, in which it degenerates into the relational topic model based on graph attention auto encoder without considering the higher-order proximity. As in the case of removing the HGAT (W/HGAT), although our model takes the higher-order information into consideration, it doesn't make the important selection for different order correlation information. We can also see that missing the higher-order proximity has a more significant negative influence than missing the HGAT based encoder module, illustrating the relative effectiveness of the higher-order information in improving the discrimination of latent representations.

Visualization
Finally, to intuitively demonstrate the effectiveness of our model, we visualize the learned representations of the test documents on the ML dataset in Figure 4. It shows that documents are better grouped by our model than ADE (with first-order correlations) and VGAE (with few-order correlations ), due to the incorporation of the in-direct correlation information among documents.

Conclusion
In this paper, we propose a novel graph relational topic model GTM for document networks to fully explore each order of relations among document    networks, which is efficiently fused by a proposed novel graph attention network HGAT equipped with log-normal attention prior. Experimental results show that full consideration of each order proximity information on the document-document graph is beneficial for improving the learned document representations. In future work, we would like to explore the better-suited method and more elegant prior distributions for discovering and fus- ing higher-order proximity in document networks.