Deep Attention Diffusion Graph Neural Networks for Text Classification

Text classification is a fundamental task with broad applications in natural language processing. Recently, graph neural networks (GNNs) have attracted much attention due to their powerful representation ability. However, most existing methods for text classification based on GNNs consider only one-hop neighborhoods and low-frequency information within texts, which cannot fully utilize the rich context information of documents. Moreover, these models suffer from over-smoothing issues if many graph layers are stacked. In this paper, a Deep Attention Diffusion Graph Neural Network (DADGNN) model is proposed to learn text representations, bridging the chasm of interaction difficulties between a word and its distant neighbors. Experimental results on various standard benchmark datasets demonstrate the superior performance of the present approach.


Introduction
Text classification, as an important and fundamental task in the field of natural language processing, has attracted extensive attention from scholars for many years and has been widely used in various practical applications, such as topic labeling (Wang and Manning, 2012), computational phenotyping (Che et al., 2015), question answering  and dialog act classification (Lee and Dernoncourt, 2016). The performance of text classification relies heavily on the ability of the model to extract textual features from raw text. Previous shallow learning-based text classification approaches mainly use hand-crafted sparse lexical features, such as bag-of-words (BoW) or n-grams, for representing texts . Since these features are predefined, the models do not take full advantage of the large amount of training data. Deep learning architectures represented by convolutional neural networks (CNNs) (Kim, 2014) and * Corresponding author. recurrent neural networks (RNNs)  are becoming more popular due to their strong performance in text mining. These models can capture semantic and syntactic information in local consecutive word sequences well.
Recently, graph neural networks (GNNs) have attracted increasing attention due to their superiority in dealing with complex structured data and relations (Kipf and Welling, 2017;Klicpera et al., 2019a). GNNs have achieved promising results in text classification tasks when modeling texts with graph structures due to their powerful expressiveness (Wu et al., 2020). Despite the success of the mentioned models, several serious limitations of prevalent GNNs hinder their performance, which is mainly attributed to the following factors: (I) Restricted Receptive Fields. Most previous approaches allow a word in the graph to access direct neighborhoods. In other words, the word embeddings depend solely on the influence of the representation of its neighboring words at each layer. Moreover, the sliding window used to build word-word edges is typically not large. This makes it impossible to enable long-range word interactions. Therefore, it is critical to increase the receptive field of target words to obtain precise text representations. (II) Shallow Layers. Most current graph-based models for text classification adopt fairly shallow settings, as they achieve the best performance given two layers. Two-layer graph models aggregate nodes in two-hop neighborhoods and thus cannot extract information beyond two-hop neighbors. Theoretically, we can capture long-range dependencies between words with a large number of layers. However, a common challenge faced by most GNNs is that performance degrades severely when stacking multiple layers to exploit larger receptive fields . Some researchers attribute this phenomenon to over-smoothing (indistinguishable representation of different classes of nodes) . It is still necessary to explore deeper GNNs to obtain more latent features in text classification tasks, especially for short texts, because the available contextual information is limited. (III) Non-Precision Document-Level Representations. Most graph-based models for text classification leverage simple pooling operations, such as summing or averaging all nodes in the graph, to obtain document-level representations. This weakens the effect of some key nodes and significantly reduces the expressiveness of the model since different words play distinct roles in the text. (IV) Low-Pass Filters. Essentially, existing graph-based methods for text classification are fixed coefficients low-pass filters (Nt and Maehara, 2019). It has been confirmed that the low-pass filter in GNNs mainly preserves the commonality of node features, ignoring the difference among them (Oono and Suzuki, 2020). Therefore, the learned representations of connected nodes become indistinguishable when always adopting a low-pass filter. Meanwhile, several studies have demonstrated the usefulness of high-frequency information in graph signals to enhance the discriminative power of the model especially when the network exhibits disassortativity (Zhu et al., 2020;Chien et al., 2021). Hence, it is necessary to design a filter that does not just exhibit low-pass properties for learning word embeddings.
To overcome the limitations above, we propose a novel model named Deep Attention Diffusion Graph Neural Network (DADGNN) for text classification based on learned effective text representations. Specifically, we use the attention diffusion technique to widen the receptive field of each word in the document, which can capture the longrange word interactions at each layer. Moreover, to extract the deep hidden semantics of words, we decouple the propagation and transformation processes of GNNs to train deeper networks. Finally, we calculate the weight of each node to obtain precise document-level representations. Our work's key contributions are as follows: (1) We introduce a novel model, namely, DADGNN, based on the attention diffusion and decoupling technique, which has excellent expressive power in modeling documents and overcomes some limitations of conventional graph-based models.
(2) We theoretically prove that the attention diffusion operation is equivalent to a polynomial graph filter that can utilize both high-and low-frequency graph signals.
(3) We conduct extensive experiments on a series of benchmark datasets, and the state-of-the-art performance of DADGNN illustrates its superiority compared to other competitive baseline models.
2 Related Work 2.1 Deep Learning for Text Classification Compared to traditional text classification methods, deep learning models can automatically learn high-dimensional textual features without the need for tedious feature engineering. Two representative deep learning models, CNNs (Kim, 2014) and RNNs (Liu et al., 2016), have shown powerful capabilities in text classification. To further improve the expressive power of the model, the attention mechanism is introduced as a part of the model, such as in hierarchical attention networks (Yang et al., 2016) and attention-based long short-term memory (LSTM) networks (Wang et al., 2016). However, word-to-word dependencies in the sequential-based learning models above often extend beyond the local sliding window, resulting in lower performance when encoding long sentences.

Graph Neural Networks
GNNs, as a special type of neural network, have achieved remarkable success in citation networks, social networks and other research areas (Wu et al., 2020). Most of the models above are massagepassing GNNs, including graph convolutional networks (GCNs) (Kipf and Welling, 2017), Graph-SAGE (Hamilton et al., 2017) and graph attention networks (GATs) (Velickovic et al., 2018), which aim to learn node embeddings by aggregating information from direct (one-hop) neighbors and stacking multiple layers of information from disjoint (multi-hop) neighbors. More recently, a collection of works introduce a diffusion mechanism in graphs to aggregate information from a larger neighborhood rather than relying only on immediate neighbors, and it capture more complex graph properties, showing powerful performance (Klicpera et al., 2019b;. Due to the powerful representation capabilities of GNNs, some recent works have used them for text classification tasks (Yao et al., 2019;Wu et al., 2019;Ding et al., 2020). For example, TextGCN (Yao et al., 2019) employs standard GCNs (Kipf and Welling, 2017) on a heterogeneous graph constructed by an entire corpus, obtaining competi-tive performance. Subsequently, SGC (Wu et al., 2019) removes the additional complexity by iteratively eliminating the nonlinear transformations between GCN layers and collapses the resulting function into a linear transformation, which yields better results. HyperGAT (Ding et al., 2020) proposes to learn text embeddings by applying hypergraphs over documents. However, the aforementioned models utilize only the immediate neighbors in the graph and suffer from over-smoothing issues. Additionally, these models cannot yield high-quality document-level representations by using the sum/mean pooling operation. To the best of our knowledge, our model is the first attempt to utilize the graph attention diffusion method to address the difficulties of long-range word interactions and achieve better performance in text classification.

Methods
The overall architecture of DADGNN is shown in Fig.1. Next, we will elaborate on the three aspects of text graph construction, key components and graph-level representation, respectively.

Text Graph Construction
The initial task of text classification based on GNNs is to represent the serialized text as a graph. In our work, we denote a text as G = (V, E), where V = {v 1 , . . . , v n } indicates a set consisting of different words and E = {e 1,1 , . . . , e n,n } indicates the set of edges formed between words. Each node in the graph can initialize a d-dimensional word embedding vector (i.e., word2vec or GloVe). We build an edge starting from a target node and ending at its p-hop adjacent nodes, which is formulized as e i,j , j ∈ [i − p, i + p], as shown in Fig.2 (a). The advantage of constructing edges in this way is that the graph is directed and its transition matrix is symmetric.

Key Components
To obtain discriminative feature representations of nodes in deeper networks, we decouple the propagation and transformation processes of GNNs. Concretely, this is formulated as: where X ∈ R n×d represents the original word vector and H (0) ∈ R n×c represents the vector obtained after feature transformation, in which c is the number of text categories. The forward propagation process of previous graph-based models can be formulated as: (2) whereÃ is the adjacency matrix, representing the propagation process of GNNs. W (·) is a layerspecific learnable weight matrix, and σ is a nonlinear function. These represent the feature transformation process of GNNs.
The entanglement of the propagation and representation transformation in Eq.2 can significantly degrade the performance of GNNs . In traditional GNNs, the original features need to be propagated and then multiplied by the transformation matrix. Intuitively, it is difficult to train a network when the number of layers becomes large. Because there are so many parameters in transformation intertwined with the receptive field in the propagation. Additionally, propagation and representation transformation affect the graph network in terms of structures and features, respectively. Hence, there is no need to intertwine the two processes. After performing Eq.1, the two fundamental processes of GNNs are decoupled.
Then, we calculate the normalized attention weights between directly connected nodes using Eqs.3 and 4.
where W (l) is a weight matrix and a (l) is a weight vector, which are trainable parameters shared by the l − th layer. A (l) is a graph attention matrix in the l − th layer. Additionally, σ is the ReLU activation function. In Eq.3 the target node i does not consider the potential impact from the neighbors beyond one hop. We compute the attention between nodes that are not directly connected in a complex network via the diffusion mechanism.
The graph attention diffusion matrix T is obtained based on the attention matrix A as follows:  where the text is "graph neural network is powerful" and p is equal to 1. (b) Our model can capture the information of disconnected nodes by considering all paths between nodes via an attention diffusion procedure in a single layer. For example (where our target node is 'graph', and for brevity, we remove the irrelevant edges of (a)), y CA = σ([y CB , y BA ]) (C→B→A) and where ζ n are learnable coefficients that depend on the properties exhibited by the constructed graph network. As illustrated in Fig.2 (b), A n is the powers of the attention matrix, which take into account the influence of all neighboring nodes j with path lengths up to n on target node i based on the powers of the graph adjacency matrix. This operation effectively increases the attentional receptive field in a single layer of the neural network. This mechanism effectively establishes attentional links between unconnected nodes to obtain the attention coefficients, whose magnitudes depend on ζ n and the path length. In practice, according to "four/six degrees of separation" in real-world graphs (Backstrom et al., 2012), which means that the shortest path distance between a pair of nodes is at most four or six, we usually do not sum to n = ∞. In our experiments, we truncate the infinite Eq.5 to a natural number N ∈ [4, 7], i.e., T = N n=0 ζ n A n , which can yield impressive model performance.
To further improve the expressiveness of the attention diffusion layer, we deploy a multi-head attention diffusion mechanism. Specifically, we first compute the attention diffusion for each head k independently and then aggregate them. The output feature representation is as follows: where || is the concatenation operation and W a denotes a weight matrix for transforming the dimensions.

Graph-Level Representation
After propagating the L-th layer of our model, we are able to compute the final representations of all nodes on each text graph. To measure the different roles of each node in the graph, in contrast to graph-based text classification models that use general pooling, we employ a node-level attention mechanism. Specifically, it can be expressed via the following equation: where W b is a trainable weight matrix and Ψ i denotes the attention coefficient of node i in the graph.
To obtain the probability of each category, we further perform X out = softmax(X out ).
Finally, we use the cross-entropy loss as the objective function to optimize the neural network for text classification. Concretely: where D is the set of training data and Φ is the indicator matrix. Note that our model is directly applicable to inductive learning tasks, and for unseen test documents, the corresponding constructed graph can be fed directly into the trained model for prediction. Moreover, it is trained in an end-to-end manner, which means that the learnable parameters are considered simultaneously when optimizing the network.

Spectral Analysis
In this section, we investigate the effectiveness of our model from the perspective of graph spectra. Graph Fourier Transform. The graph Fourier transform relies on the spectral decomposition of the graph Laplacian. A common symmetric positive semidefinite graph Laplacian matrix is L = I − D −1/2Â D −1/2 with eigendecomposition L = VΛV , where V andΛ are real-valued, D is the diagonal degree matrix andÂ is the adjacency matrix. According to (Shuman et al., 2013), the eigenvectors of L can be treated as the bases in the graph Fourier transform. The graph Fourier transform acting on the graph signal x is defined aŝ x = V x, and the inverse graph Fourier transform is x = Vx. Hence, the graph convolution between the signal x and filter g on G is defined as whereG β (Λ) = diag(g β,M (λ 1 ), . . . , g β,M (λ M )).
The attention matrix A is essentially a random walk transition matrix, which satisfies D ii = N j A ij = 1 and A ij > 0. Therefore, the symmetric normalized graph Laplacian matrix can be denoted as L = I − D −1/2 AD −1/2 = I − A. A filter g β acting in a Laplacian matrix can be appreciated as a polynomial filter of order M, i.e., g β (L) = M m=0 β m L m , since it is localized and shift invariant (Defferrard et al., 2016). Since T = N n=0 ζ n A n , inspired by (Klicpera et al., 2019b), we can obtain the relationship between the graph filter and the graph attention diffusion using the binomial theorem as , and a good choice of ζ n can capture both the node information and the graph structure. Since ζ n is a learnable parameter, the ideal ζ n value represents the optimal graph filter.
Given the graph attention matrix A = UΛU , via the eigenvalue decomposition of A, we can define the attention diffusion matrix T acting on the graph signal x as where g ζ,N (Λ) = diag(g ζ,N (λ 1 ), . . . , g ζ,N (λ n )) and the corresponding filter coefficients are g ζ,N (λ) = N n=0 ζ n λ n , whose λ corresponds to the eigenvalue of A. By limiting the parameter ζ n , we can obtain graph filters with different characteristics, such as low-pass, high-pass and even more complex filters.
In recent works, GDC (Klicpera et al., 2019b) and MAGNA , g ζ,N (·) acts a low-pass graph filter by setting stricter conditions, i.e., ζ n = α(1 − α) n , N n=0 ζ n = 1 and α ∈ (0, 1). This enhances low-frequency information corresponding to the large-scale graph structure and suppresses high-frequency information corresponding to the fine-grained message. However, as mentioned in Section 1, the high-frequency information of a graph is not always useless, and it can capture the difference between nodes. Eliminating them completely may limit the expressiveness of the model. When ζ n = (−α) n and α ∈ (0, 1), g ζ,n (·) is a high-pass filter (Chien et al., 2021). Since our model can adaptively learn the optimal weights ζ n rather than behave only as a low-pass filter, DADGNN is theoretically more expressive than fixed-weight filters, and the results in Table 2 strongly support our inference.  Datasets. For a fair and comprehensive evaluation, we select two types of text corpora, which include both long and short documents from different domains. The detailed summary statistics of the datasets are presented in Table 1.

Experiments
Long Corpus.
(1) AG news (Zhang et al., 2015) is a subdataset of AG's corpus of news articles constructed by assembling titles and description fields from the 4 largest classes.
(2) IMDB (Maas et al., 2011) is a movie review dataset for sentiment classification. We follow the data processing method used in (Tang et al., 2015). (3) WebKB (Craven et al., 1998) is a dataset that includes web pages from the computer science departments of various universities categorized into 6 imbalanced categories. (3) Reuters is a collection dataset of news documents. We use two subdatasets, R8 and R52, in our experiment, following the settings of (Yao et al., 2019).
Short Corpus.
(1) MR (Pang and Lee, 2005), a movie review dataset, is specifically for binary sentiment analysis, and each review only contains one sentence. (2) SST-1 (Socher et al., 2013) is a fine-grained sentiment analysis dataset in which each sentence is annotated with fine-grained labels (from very positive to very negative). (3) SST-2 (Socher et al., 2013) is the same as SST-1 except that neutral reviews are removed, and the data are labeled as positive or negative. (4) TREC (Li and Roth, 2002) is a question dataset; the task involves classifying questions into 6 question types. (5) DBLP is a computer science bibliography dataset that contains the titles of papers. Six different re-search areas are selected. We use the same data as (Tang et al., 2015).
Implementation Details. In our experiment, we use the same data preprocessing as (Yao et al., 2019). Concretely, we remove the words that occur fewer than 5 times and clean and tokenize the raw documents. Moreover, we randomly sample 10% of the training data as the validation set. The optimal values of the hyperparameters are selected when the model reaches the highest accuracy of the validation samples. A pre-training GloVe word embedding (Pennington et al., 2014) is adopted for those models that require word initialization. We train DADGNN for 200 epochs with an earlystopping strategy using the Adam method (Kingma and Ba, 2015). For the baseline models, we use the default hyperparameters described in the original paper. All experiments are conducted ten times to obtain the average accuracy and standard deviation. Table 2 shows that DADGNN has the state-of-theart performance in all datasets compared to other competitive baselines, which shows the superiority of our model. Additionally, we obtain the following insights and analysis below.

Results and Analysis
One crucial reason why our model achieves more significant improvements is that the receptive field of the target node is enhanced by attention diffusion, which incorporates more informative messages (i.e., both low-frequency and high-frequency information) in the text. Furthermore, it is possible to obtain distinguishable hidden features in deep graph networks by decoupling the procedures of feature transformation and propagation, which is  extremely useful for short texts.
Most graph-based approaches achieve outstanding performance compared to the other two groups of baseline models in the topic classification datasets, such as R52, R8 and AG news. This phenomenon demonstrates the critical role of capturing non-consecutive and long-distance semantics in the document for text classification. However, for the sentiment classification task, the sequence-based models have more robust performance, which can be explained by the fact that word order modeling is crucial for capturing sentiment. Since our model constructs directed graphs, it can obtain sequential context information ignored by most graph-based methods and achieves excellent results in sentiment classification.
Ablation Study. To verify the role of each module in DADGNN, we perform a series of ablation experiments, and the results are shown in Table 3. From the results, we can see that removing any part of our proposed model leads to a decrease in accuracy. In particular, removing the attention diffusion module degrades performance most significantly, indicating the effectiveness of attention diffusion for increasing the receptive field to learn more expressive word representations. Moreover, using the node-level attention mechanism can further enhance the performance of our approach because of the ability to obtain more precise graph-level representations. It is clear that stacking multiple layers captures long-distance and non-consecutive semantics and thus performs better than the one-layer model.
Memory Consumption. We select two representative models for GPU memory consumption analysis to verify the computational efficiency of our model. TextGCN is a transductive model, and the others are inductive models. From the results in Table 4, we can conclude that inductive models consume significantly less memory than transductive models. Our approach consumes less memory with better performance than the previous most computationally efficient HyperGAT model. An important reason for this is that we transform representations into a low-dimensional space by feature transformation at an early stage, making DADGNN computationally efficient and memory-saving.
Hyperparameter Sensitivity. We investigate the impact of different hyperparameters on model performance. Fig.3 (a) and (b) illustrate the effect of the diffusion distance n on model performance in Eq.5 on R52 and TREC datasets, respectively. We observe that the two datasets share an almost consistent variation curve; i.e., the test accuracy first increases and then decreases, with the optimal n ∈ [4, 6], which is consistent with the previous analysis in Section 3.2. Fig.3 (c) and (d) show the results of test accuracy with different p-hop adjacent neighbors constructing the text graph in Section 3.1 on R52 and TREC datasets, respectively. We find that our proposed model achieves optimal performance when connecting two neighbors. This implies that the nodes cannot understand dependencies across multiple words in the context of connecting only with direct neighbors. However, when connecting to more distant neighbors, the graphs become similar but ignore local features.
Deeper Layers. To verify whether DADGNN stacking multiple layers suffers from the oversmoothing problem common to GNNs, we conduct some experiments under different model depths than those of classical GNNs models, such as TextGCN and SGC. The results are presented in Fig.4. Overall, the accuracy of our model is stable and does not show a significant decrease with the increasing number of layers, which suggests     that DADGNN has a good ability to mitigate the over-smoothing issue. However, it can be clearly seen that the performace of classical GNNs drops rapidly when stacking multiple layers, indicating that a serious over-smoothing issue is encountered. An important reason for this is that, as mentioned in Section 3.2, transformation representation and propagation are decoupled in GNNs, which benefit from a deep model architecture. The other reason is that we design the graph filter to not only exhibit a low-pass property but to prevent node features from becoming indistinguishable when more highfrequency information is introduced. Visualization. To show the superiority of representations learned by our model, we use t-SNE (Van der Maaten and Hinton, 2008) method to visualize the learned embeddings of test documents. Concretely, we select three powerful baselines (i.e. HyperGAT, Text-level GCN and TextGCN) to com- pare with DADGNN in TREC dataset. From Fig.5, we observe that our model is able to learn more discriminative and distinguishable document embeddings over other methods.

Conclusion
In this work, we propose a new graph-based model, named DADGNN, for text classification. Our model decouples the necessary procedures of GNNs (i.e., representation transformation and propagation) to train a deep neural network and introduces a attention diffusion mechanism to capture non-direct-neighbor context information in a single layer. Furthermore, the node-level attention technique is introduced to obtain more precise document-level representations. Using the techniques above, our model enhances the receptive field and increases the depth without suffering from over-smoothing problems. The theoretical analysis essentially shows that DADGNN can learn the optimal filter to adapt the dataset; i.e., it can preserve more relevant high-frequency information of nodes to further improve the expressiveness of the model. Extensive experiments show that our model is much better than other competitive approaches. More importantly, our work not only provides a powerful baseline model for text classification but also contributes to graph representation learning.