Multi-hop Graph Convolutional Network with High-order Chebyshev Approximation for Text Reasoning

Graph convolutional network (GCN) has become popular in various natural language processing (NLP) tasks with its superiority in long-term and non-consecutive word interactions. However, existing single-hop graph reasoning in GCN may miss some important non-consecutive dependencies. In this study, we define the spectral graph convolutional network with the high-order dynamic Chebyshev approximation (HDGCN), which augments the multi-hop graph reasoning by fusing messages aggregated from direct and long-term dependencies into one convolutional layer. To alleviate the over-smoothing in high-order Chebyshev approximation, a multi-vote-based cross-attention (MVCAttn) with linear computation complexity is also proposed. The empirical results on four transductive and inductive NLP tasks and the ablation study verify the efficacy of the proposed model.


Introduction
Graph neural networks (GNNs) are usually used to learn the node representations in Euclidean space from graph data, which have been developed to one of the hottest research topics in recent years (Zhang, 2020). The primitive GNNs relied on recursive propagation on graphs, which takes a long time to train (Zhang et al., 2019b). One major variant of GNNs, graph convolutional networks (GCNs) (Kipf and Welling, 2017;Yao et al., 2019), takes spectral filtering to replace recursive message passing and needs only a shallow network to convergent, which have been used in various NLP tasks. For example, Yao et al. (2019) constructed the text as a graph and input it to a GCN. This method achieved better results than conventional deep learning models in text classification. Afterward, the GCNs * corresponding author: Qingcai Chen have became popular in more tasks, such as word embedding (Zhang et al., 2020b), semantic analysis (Zhang et al., 2019a), document summarization , knowledge graph , etc.
The spectral graph convolution in Yao's GCN is a localized first-order Chebyshev approximation. It is equal to a stack of 1-step Markov chain (MC) layer and fully connected (FC) layer. Unlike the multi-step Markov chains, the message propagation in vanilla GCN lacks the node probability transitions. As a result, the multi-hop graph reasoning is very tardy in GCN and easily causes the suspended animation problem (Zhang and Meng, 2019). However, the probability transition on the graph is useful to improve the efficiency in learning contextual dependencies. In many NLP tasks (like the question answering (QA) system and entity relation extraction), the features of the two nodes need to be aligned. As an example, Figure 1 shows a simple graph where the node n4 is a pronoun of node n1. In this example, the adjacency matrix is masked on nodes n2, n3, n5 to demonstrate the message passing between n1 and n4. Figure 1 (c) and (d) plot the processes of feature alignment on two nodes without and with probability transitions respectively. In this example, the feature alignment process without probability transition needs 10 more steps than which with probability transition. It is shown that encoding the multi-hop dependencies through the spectral graph filtering in GCN usually requires a deep network. However, as well known that the deep neural network (DNN) is tough to train and easily causes the over-fitting problem (Rong et al., 2019).
Some newest studies to improve the multi-hop graph reasoning include graph attention networks (GATs) (Veličković et al., 2018), graph residual neural network (GRESNET) (Zhang and Meng, 2019), graph diffusive neural network (DIFNET) Figure 1: (a): A simple graph with 5 nodes and the weighted edges, in which the nodes n4 is a pronoun of n1 and the two nodes need to align features. (b): The masked adjacency matrix on this graph. (c) and (d): The processes of feature alignment on nodes n1 and n4 without transition probability and with transition probability respectively. (Zhang, 2020), TGMC-S (Zhang et al., 2020c) and Graph Transformer Networks (Yun et al., 2019;. GATs enhance the graph reasoning by implicitly re-defining the graph structure with the attention on the 1-hop neighbors, but there is equilibrial optimization on the whole graph. GRESNET solves the suspended animation problem by creating extensively connected highways to involve raw node features and intermediate representations throughout all the model layers. However, the multi-hop dependencies are still reasoned at a slow pace. DIFNET introduces a new neuron unit, i.e., GDU (gated diffusive unit), to model and update the hidden node states at each layer. DIFNET replaces the spectral filtering with a recursive module and realizes the neural gate learning and graph residual learning. But the time cost is aggravated in DIFNET compared with GCN. TGMC-S stacks GCN layers on adjacent matrices with different hops of traffic networks. Different from the ground-truth traffic network in TGMC-S, it is hard to construct the multi-hop word-word relationships objectively from the text. TGMC-S hadn't given a way to improve the multi-hop message passing in GCN.
Transformers (Vaswani et al., 2017) and corresponding pre-trained models  could be thought of as fully-connected graph neural networks that contain the multi-hop dependencies. They figure out the contextual dependencies on the fully-connected graph with the attention mechanism. The message propagation in transformers follows the relations self-adaptively learned from input sequence instead of the fixed graph structures. Publications have shown that transformers outperform GCNs in many NLP tasks. Graph Transformer (Dwivedi and Bresson, 2020) generalizes the Transformer to arbitrary graphs, and improves inductive learning from Laplacian eigenvectors on graph topology. However, due to the connections scale quadratically growth with node number N in graphs, things get out of hand for very large N . Additionally, the fully-connected graph is not an interpretable architecture in practical tasks. For example, whether Transformers are the best choice to bring the text in linguistic theory? 1 To improve the efficiency and performance of multi-hop graph reasoning in spectral graph convolution, we proposed a new graph convolutional network with high-order dynamic Chebyshev approximation (HDGCN). A prime ChebNet and a high-order dynamic (HD) ChebNet are firstly applied to implement this Chebyshev approximation. These two sub-networks work like a trade-off on low-pass signals (direct dependencies) and highpass signals (multi-hop dependencies) respectively. The prime ChebNet takes the same frame as the convolutional layer in vanilla GCN. It mainly extracts information from direct neighbors in local contexts. The HD-ChebNet aggregates messages from multi-hop neighbors following the transition direction adaptively learned by the attention mechanism. The standard self-attention (Vaswani et al., 2017) has a O N 2 computation complexity and it is hard to be applied on long sequence. Even the existing sparsity attention methods, like the Star-Transformer (Guo et al., 2019) and Extended Transformer Construction (ETC) (Ainslie et al., 2020), have reduced the quadratic dependence limit of sequence length to linear dependence, but the fullyconnected graph structure cannot be kept. We design a multi-vote-based cross-attention (MVCAttn) mechanism. The MVCAttn scales the computation complexity O(N 2 ) in self-attention to O(N ).
The main contributions of this paper are listed below: • To improve the efficiency and performance of multi-hop reasoning in spectral graph convolution, we propose a novel graph convolutional network with high-order dynamic Chebyshev Approximation (HDGCN).
• To avoid the over-smoothing problem in HD-ChebNet, we propose a multi-vote based cross-attention (MVCAttn) mechanism, which adaptively learn the direction of node probability transition. MVCAttn is a variant of the attention mechanism with the property of linear computation complexity.
• The experimental results show that the proposed model outperforms compared SOTA models on four transductive and inductive NLP tasks.

Related Work
Our work draws supports from the vanilla GCN and the attention mechanism, so we first give a glance at the paradigm of these models in this section.

Graph Convolutional Network
The GCN model proposed by (Kipf and Welling, 2017) is the one we interested, and it is defined on graph G = {V, E}, where V is the node set and E is the edge set. The edge (v i , v j ) ∈ E represents a link between nodes v i and v j . The graph signals are attributed as X ∈ R |V|×d , and the graph relations E can be defined as an adjacency matrix A ∈ R |V|×|V| (binary or weighted). Each convolutional layer in GCN is a 1st Chebyshev approximation on spectral graph convolution, and its layer-wise propagation rule in neural network is defined as: where H (0) = X, A is the normalized adjacency matrix and σ is a non-linear activation function. The node embeddings output from the last convolutional layer are fed into a softmax classifier for node or graph classification, and the loss function L can be defined as the cross-entropy error. The weight set {W (l) } L l=0 can be jointly optimized by minimizing L via gradient descent.

Self-Attention Is a Dynamic GCN
The attention mechanism is an effective way to extract task-relevant features from inputs, and it helps the model to make better decisions . It has various approaches to compute the attention score from features, and the scaled dot-product attention proposed in Transformers (Vaswani et al., 2017) is the most popular one.
where X ∈ R N ×d is the input sequence, and weights W q ∈ R d×d k , W k ∈ R d k ×d , W v ∈ R d×dv are used to transform sequence to queries, keys and values. As showed in Equation 2, the attention scores A can be viewed as a dynamic adjacency matrix on sequence X. This process in self-attention is similar to the graph convolutional layer defined in Equation 1. The only difference is that the adjacency matrix in Equation 2 is adaptively learned from input instead of prior graph structures.

Method
In our model, the input graph G = (V, E) takes the same form as the one in GCN. The nodes are attributed as X ∈ R |V|×d , and the adjacency matrix A ∈ R |V|×|V| (binary or weighted) is defined on graph edges E.
The spectral graph convolution in Fourier domain is defined as, where x ∈ R d is the signal on a node, U is the matrix of eigenvectors on normalized graph Laplacian L = I N − D − 1 2 AD − 1 2 = UΛU T , and the filter g θ ( Λ) is a function of the eigenvalues on normalized L in Fourier domain.
The K-th (K > 2) order truncation of Chebyshev polynomials on this spectral graph convolution is, , the Kth-order Chebyshev polynomials in Equation 4 are approximated as: where the A is normalized adjacency matrix (as defined in Equation 1). As the node state transition A 2k causes the over-smoothing problem (Li et al., 2018;Nt and Maehara, 2019), we take the dynamic pairwise relationship A d self-adaptively learned by the attention mechanism to turn the direction of node state transition. The powers of adjacency matrix A 2k in Equation 5 can cause the over smoothing problem, we replace the A 2k with A k A k d . In our implementation, the first-order and higherorder Chebyshev polynomials in Equation 5 is approximated with a prime Chebyshev network (ChebNet) and high-order dynamic Chebyshev networks (HD-ChebNets) respectively. We generalize the graph convolution on Kth-order dynamic Chebyshev approximation (Equation 5) to the layerwise propagation as follows, where k is the order and W (0) , W (k) , W (k) d are nonlinear filters on node signals. For the convenience of writing, we just define the first layer of HDGCN.

Prime ChebNet
We consider the same convolutional architecture as the one in GCN to implement the prime ChebNet, and it mainly aggregates messages from the direct dependencies.
where W (0) ∈ R d×d and A is the normalized symmetric adjacency matrix.

High-order Dynamic (HD) ChebNet
As the multi-hop neighbors can be interacted via the 1-hop neighbors, we take the Z (0) output from the prime ChebNet as input of the HD-ChebNet. The multi-vote based cross-attention (MVCAttn) mechanism first adaptively learns the direction of node probability transition A (k) d , its schematic is showed in Figure 2 (b). MVCAttn has two phasesgraph information aggregation and diffusion.
Graph Information Aggregation coarsens the node embeddings Z (k−1) to a small supernode set S (k) ∈ R M ×d , M |V|. The first step is multi-vote projection (MVProj). In which node embeddings Z (k−1) are projected to multiple votes V (k) ∈ R |V|×M ×d , and these votes are aggregated to supernode set where |V| ≥ v ≥ 1, M ≥ m ≥ 1, W V m ∈ R d k ×d k is the projection weight and norm () represents the LayerNorm operation.
Next, the forward cross-attention (FCAttn) up-dates the supernode values as: Graph Information Diffusion feeds the supernodes S (k) back to update node set Z (k) . With the node embeddings Z (k−1) and supernode embeddings S (k) , the backward cross-attention (BCAttn) is defined as, The last step is adding the probability transition with A. The output of k-th order HD-ChebNet (Equation A) is, Finally, the outputs from the prime ChebNet and HD-ChebNets are integrated as the node embeddings,

Classifier Layer
Node Classification The node representations H output from the last graph convolutional layer are straightforward fed into a Softmax classifier for node classification.
Graph Classification The representation on the whole graph is constructed via a readout layer on the outputs H, where denotes the Hadamard product and f 1 (), f 2 () are two non-linear functions. The graph representation h g ∈ R d is fed into the Softmax classifier to predict the graph label.
All parameters are optimized by minimizing the cross-entropy function:

Experiments
In this section, we evaluate HDGCN on transductive and inductive NLP tasks of text classification, aspect-based sentiment classification, natural language inference, and node classification. In experiment, each layer of HDGCN is fixed with K = 6 order Chebyshev approximation and the model stacks L = 1 layer. The dimension of input node embeddings is d = 300 of GlVe or d = 768 of pre-trained BERT, and the hyper-parameter d k = 64, d a = 64. So the weights W (0) ∈ R d×64 , W l d , W (k) ∈ R 64×64 and W f k , W f q , W bq , W bk ∈ R 64×64 . The number of super-nodes is set as M = 10. Our model is optimized with adaBelief (Zhuang et al., 2020) with a learning rate 1e − 5. The schematics about the HDGCN is shown in Figure 2.
To analyze the effectiveness of MVCAttn in avoiding over-smoothing, we report the results of ablation study -HDGCN-static in Table 1, 2 5. The ablation model -HDGCN-static is an implementation of Equation 5, in which the node state transition is determined by the static adjacency matrix A 2k .

Text Classification
The first experiment is designed to evaluate the performance of HDGCN on the text graph classification. Four small-scale text datasets 2 -MR, R8, R52, Ohsumed, and four large-scale text datasets -AG's News 3 , SST-1, SST-2 4 , Yelp-F 5 are used in this task. The graph structures are built on word-word co-occurrences in a sliding window (width=3 and unweighted) on individual documents. HDGCN is initialized with word embeddings pre-trained by 300-d GloVe and 768-d BERTbase on small and large scale datasets respectively.    Table 1 shows the test accuracies on four smallscale English datasets, in which HDGCN ranks top with accuracies 86.50%, 98.45%, 96.57%, 73.97% respectively. HDGCN beats the best baselines achieved by TextING (the newest GNN model) and the fine-tuned BERT-base model. Our ablation model HDGCN-static also achieves higher accuracies than the newest GNN models -Tex-tING and minCUT. Therefore, the outperformance of HDGCN verifies that (1) the node probability transition in high-order Chebyshev approximation improves the spectral graph convolution; (2) the MVCAttn mechanism in high-order ChebNet further raises the effectiveness by avoiding the oversmoothing problem. Table 2 shows the test accuracies of HDGCN and other SOTA models on large-scale English datasets. HDGCN achieves the best results 95.5%, 53.9%, 69.6% on AG, SST-1, Yelp-F respectively, and performs a slight gap 0.3% with the top-1 baseline (TinyBERT) on SST-2. These results support that HDGCN outperforms the fully-connected graph module in Transformers and corresponding pretrained models. Additionally, these comparisons also demonstrates that the combination of prior graph structures and self-adaptive graph structures in graph convolution is able to improve the multihop graph reasoning. In the second experiment, we make a case study on the MR dataset to visualize how the HDGCN improve multi-hop graph reasoning. Here, we take the positive comment "inside the film's conflict powered plot there is a decent moral trying to get out, but it's not that , it's the tension that keeps you in your seat Affleck and Jackson are good sparring partners" as an example.

Multi-hop Graph Reasoning in Text Graph
First, the word interactions on prior graph structure A (word-word co-occurrence in a sliding window with width=3) is showed in Figure 3. We can see that the word mainly interacts with its consecutive neighbors. It is hard for the vanilla GCN to encode multi-hop and non-consecutive word-word interactions as the example shown in Figure 1. Figure 4 shows the node interactions from node embeddings Z (0) to supernodes S (1) and the graph diffusion from S (1) to node embeddings Z (1) . In which, the supernode S4 puts greater attention on the segment -it's the tension that keeps you in your seat. This segment determines its positive  Table 3: Test accuracy (%) and macro-F1 score on aspect-based sentiment classification. The results labeled with are cited from . polarity significantly. The other supernodes S1, S2, S3, S5 just aggregate messages from the global context evenly. Next, the messages aggregated in supernodes S1 ∼ S5 are mainly diffused to four tokens -conflict, decent, moral, you. That verifies the self-adaptively learned graph structure A (1) b by the MVCAttn improves the multihop graph reasoning on nodes -conflict, decent, moral, you. From the perspective of semantics, these four words determine the positive sentiment in this comment significantly. Figure 5 shows the message aggregation from node embeddings Z (1) to supernodes S (2) and the message diffusion from S (2) to node embeddings Z (2) . We can see that the supernode S4 puts greater attention on another segment -there is a decent moral young to get out, which also contributes to the sentiment polarity. Then messages aggregated to supernodes S1 ∼ S5 are diffused to all words evenly. The backward interactions from supern- odes S1 ∼ S5 to all graph nodes do not have visible differences. These results demonstrate that the multi-hop graph reasoning in HDGCN just needs one graph convolutional layer to attain the stationary state.

Aspect-based Sentiment Classification
The third experiment evaluates HDGCN's performance on the task of aspect-based sentiment classification. This task aims to identify whether the sentiment polarities of aspect are explicitly given in sentences . The datasets used in this task include TWITTER, LAP14, REST14, REST15, REST16 . The details about the statistics on these datasets are shown in Figure 6. The SOTA comparison models include AOA, TNet-LF, ASCNN, ASGCN-DT, ASGCN-DG, AEN-BERT, BERT-PT, SDGCN-BERT.
Each sample in this task includes a sentence pair, an aspect, and a label. The sentence pair and the aspect are concatenated into one long sentence, and the text graph is preprocessed with the dependency tree on this sentence. HDGCN is tested twice with word embeddings initialized by pre-trained 300-d GloVe and 768-d BERT-base respectively.  Table 3 shows the test accuracies and micro-F1 scores on 5 datasets, where HDGCN achieves new state-of-the-art results on TWITTER, REST14, REST15, REST16, and a top-3 result on the LAP14. As shown in Figure 6 that the LAP14 has the maximum percentage of long sentences among all datasets. A shallow network in HDGCN does not outperform the SOTA result on the LAP14. Additionally, compared with the newest ASGCN and attention-based AOA, HDGCN achieves the best results on TWITTER, LAP14, REST15, REST16 (Acc) and performs very close with the highest accuracy on REST14 and macro-F1 score on REST16. Above comparison supports that the matching between aspect and sentence pair in HDGCN is more accurate than the newest GNN and attention-based models, which verifies that the multi-hop graph reasoning is improved in HDGCN.

Natural Language Inference
The fourth experiment evaluates HDGCN's performance on the Stanford natural language inference (SNLI) task (Bowman et al., 2015). This task aims to predict the semantic relationship is entailment or contradiction or neutral between a premise sentence and a hypothesis sentence. All the comparison methods include fine-tuned BERTbase, MT-DNN , SMART , and CA-MTL (Pilault et al., 2021).
In this task, the premise and hypothesis sentences are concatenated and constructed into a long sentence. Which is preprocessed to a text graph with the dependency tree. The word embeddings in HDGCN were initialized from the pre-trained 768-d BERT-base.
All test accuracies are shown in  on the 10% data. As the MT-DNN, SMART and CA-MTL are all fine-tuned on multi-task learning, they perform better than HDGCN in low resource regimes (0.1% and 1.0% of the data). HDGCN just uses 0.02× more parameters than the BERT-base, and it outperforms the later model on all scales of data. These results verify that the combination of prior graph structure and self-adaptive graph structure in HDGCN performs comparably with the fully-adaptive graph structures in Transformers and BERT-based multi-task learning models.

Graph Node Classification
The fifth experiment evaluates the effectiveness of HDGCN on the node classification task. We use three standard citation network benchmark datasets -Cora, Citeseer, and Pubmed, to compare the test accuracies on transductive node classification. In the three datasets, the nodes represent the documents and edges (undirected) represent citations. The node features correspond to elements of a bagof-words representation of a document (Veličković et al., 2018). We also use the PPI dataset to compare the results on inductive node classification, which consists of graphs corresponding to different human tissues. The baselines for comparison include GCN, GAT, Graph-Bert, GraphNAS, Loopy-Net, HGCN, GRACE, GCNII. The results of our evaluation are recorded in Table 5.  HDGCN achieves the new state-of-the-art results on Cora, Citeseer and Pubmed, and performs equally best with the newest GCNII on PPI. Our ablation model, HDGCN-static, also achieves close results with the newest GNNs on Cora, Citeseer, Pubmed, but it performs poorly on PPI. Which verifies that the high-order Chebyshev approximation of spectral graph convolution has more serious over-smoothing problem in inductive node classification than transductive node classification. All comparisons in this experiment demonstrate the effectiveness of MVCAttn to avoid the oversmoothing problem.

Conclusions
This study proposes a multi-hop graph convolutional network on high-order dynamic Chebyshev approximation (HDGCN) for text reasoning. To improve the multi-hop graph reasoning, each convolutional layer in HDGCN fuses low-pass signals (direct dependencies saved in fixed graph structures) and high-pass signals (multi-hop dependencies adaptively learned by MVCAttn) simultaneously. We also firstly propose the multi-votes based cross-attention (MVCAttn) mechanism to alleviate the over-smoothing in high-order Chebyshev approximation, and it just costs the linear computation complexity. Our experimental results demonstrate that HDGCN outperforms compared SOTA models on multiple transductive and inductive NLP tasks.