Unleashing the Power of Language Models in Text-Attributed Graph

,


Introduction
Text-attributed graphs, characterized by the association of nodes with text attributes (Yang et al., 2021), are prevalent in diverse real-world contexts.For instance, in paper citation networks, each paper is accompanied by textual content, while in social networks, each user can be described through a text description.The investigation of learning techniques on text-attributed graphs has garnered considerable attention in domains such as graph learning, information retrieval, and natural language processing, reflecting the growing importance of understanding and analyzing textual information within the context of graph-based structures.Existing research on learning from textattributed graph mainly falls into three lines: (1)Language models(LMs) only, in these works, LMs (Kim, 2014;Vaswani et al., 2017) are applied to leverage the local textual information of individual nodes and generate representation for them (Howard and Ruder, 2018).
However, structural relationship between nodes are ignored in this way.
To leverage the relationships between nodes, an self-supervised learning framework GIANT (Chien et al., 2021) propose to guide the training of LM with graph structure.Nevertheless, LM-based methods ignore the message passing among nodes; (2)Graph neural networks(GNNs) only (Kipf and Welling, 2017;Zhang and Chen, 2018), which are used to capture the structural properties of text-attributed graphs.Raw texts contained by each node are transformed to numerical features as node attributes (Hu et al., 2020a;Liu et al., 2020;Hu et al., 2020c) by graph-irrelevant methods, such as bag-of-words, pre-trained bert in most previous studies.Clearly, the information contained in raw text is compressed and underutilized.Additionally, relying solely on a fixed representation of text may not be appropriate for certain scenarios.For instance, the term 'Transformer' can refer to a device used for adjusting the voltage of an electric power supply, while in an academic context, it signifies a specific model; (3)combination of LMs and GNNs (Bi et al., 2021;Zhu et al., 2021a), which boost the text embedding with graph structure.However, it suffers from severe scalability issues when facing with large-scale graph and huge parameters of LMs.To address this, GLEM (Zhao et al., 2023) leverages a variational EM framework to iteratively update both the LM and GNN modules, enabling scalability to realworld graphs.Nevertheless, GLEM relies on taskspecific labels, resulting in node representations that are constrained to the specific task at hand.Generally, prior researches encounter issues such as overlooking the relationships between nodes or words, scalability limitations, and a lack of generalizability.
In this paper, we introduce a general textattributed graph pre-training framework that could fully utilize the relationship between graph-based structure and textual information.The main contributions of our proposed research are as follows.
First, to enhance the modeling of textual information within nodes of text-attributed graphs, we construct a hierarchical text-attributed graph that incorporates both initial nodes and word nodes.More specifically, we further decouple the word nodes from the corpus consisting of the textual information from all nodes.Then we construct edges among nodes based on word occurrence in nodes (node-word edges) and word co-occurrence in the whole corpus (word-word edges), as shown in figure 1.This enables us to capture the finer nuances of the text at a more granular level.
Second, approaching the capability of generating effective representations adapted to various scenarios, we introduce a multi-task graph pre-training framework.This framework encompasses various self-supervised tasks, such as link reconstruction, node attribute reconstruction, important word reconstruction, and important word identification.The objective of link reconstruction is to capture the underlying structural patterns in a general sense, while node attribute reconstruction aims to uncover the semantic relationships among nodes.Furthermore, the tasks of important word reconstruction and important word identification are specifically designed for access to distinctive semantics and paper-occurrence correlation, respectively.
Third, to mutually boost representations of nodes and words, we employ a relational graph neural network (R-GNN) as the foundational model for acquiring knowledge from the hierarchical textattributed graph.Furthermore, within our framework, we introduce two aggregators that iteratively refine the features of both nodes and words, leveraging progressively optimized embeddings of papers/words after a designated number of training epochs.

Method
In this section, we present the entire training framework to learn paper and word representations simultaneously without supervision based on R-GNNs, including the modeling method of text-attributed graph, self-supervised tasks and pre-training architecture, see overall framework in figure 2.

Hierarchical Text-attributed Graph
To better establish the relationship between the raw text and the graph, we propose to construct a hierarchical text-attributed graph encompassing initial nodes and word nodes, inspired by Yao et al. (2019).
First, we tokenize the raw text contained in the nodes, thereby acquiring all the individual words; Second, we construct a hierarchical textattributed graph that incorporates both initial nodes and word nodes.
Third, we construct edges among nodes for the hierarchical text-attributed graph.The relationship between initial nodes constitutes the edges between them; and we build paper-word edges based on word occurrence in papers; as for edges between word nodes, we employ point-wise mutual information (PMI) to measure the co-occurrence frequency between words to determine whether to build a edge.The PMI value is computed as follows:

W
where i, j represents two word nodes; W (i) is the number of sliding windows in the nodes that contain word i; W (i, j) is the number of sliding windows in the nodes that contain both word i and j; and W is the total number of sliding windows in the corpus from all nodes.We construct edges for words with positive PMI value which suggests a high semantic correlation of words.
Thus far, we have constructed the structure of hierarchical text-attributed graph.Next, we need to assign corresponding node attributes to different types of nodes.Initially, we use a pre-trained Bert model to generate word embedding for each word node as the attributes; Subsequently, we simply average the embeddings of all word nodes in one node to obtain the feature of initial nodes.Finally, we obtain this hierarchical text-attributed graph with semantic and structural information coexisting.
Let G = (V, E, J , K, φ, ϕ, X , Z) denotes the hierarchical text-attributed graph we built, where V and E represent the sets of nodes and edges respectively; J and K represent sets of node types and edge types respectively; φ : V → J is the node type mapping function, while ϕ : E → K is the edge type mapping function; X n×d and Z m×d represent the feature matrix of paper and word nodes respectively; n and m denote the number of initial nodes and words respectively, and d denotes the feature dimension.The number of nodes |V| is the summation of the number of initial nodes and individual words.
Intuitively, by constructing hierarchical text-attributed graph with initial nodes and words, we build a bridge for information interaction.Consequently, as we incorporate individual words into the training process, the initial nodes are endowed with knowledge from both interconnected nodes and their respective textual components.

Self-supervised Tasks
Appropriate tasks drive R-GNNs to mine potential structure and semantic information in heterogeneous graph continuously.In our pre-training architecture, we apply 4 tasks to fully mine the information of different level in hierarchical textattributed graph, as shown in figure 1.

Link Reconstruction
Briefly speaking, link reconstruction is to predict existing edges between node pairs.Task Process: In the design of link reconstruction task, we regard it as a binary classification problem, and train the model by negative sampling.
First, we treat the edges existing in the graph as positive samples, and sample some non-existent edges in the graph as negative samples.
Second, for each node pair (u, v) in the graph, we calculate their score: e u,v = φ(h u , h v ) based on the representation h u and h v , where φ is a dot product, and can be any other function that computes similarity.
Third, labeling the positive sample as 1 and the negative sample as 0, we can optimize R-GNN with the following loss function: where σ denotes a activation function, and v i ∼ P n (v) denotes the negative sampling distribution.
We perform link reconstruction for edges among initial nodes.

Node Attribute Reconstruction
In our work, we use node attribute reconstruction task to maximize the semantic information of hierarchical text-attributed graph, also known as feature reconstruction in homogeneous graphs (Hou et al., 2022).Task Process: First, we sample a subset Ṽinitial ⊂ V initial , and mask their features with a mask token [MASK], i.e., a learnable vector x [M ] ∈ d , and the feature matrix of word nodes remains unchanged.Thus the node feature xi can be defined as: Second, we input the feature matrix X , Z and graph G into a graph encoder f e to obtain the latent code H. Then we replace H on masked node indices again with another mask token [DMASK].
Third, input the re-masked code matrix H into a decoder f d to obtain the reconstructed feature matrix W . Then optimize with the scaled cosine error: which is averaged loss over all masked initial nodes.The scaling factor γ is a hyper-parameter.

Important Word Reconstruction
Intuitively, we believe that there are important words in the raw text of initial nodes which can reflect the semantic information of the node to a large extent.Motivated by this, we design important word reconstruction task to reconstruct semantic information of important words.Task Process: We need to define what are important words to the initial nodes.For instance, the title of a paper contains key information thus we remove the stop words after tokenizing the title of the paper, and take the remaining words as important words.We mask the features of these important words and reconstruct them.The reconstruction loss is denoted as:

Important Word Identification
Each node contain distinctive important words.Therefore, we design important word identification task with the objective of judging important words.
Task Process: First, we label the paper-word edge differently according to whether it is a important word to this node, 1 for important and 0 for unimportant.
Second, we splice the representation h u and h m of each node-word pair (u, m) as the edge representation h u,m .
Third, we input the edge representation into a projection head to predict the edge label, then optimize with the following loss function: where y ′ u,m denotes the probability of being predicted as an important word edge.

Pre-training Framework
The representation of initial nodes and word nodes can be optimized simultaneously based on hierarchical text-attributed graph.Motivated by this, we propose a pre-traininig framework: Node Representation Update Pre-training Architecture(NRUP).
In our framework, we use the constructed graph as the input; and we select a R-GNN model as our base model, then the combination of selfsupervised tasks is served as the objectives of the pre-training stage.Furthermore, we design two aggregators to update features of both initial nodes and words, which are denoted as follows: where M EAN − AGG denotes a meanaggregator, which average the embeddings of the aggregated nodes; neigh initial v denotes 1-hop paper-neighbors of initial node v; while neigh word n denotes 1-hop word-neighbors of word node n; c and d are adaptive parameters based on the average neighbor count of papers and words respectively, neigh word initial denotes the number of word neighbors of the initial node, and neigh initial word denotes the number of initial node neighbors of the word node.
After aggregation, we normalize the aggregated initial node and word embeddings separately to obtain the updated embedding matrix U n×d initial and U m×d word .Then we replace the initial node features with the updated matrix, and continue to train the same R-GNN with self-supervised tasks until convergence.

Multi-Task Pre-training
Link reconstruction task tends to restore structural features, while node attribute reconstruction task focuses on semantic information.IWI and IWR tasks focus on deep mining of important words.Therefore, we combine these tasks to guide the representation learning of the R-GNN model.Loss function L can be denoted as: where λ 1 , λ 2 , λ 3 are hyper-parameters.

Experiment Setup
In this section, we apply our entire pre-training method to a real-world citation network ogbnarxiv (Hu et al., 2020a) and report performance on the downstream tasks.To demonstrate the generalization of our method, we considered two settings on the same dataset: Transductive Learning and Inductive Learning.Meanwhile, we select several baseline models to prove the validity of our method.

Dataset
Data Profiling: The ogbn-arxiv dataset is a benchmark node classification dataset, representing the citation network between all Computer Science (CS) arXiv papers indexed by MAG.Each node with its raw text of title and abstract is an arXiv paper and each directed edge indicates that one paper cites another one.In addition, all papers are also associated with the year that the corresponding paper was published.Downstream Tasks: We evaluate our model on two types of tasks, namely subject prediction and important words identification.
• Subject Prediction: This task is regarding prediction of subject areas of arXiv CS papers, which are manually labeled by the paper's authors and arXiv moderators.Formally, the task can be formulated as a 40-class classification problem.
• Important Words Identification: This task is designed to identify important words based on paper-word correspondence.Formally, it can be regarded as a binary classification problem.

Pre-Training and Fine-Tuning Setup
Basically, pre-training methods are designed to obtain the transferable knowledge from unlabeled datasets, thus pre-training models bring better representations for the downstream tasks.Therefore, to evaluate the effectiveness of our method, we propose two different setups.
The first setting is called Transductive Learning, we pre-train and fine-tune on the same graph in this setting, which means all nodes are visible during both pre-traing and fine-tuning stage.The second one is called Inductive Learning, we pretrain on one grpah and fine-tune on another graph in this setting.Generating representation for unseen nodes in fine-tuning stage makes this setting more challengeable.For the evaluation protocol, we conduct the same experimental process under two settings.First, we train a R-GNN encoder by the proposed NRUP based on pre-train graph.Then we freeze the parameters of the encoder and generate all the nodes' embeddings for fine-tune graph.For evaluation, we train a linear classifier and report the mean and standard deviation of performance on the test nodes through 10 random initializations.

Implementation Details
Construction of Hierarchical Text-attributed Graph: We construct pre-train and fine-tune hierarchical text-attributed graphs based on ogbn-arxiv according to two settings as shown in Table 2.The process time for construction can be found in the Appendix E. Basic Settings: In NRUP, R-GAT is selected as the base model.We update the features of both papers and words with aggregator after training for 2000 epochs, then we input normalized updated features into the same R-GAT.Throughout the entire pro-cess, we train the model to minimize the loss L using AdamW Optimizer and cosine learning rate decay without warmup.We provide an explanation of the hyper-parameter settings for different losses in Appendix D.More details and hyper-parameters can be found in Appendix A.

Baseline Models
To verify the effectiveness of our method, we select several baseline models for comparison.
• Feat: Fixed representation of paper generated by the BERT model.4 Experiment Results

Main Results
Tabel 3 presents the performance of applying different pre-trian methods on the same pre-train dataset and fine-tune test set.
In task Subject Prediction we predict subject of paper based on paper representation, and in task Important Words Identification we identify important words based on concatenation representation of paper and word.More details can be found in Appendix B.
In both Transductive Learning and Inductive Learning setting, our NRUP achieves better or competitive performance compared to the selected baseline models, demonstrating the effectiveness and transferability of our method.

Effect of Different Tasks
The experimental results demonstrate that using multi-task loss for optimization can help the model capture both semantic and structural information in heterogeneous graph simultaneously.
We further investigated the performance on the test dataset using different tasks under Inductive Learning setting without embedding update.Tabel 4 shows the results that scenario-specific tasks can bring benefits to basic tasks, and our NRUP with multi-loss achieve the best performance.

Optimized Word Embedding
The optimization of word's embedding is a characteristic of our co-modeling method based on hier-archical text-attributed graph.We conduct experiment under Inductive Learning setting using two basic self-supervised tasks to verify that embeddings of words have indeed been optimized.We average the optimized word representation obtained by training for 2000 epochs through a certain task as the representations of downstream papers, then we train a linear classifier on the downstream data to predict the field of the paper.The experimental results in Tabel 5 show that the word node embeddings are optimized as well as the paper nodes, which is the reason why our update framework works.

Effect of Important Word Reconstruction
In task important word reconstruction, it is true that a word may be "important" to some papers but not to others.However, even though the word decoupled from the title of a particular paper may not have a significant impact on another one, it is still a part of the overall content and meaning of that paper.
We focus on reconstructing the important words in this task which contribute to the overall semantic understanding of a particular paper.In other words, the less informative words which are not "important" to any paper can be disregarded.

Method
Accuracy NAR+ AWR 65.49 ± 0.07 NAR 66.65 ± 0.14 NAR+ IWR 66.85 ± 0.04 Further, we have conducted experiments regarding to all words reconstruction(AWR) instead of important words(IWR) under Inductive learning setting, and the findings of these experiments in Table 6 indicate that it is more effective not to reconstruct the word nodes unless we have specific preferences or criteria for word selection.

Ablation Studies
To verify the effects of the main components in NRUP, we further conduct several ablation studies.We choose explore under Inductive Learning setting.Effect of Update Architecture: We explore the influence of update architecture, and table 7 shows the results that pre-train with the update architecture or not.Without the update component, we use certain self-supervised tasks for end-to-end pretraining, and keep the optimal pre-training model to generate the embeddings for downstream dataset.And in our architecture, we update the features halfway through the pre-training and retain the optimal model in the later stage.The performance on downstream dataset indicates that our framework is effective.Effect of Normalization: The normalization plays an crucial role in the update pre-training framework which brings the updated feature matrix back to normal distribution, eliminated the effect of distribution transfer.Table 7 shows the results that update feature without normalization and with different normalization ways.We found that the effect of normalizing the feature matrix is significantly better than not performing it.Meanwhile, the effect of normalizing the feature matrices of the papers and words separately is better than the effect of overall normalizing.In brief, normalization brings improvements.

Method
Effect of Adaptive Parameter: Aggregators in NRUP are in charge of updating node features.We further explore the way and degree of aggregation by using fixed hyper-parameters instead of adaptive hyper-parameters for dataset.Figure 3 shows the results that when the value of hyper-parameters is around adaptive parameters, the effect is better.5 Related work Representation Learning on Text-attributed Graphs: Text-attributed graphs (Yang et al., 2021) are rich in semantic and structural information, previous studies on text-attributed graphs can be divided into three aspects: LMs only, GNNs only, combination of LMs and GNNs.
Early works leverage language models (Kim, 2014;Vaswani et al., 2017) to learn word representation based on sentence sequence.However, neglect of the relationship between nodes leads to underutilization of structural information.In order to leverage the interrelation of nodes more effectively, the graph's structural configuration is employed as a complementary resource alongside textual data, with the aim of augmenting the training process of language models (Yang et al., 2020;Mou et al., 2023).Besides, GIANT (Chien et al., 2021) propose to train LM with graph structure, but message passing among nodes is ignored in LM-based methods.
The development of GNNs (Kipf and Welling, 2017;Zhang and Chen, 2018) brings new ideas for studying this data format.In these works (Hu et al., 2020a), the raw text of nodes are transformed to numerical features as node attributes using graphirrelevant methods (Mikolov et al., 2013;Devlin et al., 2019).Nevertheless, representations for text are fixed in this situation, resulting in undermining of text information.
Co-training approaches (Bi et al., 2021;Zhu et al., 2021a) with combination of LMs and GNNs have advantages of both models.However, it suffer from issues of scalability due to the size of graph and parameters of LMs.Recently, a variational EM framework (Zhao et al., 2023) propose to alternatively update the LM and GNN, but it relies on task-specific labels thus the learned representation cannot be applied to other scenario.Heterogeneous Graph Pre-training: There are studies on related pre-training methods in the field of heterogeneous graph, which leads to more general representation generated by R-GNN encoder.Jiang et al. (2021a,b) proposed two heterogeneous graph pre-training frameworks: PT-HGNN and CPT-HG, in which PT-HGNN proposed two pretraining tasks at the node level and pattern level, while CPT-HG proposed two pre-training tasks at the relation level and subgraph level, both of which achieved good results.These pre-training methods helps the model acquire the representation with generalization and effectiveness.

Conclusion
In this work, we propose to learn representations of papers and words simultaneously via co-modeling the raw text and graph based on hierarchical textattributed graph.We design a pre-training framework and corresponding self-supervised tasks for this scenario.Sufficient experiments conducted on the benchmark dataset ogbn-arxiv demonstrate the effectiveness and generality of our method.

Limitations
In our work, we propose to construct a hierarchical text-attributed graph to realize connections between nodes and words.However, the size of constructed heterogeneous graph is proportional to the number of initial nodes, and the memory complexity is proportional to size of graph structure.Therefore, with the increase of paper-word edges, the memory cost of NRUP may become unaffordable, which limits the scalability of our method.Meanwhile, hyper-parameters of loss function may vary in different datasets.We will manage to address these issues in future work.Self-supervised Tasks Setting: In the framework, we select 2 basic self-supervised tasks: Feature Reconstruction and Link Prediction, and 2 scenario-specific tasks: Identify Important Words and Reconstruct Important Words.And we optimize the R-GAT model with the multi-loss of these tasks.Table 9 outline the hyper-parameters of selfsupervised tasks.
In task node attribute reconstruction, we follow GraphMAE (Hou et al., 2022) and select a onelayer R-GNN as the decoder, because the R-GNN decoder can recover the input features of one node based on a set of paper nodes and word nodes instead of only the node itself, and it consequently helps the encoder learn high-level latent code.And the scaling cosine error can get rid of the impact of dimensionality and vector norms thus improves the training stability of representation learning.

B Fine-tuning Details
This section illustrates some details during the finetuning stage.Subject Prediction Setting: In this downstream task, we leverage the embeddings output by R-GAT for prediction.We train a linear classifier on the fixed representations of downstream dataset.Important Words Identification Setting: In this downstream task, we leverage the concatenation of paper and word embedding output by R-GAT for prediction.For baseline models, we use the concatenation of paper embedding output by GNN and initial bert embedding for prediction.We train a linear classifier on the concatenation of representations of downstream dataset.Tabel 11 outline the hyper-parameters of this task.Due to our limited computing resources, we only sampled 0.5% of the nodes and corresponding edges for the hierarchical text-attributed graph construction(850000 nodes) and pre-training.For the fine-tuning step, we consider the edge prediction of Paper-Field, Paper-Venue as two downstream tasks.Table 12 shows our results.

D The hyper-parameters for the coefficients of different losses
During the pre-training process of ogbn-arxiv, we consider node attribute reconstruction as the fundamental self-supervised task.This is because semantic information plays a crucial role in text-attributed graphs.Also, in our experiments, we observed that setting the values of λ 1 , λ 2 , λ 3 (which are hyperparameters controlling the relative importance of different loss terms) to be around 0.1 to 0.2 yielded good results.
Empirical results indicate that sufficient mining of semantic information and appropriate learning of structural information can enable graph neural networks to acquire more useful transferable knowledge that is beneficial for downstream tasks.

E Cost of Graph Construction
We run the construction process of hierarchical textattributed graph to record the time complexity.The running time of construction process under two different settings are as follows: • Transductive Learning: 7.5hr • Inductive Learning: 3.5hr The experimental results indicate that the construction step is affordable in terms of time complexity.

Figure 1 :
Figure 1: An illustration of hierarchical text-attributed graph and corresponding self-supervised tasks for different level.

Figure 2 :
Figure 2: An illustration of of the entire process of NRUP.1).Co-modeling the text and graph based on heterogeneous graph.2).Self-supervised tasks as training objectives.3).Pre-training architecture updates the features with aggregators.

Figure 3 :
Figure 3: Effect of Adaptive Parameter(When exploring the hyperparameter c, we set d as the adaptive parameter and vice versa)

Table 1 :
Data Split of Ogbn-arxivWe propose to split the dataset into four parts as shown in Table1based on the publication dates of the papers to adapt to the pre-training settings, • Transductive Learning: Under this setting, we select all papers to involve in the pre-training stage; In the fine-tuning stage, we propose to

Table 2 :
Constructed Hierarchical Text-attributed Graph

Table 3 :
Main result of Subject Prediction and Important Words Identification; we report Accuracy in task subject prediction and ROC-AUC in task important word indentification(bolded number is the best in that column).

Table 4 :
Effect of Different Tasks

Table 5 :
Optimized Word Embedding (Word embedding means the average representation of word nodes, Paper embedding means the node representation output by RGAT)

Table 6 :
Effect of Important Word Reconstruction

Table 7 :
Effect of Update Architecture

Table 9 :
Hyper-parameters of Self-supervised Tasks Tabel 10 outline the hyper-parameters of this task.

Table 10 :
Hyper-parameters of Subject Prediction

Table 11 :
Hyper-parameters of Important Words IdentificationC Other ExperimentsIn addition to the ogbn-arxiv dataset in this paper, we have conducted extra experiment on the Open Academic Graph (OAG) dataset, which contains more than 178 million paper nodes and 2.236 billion edges.Each paper is labeled with a set of research topics/fields (e.g., Physics and Medicine) and the publication date ranges from 1900 to 2019.

Table 12 :
Hyper-parameters of Update Architecture