Patton: Language Model Pretraining on Text-Rich Networks

A real-world text corpus sometimes comprises not only text documents, but also semantic links between them (e.g., academic papers in a bibliographic network are linked by citations and co-authorships).Text documents and semantic connections form a text-rich network, which empowers a wide range of downstream tasks such as classification and retrieval.However, pretraining methods for such structures are still lacking, making it difficult to build one generic model that can be adapted to various tasks on text-rich networks.Current pretraining objectives, such as masked language modeling, purely model texts and do not take inter-document structure information into consideration.To this end, we propose our PretrAining on TexT-Rich NetwOrk framework Patton.Patton includes two pretraining strategies: network-contextualized masked language modeling and masked node prediction, to capture the inherent dependency between textual attributes and network structure.We conduct experiments on four downstream tasks in five datasets from both academic and e-commerce domains, where Patton outperforms baselines significantly and consistently.


Introduction
Texts in the real world are often interconnected through links that can indicate their semantic relationships.For example, papers connected through citation links tend to be of similar topics; e-commerce items connected through co-viewed links usually have related functions.The texts and links together form a type of network called a textrich network, where documents are represented as nodes, and the edges reflect the links among documents.Given a text-rich network, people are usually interested in various downstream tasks (e.g., document/node classification, document retrieval, and link prediction) (Zhang et al., 2019; Wang et al.,   1 Code is available at https://github.com/PeterGriffinJin/Patton 2019; Jin et al., 2023a).For example, given a computer science academic network as context, it is intuitively appealing to automatically classify each paper (Kandimalla et al., 2021), find the authors of a new paper (Schulz et al., 2014), and provide paper recommendations (Küçüktunç et al., 2012).In such cases, pretraining a language model on a given text-rich network which can benefit a great number of downstream tasks inside this given network is highly demanded (Hu et al., 2020b).
While there have been abundant studies on building generic pretrained language models (Peters et al., 2018;Devlin et al., 2019;Liu et al., 2019;Clark et al., 2020), they are mostly designed for modeling texts exclusively, and do not consider inter-document structures.Along another line of research, various network-based pretraining strategies are proposed in the graph learning domain to take into account structure information (Hu et al., 2020a,b).Yet, they focus on pretraining graph neural networks rather than language models and cannot easily model the rich textual semantic information in the networks.To empower language model pretraining with network signals, LinkBERT (Yasunaga et al., 2022) is a pioneering study that puts two linked text segments together during pretraining so that they can serve as the context of each other.However, it simplifies the complex network structure into node pairs and does not model higher-order signals (Yang et al., 2021).Overall, both existing language model pretraining methods and graph pretraining methods fail to capture the rich contextualized textual semantic information hidden inside the complex network structure.
To effectively extract the contextualized semantics information, we propose to view the knowledge encoded inside the complex network structure from two perspectives: token-level and document-level.At the token level, neighboring documents can help facilitate the understanding of tokens.For example, in Figure 1, based on the text information of neigh- .At the token level, from network neighbors, we can know that the "Dove" at the top is a personal care brand and the "Dove" at the bottom is a chocolate brand.At the document level, referring to the edge in the middle, we can learn that the chocolate from "Hershey's" should have some similarity with the chocolate from "Ferrero".
bors, we can know that the "Dove" at the top refers to a personal care brand, while the "Dove" at the bottom is a chocolate brand.At the document level, the two connected nodes can have quite related overall textual semantics.For example, in Figure 1, the chocolate from "Hershey's" should have some similarity with the chocolate from "Ferrero".
Absorbing such two-level hints in pretraining can help language models produce more effective representations which can be generalized to various downstream tasks.
To this end, we propose PATTON, a method to continuously pretrain language models on a given text-rich network.The key idea of PATTON is to leverage both textual information and network structure information to consolidate the pretrained language model's ability to understand tokens and documents.Building on this idea, we propose two pretraining strategies: 1) Network-contextualized masked language modeling: We randomly mask several tokens within each node and train the language model to predict those masked tokens based on both in-node tokens and network neighbors' tokens.2) Masked node prediction: We randomly mask some nodes inside the network and train the language model to correctly identify the masked nodes based on the neighbors' textual information.
We evaluate PATTON on both academic domain networks and e-commerce domain networks.To comprehensively understand how the proposed pretraining strategies can influence different downstream tasks, we conduct experiments on classification, retrieval, reranking, and link prediction.
In summary, our contributions are as follows: • We propose the problem of language model pretraining on text-rich networks.• We design two strategies, network contextualized MLM and masked node prediction to train the language model to extract both token-level and document-level semantic correlation hidden inside the complex network structure.• We conduct experiments on four downstream tasks in five datasets from different domains, where PATTON outperforms pure text/graph pretraining baselines significantly and consistently.

Preliminaries
Definition 2.1.Text-Rich Networks (Yang et al., 2021;Jin et al., 2023b).A text-rich network can be denoted as G = (V, E, D), where V, E and D are node set, edge set, and text set, respectively.Each v i ∈ V is associated with some textual information

Model Architecture
To jointly leverage text and network information in pretraining, we adopt the GNN-nested Transformer architecture (called GraphFormers) proposed in (Yang et al., 2021).In this architecture, GNN modules are inserted between Transformer layers.The forward pass of each GraphFormers layer is as follows.
x is token hidden states in the l-th layer for node x, N x is the network neighbor set of x, LN is the layer normalization operation and MHA asy is the asymmetric multihead attention operation.For more details, one can refer to (Yang et al., 2021).

Pretraining PATTON
We propose two strategies to help the language models understand text semantics on both the token level and the document level collaboratively from the network structure.The first strategy focuses on token-level semantics learning, namely networkcontextualized masked language modeling; while the second strategy emphasizes document-level semantics learning, namely masked node prediction.Strategy 1: Network-contextualized Masked Language Modeling (NMLM).Masked language modeling (MLM) is a commonly used strategy for language model pretraining (Devlin et al., 2019;Liu et al., 2019) and domain adaptation (Gururangan et al., 2020).It randomly masks several tokens in the text sequence and utilizes the surrounding unmasked tokens to predict them.The underlying assumption is that the semantics of each token can be reflected by its contexts.Trained to conduct masked token prediction, the language model will learn to understand semantic correlation between tokens and capture the contextualized semantic signals.The mathematical formulation of MLM is as follows, where M t is a subset of tokens which are replaced by a special [MASK] token and p(w i |H i ) is the output probability of a linear head f head which gives predictions to w i (from the vocabulary W ) based on contextualized token hidden states {H i }.
Such token correlation and contextualized semantics signals also exist and are even stronger in text-rich networks.Text from adjacent nodes in networks can provide auxiliary contexts for text semantics understanding.For example, given a paper talking about "Transformers" and its neighboring papers (cited papers) in the academic network on machine learning, we can infer that "Transformers" here is a deep learning model rather than an electrical engineering component by reading the text within both the given paper and the neighboring papers.In order to fully capture the textual semantic signals in the network, the language model needs to not only understand the in-node text token correlation but also be aware of the cross-node semantic correlation.
We extend the original in-node MLM to networkcontextualized MLM, so as to facilitate the language model to understand both in-node token correlation and network-contextualized text semantic relatedness.The training objective is shown as follows.
where z x denotes the network contextualized token hidden state in Section 3.1 and L is the number of layers.q w i refers to the MLM prediction head for w i .Since the calculation of h i is based on H x and z x , the likelihood will be conditioned on H x and z x .Strategy 2: Masked Node Prediction (MNP).While network-contextualized MLM focuses more on token-level semantics understanding, we propose a new strategy called "masked node prediction", which helps the language model understand the underlying document-level semantics correlation hidden in the network structure.
Concretely, we dynamically hold out a subset of nodes from the network (M v ⊆ V ), mask them, and train the language model to predict the masked nodes based on the adjacent network structure. where are the hidden states of the neighbor nodes in the network and N v j is the set of neighbors of v j .In particular, we treat the hidden state of the last layer of [CLS] as a representation of node level, that is, . By performing the task, the language model will absorb document semantic hints hidden inside the network structure (e.g., contents between cited papers in the academic network can be quite semantically related, and text between co-viewed items in the e-commerce network can be highly associated).
However, directly optimizing masked node prediction can be computationally expensive since we need to calculate the representations for all neighboring nodes and candidate nodes for one prediction.To ease the computation overload, we prove that the masked node prediction task can be theoretically transferred to a computationally cheaper pairwise link prediction task.Theorem 3.2.1.Masked node prediction is equivalent to pairwise link prediction.Proof: Given a set of masked nodes M v , the likelihood of predicting the masked nodes is In the above proof, the first step relies on the Bayes' rule, and we have the assumption that all nodes appear uniformly in the network, i.e., p(v In the second step, we have the conditional independence assumption of neighboring nodes generated given the center node, As a result, the masked node prediction objective can be simplified into a pairwise link prediction objective, which is where v ′ u stands for a random negative sample.In our implementation, we use "in-batch negative samples" (Karpukhin et al., 2020) to reduce the encoding cost.
Joint Pretraining.To pretrain PATTON, we optimize the NMLM objective and the MNP objective jointly: This joint objective will unify the effects of NMLM and MNP, which encourages the model to conduct network-contextualized token-level understanding and network-enhanced document-level understanding, facilitating the joint modeling of texts and network structures.We will show in Section 4.6 that the joint objective achieves superior performance in comparison with using either objective alone.

Finetuning PATTON
Last, we describe how to finetune PATTON for downstream tasks involving encoding for text in the network and text not in the network.For text in the network (thus with neighbor information), we will feed both the node text sequence and the neighbor text sequences into the model; while for texts not in the network (thus neighbor information is not available), we will feed the text sequence into the model and leave the neighbor text sequences blank.For both cases, the final layer hidden state of [CLS] is used as text representation following (Devlin et al., 2019) and (Liu et al., 2019).(Sinha et al., 2015) and e-commerce networks from Amazon (McAuley et al., 2015).In academic networks, nodes are papers and there will be an edge between two papers if one cites the other; while in e-commerce networks, nodes correspond to items, and item nodes are linked if they are frequently co-viewed by users.Since MAG and Amazon both have multiple domains, we select three domains from MAG and two domains from Amazon.In total, five datasets are used in the evaluation (i.e., MAG-Mathematics, MAG-Geology, MAG-Economics, Amazon-Clothes and Amazon-Sports).The statistics of all the datasets can be found in Table 1.Fine-classes are all the categories in the network-associated node category taxonomy (MAG taxonomy and Amazon product catalog), while coarse-classes are the categories at the first layer of the taxonomy.
Pretraining Setup.The model is trained for 5/10/30 epochs (depending on the size of the network) on 4 Nvidia A6000 GPUs with a total batch size of 512.We set the peak learning rate as 1e-5.NMLM pretraining uses the standard 15% [MASK] ratio.For our model and all baselines, we adopt a 12-layer architecture.More details can be found in the Appendix A.
Baselines.We mainly compare our method with two kinds of baselines, off-the-shelf pretrained language models and language model continuous pretraining methods.The first category includes BERT (Devlin et al., 2019), SciBERT (Beltagy et al., 2019), SPECTER (Cohan et al., 2020), Sim-CSE (Gao et al., 2021), LinkBERT (Yasunaga et al., 2022) and vanilla GraphFormers (Yang et al., 2021).BERT (Devlin et al., 2019) is a language model pretrained with masked language modeling and next sentence prediction objectives on Wikipedia and BookCorpus.SciBERT (Beltagy et al., 2019) utilizes the same pretraining strategies as BERT but is trained on 1.14 million paper abstracts and full text from Semantic Scholar.SPECTER (Cohan et al., 2020) is a language model continuously pretrained from SciBERT with a contrastive objective on 146K scientific papers.SimCSE (Gao et al., 2021) is a contrastive learning framework and we perform the experiment with the models pretrained from both unsupervised settings (Wikipedia) and supervised settings (NLI).LinkBERT (Yasunaga et al., 2022) is a language model pretrained with masked language modeling and document relation prediction objectives on Wikipedia and BookCorpus.GraphFormers (Yang et al., 2021) is a GNNnested Transformer and we initialize it with the BERT checkpoint for a fair comparison.The second category includes several continuous pretraining methods (Gururangan et al., 2020;Gao et al., 2021).We perform continuous masked language modeling starting from the BERT checkpoint (denoted as BERT.MLM) and the SciBERT checkpoint (denoted as SciBERT.MLM) on our data, respectively.We also perform in-domain supervised contrastive pretraining with the method proposed in (Gao et al., 2021) (denoted as SimCSE.in-domain).Ablation Setup.For academic networks, we pretrain our model starting from the BERT-base2 checkpoint (PATTON) and the SciBERT3 checkpoint (SciPATTON) respectively; while for ecommerce networks, we pretrain our model from BERT-base only (PATTON).Furthermore, we conduct ablation studies to validate the effectiveness of both the NMLM and the MNP strategies.The pretrained model with NMLM removed and that with MNP removed are called "w/o NMLM" and "w/o MNP", respectively.In academic networks, the ablation study is done on SciPATTON, while in e-commerce networks, it is done on PATTON.
We demonstrate the effectiveness of our framework on four downstream tasks, including classification, retrieval, reranking, and link prediction.

Classification
In this section, we conduct experiments on 8-shot coarse-grained classification for nodes in the networks.We use the final layer hidden state of [CLS] token from language models as the representation of the node and feed it into a linear layer classifier to obtain the prediction result.Both the language model and the classifier are finetuned.The experimental results are shown in

Retrieval
The retrieval task corresponds to 16-shot finegrained category retrieval, where given a node, we want to retrieve category names for it from a very large label space.We follow the widely-used DPR (Karpukhin et al., 2020) pipeline to finetune all the models.In particular, the final layer hidden states of [CLS] token are utilized as dense representations for both node and label names.Negative samples retrieved from BM25 are used as hard negatives.The results are shown in Table 3. From the result, we can have the following observations: 1) PATTON and SciPATTON consistently outperform all the baseline methods; 2) Continuously pretrained models can be better than off-the-shelf PLMs in many cases (SciBERT and SPECTER perform well on Mathematics and Economics since their pretrained corpus includes a large number of Computer Science papers, which are semantically close to Mathematics and Economics papers) and can largely outperform traditional BM25.More detailed information on the task can be found in Appendix C.

Reranking
The reranking task corresponds to the 32-shot finegrained category reranking.We first adopt BM25 (Robertson et al., 2009) and exact matching as the retriever to obtain a candidate category name list for each node.Then, the models are asked to rerank all the categories in the list based on their similarity to the given node text.The way to encode the node and category names is the same as that in retrieval.Unlike retrieval, reranking tests the ability of the language model to distinguish among candidate categories at a fine-grained level.The results are shown in Table 4. From the result, we can find that PATTON and SciPATTON consistently outperform all baseline methods, demonstrating that our pretraining strategies allow the language model to better understand fine-grained semantic similarity.More detailed information on the task can be found in Appendix D.

Link Prediction
In this section, we perform the 32-shot link prediction for nodes in the network.Language models are asked to give a prediction on whether there should exist an edge between two nodes.It is worth noting that the edge semantics here ("author overlap"4 for academic networks and "co-purchased" for e-commerce networks) are different from those in pretraining ("citation" for academic networks and "co-viewed" for e-commerce networks).We utilize the final layer [CLS] token hidden state as node representation and conduct in-batch evaluations.The results are shown in Table 5.From the result, we can find that PATTON and SciPATTON can outperform baselines and ablations in most cases, which shows that our pretraining strategies can help the language model extract knowledge from the pretrained text-rich network and apply it to the new link type prediction.More detailed information on the task can be found in Appendix E. Table 4: Experiment results on Reranking.We show the mean std of three runs for all the methods.

Ablation Study
We perform ablation studies to validate the effectiveness of the two strategies in Tables 2-5.The full method is better than each ablation version in most cases, except R@100 on Economy retrieval, NDCG@10 on Sports reranking, and link prediction on Amazon datasets, which indicates the importance of both strategies.

Pretraining Step Study
We conduct an experiment on the Sports dataset to study how the pretrained checkpoint at different pretraining steps can perform on downstream tasks.The result is shown in Figure 3. From the figure, we can find that: 1) The downstream performance on retrieval, reranking, and link prediction generally improves as the pretraining step increases.This means that the pretrained language model can learn more knowledge, which can benefit these downstream tasks from the pretraining text-rich network as the pretraining step increases.
2) The downstream performance on classification increases and then decreases.The reason is that for downstream classification, when pretrained for too long, the pretrained language model may overfit the given text-rich network, which will hurt classification performance.

Scalability Study
We run an experiment on Sports to study the time complexity and memory complexity of the proposed pretraining strategies.The model is pretrained for 10 epochs on four Nvidia A6000 GPU devices with a total training batch size set as 512.
We show the result in Table 6.From the result, we can find that: 1) Pretraining with the MNP strategy is faster and memory cheaper than pretraining with the NMLM strategy.2) Combining the two strategies together will not increase the time complexity and memory complexity too much, compared with NMLM pretraining only.
Further model studies on finetune data size can be found in Appendix F.

Attention Map Study
We conduct a case study by showing some attention maps of PATTON and the model without pretraining on four downstream tasks on Sports.We randomly pick a token from a random sample and plot the self-attention probability of how different tokens (x- Figure 3: Pretrain step study on Amazon-Sports.The downstream performance on retrieval, reranking and link prediction generally improves when pretrained for longer, while the performance on classification improves and then drops.axis), including neighbor virtual token ([n_CLS]) and the first eight original text tokens ([tk_x]), will contribute to the encoding of this random token in different layers (y-axis).The result is shown in Figure 4. From the result, we can find that the neighbor virtual token is more deactivated for the model without pretraining, which means that the information from neighbors is not fully utilized during encoding.However, the neighbor virtual token becomes more activated after pretraining, bringing more useful information from neighbors to enhance center node text encoding.
6 Related Work

Pretrained Language Models
Pretrained language models have been very successful in natural language processing since they were introduced (Peters et al., 2018;Devlin et al., 2019).Follow-up research has made them stronger by scaling them up from having millions of parameters (Yang et al., 2019;Lewis et al., 2020;Clark et al., 2020) to even trillions (Radford et al., 2019;Raffel et al., 2020;Brown et al., 2020).Another way that these models have been improved is by using different training objectives, including masked language modeling (Devlin et al., 2019), auto-regressive causal language modeling (Brown et al., 2020), permutation language modeling (Yang et al., 2019), discriminative language modeling (Clark et al., 2020), correcting and contrasting (Meng et al., 2021) and document relation modeling (Yasunaga et al., 2022).However, most of them are designed for modeling texts exclusively, and do not consider the inter-document structures.
In this paper, we innovatively design strategies to capture the semantic hints hidden inside the complex document networks.

Domain Adaptation in NLP
Large language models have demonstrated their power in various NLP tasks.However, their performance under domain shift is quite constrained (Ramponi and Plank, 2020).To overcome the negative effect caused by domain shift, continuous pretraining is proposed in recent works (Gururangan et al., 2020), which can be further categorized into domain-adaptive pretraining (Han and Eisenstein, 2019) and task-specific pretraining (Howard and Ruder, 2018).However, existing works mainly focus on continuous pretraining based on textual in- formation, while our work tries to conduct pretraining utilizing textual signal and network structure signal simultaneously.

Pretraining on Graphs
Inspired by the recent success of pretrained language models, researchers are starting to explore pretraining strategies for graph neural networks (Hu et al., 2020b;Qiu et al., 2020;Hu et al., 2020a).Famous strategies include graph autoregressive modeling (Hu et al., 2020b), masked component modeling (Hu et al., 2020a), graph context prediction (Hu et al., 2020a) and constrastive pretraining (Qiu et al., 2020;Velickovic et al., 2019;Sun et al., 2020).These works conduct pretraining for graph neural network utilizing network structure information and do not consider the associated rich textual signal.However, our work proposes to pretrain the language model, adopting both textual information and network structure information.

Conclusions
In this work, we introduce PATTON, a method to pretrain language models on text-rich networks.PATTON consists of two objectives: (1) a networkcontextualized MLM pretraining objective and (2) a masked node prediction objective, to capture the rich semantics information hidden inside the complex network structure.We conduct experiment on four downstream tasks and five datasets from two different domains, where PATTON outperforms baselines significantly and consistently.

Limitations
In this work, we mainly focus on language model pretraining on homogeneous text-rich networks and explore how pretraining can benefit classification, retrieval, reranking, and link prediction.Interesting future studies include 1) researching how to conduct pretraining on heterogeneous text-rich networks and how to characterize the edges of different semantics; 2) exploring how pretraining can benefit broader task spaces including summarization and question answering.

Ethics Statement
While it has been shown that PLMs are powerful in language understanding (Devlin et al., 2019;Lewis et al., 2020;Raffel et al., 2020), there are studies highlighting their drawbacks such as the presence of social bias (Liang et al., 2021) and misinformation (Abid et al., 2021).In our work, we focus on pretraining PLMs with information from the inter-document structures, which could be a way to mitigate bias and eliminate the contained misinformation.

A Pretrain Settings
To facilitate the reproduction of our pretraining experiment, we provide the hyperparameter configuration in Table 7.All reported continuous pretraining and in-domain pretraining methods use exactly the same set of hyperparameters for pretraining for a fair comparison.All GraphFormers (Yang et al., 2021) involved methods have the neighbor sampling number set as 5. Paper titles and item titles are used as text associated with the nodes in the two kinds of networks, respectively.(For some items, we concatenate the item title and description together since the title is too short.)Since most paper titles (88%) and item titles (97%) are within 32 tokens, we set the max length of the input sequence to be 32.The models are trained for 5/10/30 epochs (depending on the size of the network) on 4 Nvidia A6000 GPUs with a total batch size of 512.The total time cost is around 24 hours for each network.Code is available at https: //github.com/PeterGriffinJin/Patton.

B Classification
Task.The coarse-grained category names for academic networks and e-commerce networks are the first-level category names in the networkassociated category taxonomy.We train all the methods in the 8-shot setting (8 labeled training samples and 8 labeled validation samples for each class) and test the models with hundreds of thousands of new query nodes (220,681, 215,148, 85,346, 477,700, and 129,669 for Mathematics, Geology, Economics, Clothes, and Sports respectively).Detailed information on all category names can be found in

C Retrieval
Task.The retrieval task corresponds to finegrained category retrieval.Given a node in the network, we aim to retrieve its fine-grained labels from a large label space.We train all the compared methods in the 16-shot setting (16 labeled queries in total) and test the models with tens of thousands of new query nodes (38,006, 33,440, 14,577, 95,731, and 34,979 for Mathematics, Geology, Economics, Clothes, and Sports, respectively).The fine-grained label spaces for both academic networks and e-commerce networks are constructed from all the labels in the network-associated taxonomy5 6 .The statistics of the label space for all networks can be found in Table 1.
Finetuning Settings.We finetune the models with the widely-used DPR pipeline (Karpukhin et al., 2020).All reported methods use exactly the same set of hyperparameters for finetuning for a fair comparison.The median results of three runs with the same set of three different random seeds are reported.For all the methods, we finetune the model for 1,000 epochs with the training data.The peak learning rate is 1e-5, with the first 10% steps as warm-up steps.The training batch size is 128.
The number of hard BM25 negative samples7 is set as 4. We utilize the faiss library8 to perform an approximate search for nearest neighbors.The experiments are carried out on one Nvidia A6000 GPU.

D Reranking
Task.The reranking task corresponds to finegrained category reranking.Given a retrieved category list for the query node, we aim to rerank all categories within the list.We train all the methods in the 32-shot setting (32 training queries and 32 validation queries) and test the models with 10,000 new query nodes and candidate list pairs.The category space in reranking is the same as that in retrieval.In our experiment, the retrieved category list is constructed with BM25 and exact matching of category names.
Finetuning Settings.All reported methods use exactly the same set of hyperparameters for fine-

Figure 1 :
Figure1: An illustration of a text-rich network (a product item co-viewed network).At the token level, from network neighbors, we can know that the "Dove" at the top is a personal care brand and the "Dove" at the bottom is a chocolate brand.At the document level, referring to the edge in the middle, we can learn that the chocolate from "Hershey's" should have some similarity with the chocolate from "Ferrero".
For example, in an academic citation network, v ∈ V are papers, e ∈ E are citation edges, and d ∈ D are the content of the papers.In this paper, we mainly focus on networks where the edges can provide semantic correlation between texts (nodes).For example, in a citation network, connected papers (cited papers) are likely to be semantically similar.Problem Definition.(Language Model Pretraining on Text-rich Networks.)Given a text-rich network G = (V, E, D), the task is to capture the selfsupervised signal on G and obtain a G-adapted language model M G .The resulting language model M G can be further finetuned on downstream tasks in G, such as classification, retrieval, reranking, and link prediction, with only a few labels.

Figure 2 :
Figure2: Overall pretraining and finetuning procedures for PATTON.We have two pretraining strategies: networkcontextualized masked language modeling (NMLM) and masked node prediction (MNP).Apart from output layers, the same architectures are used in both pretraining and finetuning (in our experiment, we have 12 layers).The same pretrained model parameters are used to initialize models for different downstream tasks.During finetuning, all parameters are updated.

Figure 4 :
Figure 4: Attention map study Amazon-Sports.[n_CLS] refers to network neighbor virtual token and [tk_x]s refer to word tokens.[n_CLS] is more activated after pretraining (PATTON), which means that more useful information from network neighbors is extracted to enhance center node text encoding.

Table 2 .
From the result, we can find BERT.MLM, SciBERT.MLM, SimCSE.in-domain,PATTON, and SciPATTON) can have better performance than off-the-shelf PLMs, which demonstrates that domain shift exists between the pretrained PLM domain and the target domain, and the adaptive pretraining on the target domain is necessary.More detailed information on the task can be found in Appendix B.

Table 5 :
Experiment results on Link Prediction.We show the mean std of three runs for all the methods.

Table 6 :
Time scalability and memory scalability study on Amazon-Sports.
Table 8-12.Finetuning Settings.All reported methods use exactly the same set of hyperparameters for finetuning for a fair comparison.The median results of three runs with the same set of three different random seeds are reported.For all the methods, we finetune the model for 500 epochs in total.The peak learning rate is 1e-5, with the first 10% steps as warm-up steps.The training batch size and the validation batch size are both 256.During training, we validate the model every 25 steps and the best checkpoint is utilized to perform prediction on the test set.The experiments are carried out on one Nvidia A6000 GPU.