LinkBERT: Pretraining Language Models with Document Links

Language model (LM) pretraining captures various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then pretrain the LM with two joint self-supervised objectives: masked language modeling and our new proposal, document relation prediction. We show that LinkBERT outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT sets new states of the art on various BioNLP tasks (+7% on BioASQ and USMLE). We release our pretrained models, LinkBERT and BioLinkBERT, as well as code and data.


Introduction
Pretrained language models (LMs), like BERT and GPTs (Devlin et al., 2019;Brown et al., 2020), have shown remarkable performance on many natural language processing (NLP) tasks, such as text classification and question answering, becoming the foundation of modern NLP systems (Bommasani et al., 2021).By performing self-supervised learning, such as masked language modeling (Devlin et al., 2019), LMs learn to encode various knowledge from text corpora and produce informative representations for downstream tasks (Petroni et al., 2019;Bosselut et al., 2019;Raffel et al., 2020).
Figure 1: Document links (e.g.hyperlinks) can provide salient multi-hop knowledge.For instance, the Wikipedia article "Tidal Basin" (left) describes that the basin hosts "National Cherry Blossom Festival".The hyperlinked article (right) reveals that the festival celebrates "Japanese cherry trees".Taken together, the link suggests new knowledge not available in a single document (e.g."Tidal Basin has Japanese cherry trees"), which can be useful for various applications, including answering a question "What trees can you see at Tidal Basin?".We aim to leverage document links to incorporate more knowledge into language model pretraining.
However, existing LM pretraining methods typically consider text from a single document in each input context (Liu et al., 2019;Joshi et al., 2020) and do not model links between documents.This can pose limitations because documents often have rich dependencies (e.g.hyperlinks, references), and knowledge can span across documents.As an example, in Figure 1, the Wikipedia article "Tidal Basin, Washington D.C." (left) describes that the basin hosts "National Cherry Blossom Festival", and the hyperlinked article (right) reveals the background that the festival celebrates "Japanese cherry trees".Taken together, the hyperlink offers new, multi-hop knowledge "Tidal Basin has Japanese cherry trees", which is not available in the single article "Tidal Basin" alone.Acquiring such multi-hop knowledge in pretraining could be useful for various applications including question answering.In fact, document links like hyperlinks and references are ubiquitous (e.g.web, books, scientific literature), and guide how we humans acquire knowledge and Figure 2: Overview of our approach, LinkBERT.Given a pretraining corpus, we view it as a graph of documents, with links such as hyperlinks ( §4.1).To incorporate the document link knowledge into LM pretraining, we create LM inputs by placing a pair of linked documents in the same context (linked), besides the existing options of placing a single document (contiguous) or a pair of random documents (random) as in BERT.We then train the LM with two self-supervised objectives: masked language modeling (MLM), which predicts masked tokens in the input, and document relation prediction (DRP), which classifies the relation of the two text segments in the input (contiguous, random, or linked) ( §4.2). even make discoveries (Margolis et al., 1999).
In this work, we propose LinkBERT, an effective language model pretraining method that incorporates document link knowledge.Given a text corpus, we obtain links between documents such as hyperlinks, and create LM inputs by placing linked documents in the same context, besides the existing option of placing a single document or random documents as in BERT.Specifically, as in Figure 2, after sampling an anchor text segment, we place either (1) the contiguous segment from the same document, (2) a random document, or (3) a document linked from anchor segment, as the next segment in the input.We then train the LM with two joint objectives: We use masked language modeling (MLM) to encourage learning multi-hop knowledge of concepts brought into the same context by document links (e.g."Tidal Basin" and "Japanese cherry" in Figure 1).Simultaneously, we propose a Document Relation Prediction (DRP) objective, which classifies the relation of the second segment to the first segment (contiguous, random, or linked).DRP encourages learning the relevance and bridging concepts (e.g."National Cherry Blossom Festival") between documents, beyond the ability learned in the vanilla next sentence prediction objective in BERT.
Viewing the pretraining corpus as a graph of documents, LinkBERT is also motivated as self-supervised learning on the graph, where DRP and MLM correspond to link prediction and node feature prediction in graph machine learning (Yang et al., 2015;Hu et al., 2020).Our modeling approach thus provides a natural fusion of language-based and graph-based self-supervised learning.
We train LinkBERT in two domains: the general domain, using Wikipedia articles with hyperlinks ( §4), and the biomedical domain, using PubMed articles with citation links ( §6).We then evaluate the pretrained models on a wide range of downstream tasks such as question answering, in both domains.
LinkBERT consistently improves on baseline LMs across domains and tasks.For the general domain, LinkBERT outperforms BERT on MRQA benchmark (+4% absolute in F1-score) as well as GLUE benchmark.For the biomedical domain, LinkBERT exceeds PubmedBERT (Gu et al., 2020) and sets new states of the art on BLURB biomedical NLP benchmark (+3% absolute in BLURB score) and MedQA-USMLE reasoning task (+7% absolute in accuracy).Overall, LinkBERT attains notably large gains for multi-hop reasoning, multi-document understanding, and few-shot question answering, suggesting that LinkBERT internalizes significantly more knowledge than existing LMs by pretraining with document link information.

Related work
Retrieval-augmented LMs.Several works (Lewis et al., 2020b;Karpukhin et al., 2020;Oguz et al., 2020;Xie et al., 2022) introduce a retrieval module for LMs, where given an anchor text (e.g.question), retrieved text is added to the same LM context to improve model inference (e.g.answer prediction).These works show the promise of placing related documents in the same LM context at inference time, but they do not study the effect of doing so in pretraining.Guu et al. (2020) pretrain an LM with a retriever that learns to retrieve text for answering masked tokens in the anchor text.In contrast, our focus is not on retrieval, but on pretraining a general-purpose LM that internalizes knowledge that spans across documents, which is orthogonal to the above works (e.g., our pretrained LM could be used to initialize the LM component of these works).Additionally, we focus on incorporating document links such as hyperlinks, which can offer salient knowledge that common lexical retrieval methods may not provide (Asai et al., 2020).
Pretrain LMs with related documents.Several concurrent works use multiple related documents to pretrain LMs.Caciularu et al. (2021) place documents (news articles) about the same topic into the same LM context, and Levine et al. (2021) place sentences of high lexical similarity into the same context.Our work provides a general method to incorporate document links into LM pretraining, where lexical or topical similarity can be one instance of document links, besides hyperlinks.We focus on hyperlinks in this work, because we find they can bring in salient knowledge that may not be obvious via lexical similarity, and yield a more performant LM ( §5.5).Additionally, we propose the DRP objective, which improves modeling multiple documents and relations between them in LMs ( §5.5).
Hyperlinks and citation links for NLP.Hyperlinks are often used to learn better retrieval models.Chang et al. (2020); Asai et al. (2020); Seonwoo et al. (2021) use Wikipedia hyperlinks to train retrievers for open-domain question answering.Ma et al. (2021) study various hyperlink-aware pretraining tasks for retrieval.While these works use hyperlinks to learn retrievers, we focus on using hyperlinks to create better context for learning general-purpose LMs.Separately, Calixto et al. (2021) use Wikipedia hyperlinks to learn multilingual LMs.Citation links are often used to improve summarization and recommendation of academic papers (Qazvinian and Radev, 2008;Yasunaga et al., 2019;Bhagavatula et al., 2018;Khadka et al., 2020;Cohan et al., 2020).Here we leverage citation networks to improve pretraining general-purpose LMs.
Graph-augmented LMs.Several works augment LMs with graphs, typically, knowledge graphs (KGs) where the nodes capture entities and edges their relations.Zhang et al. ( 2019

Preliminaries
A language model (LM) can be pretrained from a corpus of documents, X = {X (i) }.An LM is a composition of two functions, f head (f enc (X)), where the encoder f enc takes in a sequence of tokens X = (x 1 ,x 2 ,...,x n ) and produces a contextualized vector representation for each token, (h 1 ,h 2 ,...,h n ).The head f head uses these representations to perform selfsupervised tasks in the pretraining step and to perform downstream tasks in the fine-tuning step.We build on BERT (Devlin et al., 2019), which pretrains an LM with the following two self-supervised tasks.Masked language modeling (MLM).Given a sequence of tokens X, a subset of tokens Y ⊆ X is masked, and the task is to predict the original tokens from the modified input.Y accounts for 15% of the tokens in X; of those, 80% are replaced with [MASK], 10% with a random token, and 10% are kept unchanged.Next sentence prediction (NSP).The NSP task takes two text segments2 (X A ,X B ) as input, and predicts whether X B is the direct continuation of X A .Specifically, BERT first samples X A from the corpus, and then either (1) takes the next segment X B from the same document, or (2) samples X B from a random document in the corpus.The two segments are joined via special tokens to form an input instance, where the prediction target of [CLS] is whether X B indeed follows X A (contiguous or random).
In this work, we will further incorporate document link information into LM pretraining.Our approach ( §4) will build on MLM and NSP.

LinkBERT
We present LinkBERT, a self-supervised pretraining approach that aims to internalize more knowledge into LMs using document link information.Specifically, as shown in Figure 2, instead of viewing the pretraining corpus as a set of documents X = {X (i) }, we view it as a graph of documents, G = (X , E), where E = {(X (i) , X (j) )} denotes links between documents ( §4.1).The links can be existing hyperlinks, or could be built by other methods that capture document relevance.We then consider pretraining tasks for learning from document links ( §4.2):We create LM inputs by placing linked documents in the same context window, besides the existing options of a single document or random documents.We use the MLM task to learn concepts brought together in the context by document links, and we also introduce the Document Relation Prediction (DRP) task to learn relations between documents.Finally, we discuss strategies for obtaining informative pairs of linked documents to feed into LM pretraining ( §4.3).

Document graph
Given a pretraining corpus, we link related documents so that the links can bring together knowledge that is not available in single documents.We focus on hyperlinks, e.g., hyperlinks of Wikipedia articles ( §5) and citation links of academic articles ( §6).Hyperlinks have a number of advantages.They provide background knowledge about concepts that the document writers deemed useful-the links are likely to have high precision of relevance, and can also bring in relevant documents that may not be obvious via lexical similarity alone (e.g., in Figure 1, while the hyperlinked article mentions "Japanese" and "Yoshino" cherry trees, these words do not appear in the anchor article).Hyperlinks are also ubiquitous on the web and easily gathered at scale (Aghajanyan et al., 2021).To construct the document graph, we simply make a directed edge (X (i) ,X (j) ) if there is a hyperlink from document X (i) to document X (j) .
For comparison, we also experiment with a document graph built by lexical similarity between documents.For each document X (i) , we use the common TF-IDF cosine similarity metric (Chen et al., 2017;Yasunaga et al., 2017) to obtain top-k documents X (j) 's and make edges (X (i) ,X (j) ).We use k = 5.

Pretraining tasks
Creating input instances.Several works (Gao et al., 2021;Levine et al., 2021) find that LMs can learn stronger dependencies between words that were shown together in the same context during training, than words that were not.To effectively learn knowledge that spans across documents, we create LM inputs by placing linked documents in the same context window, besides the existing option of a single document or random documents.Specifically, we first sample an anchor text segment from the corpus (Segment A; X A ⊆ X (i) ).For the next segment (Segment B; X B ), we either (1) use the contiguous segment from the same document (X B ⊆ X (i) ), (2) sample a segment from a random document (X B ⊆ X (j) where j = i), or (3) sample a segment from one of the documents linked from Segment A (X B ⊆ X (j) where (X (i) ,X (j) ) ∈ E).We then join the two segments via special tokens to form an input instance: Training objectives.To train the LM, we use two objectives.The first is the MLM objective to encourage the LM to learn multi-hop knowledge of concepts brought into the same context by document links.The second objective, which we propose, is Document Relation Prediction (DPR), which classifies the relation r of segment X B to segment X A (r ∈ {contiguous,random,linked}).By distinguishing linked from contiguous and random, DRP encourages the LM to learn the relevance and existence of bridging concepts between documents, besides the capability learned in the vanilla NSP objective.
To predict r, we use the representation of [CLS] token, as in NSP.Taken together, we optimize: where , and h i is its representation.
Graph machine learning perspective.Our two pretraining tasks, MLM and DRP, are also motivated as graph self-supervised learning on the document graph.In graph self-supervised learning, two types of tasks, node feature prediction and link prediction, are commonly used to learn the content and structure of a graph.In node feature prediction (Hu et al., 2020), some features of a node are masked, and the task is to predict them using neighbor nodes.This corresponds to our MLM task, where masked tokens in Segment A can be predicted using Segment B (a linked document on the graph), and vice versa.In link prediction (Bordes et al., 2013;Wang et al., 2021a), the task is to predict the existence or type of an edge between two nodes.This corresponds to our DRP task, where we predict if the given pair of text segments are linked (edge), contiguous (self-loop edge), or random (no edge).Our approach can be viewed as a natural fusion of language-based (e.g.BERT) and graph-based self-supervised learning.

Strategy to obtain linked documents
As described in §4.1, §4.2, our method builds links between documents, and for each anchor segment, samples a linked document to put together in the LM input.Here we discuss three key axes to consider to obtain useful linked documents in this process.
Relevance.Semantic relevance is a requisite when building links between documents.If links were randomly built without relevance, LinkBERT would be same as BERT, with simply two options of LM inputs (contiguous or random).Relevance can be achieved by using hyperlinks or lexical similarity metrics, and both methods yield substantially better performance than using random links ( §5.5).
Salience.Besides relevance, another factor to consider (salience) is whether the linked document can offer new, useful knowledge that may not be obvious to the current LM.Hyperlinks are potentially more advantageous than lexical similarity links in this regard: LMs are shown to be good at recognizing lexical similarity (Zhang et al., 2020), and hyperlinks can bring in useful background knowledge that may not be obvious via lexical similarity alone (Asai et al., 2020).Indeed, we empirically find that using hyperlinks yields a more performant LM ( §5.5).
Diversity.In the document graph, some documents may have a very high in-degree (e.g., many incoming hyperlinks, like the "United States" page of Wikipedia), and others a low in-degree.If we uniformly sample from the linked documents for each anchor segment, we may include documents of high in-degree too often in the overall training data, losing diversity.To adjust so that all documents appear with a similar frequency in training, we sample a linked document with probability inversely proportional to its in-degree, as done in graph data mining literature (Henzinger et al., 2000).We find that this technique yields a better LM performance ( §5.5).

Experiments
We experiment with our proposed approach in the general domain first, where we pretrain LinkBERT on Wikipedia articles with hyperlinks ( §5.1) and evaluate on a suite of downstream tasks ( §5.2).We compare with BERT (Devlin et al., 2019) as our baseline.We experiment in the biomedical domain in §6.

Pretraining setup
Data.We use the same pretraining corpus used by BERT: Wikipedia and BookCorpus (Zhu et al., 2015).For Wikipedia, we use the WikiExtractor3 to extract hyperlinks between Wiki articles.We then create training instances by sampling contiguous, random, or linked segments as described in §4, with the three options appearing uniformly (33%, 33%, 33%).For BookCorpus, we create training instance by sampling contiguous or random segments (50%, 50%) as in BERT.We then combine the training instances from Wikipedia and BookCorpus to train LinkBERT.In summary, our pretraining data is the same as BERT, except that we have hyperlinks between Wikipedia articles.
For -tiny, we pretrain from scratch with random weight initialization.We use the AdamW (Loshchilov and Hutter, 2019) optimizer with (β 1 , β 2 ) = (0.9, 0.98), warm up the learning rate for the first 5,000 steps and then linearly decay it.
We train for 10,000 steps with a peak learning rate 5e-3, weight decay 0.01, and batch size of 2,048 sequences with 512 tokens.Training took 1 day on two GeForce RTX 2080 Ti GPUs with fp16.
For -base, we initialize LinkBERT with the BERT base checkpoint released by Devlin et al. (2019) and continue pretraining.We use a peak learning rate 3e-4 and train for 40,000 steps.Other training hyperparameters are the same as -tiny.
Training took 4 days on four A100 GPUs with fp16.
For -large, we follow the same procedure as -base, except that we use a peak learning rate of 2e-4.Training took 7 days on eight A100 GPUs with fp16.
Baselines.We compare LinkBERT with BERT.Specifically, for the -tiny scale, we compare with BERT tiny , which we pretrain from scratch with the same hyperparameters as LinkBERT tiny .The only difference is that LinkBERT uses document links to create LM inputs, while BERT does not.
For -base scale, we compare with BERT base , for which we take the BERT base release by Devlin et al. (2019) and continue pretraining it with the vanilla BERT objectives on the same corpus for the same number of steps as LinkBERT base .
For -large, we follow the same procedure as -base.

Evaluation tasks
We fine-tune and evaluate LinkBERT on a suite of downstream tasks.
As the MRQA shared task does not have a public test set, we split the dev set in half to make new dev and test sets.We follow the fine-tuning method BERT (Devlin et al., 2019)

Results
Table 1 shows the performance (F1 score) on MRQA datasets.LinkBERT substantially outperforms BERT on all datasets.On average, the gain is +4.1% absolute for the BERT tiny scale, +2.6% for the BERT base scale, and +2.5% for the BERT large scale.Table 2 shows the results on GLUE, where LinkBERT performs moderately better than BERT.These results suggest that LinkBERT is especially effective at learning knowledge useful for QA tasks (e.g.world knowledge), while keeping performance on sentence-level language understanding.

Analysis
We further study when LinkBERT is especially useful in downstream tasks.
Improved multi-hop reasoning.In Table 1, we find that LinkBERT obtains notably large gains on QA datasets that require reasoning with multiple documents, such as HotpotQA (+5% over BERT tiny ), TriviaQA (+6%) and SearchQA (+8%), as opposed to SQuAD (+1.4%) which just has a single document per question.To further gain qualitative insights, we studied in what QA examples LinkBERT succeeds but BERT fails.Figure 3 shows a representative example from HotpotQA.
Answering the question needs 2-hop reasoning: identify "Roden Brothers were taken over by Birks Group" from the first document, and then "Birks Group is headquartered in Montreal" from the second document.While BERT tends to simply predict an entity near the question entity ("Toronto" in the first document, which is just 1-hop), LinkBERT correctly predicts the answer in the second document ("Montreal").Our intuition is that because LinkBERT is pretrained with pairs of linked documents rather than purely single documents, it better learns how to flow information (e.g., do attention) across tokens when multiple related documents are given in the context.In summary, these results suggest that pretraining with linked documents helps for multi-hop reasoning on downstream tasks.

Improved understanding of document relations.
While the MRQA datasets typically use groundtruth documents as context for answering questions, in open-domain QA, QA systems need to use documents obtained by a retriever, which may include noisy documents besides gold ones (Chen et al., 2017;Dunn et al., 2017).In such cases, QA systems need to understand the document relations to perform well (Yang et al., 2018).To simulate this setting, we modify the SQuAD dataset by prepending or appending 1-2 distracting documents to the original document given to each question.Table 3 shows the result.While BERT incurs a large performance drop (-2.8%),LinkBERT is robust to distracting documents (-0.5%).This result suggests that pretraining with document links improves the ability to understand document relations and LinkBERT prediction: "Montreal" (✓) BERT prediction: "Toronto" (✗)
Answering the question needs to identify "Roden Brothers were taken over by Birks Group" from the first document, and then "Birks Group is headquartered in Montreal" from the second document.While BERT tends to simply predict an entity near the question entity ("Toronto" in the first document), LinkBERT correctly predicts the answer in the second document ("Montreal").
relevance.In particular, our intuition is that the DRP objective helps the LM to better recognize document relations like (anchor document, linked document) in pretraining, which helps to recognize relations like (question, right document) in downstream QA tasks.We indeed find that ablating the DRP objective from LinkBERT hurts performance ( §5.5).The strength of understanding document relations also suggests the promise of applying LinkBERT to various retrieval-augmented methods and tasks (e.g.Lewis et al. 2020b), either as the main LM or the dense retriever component.
Improved few-shot QA performance.We also find that LinkBERT is notably good at few-shot learning.Concretely, for each MRQA dataset, we fine-tune with only 10% of the available training data, and report the performance in Table 4.In this few-shot regime, LinkBERT attains more significant gains over BERT, compared to the full-resource regime in Table 1 (on NaturalQ, 5.4% vs 1.8% absolute in F1, or 15% vs 7% in relative error reduction).This result suggests that LinkBERT internalizes more knowledge than BERT during pretraining, which supports our core idea that document links can bring in new, useful knowledge for LMs.

Ablation studies
We conduct ablation studies on the key design choices of LinkBERT.
What linked documents to feed into LMs?We study the strategies discussed in §4.3 for obtaining linked documents: relevance, salience, and diversity.
Table 5 shows the ablation result on MRQA datasets.First, if we ignore relevance and use random document links instead of hyperlinks, we get the same performance as BERT (-4.1% on average; "random" in Table 5).Second, using lexical similarity links instead of hyperlinks leads to 1.8% performance drop ("TF-IDF").Our intuition is that hyperlinks can provide more salient knowledge that may not be obvious from lexical similarity alone.Nevertheless, using lexical similarity links is substantially better than BERT (+2.3%), confirming the efficacy of placing relevant documents together in the input for LM pretraining.Finally, removing the diversity adjustment in document sampling leads to 1% performance drop ("No diversity").In summary, our insight is that to create informative inputs for LM pretraining, the linked documents must be semantically relevant and ideally be salient and diverse.
Effect of the DRP objective.Table 6 shows the ablation result on the DRP objective ( §4.2).Removing DRP in pretraining hurts downstream QA performance.The drop is large on tasks with multiple documents (HotpotQA, TriviaQA, and SQuAD with distracting documents).This suggests that DRP facilitates LMs to learn document relations.
6 Biomedical LinkBERT (BioLinkBERT) Pretraining LMs on biomedical text is shown to boost performance on biomedical NLP tasks (Beltagy et al., 2019;Lee et al., 2020;Lewis et al., 2020a;Gu et al., 2020).Biomedical LMs are typically trained on PubMed, which contains abstracts and citations of biomedical papers.While prior works only use their raw text for pretraining, academic papers have rich dependencies with each other via citations (references).We hypothesize that incorporating citation links can help LMs learn dependencies between papers and knowledge that spans across them.
With this motivation, we pretrain LinkBERT on PubMed with citation links ( §6.1), which we term BioLinkBERT, and evaluate on biomedical downstream tasks ( §6.2).As our baseline, we follow and compare with the state-of-the-art biomedical LM, PubmedBERT (Gu et al., 2020), which has the same architecture as BERT and is trained on PubMed.

Pretraining setup
Data.We use the same pretraining corpus used by PubmedBERT: PubMed abstracts (21GB). 4We use the Pubmed Parser5 to extract citation links between articles.We then create training instances by sampling contiguous, random, or linked segments as described in §4, with the three options appearing uniformly (33%, 33%, 33%).In summary, our pretraining data is the same as PubmedBERT, except that we have citation links between PubMed articles.
Implementation.We pretrain BioLinkBERT of -base size (110M params) from scratch, following the same hyperparamters as the PubmedBERT base (Gu et al., 2020).Specifically, we use a peak learning rate 6e-4, batch size 8,192, and train for 62,500 steps.We warm up the learning rate in the first 10% of steps and then linearly decay it.Training took 7 days on eight A100 GPUs with fp16.
Additionally, while the original PubmedBERT release did not include the -large size, we pretrain BioLinkBERT of the -large size (340M params) from scratch, following the same procedure as -base, except that we use a peak learning rate of 4e-4 and warm up steps of 20%.Training took 21 days on eight A100 GPUs with fp16.

Evaluation tasks
For downstream tasks, we evaluate on the BLURB benchmark (Gu et al., 2020), a diverse set of biomedical NLP datasets, and MedQA-USMLE (Jin et al., 2021), a challenging biomedical QA dataset.
BLURB consists of five named entity recognition tasks, a PICO (population, intervention, comparison, and outcome) extraction task, three relation extraction tasks, a sentence similarity task, a document classification task, and two question answering tasks, as summarized in Table 7.We follow the same fine-tuning method and evaluation metric used by PubmedBERT (Gu et al., 2020).
MedQA-USMLE is a 4-way multi-choice QA task that tests biomedical and clinical knowledge.The questions are from practice tests for the US Medical License Exams (USMLE).The questions typically require multi-hop reasoning, e.g., given patient symptoms, infer the likely cause, and then answer the appropriate diagnosis procedure (Figure 4).We follow the fine-tuning method in Jin et al. (2021).More details are provided in Appendix B.
MMLU-professional medicine is a multichoice QA task that tests biomedical knowledge and reasoning, and is part of the popular MMLU
BioLinkBERT significantly outperforms the largest generaldomain LM or QA model, despite having just 340M parameters.
benchmark (Hendrycks et al., 2021) that is used to evaluate massive language models.We take the BioLinkBERT fine-tuned on the above MedQA-USMLE task, and evaluate on this task without further adaptation.

HotpotQA example
LinkBERT predicts: "Montreal" (✓) BERT predicts: "Toronto" (✗) from the patient symptoms described in the question (leg swelling, pancreatic cancer), infer the cause (deep vein thrombosis), and then infer the appropriate diagnosis procedure (compression ultrasonography).While the existing PubmedBERT tends to simply predict a choice that contains a word appearing in the question ("blood" for choice D), BioLinkBERT correctly predicts the answer (B).Our intuition is that citation links bring relevant documents together in the same context in pretraining (right), which readily provides the multi-hop knowledge needed for the reasoning (center).
tasks such as question answering (+7% on BioASQ and PubMedQA).This result is consistent with the general domain ( §5.3) and confirms that LinkBERT helps to learn document dependencies better.4 shows a representative example.Answering the question (left) needs 2-hop reasoning (center): from the patient symptoms described in the question (leg swelling, pancreatic cancer), infer the cause (deep vein thrombosis), and then infer the appropriate diagnosis procedure (compression ultrasonography).We find that while the existing PubmedBERT tends to simply predict a choice that contains a word appearing in the question ("blood" for choice D), BioLinkBERT correctly predicts the answer (B).Our intuition is that citation links bring relevant documents and concepts together in the same context in pretraining (right), 6 which readily provides the multi-hop knowledge needed for the reasoning (center).Combined with the analysis on HotpotQA ( §5.4), our results suggest that pretraining with document links consistently helps for multi-hop reasoning across domains (e.g., general documents with hyperlinks and biomedical articles with citation links). 6For instance, as in Figure 4 (right), Ansari et al. (2015) in PubMed mention that pancreatic cancer can induce deep vein thrombosis in leg, and it cites a paper in PubMed, Piovella et al. (2002), which mention that deep vein thrombosis is tested by compression ultrasonography.Placing these two documents in the same context yields the complete multi-hop knowledge needed to answer the question ("pancreatic cancer" → "deep vein thrombosis" → "compression ultrasonography").

MedQA-USMLE.
MMLU-professional medicine.Table 9 shows the performance.Despite having just 340M parameters, BioLinkBERT large achieves 50% accuracy on this QA task, significantly outperforming the largest general-domain LM or QA models such as GPT-3 175B params (39% accuracy) and UnifiedQA 11B params (43% accuracy).This result shows that with an effective pretraining approach, a small domain-specialized LM can outperform orders of magnitude larger language models on QA tasks.

Conclusion
We presented LinkBERT, a new language model (LM) pretraining method that incorporates document link knowledge such as hyperlinks.In both the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links), LinkBERT outperforms previous BERT models across a wide range of downstream tasks.The gains are notably large for multi-hop reasoning, multi-document understanding and few-shot question answering, suggesting that LinkBERT effectively internalizes salient knowledge through document links.Our results suggest that LinkBERT can be a strong pretrained LM to be applied to various knowledge-intensive tasks.

Reproducibility
); He et al. (2020); Wang et al. (2021b) combine LM training with KG embeddings.Sun et al. (2020); Yasunaga et al. (2021); Zhang et al. (2022) combine LMs and graph neural networks (GNNs) to jointly train on text and KGs.Different from KGs, we use document graphs to learn knowledge that spans across documents.

Table 1 :
uses for extractive QA.More details are provided in Appendix B. Performance (F1) on MRQA question answering datasets.LinkBERT consistently outperforms BERT on all datasets across the -tiny, -base, and -large scales.The gain is especially large on datasets that require reasoning with multiple documents in the context, such as HotpotQA, TriviaQA, SearchQA.

Table 3 :
Performance (F1) on SQuAD when distracting documents are added to the context.While BERT incurs a large drop in F1, LinkBERT does not, suggesting its robustness in understanding document relations.

Table 4 :
Few-shot QA performance (F1) when 10% of finetuning data is used.LinkBERT attains large gains, suggesting that it internalizes more knowledge than BERT in pretraining.
et al., 2007), and report the average score.More fine-tuning details are provided in Appendix B.
Roden Brothers were taken over in 1953 by a group headquartered in which Canadian city?Doc A: Roden Brothers was founded June 1, 1891 in Toronto, Ontario, Canada by Thomas and Frank Roden.In the 1910s the firm became known as Roden Bros. Ltd. and were later taken over by Henry Birks and Sons in 1953....In 1974 Roden Bros. Ltd. published the book, "Rich Cut Glass" with Clock House Publications in Peterborough, Ontario, which was a reprint of the 1917 edition published by Roden Bros., Toronto.Doc A: Roden Brothers was founded June 1, 1891 in Toronto, Ontario, Canada by Thomas and Frank Roden.In the 1910s the firm became known as Roden Bros. Ltd. and were later taken over by HenryBirks and Sons  in 1953....In 1974 Roden Bros.Ltd.published the book, "Rich Cut Glass" with Clock House Publications in Peterborough, Ontario, which was a reprint of the 1917 edition published by Roden Bros., Toronto. Question:

Table 7 :
Performance on BLURB benchmark.BioLinkBERT attains improvement on all tasks, establishing new state of the art on BLURB.Gains are notably large on document-level tasks such as PubMedQA and BioASQ.
Table 7 shows the results on BLURB.BioLinkBERT base outperforms PubmedBERT base on all task categories, attaining a performance boost of +2% absolute on average.Moreover, BioLinkBERT large provides a further boost of +1%.In total, BioLinkBERT outperforms the previous best by +3% absolute, establishing a new state of the art on the BLURB leaderboard.We see a trend that gains are notably large on document-level Three days after undergoing a laparoscopic Whipple's procedure, a 43-year-old woman has swelling of her right leg. ...She was diagnosed with pancreatic cancer 1 month ago.... Her temperature is 38°C (100.4°F ), pulse is 90/min, and blood pressure is 118/78 mm Hg.Examination shows mild swelling of the right thigh to the ankle; there is no erythema or pitting edema.... Which of the following is the most appropriate next step in management?Doc A: ... Pancreatic cancer can induce deep vein thrombosis in leg ... (e.g.Ansari et al. 2015) Doc B: ... Deep vein thrombosis is tested by compression ultrasonography ... Roden Brothers was founded June 1, 1891 in Toronto, Ontario, Canada by Thomas and Frank Roden.In the 1910s the firm became known as Roden Bros. Ltd. and were later taken over by Henry Birks and Sons in 1953.... In 1974 Roden Bros. Ltd. published the book, "Rich Cut Glass" with Clock House Publications in Peterborough, Ontario, which was a reprint of the 1917 edition published by Roden Bros., Toronto.Doc A: Roden Brothers was founded June 1, 1891 in Toronto, Ontario, Canada by Thomas and Frank Roden.In the 1910s the firm became known as Roden Bros. Ltd. and were later taken over by Henry Birks and Sons in 1953.... In 1974 Roden Bros. Ltd. published the book, "Rich Cut Glass" with Clock House Publications in Peterborough, Ontario, which was a reprint of the 1917 edition published by Roden Bros., Toronto.
ReferenceQuestion: Roden Brothers were taken over in 1953 by a group headquartered in which Canadian city?Doc A: Table 8 shows the results.BioLinkBERT base obtains a 2% accuracy boost over PubmedBERT base , and BioLinkBERT large provides an additional +5% boost.In total, Bi-oLinkBERT outperforms the previous best by +7% absolute, setting a new state of the art.To further gain qualitative insights, we studied in what QA examples BioLinkBERT succeeds but the baseline PubmedBERT fails.Figure