Modular Self-Supervision for Document-Level Relation Extraction

Extracting relations across large text spans has been relatively underexplored in NLP, but it is particularly important for high-value domains such as biomedicine, where obtaining high recall of the latest findings is crucial for practical applications. Compared to conventional information extraction confined to short text spans, document-level relation extraction faces additional challenges in both inference and learning. Given longer text spans, state-of-the-art neural architectures are less effective and task-specific self-supervision such as distant supervision becomes very noisy. In this paper, we propose decomposing document-level relation extraction into relation detection and argument resolution, taking inspiration from Davidsonian semantics. This enables us to incorporate explicit discourse modeling and leverage modular self-supervision for each sub-problem, which is less noise-prone and can be further refined end-to-end via variational EM. We conduct a thorough evaluation in biomedical machine reading for precision oncology, where cross-paragraph relation mentions are prevalent. Our method outperforms prior state of the art, such as multi-scale learning and graph neural networks, by over 20 absolute F1 points. The gain is particularly pronounced among the most challenging relation instances whose arguments never co-occur in a paragraph.


Introduction
Prior work on information extraction tends to focus on binary relations within sentences. However, practical applications often require extracting complex relations across large text spans. This is especially important in high-value domains such as biomedicine, where obtaining high recall of the latest findings is crucial. For example, Figure 1 shows a ternary (drug, gene, mutation) relation signifying that a tumor with MAP2K1 mutation * Work done as an intern at Microsoft Research. K57T is sensitive to cobimetinib, yet the entities never co-occur in any single paragraph. Such precision oncology knowledge is key for determining personalized treatment for cancer patients, but it is scattered among a vast biomedical literature of more than 30 million papers, with over 1 million being added each year 1 .
Recently, there has been increasing interest in cross-sentence relation extraction, but most existing work still focuses on short text spans.  and Peng et al. (2017) restrict extraction to three consecutive sentences and Verga et al. (2018) to abstracts. DocRED (Yao et al., 2019), a popular document-level relation extraction dataset, consists of Wikipedia introduction sections, each with only eight sentences on average. Further, half of the relation instances reside in a single sentence, all effectively in a single paragraph.
To the best of our knowledge, Jia et al. (2019) is the first to consider relation extraction in full-text articles, which is considerably more challenging. They use the CKB dataset (Patterson et al., 2016) for evaluation where each document contains, on "... The patient's peripheral blood indices are shown over time relative to the first dose of the MEK inhibitor cobimetinib ..." (... 17   average, 174 sentences, spanning 39 paragraphs. Additionally, while prior work focuses mainly on binary relations, Jia et al. (2019) follows Peng et al. (2017) to extract ternary relations. However, while Jia et al. (2019) admits relation instances for which the three arguments never co-occur in a paragraph, it still requires that the two arguments for each binary subrelation co-occur at least once in a paragraph, leaving one fifth of findings out of reach for their method.
All this prior work considers document-level relation extraction as a single monolithic problem, which presents major challenges in both inference and learning. Despite recent progress, there are still significant challenges in modeling long text spans using state-of-the-art neural architectures, such as LSTM and transformer. Moreover, direct supervision is scarce and task-specific self-supervision, such as distance supervision, becomes extremely noisy when applied beyond short text spans.
In this paper, we explore an alternative paradigm by decomposing document-level relation extraction into local relation detection and global reasoning over argument resolution. Specifically, we represent n-ary relation using Davidsonian semantics and combine paragraph-level relation classification with discourse-level argument resolution using global reasoning rules (e.g., transitivity over argument resolution). Each component problem resides in short text spans and their corresponding self-supervision is much less error-prone. Our approach takes inspiration from modular neural networks (Andreas et al., 2016) and neural logic programming (Rocktäschel and Riedel, 2017) in decomposing a complex task into local neural learning and global structured integration. However, instead of learning from end-to-end direct supervision, we admit modular self-supervision for the component problems, which is more readily available. Our method can thus be viewed as applying deep probabilistic logic (Wang and Poon, 2018) to combine modular self-supervision and joint inference with global reasoning rules.
This modular approach enables us to not only handle long text spans such as full-text articles like Jia et al. (2019), but also expand extraction to the significant portion of cross-paragraph relations that are out of reach to all prior methods. We conduct a thorough evaluation in biomedical machine reading for precision oncology, where such cross-paragraph relations are especially prevalent. Our method outperforms prior state of the art such as multiscale learning (Jia et al., 2019) and graph neural networks (Zeng et al., 2020) by over 20 absolute F1 points. The gain is particularly pronounced among the most challenging relations whose arguments never co-occur in a paragraph.

Document-Level Relation Extraction
Let E 1 , . . . , E n be entities that co-occur in a document D. Relation extraction amounts to classifying whether a relation R holds for E 1 , . . . , E n in D. For example, in Figure 2, R is the relation of precision cancer drug response, and E 1 , E 2 , E 3 represent drug cobimetinib, gene MAP2K1, mutation K57T, respectively. The relation mention spans multiple paragraphs and dozens of sentences. Direct extraction is challenging and ignores the elaborate underlying linguistic phenomena. The drug-response relation is explicitly mentioned in the last paragraph, though it is between "MEK inhibitors" and "MAP2K1 mutations". Meanwhile, the top paragraph states the ISA relation between "cobimetinib" and "MEK inhibitors", as apparent from the apposition. From the middle paragraph one can infer the ISA relation between "K57T" and "MAP2K1 mutations". Finally, "MAP2K1 mutants" in the last paragraph can be resolved with "MAP2K1 mutations" in the middle based on semantic similarity. Combining these, we can conclude that the drug-response relation holds for (cobimetinib, MAP2K1, K57T) in this document (Figure 2), even though cobimetinib never co-occurs with MAP2K1 or K57T in any paragraph.
Formally, we represent n-ary relation extraction by neo-Davidsonian semantics (Parsons, 1990): Here, T is a text span in D, r is a reified event variable introduced to represent relation R, and the arguments are represented by binary relations with the event variable. The distributed nature of this representation makes it suitable for arbitrary n-ary relations and does not require drastic changes when arguments are missing or when new arguments are added. Given this representation, document-level relation extraction is naturally decomposed into local relation detection (e.g., classifying if R T (r) holds for some paragraph T ) and global argument resolution (e.g., classifying A i (r, E i )). Entity-level argument resolution can be reduced to mention-level argument resolution A(r, E) . = ∃e. [Mention(e, E) ∧ A(r, e)], where Mention(e, E) signifies that e is an entity mention of E. Additionally, the transitivity rule applies: A(r, e) ∧ Resolve(r, e, e ) =⇒ A(r, e ) Here, Resolve(r, e, e ) signifies that the mentions e, e are interchangeable in the context of relation mention r. For brevity, in this paper we drop the relation context r and simply consider Resolve(e, e ). If e, e are coreferent or semantically equivalent, as in (MAP2K1 mutations, MAP2K1 mutants), Resolve obviously holds. More generally, ISA (e.g., K57T and MAP2K1 mutations) and PartOf (e.g., a mutation and a cell line containing it) may also signify resolution:

Modular Self-Supervision
Our problem formulation makes it natural to introduce modular self-supervision for relation detection and argument resolution (Table 1).

Relation Detection
The goal is to train a classifier for R T (r). In this paper, we consider paragraphs as candidates for T and use distant supervision (Mintz et al., 2009)  Here, we leverage the fact that paragraph-level distant supervision is much less noise-prone, but document-level relation mentions still observe similar textual patterns as paragraph-level ones, as can be seen in Figure 2.

Argument Resolution
The goal is to train a classifier for Resolve(e, e ) based on local context for entity mentions e, e . As stated in the prior section, Resolve is strictly more general than coreference and may involve ISA and PartOf relations. For self-supervision, we introduce data programming rules that capture identical mentions and appositives. These are used as seed self-supervision to annotate high-precision resolution instances. In turn, additional instances in the same document can be generated by applying the transitivity rule. E.g., in Figure 2, by deriving Resolve(cobimetinib, MEK inhibitor) in the top paragraph based on the apposition, we may annotate additional Resolve instances between "cobimetinib" and "MEK inhibitors" in the bottom paragraph. As in distant supervision, there will be noise, but on balance, such joint inference helps learn more general resolution patterns.

Relation Detection
Knowledge Base

Deep Learning
Figure 3: Our approach for document-level relation extraction applies deep probabilistic logic to incorporate modular self-supervision and joint inference for relation detection and argument resolution. Figure 3 shows our document-level relation extraction system, which uses deep probabilistic logic (Wang and Poon, 2018) to incorporate modular self-supervision and joint inference.

Prediction Module
The prediction module comprises transformer-based neural models (Vaswani et al., 2017) for local relation detection and global argument resolution. For relation detection, let (m 1 , · · · , m n ) be a candidate co-occurring mention tuple in a paragraph T . We input T to a transformer-based relation classifier, with mentions m i dummified. 2 For argument resolution, let (m, m ) be a candidate mention pair. We compute the contextual representation using a transformer model for both mentions and classify the pair using a comparison network. The input concatenates contextual representations of the entities as well as their element-wise multiplication. For the detailed neural architectures, see Appendix A.
Supervision Module As described in the previous section, the supervision module incorporates the relation KBs and resolution data programming rules as seed self-supervision, as well as reasoning rules such as resolution transitivity for joint inference. Note that these self-supervision rules can be noisy, but for simplicity we still treat them as hard constraints. Deep probabilistic logic offers a principled way to soften them and model their noisiness, which can be investigated in future work.
Learning The prediction and supervision modules define a joint probabilistic distribution  Here, K represents the self-supervision and (X i , Y i ) the input-output pairs of relation detection and argument resolution. Φ, Ψ are the supervision and prediction modules, respectively. Learning is done via variational EM. In the E-step, we compute a variational approximation q(Y ) ∝ P (Y |K, X) using loopy belief propagation, based on current Φ, Ψ. In the M-step, we treat q(Y ) as the probabilistic labels and refine parameters of Φ, Ψ. As aforementioned, we treat the self-supervision in Φ as hard constraints, so the M-step simplifies to finetuning the transformer-based models for relation detection and argument resolution, treating q(Y ) as probabilistic labels.
Inference After learning, given a test document and candidate entities and mentions, it is straightforward to run the neural modules for relation detection and argument resolution. Additionally, we would incorporate joint inference for argument resolution as in self-supervision (e.g., transitivity) using loopy belief propagation.

Experiments
In this section, we study how our modular selfsupervision approach performs in document-level relation extraction. A popular dataset is Do-cRED (Yao et al., 2019), which features Wikipedia introduction sections and general-domain relations. However, upon close inspection, DocRED does not have many truly long-range relation instances. As  be extracted from a single sentence, per evidence annotation. Consequently, there is very little room to explore the more challenging scenario where relations span multiple paragraphs in large text spans. In fact, Huang et al. (2021) finds that over 95% instances in DocRED require no more than three sentences as supporting evidence, and 87% requires no more than two sentences. Ye et al. (2020) shows that a simple BERT-based system (a special case of our approach with just local relation detection) yields 60.06% F1 on DocRED test (Table 3 in their paper), very close to the state-of-the-art results of 62.76% by GAIN (Zeng et al., 2020). We thus focus on biomedical machine reading, where there is a pressing need for comprehensive extraction of the latest findings from full-text articles, and cross-paragraph relation mentions are prevalent. Following Peng et al. (2017); Jia et al. (2019), we consider the problem of extracting precision oncology knowledge from PubMed Central full-text articles, which is critical for molecular tumor boards and other precision health applications. Concretely, the goal is to extract drug-genemutation relations as shown in Figure 1: given a drug, gene, mutation, and document in which they are mentioned, determine whether the document asserts that the mutation in the gene affects response to the drug.

Datasets
Self-Supervision For training and development, we use unlabeled documents from the PubMed Central Open Access Subset (PMC-OA) 3 . For relation detection, we derive distant supervision from three knowledge bases (KBs) with manually-curated drug-gene-mutation relations: CIVIC 4 , GDKD (Dienstmann et al., 2015), OncoKB (Chakravarty et al., 2017). We randomly split the generated examples into training and development sets and ensure no overlap of documents. as our gold-standard test set. CKB contains high-quality documentlevel annotations of drug-gene-mutation interactions, which are manually curated from PubMed articles by The Jackson Laboratory (JAX), an NCIdesignated cancer center. CKB has minimal overlap with the three KBs used in training and development. To avoid contamination, we remove CKB entries whose documents are used in our training and development. See Table 2 for statistics. Note that compared to the version used in Jia et al. (2019), the latest dataset (accessed in Oct. 2020) contains substantially more relations from recent findings. For about one fifth of annotated relations (17.4%), the key entities such as drug and mutation never co-occur in the same paragraph. These relations are out of scope in Jia et al. (2019). We denote this subset as CKB HARD , which comprises particularly challenging instances requiring crossparagraph discourse modeling.

Systems
We implemented our modular self-supervision method (Modular) using PyTorch (Paszke et al., 2019). We conducted variational EM for eight iterations, which appear to work well in preliminary experiments. In the M-step, we incorporate early stopping to identify the best checkpoint based on the development performance for fine-tuning the relation detection and argument resolution neural modules. We initialized the encoding layers in both modules with PubMedBERT (Gu et al., 2021), which has been pretrained from scratch on PubMed articles and demonstrated superior performance in a wide range of biomedical NLP applications.
We follow Wang and Poon (2018); Jia et al. (2019) to conduct standard data preprocessing and entity linking. We used the AdamW optimizer (Loshchilov and Hutter, 2019). For training, we set the mini-batch size to 32 and the learning rate 5e-5 with 100 warm-up steps and 0.01 weight decay. The drop-out rate is 0.1 for transformerbased encoders, and 0.5 for other layers. The hidden size is 765 for transformer-based encoders, and 128 for all other feed-forward networks. We generate checkpoints at every 4096 steps. Three random seeds are tried in our experiments: [7,12,17].  Table 4: Comparison of test results on CKB and CKB HARD . Relations in CKB HARD are particularly challenging as key entity pairs such as drug and mutation never co-occur in a paragraph. All systems were trained using the same three KBs for distant supervision, with no overlap with CKB. We report results from three random runs.
For self-supervised relation detection, following Jia et al. (2019), we further decompose it into classifying drug-mutation relations and then augmenting them with high-precision gene-mutation associations. As stated in Section 3, at training time, only named entities such as drug, gene, mutation are considered, whereas at inference time, in principle any co-occurring noun phrases within a paragraph would be considered (see Figure 2, bottom paragraph). In practice, however, this would incur too much computation, most of which wasted on irrelevant candidates. Therefore, we employ the following heuristics to leverage argument resolution results for filtering candidates: In argument resolution, we focus on resolving candidate mentions with drugs, genes, mutations. We also stipulate that a candidate mention must contain within it some relevant biomedical entity mentions (e.g., cell lines, genes, etc., as in "MEK inhibitors" that contains gene reference "MEK"). In relation detection, we only consider candidate mentions that are classified as resolving with entities among drugs, genes, mutations, based on current prediction module.
We compare Modular with the following baselines: AllPositive is a recall-friendly baseline that always predicts positive; Multiscale (Jia et al., 2019) is a state-of-the-art approach that combines local mention-level representations into an entity-level representation over the entire document; GAIN (Zeng et al., 2020) is another state-of-the-art approach that constructs mentionlevel graphs and applies graph convolutional network (Kipf and Welling, 2017) to model interdependencies among intra-and inter-sentence mentions, attaining top performance on DocRED.
For fair comparison, we replaced the encoders in Multiscale and GAIN with the state-of-the-art PubMedBERT as in our approach, which helped improve the performance (Appendix B). The orig-inal Multiscale encodes each paragraph separately using LSTM, so it's straightforward to replace that with PubMedBERT. GAIN, on the other hand, encodes the entire input text all at once. This is feasible in DocRED, where each "document" is actually a Wikipedia introduction section, thus more like a paragraph (average only eight sentences long). But it doesn't work in CKB, where each document is a full-text article. Even the minimal text span covering given entities is often too long to encode using a transformer. Therefore, we ran the encoder on individual paragraphs. Note that the original version of Multiscale can't make prediction for any instances where key entity pairs such as drug and mutation never co-occur in a paragraph. We implemented a natural extension that would generate local mention-level representations even for singleton mentions (i.e., only one relevant entity shows up in a paragraph). Table 4 compares various approaches for documentlevel relation extraction on CKB. Our modular self-supervision approach (Modular) substantially outperforms all other methods, gaining more than 20 absolute F1 points compared to prior state of the art such as multiscale learning and graph neural networks. This demonstrates the superiority in leveraging less noise-prone modular selfsupervision as well as fine-grained discourse modeling in argument resolution. Note that the results for Multiscale are different than that in Jia et al. (2019) as we used the latest CKB which contains considerably more cross-paragraph relations.

Main Results
Compared to multiscale learning (Multiscale), the graph-neural-network approach (GAIN) attains significantly better precision, as it incorporates more elaborate graph-based reasoning among entities across sentences. However, this comes with  substantial expense at recall. Our approach outperforms both substantially in precision and recall.
On the most challenging subset CKB HARD , the contrast is particularly pronounced, as all other systems could only attain single-digit precision. The graph-neural-network approach also suffers heavily in recall. Our approach attains much better precision and recall, and more than triples the F1.

Ablation Study
To understand the impact of our modular selfsupervision, we conducted an additional ablation study on CKB HARD . See Table 5 for results.
To assess the limitation of our current argument resolution module, we replace self-supervised relation detection with a baseline that always predicts positive for candidate tuples whose components have been resolved with some drug, gene, mutation entities. This yields a maximum recall of 66.1%, which means that about a third of the especially hard cases of cross-paragraph relations are still out of reach for our method. In some cases, this is because the only hint at the relation resides in figures or appendix, which are currently not in scope for extraction. In other cases, the argument resolution fails to make correct resolution with the corresponding entities. We leave further investigation and improvement to future work. With self-supervised relation detection, our full model improves both F1 and precision for the end extraction on CKB HARD .
Next, we investigate the impact of our selfsupervision for argument resolution by replacing it with state-of-the-art coreference systems: Multi-Sieve Pass (Raghunathan et al., 2010;Lee et al., 2011), and SpanBERT Coreference (Joshi et al., 2020). Multi-Sieve Pass is a rule-based system that incorporates a series of resolution rules with increasing recall but lower precision. SpanBERT Coreference is a state-of-the-art transformer-based system fine-tuned on OntoNotes (Pradhan et al., 2012), an annotated corpus with diverse text. Both result in significant performance drop. Using Multi-Sieve results in substantial drop in precision, indicating that coreference heuristics suitable for general domains are less effective in biomedicine. SpanBERT, on the other hand, suffers catastrophic drop in recall. This suggests that argument resolution for document-level relation extraction may involve more general anaphoric phenomena such as ISA and PartOf, which are out of scope in standard coreference annotations. Remarkably, bootstrapping from the simple data programming rules of identical mentions and apposition, our selfsupervised module is able to perform much better argument resolution than these state-of-the-art systems for document-level relation extraction.  Given that our evaluation is in the biomedical domain, it is natural to initialize our self-supervised neural modules with PubMedBERT (Gu et al., 2021). Table 6 shows that this is indeed advantageous, with domain-specific pretraining attains significant gain over general-domain pretraining.

Discussion
Interpretability We envision that machine reading is used not as standalone automation, but as assisted curation to help expert curators attain significant speed-up (Peng et al., 2017). For extraction within short text spans, human experts can validate the results by simply reading through the provenance text. For document-level relation extraction, as in Jia et al. (2019), this can be challenging, as the intervening text span is long and validation may require a significant amount of reading that is not much faster than curation from scratch. Our modular approach not only enables us to tackle the harder cases of cross-paragraph relations, but also yields a natural explanation for an extraction result with the local relation and chains of argument resolution, all of which can be quickly validated by curators. We leave studying the impact on assisted curation to future work. Error Analysis We focus our error analysis on CKB HARD , which is particularly challenging. An im-mediate opportunity lies in significant recall miss by argument resolution. As shown in the prior subsection, the maximum recall for our current argument resolution module is 66%. From preliminary sample analysis, there are three main types of errors. Some relation instances are hinted in figures, tables, or supplements, which are beyond the current scope of extraction. In other cases, the relation statement is vague and scattered, and relation detection requires inference piecing together multiple evidences. However, the bulk of recall errors simply stem from argument resolution failures. Likewise, we found that in the majority of precision errors, relation detection appears to make the right call, but argument resolution is mistaken. Variational EM One direction to improve argument resolution is to augment the self-supervision used in the resolution module. Figure 4 shows that argument resolution did improve during learning, thanks to the global reasoning rules, at least in terms of expanding maximum recall. However, we notice that our current mention filtering rules may be overly strict, which limit the room for growth in recall. Additionally, we treat the reasoning rules such as transitivity as hard constraints, whereas in practice they can be noisy. (E.g., qualifiers like "some MAP2K1 mutations" or negation are currently not considered in resolution.) 6 Related Work Document-Level Relation Extraction Due to the significant challenges in modeling long text spans and obtaining high-quality supervision signals, document-level relation extraction has been relatively underexplored (Surdeanu and Ji, 2014). Prior work often focuses on simple extensions of sentence-level extraction (e.g., by incorporat-ing coreference annotations or considering special cases when document-level relations reduces to sentence-level attribute classification) (Wick et al., 2006;Gerber and Chai, 2010;Swampillai and Stevenson, 2011;Yoshikawa et al., 2011;Koch et al., 2014;Yang and Mitchell, 2016). Recently, cross-sentence relation extraction has seen increasing interest (Li et al., 2016;Peng et al., 2017;Verga et al., 2018;Christopoulou et al., 2019;Wu et al., 2019;Yao et al., 2019), but most efforts are still limited to short text spans, such as consecutive sentences or abstracts. A notable exception is Jia et al. (2019), which considers fulltext articles that comprise hundreds of sentences. However, they still model local text units in isolation and can't effectively handle relations whose arguments never co-occur in a paragraph. In contrast, we provide the first attempt to systematically explore cross-paragraph relation extraction. Most prior work focuses on binary relations. We instead follow Peng et al. (2017); Jia et al. (2019) to study general n-ary relation extraction, using precision oncology treatment as a case study.
Discourse Modeling Given the focus of standard information extraction on short text spans, discourse modeling has not featured prominently in prior work. An exception is coreference resolution, though the focus tends to be improving sentencelevel extraction, as in Koch et al. (2014). Here, we show that document-level relation extraction often requires modeling more general anaphoric phenomena. As discussed in the experiment section, many remaining errors lie in argument resolution, which offers an exciting opportunity to study discourse modeling for an important end application.
Self-Supervision Task-specific self-supervision alleviates the annotation bottleneck by leveraging freely available domain knowledge (as in distant supervision (Craven and Kumlien, 1999;Mintz et al., 2009)) and expert-derived labeling rules (as in data programming (Ratner et al., 2017)). Unfortunately, such self-supervision becomes extremely noisy when applied to full-text documents, prompting many prior efforts to focus on short text spans Peng et al., 2017;Verga et al., 2018;Yao et al., 2019). We instead decompose end-to-end document-level extraction into relation detection and argument resolution modules, for each of which we leverage modular selfsupervision that is much less error-prone.
Neural-Symbolic NLP In the past few decades, the dominant paradigm in NLP has swung from logical approaches (rule-based or relational systems) to statistical and neural approaches. However, given the prevalence of linguistic structures and domain knowledge, there has been increasing interest for synergizing the contrasting paradigms to improve inference and learning. Neural logic programming replaces logical operators with neural representations to leverage domain-specific constraints with end-to-end differentiable learning (Rocktäschel and Riedel, 2017). Similarly, modular neural networks integrate component neural learning along a structured scaffold (e.g., syntactic parse of a sentence for visual questionanswering) (Andreas et al., 2016). On the other hand, deep probabilistic logic (Wang and Poon, 2018) combines probabilistic logic with neural networks to incorporate diverse self-supervision for deep learning. We take inspiration from modular neural networks and neural logic programming, and use deep probabilistic logic to combine relation detection and argument resolution using global reasoning rules for document-level relation extraction.

Conclusion
We propose to decompose document-level relation extraction into local relation detection and global argument resolution, and apply modular self-supervision and discourse modeling using deep probabilistic logic. On the challenging problem of biomedical machine reading, where crossparagraph relations are prevalent, our approach substantially outperforms prior state of the art such as multiscale learning and graph neural networks, gaining over 20 absolute F1 points. Future directions include: improving discourse modeling for argument resolution; studying the impact on assisted curation; applications to other domains.   Figure 5 shows the neural architectures for relation detection and argument resolution. For relation detection, we input a paragraph to a transformerbased encoder, with mentions dummified. The hidden state h [CLS] in the last layer is then passed to a simple feed-forward classifier defined as below:

A Neural Architectures
where FFNN 1 is a two-layer feed-forward network using ReLU as activation functions.
For argument resolution, given a candidate mention pair (m i , m j ) and their context, we first compute their contextual representations using a transformer-based encoder. If a mention span contains multiple tokens, we use average pooling to combine their contextual representations (hidden states in the last layer). The pair of mention representations (h m i , h m j ) are then passed to a classifier defined as below: where s(x, y) is a scoring function similar to those   where • denotes element-wise multiplication. FFNN 2 and FFNN 3 are two-layer feed-forward networks using ReLU as activation functions.

B Multiscale with PubMedBERT
Replacing LSTM with PubMedBERT (Gu et al., 2021) as the encoder generally leads to comparable or better performance by the Multiscale system (Jia et al., 2019). Table 7 shows the results on the original CKB test set as used in Jia et al. (2019). Note that these results shouldn't be compared with the main results in Table 4, as the latter are obtained on the latest CKB with considerably more crossparagraph relations.