SAIS: Supervising and Augmenting Intermediate Steps for Document-Level Relation Extraction

Stepping from sentence-level to document-level, the research on relation extraction (RE) confronts increasing text length and more complicated entity interactions. Consequently, it is more challenging to encode the key information sources—relevant contexts and entity types. However, existing methods only implicitly learn to model these critical information sources while being trained for RE. As a result, they suffer the problems of ineffective supervision and uninterpretable model predictions. In contrast, we propose to explicitly teach the model to capture relevant contexts and entity types by supervising and augmenting intermediate steps (SAIS) for RE. Based on a broad spectrum of carefully designed tasks, our proposed SAIS method not only extracts relations of better quality due to more effective supervision, but also retrieves the corresponding supporting evidence more accurately so as to enhance interpretability. By assessing model uncertainty, SAIS further boosts the performance via evidence-based data augmentation and ensemble inference while reducing the computational cost. Eventually, SAIS delivers state-of-the-art RE results on three benchmarks (DocRED, CDR, and GDA) and outperforms the runner-up by 5.04% relatively in F1 score in evidence retrieval on DocRED.


Introduction
Playing a crucial role in the continuing effort of transforming unstructured text into structured knowledge, RE (Bach and Badaskar, 2007) seeks to identify relations between an entity pair based on a given piece of text. Earlier studies mostly pay attention to sentence-level RE (Zhang et al., 2017;Hendrickx et al., 2019) (i.e., the targeting entity pair co-occur within a sentence) and achieve promising results . Based on an extensive empirical analysis,  reveals that textual contexts and entity types are the major information sources that lead to the success of prior approaches.
Given that more complicated relations are often expressed by multiple sentences, recent focus of RE has been largely shifted to the document level (Yao et al., 2019;Cheng et al., 2021). Existing document-level RE methods (Zeng et al., 2020;Zhou et al., 2021) utilize advanced neural architectures such as heterogeneous graph neural networks (Yang et al., 2020) and pre-trained language models (Xu et al., 2021b). However, although documents typically include longer contexts and more intricate entity interactions, most prior methods only implicitly learn to encode contexts and entity types while being trained for RE. As a result, they deliver inferior and uninterpretable results.
On the other hand, it has been a trend that many recent datasets support the training of more powerful language models by providing multi-task annotations such as coreference and evidence (Yao et al., 2019;Li et al., 2016;Wu et al., 2019). Therefore, in contrast to existing methods, we advocate for explicitly guiding the model to capture textual contexts and entity type information by Supervising and Augmenting Intermediate Steps (SAIS) for RE. More specifically, we argue that, from the input document with annotated entity mentions to the ultimate output of RE, there are four intermediate steps involved in the reasoning process. Consider the motivating example in Figure 1: (1) Coreference Resolution (CR): Although Sentence 0 describes the "citizenship" of "Carl Linnaeus the Younger" and Sentence 1 discusses the "father" of "Linnaeus filius", the two names essentially refer to the same person. Hence, given a document, we need to first resolve various contextual roles represented by different mentions of the same entity via CR.
Within an entity pair, the type information of the head and tail entities can be used to filter out impossible relations, as the relation "year_of_birth" can never appear between two entities of type "PER", for instance. (3) Pooled and (4) Fine-grained Evidence Retrieval (PER and FER): A unique task for locating the relevant contexts within a document for an entity pair with any valid relation is to retrieve the evidence sentences supporting the relation. Nonetheless, some entity pairs may not express valid relations within the given document (e.g., Entities D and B in the example). Meanwhile some entity pairs possess multiple relations (e.g., Entity A is both "ed-ucated_at" and an "employee" of Entity D), each with a different evidence set. Therefore, we use PER to distinguish entity pairs with and without valid supporting sentences and FER to output more interpretable evidence unique to each valid relation of an entity pair. In addition, based on the predicted evidence, we filtrate relevant contexts by augmenting specific intermediate steps with pseudo documents or attention masks. By assessing model confidence, we apply these two kinds of evidence-based data augmentation together with ensemble inference, only when the model is uncertain about its original predictions. Eventually, we further boost the performance with negligible computational cost.
Altogether, our SAIS method achieves state-ofthe-art RE performance on three benchmarks (Do-cRED (Yao et al., 2019), CDR (Li et al., 2016), and GDA (Wu et al., 2019)) due to more effective supervision and enhances interpretability by improving the evidence retrieval (ER) F1 score on DocRED by 5.04% relatively compared to the runner-up.

Consider a document d containing sentences
where each entity e is assigned an entity type c ∈ C and appears at least once in d by its mentions . For a pair of head and tail entities (e h , e t ), document-level RE aims to predict if any relation r ∈ R exists between them, based on whether r is expressed by some pair of e h 's and e t 's mentions in d. Here, C and R are pre-defined sets of entity and relation types, respectively. Moreover, for (e h , e t ) and each of their valid relations r ∈ R h,t , ER aims to identify the subset V h,t,r of S d that is sufficient to express the triplet (e h , e t , r).

Related Work
Early research efforts on RE (Bach and Badaskar, 2007;Pawar et al., 2017) center around predicting relations for entity pairs at the sentence level (Zhang et al., 2017;Hendrickx et al., 2019). Many pattern-based (Califf and Mooney, 1999;Qu et al., 2018; and neural network-based (Cai et al., 2016;Feng et al., 2018; models have shown impressive results. A recent study  attributes the success of these models to their ability to capture textual contexts and entity type information.
Nevertheless, since more complicated relations can only be expressed by multiple sentences, there has been a shift of focus lately towards documentlevel RE (Yao et al., 2019;Li et al., 2016;Cheng et al., 2021;Wu et al., 2019). According to how an approach models contexts, there are two general trends within the domain. Graph-based approaches (Nan et al., 2020;Zeng et al., 2020;Zeng et al., 2021;Xu et al., 2021c,d;Sahu et al., 2019;Guo et al., 2019) typically infuse contexts into heuristic-based document graphs and perform multi-hop reasoning via advanced neural techniques. Transformer-based approaches Tang et al., 2020;Huang et al., 2020;Xu et al., 2021a;Zhou et al., 2021;Zhang et al., 2021;Xie et al., 2022; leverage the strength of pre-trained language models (Devlin et al., 2019; to encode long-range contextual dependencies. However, most prior methods only implicitly learn to capture contexts while being trained for RE. Consequently, they experience ineffective supervision and uninterpretable model predictions. On the contrary, we propose to explicitly teach the model to capture textual contexts and entity type information via a broad spectrum of carefully designed tasks. Furthermore, we boost the RE performance by ensembling the results of evidenceaugmented inputs. Compared to EIDER (Xie et al., 2022), we leverage the more precise and interpretable FER for retrieving evidence and present two different kinds of evidence-based data augmentation. We also save the computational cost by applying ensemble learning only to the uncertain subset of relation triplets. As a result, our SAIS method not only enhances the RE performance due to more effective supervision, but also retrieves more accurate evidence for better interpretability.

Supervising Intermediate Steps
This section describes the tasks that explicitly supervise the model's outputs in the four intermediate steps. Together they complement the quality of RE.

Document Encoding
Given the promising performance of pre-trained language models (PLM) in various downstream tasks, we resort to PLM for encoding the document. More specifically, for a document d, we insert a classifier token "[CLS]" and a separator token "[SEP]" at the start and end of each sentence s ∈ S d , respectively. Each mention m ∈ M d is wrapped with a pair of entity markers "*" (Zhang et al., 2017) to indicate the position of entity mentions. Then we feed the document, with alternating segment token indices for each sentence (Liu and Lapata, 2019), into a PLM: to obtain the token embeddings H ∈ R N d ×H and the cross-token attention A ∈ R N d ×N d . A is the average of the attention heads in the last transformer layer (Vaswani et al., 2017) of the PLM. N d is the number of tokens in d, and H is the embedding dimension of the PLM. We take the embedding of "*" or "[CLS]" before each mention or sentence as the corresponding mention or sentence embedding, respectively.

Coreference Resolution (CR)
As a case study, it is reported by Yao et al. (2019) that around 17.6% of relation instances in DocRED require coreference reasoning. Hence, after encoding the document, we resolve the repeated contextual mentions to an entity via CR. In particular, consider a pair of mentions (m i , m j ), we determine the probability of whether m i and m j refer to the same entity by passing their corresponding embeddings m i and m j through a group bilinear layer (Zheng et al., 2019). The layer splits the embeddings into K equal-sized groups ([m 1 i , . . . , m K i ] = m i , similar for m j ) and applies bilinear with parameter W k m ∈ R H/K×H/K within each group: where b m ∈ R and σ is the sigmoid function.
Since most mention pairs refer to distinct entities (each entity has only 1.34 mentions on average in DocRED), we adopt the focal loss (Lin et al., 2017) on top of the binary cross-entropy to mitigate this extreme class imbalance: where y CR i,j = 1 if m i and m j refer to the same entity, and 0 otherwise. Class weight w CR i,j is inversely proportional to the frequency of y CR i,j , and γ CR is a hyperparameter.

Entity Typing (ET)
In a pair of entities, the type information can be used to filter out impossible relations. Therefore, we regularize entity embeddings via ET. More specifically, we first derive the embedding of an entity e by integrating the embeddings of its mentions M e via logsumexp pooling (Jia et al., 2019): e = log m∈Me exp(m). Since entity e could appear either at the head or tail in an entity pair, we distinguish between the head entity embedding e h and the tail entity embedding e t via two separate linear layers: However, no matter where e appears in an entity pair, its head and tail embeddings should always preserve e's type information. Hence, we calculate the probability of which entity type e belongs to by passing e ν for ν ∈ {h, t} through a linear layer followed by the multi-class cross-entropy loss: where W c ∈ R |C|×H , b c ∈ R |C| , and ς is the softmax function. y ET e,c = 1 if e is of entity type c, and 0 otherwise.

Pooled Evidence Retrieval (PER)
To further capture textual contexts, we explicitly guide the attention in the PLM to the supporting sentences of each entity pair via PER. That is, we want to identify the pooled evidence set V h,t = ∪ r∈R h,t V h,t,r in d that is important to an entity pair (e h , e t ), regardless of the specific relation expressed by a particular sentence s ∈ V h,t . In this case, given (e h , e t ), we first compute a unique context embedding c h,t based on the cross-token attention from Equation 1: Here, ⊗ is the element-wise product. A h is e h 's attention to all the tokens in the document (i.e., the average of e h 's mention-level attention). Similar for A t . Then we measure the probability of whether a sentence s ∈ S d is part of the pooled supporting evidence V h,t by passing (e h , e t )'s context embedding c h,t and sentence s' embedding s through a group bilinear layer: where W k p ∈ R H/K×H/K and b p ∈ R. Again, we face a severe class imbalance here, since most entity pairs (97.1% in DocRED) do not have valid relations or supporting evidence. As a result, similar to Section 3.2, we also use the focal loss with the binary cross-entropy: where y PER is inversely proportional to the frequency of y PER h,t,s , and γ PER is a hyperparameter.

Fine-grained Evidence Retrieval (FER)
In addition to PER, we would like to further refine V h,t , since an entity pair could have multiple valid relations and, correspondingly, multiple sets of evidence. As a result, we explicitly train the model to recover contextual evidence unique to a triplet (e h , e t , r) via FER for better interpretability. More specifically, given (e h , e t , r), we first generate a triplet embedding l h,t,r by merging e h , e t , c h,t , and r's relation embedding r via a linear layer: where W l ∈ R H×4H , b l ∈ R H , represents concatenation, and r is initialized from the embedding matrix of the PLM.
Similarly, we use a group bilinear layer to assess the probability of whether a sentence s ∈ S d is included in the fine-grained evidence set V h,t,r : where W k f ∈ R H/K×H/K and b f ∈ R. Since FER only involves entity pairs with valid relations, the class imbalance is milder here than in PER. Hence, let y FER h,t,r,s = 1{s ∈ V h,t,r }, we deploy the standard binary cross-entropy loss:

Relation Extraction (RE)
Based on the four complementary tasks introduced above, for an entity pair (e h , e t ), we encode relevant contexts in c h,t and preserve entity type information in e h and e t . Ultimately, we acquire the contexts needed by the head and tail entities from c h,t via two separate linear layers: where W c h , W ct ∈ R H×H and b c h , b ct ∈ R H , and then combine them with the type information to generate the head and tail entity representations: Next, a group bilinear layer is utilized to calculate the logit of how likely a relation r ∈ R exists between e h and e t : where W k r ∈ R H/K×H/K and b r ∈ R. As discussed earlier, only a small portion of entity pairs have valid relations, among which multiple relations could co-exist between a pair. Therefore, to deal with the problem of multi-label imbalanced classification, we follow Zhou et al. (2021) by introducing a threshold relation class TH and adopting an adaptive threshold loss: In essence, we aim to increase the logits of valid relations P h,t and decrease the logits of invalid relations N h,t , both relative to TH. Overall, with the goal of improving the model's RE performance by better capturing entity type information and textual contexts, we have designed four tasks to explicitly supervise the model's outputs in the corresponding intermediate steps. To this end, we visualize the entire pipeline SAIS O All in Appendix A and integrate all the tasks by minimizing the multi-task learning objective where Task ∈ {CR, ET, PER, FER}. η Task 's are hyperparameters balancing the relative task weight.
During inference with the current pipeline SAIS O All , we predict if a triplet (e h , e t , r) is valid (i.e., if relation r exists between entity pair (e h , e t )) by checking if its logit is larger than the corresponding threshold logit (i.e., L RE h,t,r > L RE h,t,TH ). For each predicted triplet (e h , e t , r), we assess if a sentence s belongs to the evidence set V h,t,r by checking if P FER h,t,r,s > α FER where α FER is a threshold.

Augmenting Intermediate Steps
We further improve RE after training the pipeline SAIS O All by augmenting the intermediate steps in SAIS O All with the retrieved evidence from FER.

When to Augment Intermediate Steps
The evidence predicted by FER is unique to each triplet (e h , e t , r). However, consider the total number of all possible triplets (around 40 million in the develop set of DocRED), it is computationally prohibitive to augment the inference result of each triplet with individually predicted evidence. Instead, following the idea of selective prediction (El-Yaniv et al., 2010), we identify the triplet subset U for which the model is uncertain about its relation predictions with the original pipeline SAIS O All . More specifically, we set the model's confidence for (e h , e t , r) as L O h,t,r = L RE h,t,r − L RE h,t,TH . Then, the uncertain set U consists of triplets with the lowest θ% absolute confidence |L O h,t,r |. Consequently, we reject the original relation predictions for (e h , e t , r) ∈ U and apply evidence-based data augmentation to enhance the performance (more details in Section 4.2).
To determine the rejection rate θ% (note that θ% is NOT a hyperparameter), we first sort all the triplets in the develop set based on their absolute confidence |L O h,t,r |. When θ% increases, the risk (i.e., inaccuracy rate) of the remaining triplets that are not in U is expected to decrease, and vice versa. On the one hand, we wish to reduce the risk for more accurate relation predictions; on the other hand, we want a low rejection rate so that data augmentation on a small rejected set incurs little computational cost. To balance this trade-off, we set θ% as the rate that achieves the minimum of risk 2 + rejection rate 2 . As shown in Figure 2, we find θ% ≈ 4.6% in the develop set of DocRED. In practice, we can further limit the maximum number of rejected triplets per entity pair. By setting it as

Risk of Non-rejected Triplets
Min of (Risk² + Rejection Rate²) where Rejection Rate θ% ≈ 4.6% 10 in experiments, we reduce the size of U to only 1.5% of all the triplets in the DocRED develop set.

How to Augment Intermediate Steps
Consider a triplet (e h , e t , r) ∈ U. We first assume its validity and calculate the probability P FER h,t,r,s of a sentence s being part of V h,t,r based on Section 3.5. Then in a similar way to how L O h,t,r is generated with SAIS O All , we design two types of evidencebased data augmentation as follows: Pseudo Document-based (SAIS D All ): Construct a pseudo document using sentences with P FER h,t,r,s > α FER and feed it into the original pipeline to get the confidence L D h,t,r . Attention Mask-based (SAIS M All ): Formulate a mask P FER h,t,r ∈ R N d based on P FER h,t,r,s and modify the context embedding to c h,t = h,t,r ) . Maintain the rest of the pipeline and get the confidence L M h,t,r . Following Xie et al. (2022), we ensemble L D h,t,r , L M h,t,r , and the original confidence L O h,t,r with a blending parameter τ r ∈ R (Wolpert, 1992) for each relation r ∈ R as These parameters are trained by minimizing the binary cross-entropy loss on U of the develop set: where y RE h,t,r = 1 if (e h , e t , r) is valid, and 0 otherwise. When making relation predictions for (e h , e t , r) ∈ U, we check whether its blended confidence is positive (i.e., L B h,t,r > 0). In this way, we improve the RE performance when the model is uncertain about its original predictions and save the computational cost when the model is confident. The overall steps for evidencebased data augmentation and ensemble inference SAIS B All are summarized in Appendix B. These steps are executed only after the training of SAIS O All and, therefore, adds negligible computational cost.

Experiment Setup
We evaluate the proposed SAIS method on the following three document-level RE benchmarks. DocRED (Yao et al., 2019) is a large-scale crowdsourced dataset based on Wikipedia articles. It consists of 97 relation types, seven entity types, and 5,053 documents in total, where each document has 19.5 entities on average. CDR (Li et al., 2016) and GDA (Wu et al., 2019) are two biomedical datasets where CDR studies the binary interactions between disease and chemical concepts with 1,500 documents and GDA studies the binary relationships between gene and disease with 30,192 documents. We follow  for splitting the train and develop sets.
We run our experiments on one Tesla A6000 GPU and carry out five trials with different seeds to report the mean and one standard error. Based on Huggingface (Wolf et al., 2019), we apply cased BERT-base (Devlin et al., 2019) and RoBERTalarge  for DocRED and cased SciBERT (Beltagy et al., 2019) for CDR and GDA. The embedding dimension H of BERT or SciBERT is 768, and that of RoBERTa is 1,024. The number of groups K in all group bilinear layers is 64.
For the general hyperparameters of language models, we follow the setting in (Zhou et al., 2021). The learning rate for fine-tuning BERT is 5e−5, that for fine-tuning RoBERTa or SciBERT is 2e−5, and that for training the other parameters is 1e−4. All the trials are optimized by AdamW (Loshchilov and Hutter, 2019) for 20 epochs with early stopping and a linearly decaying scheduler (Goyal et al., 2017) whose warm-up ratio = 6%. Each batch contains 4 documents and the gradients of model parameters are clipped to a maximum norm of 1.

Quantitative Evaluation
Besides RE, DocRED also suggests to predict the supporting evidence for each relation instance. Therefore, we apply SAIS B All to both RE and ER. We report the results of SAIS B All as well as existing graph-based and transformer-based baselines in Table 1 2 (full details in Appendix C). Generally, thanks to PLMs' strength in modeling long-range dependencies, transformer-based methods perform better on RE than graph-based methods. Moreover, most earlier approaches are not capable of ER despite the interpretability ER adds to the predictions. In contrast, our SAIS B All method not only establishes a new state-of-the-art result on RE, but also outperforms the runner-up significantly on ER.
Since neither CDR nor GDA annotates evidence sentences, we apply SAIS O RE+CR+ET here. It is trained with RE, CR, and ET and infers without data augmentation. As shown in Table 2 (full details in Appendix C), our method improves the prior best RE F1 scores by 2.7% and 1.8% absolutely on CDR and GDA, respectively. It indicates that our proposed method can still improve upon the baselines even if only part of the four complementary tasks are annotated and operational.

Ablation Study
To investigate the effectiveness of each of the four complementary tasks proposed in Section 3, we carry out an extensive ablation study on the Do-cRED develop set by training SAIS with all possible combinations of those tasks. As shown in Table 3, without any complementary tasks, the RE performance of SAIS is comparable to ATLOP (Zhou et al., 2021) due to similar neural architectures. When only one complementary task is allowed, PER is the most effective single task, followed by ET. Although FER is functionally analogous to PER, since FER only involves the small subset of entity pairs with valid relations, the performance gain brought by FER alone is limited. When

Model CDR GDA
LSR (Nan et al., 2020) 64.8 82.2 SciBERT (Beltagy et al., 2019) 65.1 82.5 DHG  65.9 83.1 SSAN-SciBERT (Xu et al., 2021a) 68.7 83.7 ATLOP-SciBERT (Zhou et al., 2021) 69.4 83.9 SIRE-BioBERT (Zeng et al., 2021) 70.8 84.7 DocuNet-SciBERT (Zhang et al., 2021) 76  two tasks are used jointly, the pair of PER and ET, which combines textual contexts and entity type information, delivers the most significant improvement. The pair of PER and FER also performs well, which reflects the finding in  that context is the most important source of information. The version with all tasks except CR sees the least drop in F1, indicating that CR's supervision signals on capturing contexts can be covered in part by PER and FER. Last but not least, the SAIS pipeline with all four complementary tasks achieves the highest F1 score. Similar trends are also recognized on CDR and GDA in Table 2, where SAIS trained with both CR and ET (besides RE) scores higher than its single-task counterpart.
Moreover, as compared to the original pipeline SAIS O All , pseudo document-based data augmentation SAIS D All acts as a hard filter by directly removing predicted non-evidence sentences, while attention mask-based data augmentation SAIS M All distills the context more softly. Therefore, we observe in Table 4 that SAIS D All earns a higher precision, whereas SAIS M All attains a higher recall. By ensembling SAIS O All , SAIS D All , and SAIS M All , we improve the RE F1 score by 0.57% absolutely on the DocRED develop set.

Qualitative Analysis
To obtain a more insightful understanding of how textual contexts and entity type information help with RE, we present a case study in Figure 3 (a). Here, SAIS O RE+ET is trained with the task (i.e., ET) related to entity type information while SAIS O RE+CR+PER+FER is trained with the tasks (i.e., CR, PER, and FER) related to textual contexts.   Table 4: Ablation study (%) using BERT base to assess the effectiveness of data augmentation (i.e., original (SAIS O All ), pseudo document-based (SAIS D All ), and attention mask-based (SAIS M All )) for RE based on the Do-cRED develop set.

Compared to SAIS O
All , which is trained with all four complementary tasks, they both exhibit drawbacks qualitatively. In particular, SAIS O RE+ET can easily infer the relation "country" between Entities E and C based on their respective types "ORG" and "LOC", whereas SAIS O RE+CR+PER+FER may misinterpret Entity E as of type "PER" and infer the relation "citizenship" wrongly. On the other hand, SAIS O RE+CR+PER+FER can directly predict the relation "place_of_birth" between Entities A and B by pattern matching, while overemphasizing the type "LOC" of Entity B may cause SAIS O RE+ET to deliver the wrong relation prediction "location". Last but (a) Case Study on the Effectiveness of Textual Contexts and Entity Type Information:
[1] He wrote the novels that formed the basis of two … films, Kiss of Death … and The People Against O'Hara … [3] Lipsky, …, was an assistant district attorney … and served as legal counsel to the Mystery Writers of America.  (b) Case Study on the Difference between FER and PER: Figure 3: (a) Case study on the effectiveness of textual contexts and entity type information based on models' extracted relations from the DocRED develop set. By capturing contexts across sentences and regularizing them with entity type information, SAIS O All extracts relations of better quality. (b) Case study on the difference between FER and PER based on retrieved evidence from the DocRED develop set. FER considers evidence unique to each relation for better interpretability. Irrelevant sentences are omitted here. not least, SAIS O All effectively models contexts spanning multiple sentences and regularizes them with entity type information. As a result, it is the only SAIS variant that correctly predicts the relation "country_of_origin" between Entities D and C.
Furthermore, to examine why SAIS (which uses FER for retrieving evidence) outperforms Eider (Xie et al., 2022) (which uses PER) significantly on ER in Table 1, we compare the performance of FER and PER based on a case study in Figure 3 (b). More specifically, PER identifies the same set of evidence for both relations between Entities A and B, among which Sentence 2 describes "place_of_birth" while Sentence 6 discusses "place_of_death". In contrast, FER considers an evidence set unique to each relation and outputs more interpretable results.

Conclusion
In this paper, we propose to explicitly teach the model to capture the major information sources of RE-textual contexts and entity types by Supervising and Augmenting Intermediate Steps (SAIS). Based on a broad spectrum of carefully designed tasks, SAIS extracts relations of enhanced quality due to more effective supervision and retrieves more accurate evidence for improved interpretability. SAIS further boosts the performance with evidence-based data augmentation and ensemble inference while preserving the computational cost by assessing model uncertainty. Experiments on three benchmarks demonstrate the state-of-theart performance of SAIS on both RE and ER.
If given a plain document, we shall utilize existing tools (e.g., spaCy) to get noisy annotations and apply our method afterward. It is also interesting to investigate how other tasks (e.g., named entity recognition) could be incorporated into the multitask learning pipeline of our SAIS method. We plan to explore these extensions in future works.

A Multi-Task Learning Pipeline by Supervising Intermediate Steps (SAIS O All )
To explicitly teach the model to capture relevant contexts and entity type information for RE, we design four tasks to supervise the model's outputs in the corresponding intermediate steps. We illustrate the overall multi-task pipeline SAIS O All in Figure 4.

B Ensemble Inference Algorithm with Evidence-based Data Augmentation (SAIS B All )
After training the multi-task pipeline SAIS O All proposed in Section 3, we further boost the model performance by evidence-based data augmentation and ensemble inference as discussed in Section 4. The detailed steps are explained in Algorithm 1 below.

C Experiment Details
We compare the proposed SAIS method against existing baselines based on three benchmarks: CDR (Li et al., 2016) and GDA (Wu et al., 2019) in Table 5, and DocRED (Yao et al., 2019) in Table 6. The details are explained in Section 5.
In particular, DocRED uses the MIT License, CDR is freely available for the research community, and GDA uses the GNU Affero General Public License. DocRED is constructed from Wikipedia and Wikidata and, therefore, contains information that names people. However, since our research focuses on identifying relations among real-world entities (including public figures) based on a given document, it is impossible to fully anonymize the dataset. We ensure that we only use publicly available information in our experiments. Our use of these datasets is consistent with their intended use. Although our method achieves state-of-the-art performance for RE and ER, using the predicted relations and evidence directly for downstream tasks without manual validation may increase the risk of errors carried forward due to the incorrect predictions. The experiments in this paper focus on English documents from biomedical and general domains, but our proposed framework can be easily extended to documents of other languages.