Entity and Evidence Guided Document-Level Relation Extraction

Document-level relation extraction is a challenging task, requiring reasoning over multiple sentences to predict a set of relations in a document. In this paper, we propose a novel framework E2GRE (Entity and Evidence Guided Relation Extraction) that jointly extracts relations and the underlying evidence sentences by using large pretrained language model (LM) as input encoder. First, we propose to guide the pretrained LM’s attention mechanism to focus on relevant context by using attention probabilities as additional features for evidence prediction. Furthermore, instead of feeding the whole document into pretrained LMs to obtain entity representation, we concatenate document text with head entities to help LMs concentrate on parts of the document that are more related to the head entity. Our E2GRE jointly learns relation extraction and evidence prediction effectively, showing large gains on both these tasks, which we find are highly correlated.

Document-level relation extraction is a challenging task, requiring reasoning over multiple sentences to predict a set of relations in a document. In this paper, we propose a novel framework E2GRE (Entity and Evidence Guided Relation Extraction) that jointly extracts relations and the underlying evidence sentences by using large pretrained language model (LM) as input encoder. First, we propose to guide the pretrained LM's attention mechanism to focus on relevant context by using attention probabilities as additional features for evidence prediction. Furthermore, instead of feeding the whole document into pretrained LMs to obtain entity representation, we concatenate document text with head entities to help LMs concentrate on parts of the document that are more related to the head entity. Our E2GRE jointly learns relation extraction and evidence prediction effectively, showing large gains on both these tasks, which we find are highly correlated. Our experimental result on DocRED, a large-scale document-level relation extraction dataset, is competitive with the top of the public leaderboard for relation extraction, and is top ranked on evidence prediction, which shows that our E2GRE is both effective and synergistic on relation extraction and evidence prediction.

Introduction
Relation Extraction (RE), the problem of predicting relations between pairs of entities from text, has received increasing research attention in recent years [Zhang et al., 2017;Zhao et al., 2019;Guo et al., 2019]. This problem has important downstream applications to numerous tasks, such as automatic knowledge acquisition from web documents for knowledge graph construction [Trisedya et al., 2019], question answering [Yu et al., 2017] and dialogue systems [Young et al., 2018]. While most    previous work focus on relation extraction at the sentence level, in real world applications, e.g predicting relations from web articles, the majority of relations are expressed across multiple sentences. Figure 1 shows an example from the recently released DocRED dataset [Yao et al., 2019], which requires reasoning over three evidence sentences to predict the relational fact that "Link" is present in the work "The Legend of Zelda". In this paper, we focus on the more challenging task of documentlevel relation extraction task and design a method to facilitate document-level reasoning.
Aside from extracting entity relations from a document, it is often useful to also highlight the evidence that a system uses to predict them, so that a human or second system can verify them for consistency. What is more, evidence prediction can potentially supplement RE performance by restricting the model's focus on the correct context. In preliminary experiments, we find that current models are able to achieve around 87% RE F1 on DocRED by only keeping the gold evidence sentences when trained and evaluated only on the gold evidence sentences, which is a significant im-provement on current leaderboard DocRED RE F1 numbers (∼ 63% RE F1). However, evidence prediction is a challenging task, and most existing relation extraction (RE) approaches ignore the task of evidence prediction entirely.
Most recent approaches for relation extraction fine-tune large pretrained Language Models (LMs) (e.g.,BERT [Devlin et al., 2019], RoBERTa ) as input encoder. However, naively adapting pretrained LMs for document-level RE faces an issue which limits its performance. Due to the length of a given document, many more entities and relations exist in document-level RE than in intra-sentence RE. A pretrained LM has to simultaneously encode information regarding all pairs of entities for relation extraction, making the task more difficult, and limiting the pretrained LM's effectiveness.
In this paper we propose a new framework: Entity and Evidence Guided Relation Extraction (E2GRE), which jointly solves relation extraction and evidence prediction. For evidence prediction, we take a pretrained LM as input encoder and use its internal attention probabilities as additional features to predict evidence sentences. As a result, we use supporting evidence sentences to provide direct supervision on which tokens the LM should attend to during finetuning, which in turn helps improve relation extraction in a joint training framework. To further help LMs focus on a smaller set of relevant word context from a long document, we also introduce entity-guided input sequences as the input to these models, by appending each head entity to the document text, one at a time. This allows the LM encoder to explicitly model relations involving a specific head entity while ignoring all other entity pairs, thus simplifying the task for the LM encoder. The joint training framework helps the model locate the correct semantics that are required for each relation prediction. To the best of our knowledge 1 , we are the first to present an effective joint training framework for relation extraction and evidence prediction.
Each of these ideas gives a significant boost in performance, and by combining them, we are able to achieve highly competitive results on the DocRED leaderboard. We obtain 62.5 relation extraction F1 and 50.5 evidence prediction F1 from our E2GRE trained RoBERTa LARGE model, which is the current state-of-the-art performance on evi-1 Based on published papers on DocRED. dence prediction. Our proposed E2GRE framework is a simple joint training approach that effectively incorporates information from evidence prediction to guide the pretrained LM encoder, boosting performance on both relation extraction and evidence prediction.
Our main contributions are summarized as follows: • We propose to generate multiple new entityguided inputs to a pretrained language model: for every document, we concatenate every entity with the document and feed it as an input sequence to a pretrained LM encoder.
• We propose to use internal attention probabilities of the pre-trained LM encoder as additional features for the evidence prediction.
• Our joint training framework of E2GRE which receives the guidance from entity and evidence, improves the performance on both relation extraction and evidence prediction, showing that the two tasks are mutually beneficial to each other.

Related Work
Early work attempted to solve RE with statistical methods with different feature engineering [Zelenko et al., 2003;Bunescu and Mooney, 2005]. Later on, neural models have shown better performance at capturing semantic relationships between entities. These methods include CNN-based approaches [Zeng et al.; and LSTM-based approaches [Cai et al., 2016]. On top of using CNNs/LSTM encoders, previous models added additional layers to model semantic interactions. For example, Han et al. [2018] introduced using hierarchical attentions in order to generate relational information from coarse-tofine semantic ideas; Zhang et al. [2017] applied GCNs over pruned dependency trees, and Guo et al. [2019] introduced Attention Guided Graph Convolutional Networks (AG-GCNs) over dependency trees. These models have shown good performance on intra-sentence relation extraction, but are not easily adapted for document-level RE.
Many approaches for document-level RE are graph-based neural network methods. Quirk and Poon [2017] first introduced a document graph being used for document-level RE; In [Jia et al., 2019], an entity-centric, multi-scale representation learning on entity/sentence/document-level LSTM model was proposed for document-level n-ary RE task. Christopoulou et al. [2019] recently proposed a novel edge-oriented graph model that deviates from existing graph models. Nan et al. [2020] proposed an induced latent graph and Li et al. [2020] used an explicit heterogeneous graph for DocRED. These graph models generally focus on constructing unique nodes and edges, and have the advantage of connecting and aggregating different granularities of information. Zhou et al. [2021] pointed out multi-entity and multi-label issues for documentlevel RE, and proposed two techniques: adaptive thresholding and localized context pooling, to address these problems.
Pretrained Language Models [Radford et al., 2019;Devlin et al., 2019; are powerful NLP tools trained with enormous amounts of unlabelled data. In order to take advantage of the large amounts of text that these models have seen, finetuning on large pretrained LMs has been shown to be effective on relation extraction [Wadden et al., 2019]. Generally, large pretrained LMs are used to encode a sequence and then generate the representation of a head/tail entity pair to learn a classification [Eberts and Ulges, 2019;Yao et al., 2019]. Baldini Soares et al.
[2019] introduced a new concept similar to BERT called "matchingthe-black" and pretrained a Transformer-like model for relation learning. The models were finetuned on SemEval-2010 Task 8 and TACRED achieved stateof-the-art results. Our framework aims to improve the effectiveness of pretrained LMs for documentlevel relation extraction, with our entity and evidence guided approaches.

Method
In this section, we introduce our E2GRE framework. First, we describe how to generate entityguided inputs. Then we present how to jointly train RE with evidence prediction, and finally show how to combine this with our evidence-guided attentions. We use BERT as our pretrained LM when describing our framework.

Entity-Guided Input Sequences
The goal of relation extraction is to predict relation label between every head/tail (h/t) pair of given entities in a given document. Most standard models approach this problem by feeding in an entire document and then extracting all of the head/tail pairs Figure 2: Diagram of our E2GRE framework. As shown in the diagram, we pass an input sequence consisting of an entity and document into BERT. We extract head and tails for relation extraction. We show the learned relation vectors in grey. We extract out sentence representation and BERT attention probabilities for evidence predictions.
to predict relations.
Instead, we design entity-guided inputs to give BERT more guidance towards the entities during training. Each training input is organized by concatenating the tokens of the first mention of a head entity, denoted by H, together with the document tokens D, to form: "[CLS]"+ H + "[SEP]" + D + "[SEP]", which is then fed into BERT. 2 We generate these input sequences for each entity in the given document. Therefore, for a document with N e entities, N e new entity-guided input sequences are generated and fed into BERT separately.
Our framework predicts N e − 1 different sets of relations for each training input, corresponding to N e − 1 head/tail entity pairs.
After passing a training input through BERT, we extract the head entity embedding and a set of tail entity embeddings from the BERT output. After obtaining the head entity embedding h ∈ R d and all tail entity embeddings {t k |t k ∈ R d } in an entity-guided sequence, where 1 ≤ k ≤ N e − 1, we feed them into a bilinear layer with the sigmoid activation function to predict the probability of i-th relation between the head entity h and the k-th tail entity t k , denoted byŷ ik , as followŝ where δ is the sigmoid function, W i and b i are the learnable parameters corresponding to i-th relation, where 1 ≤ i ≤ N r , and N r is the number of relations. Finally, we finetune BERT with multi-label cross-entropy loss. During inference, we group the N e − 1 predicted relations for each entity-guided input sequence from the same document, to obtain the final set of predictions for a document.

Evidence Prediction
Evidence sentences are sentences which contain important facts for predicting the correct relationships between head and tail entities. Therefore, evidence prediction is a very important auxiliary task to relation extraction and also provides explainability for the model. We build our evidence prediction upon the baseline introduced by Yao et al. [2019], which we will describe next.
Let N s be the number of sentences in the document. We first obtain the sentence embedding s ∈ R N S ×d by averaging all the embeddings of the words in each sentence (i.e., Sentence Extraction in Fig. 2). These word embeddings are derived from the BERT output embeddings.
Let r i ∈ R d be the relation embedding of i-th relation r i (1 ≤ i ≤ N r ), which is learnable and initialized randomly in our model. We employ a bilinear layer with sigmoid activation function to predict the probability of the j-th sentence s j being an evidence sentence w.r.t. the given i-th relation r i as follows.
where s j represents the embedding of j-th sentence, W r i /b r i and W r o /b r o are the learnable parameters w.r.t. i-th relation. We define the loss of evidence prediction under the given i-th relation as follows: where y j ik ∈ {0, 1}, and y j ik = 1 means that sentence j is an evidence for the i-th relation. It should be noted that in the training stage, we use the embedding of true relation in Eq. 2. In testing/inference stage, we use the embedding of the relation predicted by the relation extraction model.

Baseline Joint Training
In [Yao et al., 2019] the baseline relation extraction loss L RE and the evidence prediction loss are combined as the final objective function for the joint training: where λ > 0 is the weight factor to make tradeoffs between two losses, which is data dependent. In order to compare to our models, we utilize a BERT-baseline to predict relation extraction loss and evidence prediction loss.

Guiding BERT Attention with Evidence Prediction
Pretrained language models have been shown to be able to implicitly model semantic relations internally. By looking at internal attention probabilities, Clark et al.
[2019] has shown that BERT learns coreference and other semantic information in later BERT layers. In order to take advantage of this inherent property, our framework attempts to give more guidance to where correct semantics for RE are located. For each pair of head h and tail t k , we introduce the idea of using internal attention probabilities extracted from the last l internal BERT layers for evidence prediction. Let Q ∈ R N h ×L×(d/N h ) be the query and K ∈ R N h ×L×(d/N h ) be the key of the Multi-Head Self Attention layer, N h be the number of attention heads as described in [Vaswani et al., 2017], L be the length of the input sequence and d be the embedding dimension. We first extract the output of multiheaded self attention (MHSA) A ∈ R N h ×L×L from a given layer in BERT as follows. These extraction outputs are shown as Attention Extractor in Fig. 2.
For a given pair of head h and tail t k , we extract the attention probabilities corresponding to head and tail tokens to help relation extraction. Specifically, we concatenate the MHSAs for the last l BERT 400 layers extracted by Eq. 7 to form an attention probability tensor as:Ã k ∈ R l×N h ×L×L .
Then, we calculate the attention probability representation of each sentence under the given headtail entity pair (h, t k ) as follows.
1. We first apply maximum pooling layer along the attention head dimension (i.e., second dimension) overÃ k . The max values are helpful to show where a specific attention head might be looking at. Afterwards we apply mean pooling over the last l layers. We obtaiñ A s = 1 l l i=1 maxpool(Ã ki ),Ã s ∈ R L×L from these two steps.
2. We then extract the attention probability tensor from the head and tail entity tokens according to the start and end positions of in the document. We average the attention probabilities over all the tokens for the head and tail embeddings to obtainÃ sk ∈ R L .
3. Finally, we generate sentence representations fromÃ sk by averaging over the attentions of each token in a given sentence from the document to obtain a sk ∈ R Ns Once we get the attention probabilities a sk , we pass the sentence embeddingsF i k from Eq. 2 through a transformer layer to encourage intersentence interactions and form the new represen-tationẐ i k . We combine a sk withẐ i k and feed it into a bilinear layer with sigmoid (δ) for evidence sentence prediction as follows: Finally, we define the loss of evidence prediction under a given i-th relation based on attention probability representation as follows: whereŷ ia jk is the j-th value ofŷ ia k computed by Eq. 8.

Joint Training with Evidence Guided Attention Probabilities
Here we combine the relation extraction loss and the attention guided evidence prediction loss as the final objective function for the joint training: where λ a > 0 is the weight factor to make tradeoffs between two losses, which is data dependent.

Dataset
DocRED [Yao et al., 2019] is a large documentlevel dataset for the tasks of relation extraction and evidence prediction. It consists of 5053 documents, 132375 entities, and 56354 relations mined from Wikipedia articles. For each (head, tail) entity pair, there are 97 different relation types as candidates to predict. The first relation type is an "NA" relation between two entities, and the rest correspond to a WikiData relation name. Each of the head/tail pair that contains valid relations also includes a set of evidence sentences. We follow the same setting in [Yao et al., 2019] to split the data into Train/Development/Test for model evaluation for fair comparisons. The number of documents in Train/Development/Test is 3000/1000/1000, respectively. The dataset is evaluated with the metrics of relation extraction RE F1, and evidence Evi F1. There are also instances where relational facts may occur in both the development and train set, so we also evaluate Ign RE F1, which removes these relational facts.

Experimental Setup
Hyper-parameter Setting. The configuration for the BERT BASE model follows the setting in [Devlin et al., 2019]. We set the learning rate to 1e-5, λ a to 1e-4, the hidden dimension of the relation vectors to 108, and extract internal attention probabilities from last three BERT layers.
We conduct our experiments by fine-tuning the BERT BASE model. The implementation is based on the HuggingFace [Wolf et al., 2020] PyTorch [Paszke et al., 2017] implementation of BERT 3 . The DocRED baseline and our E2GRE model have 115M parameters 4 . We implement a RoBERTa-large model for the public leaderboard. Baseline models. We compare our framework with the following published models. 1. Context Aware BiLSTM. [Yao et al., 2019] introduced the original baseline to DocRED in their paper. They used a context-aware BiLSTM (+ additional features such as entity type, coreference and 500

Model
Dev

Baseline Models
BiLSTM [Yao et al., 2019] 45  Table 1: Main results (%) on the development and test set of DocRED. We report the official test score of the best checkpoint on the development set. Our E2GRE framework is competitive with the top of the current DocRED leaderboard, and is the best on the public leaderboard for evidence prediction. distance) to encode the document. Head and tail entities are then extracted for relation extraction. 2. BERT Two-Step. [Wang et al., 2019] introduced finetuning BERT in a two-step process, where the model first does predicts the NA relation, and then predicts the rest of the relations. 3. HIN. [Tang et al., 2020] introduced using a hierarchical inference network to help aggregate the information from entity to sentence and further to document-level in order to obtain semantic reasoning over an entire document. 4. CorefBERT.  introduced a way of pretraining BERT in order to encourage the model to look more at relations between the coreferences of different noun phrases. 5. BERT+LSR. [Nan et al., 2020] introduced an induced latent graph structure to help learn how the information should flow between entities and sentences within a document. 6. ATLOP. [Zhou et al., 2021] introduced adaptive thresholding and localized context pooling to help alleviate multi-label and multi-entity issues in document-level RE. • Our RE result is highly competitive with the best published models using BERT BASE model. Our proposed framework is also the only one which solves the dual task of evidence prediction, while taking advantage of evidence sentences for relation extraction.

Main Results
• By replacing BERT BASE with RoBERTa LARGE , we obtain SOTA performance on the DocRED leaderboard. Our test result ranks top 3 on the public leaderboard for relation extraction, and top 1 for evidence prediction 5 , which shows that our E2GRE is both effective and mutually beneficial for relation extraction and evidence prediction.
We see that our framework significantly boosts F1 scores on both relation extract and evidence prediction compared to previous BERT BASE models. Even though we do not have the state-of-the-art performance on relation extraction, we are the first paper to show that with appropriate joint training of RE and evidence prediction we can effectively improve performance for both. 6 Table 2 compares our proposed E2GRE with the joint-training BERT baseline, as described in our  model section on evidence prediction. We examine the comparison under two challenging scenarios in the dev set: 1) entity pairs which consists of multiple mentions in a document; and 2) entity pairs with multiple evidence sentences for evidence prediction.
From Table 2, we observe that: E2GRE shows consistent improvement in terms of F1 on both settings. This is due to the evidence guided attention probabilities from the pretrained LM which helps extract relevant contexts from the document. These relevant contexts further benefit the relation extraction and thus result in significant F1 improvement comparing to the baseline. In summary, our implementation of evidence prediction enhances the performance of relation extraction, and the utilization of a pretrained LM's internal attention probability is a more effective way for joint training.

Ablation Study
To explore the contribution of different components in our E2GRE, we conduct an ablation study in Table 3. We start off with our full E2GRE, and consecutively remove the evidence-guided attention and entity-guided sequences. From this table, we observe that: both entity-guided sequences and evidence-guided attentions play a significant role in improving F1 on relation extraction and evidence prediction: entity-guided sequences improve RE by about 2 F 1 and evidence prediction by about 3.5 F 1. Evidence-guided attentions improve RE by about 1.7 F 1 and evidence prediction by about 1 F 1.
We also observe that entity-guided sequences tend to help more on precision in both tasks of RE and evidence prediction. Entity-guided sequences help by grounding the model to focus on the correct entities, allowing it to be more precise in its information extraction. In contrast, evidence-guided attentions tend to help more on recall in both tasks of RE and evidence prediction.These attentions help by giving more guidance to locate relevant contexts, therefore increasing the recall of RE and evidence prediction.

Recall Precision F1
Relation Extraction  Table 3: Ablation study on evidence guided attentions and entity guided input sequence components, by removing attention extraction module in Figure 2, and entity-guided input sequences consecutively on the dev set. Table 4 shows the impact of the number of BERT layers from which the attention probabilities are extracted on evidence prediction and relation extraction. We observe that using the last 3 layers is better than using the last 6 layers. This is because later layers in pretrained LMs tend to focus more on semantic information, whereas earlier layers focus more on syntactic information [Clark et al., 2019]. We hypothesize that the last 6 layers may include noisy information related to syntax.

Analysis on Evidence/Relation Interdependence
In Fig. 3   0.7923, showing that when EVI F1 improves, RE F1 also improves. We observe that the centroid of the points lies in the first quadrant (2.7%, 5.8%), showing the overall improvement of our model. Furthermore, we analyze the effectiveness of our E2GRE model with smaller amounts of training data. Table. 5 shows that our model achieves much larger gains on RE F1 when training with 10, 30 and 50% of the data. E2GRE-BERT BASE is able to achieve bigger improvements with less data, as attention probabilities used for evidence prediction provides a effective guidance for relation extraction.

Conclusion
In this paper we propose a simple, yet effective joint training framework E2GRE (Entity and Evidence Guided Relation Extraction) for relation extraction and evidence prediction on DocRED. In order to more effectively exploit pretrained LMs for document-level RE, we first generate new entityguided sequences to feed into an LM, focusing the model on the relevant areas in the document. Then we utilize the internal attentions extracted from the last few layers to help guide the LM to focus on relevant sentences for evidence prediction. Our E2GRE method improves performance on both RE and evidence prediction, and achieves the state-of-the-art performance on the DocRED public leaderboard. We show that evidence prediction is an important task that helps RE models perform better.