Three Sentences Are All You Need: Local Path Enhanced Document Relation Extraction

Document-level Relation Extraction (RE) is a more challenging task than sentence RE as it often requires reasoning over multiple sentences. Yet, human annotators usually use a small number of sentences to identify the relationship between a given entity pair. In this paper, we present an embarrassingly simple but effective method to heuristically select evidence sentences for document-level RE, which can be easily combined with BiLSTM to achieve good performance on benchmark datasets, even better than fancy graph neural network based methods. We have released our code at https://github.com/AndrewZhe/Three-Sentences-Are-All-You-Need.


Introduction
The task of relation extraction (RE) focuses on extracting relations between entity pairs in texts, and has played an important role in information extraction. While earlier works focus on extracting relations within a sentence (Lin et al., 2016;Zhang et al., 2018), recent studies begin to explore RE at document level (Peng et al., 2017;Zeng et al., 2020a;Nan et al., 2020a), which is more challenging as it often requires reasoning across multiple sentences.
Compared with sentence level extraction, documents are significantly longer with useful information scattered in a larger scale. However, given a pair of entities, one may only need a few sentences, not the entire document, to infer their relationship; reading the whole document may not be necessary, since it may introduce unrelated information inevitably. As we can see in Figure 1, S[1] is sufficient to recognize Finland as the country of Espoo, and recognizing the rest two instances requires just 2 sentences as supporting evidence as * Corresponding author. While the document has 6 sentences, only 1 or 2 sentences form the evidence for each relation instance.
well. Although the document contains 6 sentences and evidence may span from S[1] ∼ S[6], identifying each relation instance can be achieved by just reading through 1 or 2 related sentences. This naturally leads us to consider a question: given an entity pair, how many sentences are required to identify a relationship between them? We perform a pilot study across 3 widely-used document RE datasets, DocRED (Yao et al., 2019), CDR (Li et al., 2016) and GDA (Wu et al., 2019). As shown in Table 1, we find that more than 95% instances require no more than 3 sentences as supporting evidence, and 87% even requires only 2 or less.
Our preliminary finding suggests that, instead of taking the entire document as context, a casespecific selection may be more useful to help a model focus on the most relevant and informative evidence. Previous studies apply graph neural networks (GNNs) for this filtering process (Christopoulou et al., 2019;Zeng et al., 2020b). Here, GNNs are used to collect relevant information from the entire context through an aggregation scheme (Nan et al., 2020a) and achieve great performance, but the selection of crucial evidence from documents is still implicit and lacks interpretability. If, as indicated by our pilot study, most entity relationships can be decided with just 1 ∼ 3 evidence sentences, is there a simpler method that can filter the document explicitly while maintaining the  crucial information?
We take a closer look at how entity pairs are contextually related in the annotated supporting evidence, and find that annotators tend to select sentences that can connect the two entities. We therefore design three heuristic rules to extract a small set of paths from the document, which can be seen as an approximation of the supporting evidence. Specifically, the Consecutive Paths consider the scenario where the head and tail entities are close in the context: if they are within 3 consecutive sentences, we regard these sentences as one path. The Multi-Hop Paths correspond to the entity pairs in distant sentences, which can be bridged via other entities that co-occur with the head entity and tail entity in different sentences. As the third relation in Figure 1 shows, Finland co-occurs with The Espoo Cathedral in S[1] and with the EC Parish in S[6], which makes it a bridge to connect The Espoo Cathedral and the EC Parish. In this case, S[1] and S[6] compose a multi-hop path. When neither of the above rules applies, we collect all the pairs of sentences where one contains head entity and the other contains tail entity as Default Paths.
By comparing our path set with humanannotated supporting evidence, we find that up to 87.5% of the supporting evidence can be fully covered by our heuristically selected paths. In other words, our straightforward and interpretable rules serve as an effective proxy to select supporting evidence from documents. We further feed our selected paths to a simple neural network model and obtain surprisingly good performance on DocRED, showing that our selected evidence can retain sufficient information from the entire document to support document-level relation extraction.
2 Do we need the entire document?
For document RE, the major challenge is that the subject and object involved in a relationship may appear in different sentences. Thus, more than one sentence is required to capture the relations. Nonetheless, how many sentences from the entire document are required to identify the relationship between an entity pair? To address this question, we analyze the supporting evidence presented in DocRED. The supporting evidence for a relation instance refers to all the sentences that can be used to decide whether this relation holds between the entity pair, labeled by human annotators (Yao et al., 2019). Table 1 shows the proportions of entity relation instances with different number of supporting sentences. As can be seen, more than 96% of the DocRED instances are associated with at most 3 supporting evidence. These only take up 37.5% of a document, since the average document length is 8 sentences. This means that reading a small part of a document is adequate for one to identify an entity relation instance.
We further extend our study to two widely used document RE datasets, CDR (Li et al., 2016) and GDA (Wu et al., 2019), where CDR is manually constructed and GDA is distantly supervised. In order to find the minimal number of sentences required, we ask annotators to label a minimal set of sentences that are exactly sufficient to identify an entity relation instance, instead of including all relation-associated sentences as the original Do-cRED pattern. We randomly select 100 instances respectively from CDR and GDA for this further annotation, and the results are shown at the bottom of Table 1 1 . Although the average length of documents in GDA and CDR are longer than Do-cRED, it turns out that one can still use no more than 3 supporting sentences to identify over 95% of the entity relation instances. The results on CDR and GDA confirm our previous finding that, a very small number of sentences (or more exactly, no more than 3 sentences) would make it sufficient for human annotators to recognize almost all entity relation instances in a document in widely-used benchmark datasets.

Which sentences are decisive?
Now our question is how to select the supporting sentences that are sufficient to identify an entity relation instance. Intuitively, the supporting evidence should be the sentences that build up the connection between a pair of entities. Thus, we aim to extract sentence paths from the head entity to the tail entity to describe how they are connected. As for the simplest case, if there exists one sentence that contains  both the head and tail entities, the sentence itself can be seen as a path (the intra-sentence case). For more complex situations where the head and tail entities do not co-occur in one sentence, we define the following 3 types of paths which indicate how the head and tail entities can be possibly related in the context. Figure 2 provides a visualization of the three types of paths.

Consecutive Paths
Previous studies have shown that the majority of inter-sentence relations are often in nearby text (Swampillai and Stevenson, 2010;. We thus select the consecutive sentences to form a path when the head and tail entities are in nearby sentences. Formally, if one mention of the head entity appears in sentence S i and one mention of the tail entity is in sentence S j , these two sentences along with the sentence in between, i.e., sentence S i+1 , . . . , S j−1 (or S j+1 , . . . , S i−1 when i ≥ j) forms a possible path that connects the two entities. Given that no more than 3 sentences would suffice for inference, we limit the length of these Consecutive Paths to be at most 3, which means |j − i| ≤ 2. Note that this definition can be naturally extended to the intra-sentence case where j = i. We thus consider the intra-sentence case as a type of the Consecutive Path. A pair of entities can correspond to multiple consecutive paths since they can be mentioned more than once.
Multi-Hop Paths Another typical case for intersentence relation instances is the multi-hop relation (Yao et al., 2019;Zeng et al., 2020a). In such cases, the head and tail entities are far from each other in the document but can be connected through bridge entities, just like the entity The Espoo Cathedral in Figure 1 bridges the EC Parish and Finland in sentence 1 and 6.
For these cases, we start from the head entity, go through all the bridge entities, arrive at the tail entity, and select all the corresponding sentences in this route as a path. Formally, for the head entity e h and the tail entity e t , the multi-hop relation indicates that there exist a list of bridge entities e b 1 , . . . , e b k such that (e h , e b 1 ), (e b 1 , e b 2 ), . . . , (e b k , e t ) form k + 1 intra-sentence relations respectively in sentence S p 1 , . . . , S p k+1 . Following this route, we choose these k +1 sentences as the Multi-Hop Path. Given the discovery in §2 that most instances only needs 3 sentences, we restrict k to be at most 2, i.e., with only 1 or 2 bridge entities. It is possible to have several multi-hop paths for a certain pair with different lists of bridge entities.
Default Paths If neither of the aforementioned rules applies, we consider a rough estimate for the evidence with the most relevant sentences. We collect all pairs of sentences where one contains the head entity and the other contains the tail entity as Default Paths. Formally, let {S h 1 , . . . , S hp } and {S t 1 , . . . , S tq } denote the sets of sentences that contain the head entity e h and the tail entity e t , respectively. For this entity pair, we will have p × q Default Paths {S h 1 , S t 1 }, . . . , {S hp , S tq }. Note that this type of paths is extracted only when no paths are found with the previous two patterns.

Comparing with Annotated Evidence
To demonstrate the effectiveness of our heuristic rules, we check the size of our path set on DocRED  and their consistency with the gold supporting evidence. As mentioned in §2, the gold annotation acts as a collection of all related evidence, while each of our extracted paths represents one possible and minimal sentence set. Ideally, if the path set is sufficient, all connecting sentences between the entity pair should be successfully captured. In other words, they would be presented via various paths in our path set. Therefore, the union of paths is expected to be a superset of the supporting evidence. We use the Coverage of the supporting evidence to measure the sufficiency of our path set, which stands for the percentage of instances whose supporting evidence is fully covered by the union of our paths. Meanwhile, the total number of paths (#P ath) and union size of the paths (#Sent) should also remain at a low standard, so as to avoid redundancy. Table 2 shows the statistics of the path sets extracted via our rules. The Consecutive Paths form a strong baseline that covers 71.7% of instances. Combining the three types, up to 87.5% of instances from the supporting evidence are fully covered by our path sets. The main reason that C+M+D can not cover all the instances is that the supporting evidence annotated in DocRED includes all associated sentences, while C+M+D only find a sufficient set to identify the relation.
Meanwhile, notice that the union of the three types contains only 2.69 different sentences on average, which means that our methods can filter out up to 2/3 of the original text. Also, our method is computationally efficient since only 2.27 paths need to be modeled on average. This demonstrates that our methods form a sufficient and nonredundant estimate for the gold supporting evidence, drastically alleviating the negative impact of irrelevant information.

Experiments
To further validate the sufficiency of our selected paths, we perform evaluation on DocRED by feeding the paths to an RE model. While previous works take entire documents as input, we replace the document with our selected paths regarding a given entity pair. Intuitively, if the paths can cover all crucial information in the document, we would expect comparable or better performance with identical model architecture, as our paths contain little irrelevant information and may help focus on a few key sentences.
Setup Given a pair of entities, all paths are first extracted as described in §3. Since each path corresponds to one possible connection of the head and tail entities, we predict the relations with each path independently and aggregate the results afterwards. For every single path c, we concatenate all sentences in it as one segment [w c 1 , ..., w c m ], where the order of sentences is the same as in the original document. The segment is fed to a BiLSTM to obtain the contextual embeddings [h c 1 , ..., h c m ]. The representation of an entity mention, which spans from the s-th word to the t-th word, is defined as m c k = 1 t−s+1 t j=s h c j . The representation of an entity e c i with K mentions is computed as the average of the representations of its mentions: e c i = 1 K k m c k . Then, we use a two-layer perceptron to calculate the probability of each relation r based on the current path c: P c ij (r) = σ(F ([e c i ; e c j ; |e c i − e c j |; e c i * e c j ])), where σ(·) is the Sigmoid function and F (·) stands for the two-layer perceptron.
After obtaining the prediction of every path between a given entity pair, we aggregate the predicted results by selecting the most likely predictions: P ij (r) = max c P c ij (r). We use the Glove-100 (Pennington et al., 2014) embedding for the BiLSTM encoder with hidden size 256. Following previous works (Nan et al., 2020b), we report the F1 for intra-and intersentence entity pairs along with the overall F1 score as evaluation metrics. Results We compare our methods with previous sequence-based models and graph-based models. All these models take the entire document as input. As shown in Table 3, our selected path with BiL-STM achieves 56.23% F1 on the test set, which outperforms the sequence-based models. Compared with the baseline BiLSTM, our model brings 5.68% and 5.62% improvement on intra-and intersentence entity pairs on the dev set, respectively.
Surprisingly, our simple method achieves a higher performance compared with graph-based models, which are more complex and also possess the ability to filter out irrelevant information. Combined with our path-selection scheme, a BiLSTM can perform 1.25% and 1.15% better on the dev and test set, respectively, compared to the SOTA graph-based model in the same situation. This may indicate that, while graph-based models have shown excellent abilities to focus on important information in a self-adaptive manner, it is more helpful to explicitly select from the document than to fully rely on graph-based models. With a simple filtering scheme inspired by human annotations, we can better explore the potentials of existing models and produce better results.

Discussion
So far we have shown from experiments the limited number of sentences required to deduce a relation instance. While the interesting results seem unconventional for Document RE, which features complex inter-sentence relations, it is worth mentioning that possible explanations exist in current works in related fields. The interdisciplinary outlooks may provide helpful insights for community members to understand the causes of the three-sentences phenomenon and revisit the problem of Documentlevel Relation Extraction.
Linguistic Perspective One likely cause of the discussed phenomenon is that the seemingly distant relations are not so difficult given their linguistic form. Stevenson (2006) mentions that a majority of inter-sentence relation instances are in fact due to co-references (anaphoric expressions or alternative descriptions). In these cases, relations could be considered to be described entirely within one sen-tence but with head or tail entities being referred to indirectly. Considering anaphoric expressions are likely to appear in surrounding sentences for the candidate mentions (Chowdhury and Zweigenbaum, 2013), these findings are directly in line with our observation that consecutive paths could support more than 70% relation instances, and provide evidence for three-sentences phenomenon.
Cognitive Perspective Another possible explanation is that the RE task is naturally defined within a limited amount of entities and context, given the nature of the human brain. It is widely believed that Working Memory (WM) (Baddeley, 1992) plays a vital role to store and manipulate information in inference tasks (Barreyro et al., 2012), but the capacity of separate information chunks in WM are often limited to 4 (Cowan, 2001). As we need to memorize all the separate entities in the inference chain along with their relations, it is natural that we tend to describe a relation within a limited number of sentences, since rendering a relationship with more sentences may cause our WM to exceed its capacity. Daneman and Carpenter (1980) show that the success rate of completing a reading task drastically drops if too much information, exceeding the subject's WM capacity, is required for the task. Therefore, as the datasets are constructed from natural language, the three-sentences phenomenon in the data may be a common pattern that we (unconsciously) follow for mutual understanding.

Conclusion
In this paper, we perform an analysis over 3 document RE benchmark datasets, and find that human annotators often use a small number of sentences to extract entity relations in document level. This motivates us to think over which sentences are critical for document RE. We carefully design heuristic rules to select informative path sets from entire documents, which can be further combined with a simple BiLSTM to achieve competitive performance on a benchmark dataset, even better than complex graph-based methods.