SIRE: Separate Intra- and Inter-sentential Reasoning for Document-level Relation Extraction

Document-level relation extraction has attracted much attention in recent years. It is usually formulated as a classification problem that predicts relations for all entity pairs in the document. However, previous works indiscriminately represent intra- and inter-sentential relations in the same way, confounding the different patterns for predicting them. Besides, they create a document graph and use paths between entities on the graph as clues for logical reasoning. However, not all entity pairs can be connected with a path and have the correct logical reasoning paths in their graph. Thus many cases of logical reasoning cannot be covered. This paper proposes an effective architecture, SIRE, to represent intra- and inter-sentential relations in different ways. We design a new and straightforward form of logical reasoning module that can cover more logical reasoning chains. Experiments on the public datasets show SIRE outperforms the previous state-of-the-art methods. Further analysis shows that our predictions are reliable and explainable. Our code is available at https://github.com/DreamInvoker/SIRE.


Introduction
Relation Extraction (RE) is an important way of obtaining knowledge facts from natural language text. Many recent advancements (Sahu et al., 2019;Yao et al., 2019b;Nan et al., 2020; manage to tackle the document-level relation extraction (doc-level RE) that extracts semantic relations among entities across multiple sentences. Due to its strong correlation with real-world scenarios, doc-level RE has attracted much attention in the field of information extraction.
The doc-level RE task is usually formulated as a classification problem that predicts possible rela- Figure 1: Two examples from DocRED (Yao et al., 2019b) for illustration of intra-and inter-sentential relations. Sentence numbers, entity mentions, and supporting evidence involved in these relation instances are colored. Other mentions are underlined for clarity. tions for all entity pairs, using the information from the entire document. It has two different kinds of relations: intra-sentential relation and inter-sentential relation. We show examples of these two kinds of relations in Figure 1. When two entities have mentions co-occurred in the same sentence, they may express intra-sentential relations. Otherwise, they may express inter-sentential relations.
Previous methods do not explicitly distinguish these two kinds of relations in the design of the model and use the same method to represent them. However, from the perspective of linguistics, intrasentential relations and inter-sentential relations are expressed in different patterns. For two intrasentential entities, their relations are usually expressed from local patterns within their co-occurred sentences. As shown in the first example in Figure 1,(Polar Music,country of origin,Swedish) and (Wembley Arena, located in, London) can be inferred based solely on the sentence they reside in, i.e., sentences 1 and 6 respectively. Unlike intrasentential relations, inter-sentential relations tend to be expressed from the global interactions across multiple related sentences, also called supporting evidence. Moreover, cross-sentence relations usually require complex reasoning skills, e.g., logical reasoning. As shown in the second example in Figure 1, (São Paulo, continent, South America) can be inferred from the other two relation facts expressed in the document: (São Paulo, country, Brazil) and (Brazil, continent, South America). So the different patterns between intra-and intersentential relations show that it would be better for a model to treat intra-and inter-sentential relations differently. However, previous works usually use the information from the whole document to represent all relations, e.g., 13 sentences for predicting (Polar Music, country of origin, Swedish) in the first example in Figure 1. We argue that this will bring useless noises from unrelated sentences that misguide the learning of relational patterns.
Besides, previous methods  treat logical reasoning as a representation learning problem. They construct a document graph from the input document using entities as nodes. And the paths between two entities on their graphs, usually passing through other entities, could be regarded as clues for logical reasoning. However, since not all entity pairs can be connected with a path and have the correct logical reasoning paths available on the graph, many cases of logical reasoning cannot be covered. So their methods are somehow limited, and we should consider a new form of logical reasoning to better model and cover all possible reasoning chains.
In this paper, we propose a novel architecture called Separate Intraand inter-sentential REasoning (SIRE) for doc-level RE. Unlike previous works in this task, we introduce two different methods to represent intra-and inter-sentential relations respectively. For an intra-sentential relation, we utilize a sentence-level encoder to represent it in every co-occurred sentence. Then we get the final representation by aggregating the relational representations from all co-occurred sentences. This will encourage intra-sentential entity pairs to focus on the local patterns in their co-occurred sen-tences. For an inter-sentential relation, we utilize a document-level encoder and a mention-level graph proposed by  to capture the document information and interactions among entity mentions, document, and local context. Then, we apply an evidence selector to encourage intersentential entity pairs to selectively focus on the sentences that may signal their cross-sentence relations, i.e., finding supporting evidence. Finally, we develop a new form of logical reasoning module where one relation instance can be modeled by attentively fusing the representations of other relation instances in all possible logical chains. This form of logical reasoning could cover all possible cases of logical reasoning in the document.
Our contributions can be summarized as follows: • We propose an effective architecture called SIRE that utilizes two different methods to represent intra-sentential and inter-sentential relations for doc-level RE.
• We come up with a new and straightforward form of logical reasoning module to cover all cases of logical reasoning chains.
We evaluate our SIRE on three public doc-level RE datasets. Experiments show SIRE outperforms the previous state-of-the-art models. Further analysis shows SIRE could produce more reliable and explainable predictions which further proves the significance of the separate encoding.

Intra-and Inter-sentential Relation Representation Module
As is discussed in Sec. 1, for two intra-sentential entities, their relations are usually determined by the local patterns from their co-occurred sentences, while for two inter-sentential entities, their relations are usually expressed across multiple related sentences that can be regarded as the supporting evidence for their relations. So in this module, we utilize two different methods to represent intrasentential and inter-sentential relations separately.  Our model uses different methods to represent intra-and inter-sentential relations and the self-attention mechanism to model the logical reasoning process. We use the logical reasoning chain:e A → e B → e C for illustration.
Our methods encourage intra-sentential entity pairs to focus on their co-occurred sentences as much as possible and encourage inter-sentential entity pairs to selectively focus on the sentences that may express their cross-sentence relations. We use three parts to represent the relation between two entities: head entity representation, tail entity representation and context representation.

Intra-sentential Relation Representation Module
Encoding. We use a sentence-level encoder to capture the context information for intra-sentential relations and produce contextualized word embedding for each word. Formally, we convert the i-th sentence S i containing n i words w S i For each word w in S i , we first concatenate its word embedding with entity type embedding and co-reference embedding 1 : where E w (·) , E t (·) and E c (·) denote the word embedding layer, entity type embedding layer and 1 The existing doc-level RE datasets annotate which mentions belong to the same entity. So for each word in the document, it may belong to the i-th entity or non-entity in the document. We embed this co-reference information between entity mention (surface words) and entity (an abstract concept) into the initialized representation of a word. co-reference embedding layer, respectively. t and c are named entity type and entity id. 2 Then the vectorized word representations are fed into the sentence-level encoder to obtain the sentence-level context-sensitive representation for each word: where the f S enc denotes sentence-level encoder, which can be any sequential encoder. We will also get the sentence representation s S i for sentence S i from this encoder. For LSTM, s S i is the hidden state of the last time step; for BERT, s S i is the output representation of the special marker [CLS]. Representing. For i-th entity pair (e i,h , e i,t ) which expresses intra-sentential relations, where e i,h is the head entity and e i,t is the tail entity, their mentions co-occur in C sentences S co−occur = {S i 1 , S i 2 , . . . , S i C } once or many times. In j-th co-occurred sentence S i j , we use the entity mentions in S i j to represent head and tail entity. And we define that the context representation of this relation instance in S i j is the top K words correlated with the relations of these two mentions.
Specifically, head entity mention ranging from sth to t-th word is represented as the average of the words it contains: e k , so is the tail entity mention e S i j i,t . Then, we concatenate the representations of head and tail entity mentions and use it as a query to attend all words in S i j and compute relatedness score for each word in S i j : where [·; ·] is a concatenation operation. W intra ∈ R d×2d is a parameter matrix. σ is an activation function (e.g., ReLU). Then, we average the representations of top K related words to represents the context information c i for intra-sentential entity pair (e i,h , e i,t ) in S i j . In order to make W intra trainable during computing gradient, we also add an item which is the weighted average representation of all words: where β is a hyperparameter and we use 0.9 here to force model to focus on the top K words but still consider the subtle influence from other words.
Next, we concatenate the three parts obtained above to form the relational representation of intrasentential entity pair (e i,h , e i,t ) in S i j and further average the representations in all co-occured sentences S co−occur to get our final relation representation r i for intra-sentential entity pair (e i,h , e i,t ) 3 : This way, we could force the intra-sentential entity pairs to focus on the semantic information from their co-occurred sentences and ignore the noise information from other sentences.

Inter-sentential Relation
Representation Module Encoding. According to the nature of intersentetential relation, we use a document-level encoder to capture the global interactions for intersentential relations and produce contextualized word embedding for each word. Formally, we convert a document D containing m words w D Same as the embedding for intra-sentential relations, we use Equation 1 to embed each word in the document. Then the vectorized word representations are fed into the document-level encoder to obtain document-level context-sensitive representation for each word: where f D enc denotes the document-level encoder. And we will also get the document representation d D from this encoder.
To further enhance the document interactions, we utilize the mention-level graph (MG) proposed by . MG in  contains two different nodes: mention node and document node. Each mention node denotes one particular mention of an entity. Furthermore, MG also has one document node that aims to model the document information. We argue that this graph only contains nodes concerning prediction, i.e., the mentions of the entities and document information. However, it does not contain the local context information, which is crucial for the interaction among entity mentions and the document. So we introduce a new type of node: sentence node and its corresponding new edges to infuse the local context information into MG.
So there are four types of edges 4 in MG: Intra-Entity Edge: Mentions referring to the same entity are fully connected. This models the interactions among mentions of the same entity. Inter-Entity Edge: Mentions co-occurring in the same sentence are fully connected. This models the interactions among different entities via cooccurrences of their mentions. Sentence-Mention Edge: Each sentence node connects with all entity mentions it contains. This models the interactions between mentions and their local context information. Sentence-Document Edge: All sentence nodes are connected to the document node. This models the interactions between local context information and document information, acting as a bridge between mentions and document.
Next, we apply Relational Graph Convolutional Network (R-GCN, Schlichtkrull et al., 2017) on MG to aggregate the features from neighbors for each node. Given node u at the l-th layer, the graph convolutional operation can be defined as: u denotes a set of neighbors for node u connected with t-th type edge. c u,t = |N t u | is a normalization constant. We then aggregate the outputs of all R-GCN layers to form the final representation of node u: u is the initial representation of node u. For a mention ranging from the s-th word to the t-th word in the document, h for i-th sentence node, it is initialized with s S i from sentence-level encoder; for the document node, it is initialized with d D from document-level encoder.
Representing. We argue that inter-sentential relations can be inferred from the following information sources: 1) the head and tail entities themselves; 2) the related sentences that signal their cross-sentence relations, namely supporting evidences; 3) reasoning information such as logical reasoning, co-reference reasoning, world knowledge, etc. We here only consider the first two information and leave the last in Sec. 2.2.
Different from intra-sentential relations, intersentential relations tend to be expressed from the global interactions. So for the i-th entity pair (e i,h , e i,t ) which expresses inter-sentential relation, the head entity representation e i,h and the tail entity representation and e i,t are defined as the average of their entity mentions from MG: where the M (e i ) is the mention set of e i . And we apply an evidence selector with attention mechanism (Bahdanau et al., 2015) to encourage the inter-sentential entity pair to selectively focus on the sentences that express their cross-sentence relations. This process could be regarded as finding supporting evidence for their relations. So the context representation c i for inter-sentential entity pair (e i,h , e i,t ) is the weighted average of the sentence representations from MG: where W k ∈ R 1×2d is a trainable parameter matrix. σ is a sigmoid function. Next, the final relation representation for intersentential entity pair (e i,h , e i,t ) should be:

Logical Reasoning Module
In this module, we focus on logical reasoning modeling. As mentioned in Sec. 1, previous works usually use the paths between each entity pair as the clues for logical reasoning. Furthermore, they concatenate the path representations with entity pair representations to predict relations. However, since not all entity pairs are connected with a path and have the correct logical reasoning paths in their graph, many cases of logical reasoning cannot be covered. So their methods are somehow limited.
In this paper, we utilize self-attention mechanism (Vaswani et al., 2017) to model logical reasoning. Specifically, we can get the relational representations for all entity pairs from the above sections. For i-th entity pair (e h , e t ), we can assume there is a two-hop logical reasoning chains: e h → e k → e t in the document, where e k can be any other entities in the document except e h and e t . So (e h , e t ) can attend to all the relational representations of other entity pairs including (e h , e k ) and (e k , e t ), termed as R att . Finally, the weighted sum of R att can be treated as a new relational representation for (e h , e t ), which considers all possible two-hop logical reasoning chains in the document. 5 where W att ∈ R 3d×3d is a parameter matrix. In this way, the path in the previous works could be converted into the individual attention on every entity pair in the logical reasoning chains. We argue that this form of logical reasoning is simpler and more scalable because it will consider all possible logical reasoning chains without connectivity constraints in the graph structure.

Classification Module
We formulate the doc-level RE task as a multi-label classification task: (17) where W 1 , W 2 , b 1 , b 2 are trainable parameters, σ is an activation function (e.g., ReLU). We use binary cross entropy as objective to train our SIRE: where C denotes the whole corpus, R denotes relation type set and I (·) refers to indicator function.

Dataset
We evaluate our proposed model on three document-level RE datasets:

Experimental Settings
In our SIRE implementation, we use 3 layers of GCN, use ReLU as our activation function, and set the dropout rate to 0.3, learning rate to 0.001. We train SIRE using AdamW (Loshchilov and Hutter, 2019) as optimizer with weight decay 0.0001 and implement SIRE under PyTorch (Paszke et al., 2017) and DGL (Wang et al., 2019b) frameworks. We implement two settings for our SIRE. SIRE-GloVe uses GloVe (100d, Pennington et al., 2014) and BiLSTM (512d, Schuster and Paliwal, 1997) as word embedding and encoder, respectively. SIRE-BERT use BERT-base (Devlin et al., 2019) as encoder on DocRED, cased BioBERT-Base v1.1 as the encoder on CDR/GDA, and the learning rate for BERT parameters is set to 1e −5 and learning rate for other parameters remains 1e −3 . Detailed hyperparameter settings are in Appendix.

Baselines and Evaluation Metrics
We use the following models as our baselines: Yao et al. (2019b) propose the BiLSTM (Schuster and Paliwal, 1997) as the encoder on DocRED and use the output from the encoder to represent all entity pairs to predict relations. Wang et al. (2019a) propose BERT to replace the BiLSTM as the encoder on DocRED. Moreover, they also propose BERT-Two-Step, which first predicts whether two entities have a relation and then predicts the specific target relation. Tang et al. (2020) propose the hierarchical inference networks HIN-GloVe and HIN-BERT, which make full use of multi-granularity inference information including entity level, sentence level, and document level to infer relations. Similar to Wang et al. (2019a), Ye et al. (2020) propose a language representation model called CorefBERT as encoder on DocRED that can capture the coreferential relations in context. Nan et al. (2020) propose the LSR-GloVe and LSR-BERT to dynamically induce the latent dependency tree structure to better model the document interactions for prediction.  propose a global-to-local network GLRE, which encodes the document information in terms of entity global and local representations as well as context relation representations.  propose the graph aggregationand-inference networks GAIN-GloVe and GAIN-  Table 1: Performance on DocRED. Models above the double line do not use pre-trained model. LR Module is the logical reasoning module. context denotes context representations in Eq. 6 and Eq. 14. inter4intra denotes using the inter-sentential module also for intra-sentential entity pairs.
Following the previous works (Yao et al., 2019b;, we use the F1 and Ign F1 as the evaluation metrics to evaluate the overall performance of a model. The Ign F1 metric calculates F1 excluding the common relation facts in the training and dev/test sets. We also use the intra-F1 and inter-F1 metrics to evaluate a model's performance on intra-sentential relations and inter-sentential relations on the dev set.

Results
The performances of SIRE and baseline models on the DocRED dataset are shown in Table 1. Among the model not using BERT encoding, SIRE outperforms the previous state-of-the-art model by 0.88/1.38 F1/Ign F1 on the test set. Among the model using BERT encoding, SIRE outperforms the previous state-of-the-art models by 1.18/0.81 F1/Ign F1 on the test set. The improvement on Ign F1 is larger than that on F1. This shows SIRE has a stronger generalization ability on the unseen relation instances. On intra-F1 and inter-F1, we can observe that SIRE is better than the previous models that indiscriminately represent the intra-and intersentential relations in the same way. This demonstrates that representing intra-and inter-sentential relations in different methods is better than representing them in the same way. The improvement on intra-F1 is greater than the improvement on inter-F1. This shows that SIRE mainly improves the performance of intra-sentential relations. The performances of SIRE and baseline models on the CDR/GDA dataset are shown in Table 2, which are consistent with the improvement on DocRED.

Ablation Study
To further analyze SIRE, we also conduct ablation studies to illustrate the effectiveness of different modules in SIRE. We show the results in Table 1. 1) the importance of the logical reasoning module: When we discard the logical reasoning module, the performance of SIRE-GloVe decreases by 0.41 F1 on the DocRED test set. This shows the effectiveness of our logical reasoning module, which can better model the reasoning information in the document. Moreover, it drops significantly on inter-F1 and drops fewer points on intra-F1. This shows our logical reasoning module mainly improves the performance of the inter-sentential relations that usually require reasoning skills. 2) Ablation on context representations in Eq. 6 Type Examples Intra-sentential relation instances "Your Disco Needs You" is a song performed by Australian recording artist and songwriter Kylie Minogue, taken from her seventh studio album Light Years (2000). and Eq. 14: When we remove the context representations in intra-and inter-sentential relational representations, the performance of SIRE-GloVe on the DocRED test set drops by 1.81 F1. This shows context information (top K words for intra, evidence sentences for inter) is important for both intra-and inter-sentential relation representation.

Lark Force was an Australian Army formation established in
3) Using the inter-sentential module also for intra-sentential entity pairs: In this experiment, we do not distinguish these two types of relations, using the encoding method for inter-sentential to encode all entity pairs, and remain the logical reasoning module unchanged. The performance of SIRE-GloVe drops by 2.66/2.13 F1/intra-F1 on the DocRED test set. This confirms the motivation that we cannot use global information to learn the local patterns for intra-sentential relations.

Reasoning Performance
Furthermore, we evaluate the reasoning ability of our model on the development set in Table 3. We use infer-F1 as the metric that considers only twohop positive relation instances in the dev set. So it will naturally exclude many cases that do not belong to the two-hop logical reasoning process to strengthen the evaluation of reasoning performance. As Table 3 shows, SIRE is superior to previous models in handling the two-hop logical reasoning process. Moreover, after removing the logical reasoning module, out SIRE drops signif-  icantly on infer-F1. This shows that our logical reasoning module plays a crucial role in modeling the logical reasoning process. Figure 3 shows the prediction cases of our SIRE. In intra-sentential relations, the top 4 words related to the relations of three entity pairs conform with our intuition. Our model correctly find the words by using Eq.5 that trigger the relations of these entity pairs. In inter-sentential relations, the supporting evidence that the model finds, i.e., sentences 1 and 2, indeed expresses the relations between São Paul and South America. We also conduct logical reasoning in terms of the logical reasoning chains: São Paul→ other-entity → South America. Our SIRE could focus on the correct logical reasoning chains: São Paul→ Brazil → South America. These cases show the predictions of SIRE are explainable.
Most of them use graph-based models, such as Graph Convolutional Networks (GCNs, Schlichtkrull et al., 2017) that has been used in many natural language processing tasks (Marcheggiani and Titov, 2017;Yao et al., 2019a;. They construct a graph structure from the input document. This graph uses the word, mentions or entities as nodes and uses heuristic rules and semantic dependencies as edges. They use this graph to model document information and interactions and to predict possible relations for all entity pairs. Nan et al. (2020) proposed a latent structure induction to induce the dependency tree in the document dynamically.  proposed a double graph-based graph aggregationand-inference network that constructs two graphs: mention-level graph and entity-level graph. They use the former to capture the document information and interactions among entity mentions and document and use the latter to conduct path-based logical reasoning. However, these works do not explicitly distinguish the intra-and inter-sentential relation instances in the design of the model and use the same way to encode them. So the most significant difference between our model and previous models is that we treat intra-sentential and intersentential relations differently to conform with the relational patterns for their prediction.
Reasoning in relation extraction. Reasoning problem has been extensively studied in the field of question answering (Dhingra et al., 2018). However, few works manage to tackle this problem in the document-level relation extraction task.  is the first to propose the explicit way of relational reasoning on doc-level RE, which mainly focuses on logical reasoning. They use the paths on their entity-level graph to provide clues for logical reasoning. However, since not all entity pairs are connected with a path and have the correct logical reasoning paths in their graph, their methods are somehow limited. In this work, we design a new form of logical reasoning to cover more cases of logical reasoning.

Conclusion
Intra-and inter-sentential relations are two types of relations in doc-level RE. We propose a novel architecture, SIRE, to represent these two relations in different ways separately in this work. We introduce a new form of logical reasoning module that models logical reasoning as a self-attention among representations of all entity pairs. Experiments show that our SIRE outperforms the previous state-of-the-art methods. The detailed analysis demonstrates that our predictions are explainable. We hope this work will have a positive effect on future research regarding new encoding schema, a more generalizable and explainable model.