Discriminative Reasoning for Document-level Relation Extraction

Document-level relation extraction (DocRE) models generally use graph networks to implicitly model the reasoning skill (i.e., pattern recognition, logical reasoning, coreference reasoning, etc.) related to the relation between one entity pair in a document. In this paper, we propose a novel discriminative reasoning framework to explicitly model the paths of these reasoning skills between each entity pair in this document. Thus, a discriminative reasoning network is designed to estimate the relation probability distribution of different reasoning paths based on the constructed graph and vectorized document contexts for each entity pair, thereby recognizing their relation. Experimental results show that our method outperforms the previous state-of-the-art performance on the large-scale DocRE dataset. The code is publicly available at https://github.com/xwjim/DRN.


Introduction
Document-level relation extraction (DocRE) aims to extract relations among entities within a document which requires multiple reasoning skills (i.e., pattern recognition, logical reasoning, coreference reasoning, and common-sense reasoning) (Yao et al., 2019). Generally, the input document is constructed as a structural graph-based on syntactic trees, coreference or heuristics to represent relation information between all entity pairs (Nan et al., 2020;Zeng et al., 2020;Xu et al., 2021). Thus, graph neural networks are applied to the constructed structural graph to model these reasoning skills. After performing multi-hop graph convolution, the feature representations of two entities are concatenated to recognize their relation by the classifier, achieving state-of-the-art performance in the DocRE task (Zeng et al., 2020;Xu et al., 2021). However, it is yet to be seen whether modeling these reasoning skills implicitly is competitive with the intuitive reasoning skills between one entity pair in this document. Figure 1 shows four kinds of reasoning skills for entity pairs in the DocRE dataset (Yao et al., 2019).
First, take two entity pairs {"Me Musical Nephews", "1942"} and {"William", "Adelaide"} as examples, the intrasentence reasoning concerns about the mentions inside the sentence, for example, "Me Musical Nephews" and "1942" for pattern recognition, and "William" and "Adelaide" for the commonsense reasoning. Also, the logical reasoning for entity pair {"U.S.", "Baltimore"} requires the reason path from "U.S."→"Maryland" (bridge entity)→"Baltimore" while the coreference reasoning for entity pair {"Dwight Tillery", "University of Michigan Law School"} pays attention to the reason path from "Dwight Tillery"→"He" (reference word)→"University of Michigan Law School".
However, the advanced DocRE models generally use the universal multi-hop convolution networks to model these reasoning skills implicitly and do not consider the above intuitive reasoning skills explicitly, which may hinder the further improvement of DocRE.
To this end, we propose a novel discriminative reasoning framework to explicitly model the reasoning processing of these reasoning skills, such as intra-sentence reasoning (including pattern recognition and common-sense reasoning), logical reasoning, and coreference reasoning. Specifically, inspired by Xu et al.'s meta-path strategy, we extract the reasoning paths of the three reasoning skills discriminatively from the input document. Thus, a discriminative reasoning network is designed to estimate the relation probability distribution of different reasoning paths based on the constructed graph and vectorized document contexts for each entity pair, thereby recognizing their relation.
In particular, there are the probabilities of multiple reasoning skills for each candidate relation between one entity pair, to ensure that all potential reasoning skills can be considered in the inference. In summary, our main contributions are as follows: • We propose a discriminative reasoning framework to model the reasoning skills between two entities in a document. To the best of our knowledge, this is the first work to model different reasoning skills explicitly for enhancing the DocRE.
• Also, we introduce a discriminative reasoning network to encode the reasoning paths based on the constructed heterogeneous graph and the vectorized original document, thereby recognizing the relation between two entities by the classifier.
• Experimental results on the large-scale DocRE dataset show the effectiveness of the proposed method, especially outperform the recent state-of-the-art DocRE model.

Discriminative Reasoning Framework
In this section, we propose a novel discriminative reasoning framework to model different reasoning skills explicitly to recognize the relation between each entity pair in the input document. The discriminative reasoning framework contains three parts: definition of reasoning paths, modeling reasoning discriminatively, and multi-reasoning based relation classification.

Definition of Reasoning Path
Formally, given one unstructured document comprised of N sentences D={s 1 , s 2 , · · · , s N }, each sentence is a sequence of words s n = {s 1 n , s 2 n , · · · , s J n } with the length J n =|s n |. The annotations include concept-level entities ε = {e i } P i=1 as well as multiple occurrences of each entity under the same phrase of alias e i = {m s k i } Q k=1 (m s k i denotes the mention of e i which occur in the sentence s k ) and their entity type (i.e. locations, organizations, and persons). The DocRE aims to extract the relation between two entities in ε, namely P (r|e i , e j , D). For the simplification of reason skills, we first combine both pattern recognition and commonsense reasoning as the intra-sentence reasoning because they generally perform reasoning inside the sentence. Consequently, the original four kinds of the reasoning skills (Yao et al., 2019) are further refined as three reasoning skills: intra-sentence reasoning, logical reasoning, and coreference reasoning. Inspired by Xu et al.'s work, we also use the meta-path strategy to extract reasoning path for each reason skill, thereby representing the above three reasoning skills explicitly. Specifically, meta-paths for different reasoning skills are defined as follows: 1) Intra-sentence reasoning path: It is formally denoted as P I ij =m s 1 i • s 1 • m s 1 j for one entity pair {e i , e j } inside the same sentence s 1 in the input document D. m s 1 i and m s 1 j are mentions related to two entities, respectively. "•" denotes one reasoning step on the reasoning path from e i to e j .
2) Logical reasoning path: The relation between one entity pair {e i , e j } from sentences s 1 and s 2 is indirectly established by the occurrence bridge entity e l for the logical reasoning. The reasoning path can be formally as 3) Coreference reasoning path: A reference word refers to one of two entities e i and e j , which occur in the same sentence as the other entity. We simplify the condition and assume that there is a coreference reasoning path when the entities occur in different sentences. The reasoning path can be formally as Note that there are no entities in the defined reasoning path compare to the meta-path defined in Xu et al.'s work. This difference is mainly due to the following considerations: i) the reason path pays more attention to the mentions and referred sentences; ii) entities generally are contained by mentions; iii) it makes modeling of path reasoning more simple.

Modeling Reasoning Discriminatively
Based on the defined reasoning paths, we decompose the DocRE problem into three reasoning sub-tasks: intra-sentence reasoning (IR), logical reasoning (LR), and coreference reasoning (CR). Next, we introduce modeling of three sub-tasks in detail: Modeling Intra-Sentence Reasoning. Given one entity pair {e i , e j } and its reasoning path P I ij in the sentence s 1 , the intra-sentence reasoning is modeled to recognize the relation between this entity pair based as follows: Modeling Logical Reasoning. Given one entity pair {e i , e j } and its reasoning path P L ij , the logical reasoning is modeled to recognize the relation between this entity pair based as follows: R P L (r) = P (r|e i , e j , P L ij , D).
Since the e l co-occur with the entity pair e i and e j respectively, the logical reasoning is further formally as follows: where • denotes the connection of the paths. Modeling Coreference Reasoning. Similarity, given one entity pair {e i , e j } and its reasoning path P C ij , the coreference reasoning is modeled to recognize the relation between this entity pair based as follows:

Multi-reasoning Based Relation Classification
In the DocRE task, one entity usually involves multiple relationships which rely on different reasoning types. Thus, the relation between one entity pair may be reasoned by multiple types of reasoning rather than one single reasoning type. Based on the proposed three reasoning subtasks, the relation reasoning between one entity pair is regarded as a multi-reasoning classification problem. Formally, we select the reasoning type with max probability to recognize the relation between each entity pair as follows: (5) In addition, there are often multiple reason paths between two entities for one reasoning type. Thus, the classification probability in Eq.(5) can be rewritten as follows: where K is the number of reasoning paths for one reasoning skill, which is the same to each reasoning skill for simplicity. Note that all the entity pairs have at least one reasoning path from one of three defined reasoning sub-tasks. When the number of reasoning paths is greater than K for one reasoning sub-task, we choose the K first reasoning paths, otherwise we use the actual reasoning paths.

Discriminative Reasoning Network
In this section, we design a discriminative reasoning network (DRN) to model three defined reasoning sub-tasks for recognizing the relation between two entities in a document. Follow Zeng et al. and Zhou et al.'s work, we use two kinds of context representations (heterogeneous graph context representation and document-level context representation) to model different reasoning paths discriminatively in Eq.(1)-(4)

Heterogeneous Graph Context Representation
Formally, the embedding of each word w e is concatenated with the embedding of its entity type w t and the embedding of its coreference w c as the representation of word b=[w e :w t :w c ]. These sequences of word representations are in turn fed into a bidirectional long shortterm memory (BiLSTM) to vectorize the input [0] The Eminem Show is the fourth studio album by American rapper Eminem , released on May 26 , 2002 by Aftermath Entertainment , Shady Records , and Interscope Records .。 [1] The Eminem Show includes the commercially successful singles " Without Me " , " Cleanin ' Out My Closet " , " Superman " , and " Sing for the Moment " .

IR Task
The Eminem Show is the fourth studio …. … Figure 2: The overall architecture of DRN. First, A context encoder consumes the input document to get a contextualized representation of each word. Then the heterogeneous graph context representation and the document-level context representation are prepared as the input of the discriminative reasoning framework. Intrasentence reasoning (IR) task, logical reasoning (LR) task and co-reference reasoning (CR) task are modeled explicitly and calculate the classification score respectively. Finally, the maximal score is selected as the output.
. . , h n Jn ) and h j i denotes the hidden representation of the i − th words of the j − th sentence in the document. Similar to Zeng et al.'s work, we construct a heterogeneous graph which contains sentence node and mention node. There are four kinds of edges in the heterogeneous graph: sentence-sentence edge (all the sentence nodes are connected), sentence-mention edge (the sentence node and the mention node which resides in the sentence ), mention-mention edge (all the mention nodes which are in the same sentence) and co-reference edge (all the mention nodes which refer to the same entity). Then we apply the graph-based DocRE method (Zeng et al., 2020) to encode the heterogeneous graph, based on which the heterogeneous graph context representation (HGCRep) are learned. The HGCRep of each mention node and sentence node g n is formally denoted as: where g n ∈ R d 1 and ":" is the concatenation of vectors and each of {p 1 n , p 2 n , · · · , p l−1 n } is learned by the multi-hop graph convolutional network (Zeng et al., 2020) and v n is the initial representation of the n-th node extracted from D. Finally, there is a heterogeneous graph representation G={g 1 , g 2 , · · · , g N } including each mention nodes and sentence nodes.

Document-level Context Representation
In the DocRE task, these reasoning skills heavily rely on the original document context information rather than the heterogeneous graph context information. Thus, the existing advanced DocRE models use syntactic trees or heuristics rules to extract the context information (i.e., entities, mentions, and sentences) that is directly related to the relation between entity pairs. However, this approach destroys the original document structure, which is weak in modeling the reasoning between two entities for the DocRE task. Therefore, we use the self-attention mechanism (Vaswani et al., 2017) to learn a document-level context representation (DLCRep) c n for one mention based on the vectorized input document D: where c n ∈ R d 2 and {K, V} are key and value matrices that are transformed from the vectorized input document D using a linear layer. Here, inspired by relation learning (Baldini Soares et al., 2019), we use the hidden state of the head word in one mention or one sentence to denote them for simplicity.

Modeling of Reasoning Paths
In this section, we use the concatenation operation to model the reasoning step on the reasoning path, thereby modeling the defined reasoning paths in Section 2.1 as the corresponding reasoning representations as follows: 1) For the intra-sentence reasoning path, both HGCReps and DLCReps of two mentions are concatenated in turn as a reasoning representation: where α ij ∈ R 2d 1 +2d 2 and ":" is the concatenation of vectors.
2) For the logical reasoning path, both HGCReps of mention m s 1 i and m s 2 j and DLCReps of two mention pair (m s 1 i , m s 1 l ) and (m s 2 j , m s 2 l ) are concatenated as their reasoning representation: where β ij ∈ R 2d 1 +2d 2 .
3) For the coreference reasoning path, we connect both HGCReps of two mentions and DLCReps of two sentences are are concatenated in turn as their reasoning representation: where γ ij ∈ R 2d 1 +2d 2 and both c s 2 and c s 2 denote DLCReps for two sentences s 1 and s 2 .
The learned reasoning representations α ij , β ij , and γ ij is as the input to classifier to compute the probabilities of relation between e i and e j entities by a multilayer perceptron (MLP) respectively: Similarly, when there are multiple reasoning paths between two entities for one reasoning type in Eq.6, Eq.12 is rewritten as follows: Also, the binary cross-entropy is used as training objection, which is the same as the advanced DocRE model (Yao et al., 2019). The proposed methods were evaluated on a large-scale human-annotated dataset for document-level relation extraction (Yao et al., 2019). DocRED contains 3,053 documents for the training set, 1,000 documents for the development set, and 1,000 documents for the test set, totally with 132,375 entities, 56,354 relational facts, and 96 relation types. More than 40% of the relational facts require reading and reasoning over multiple sentences. For more detailed statistics about DocRED, we recommend readers to refer to the original paper (Yao et al., 2019).
Following settings of Yao et al.'s work, we used the GloVe embedding (100d) and BiLSTM (128d) as word embedding and encoder. The number of the reasoning path for each task is set to 3. The learning rate was set to 1e-3 and we trained the model using AdamW (Loshchilov and Hutter, 2019) as the optimizer with weight decay 0.0001 under Pytorch (Paszke et al., 2017). For the BERT representations, we used uncased BERT-Based model (768d) as the encoder and the learning rate was set to 1e −5 . For evaluation, we used F1 and Ign F1 as the evaluation metrics. Ign F1 denotes F1 score excluding relational facts shared by the training and development/test sets. In particular, the predicted results were ranked by their confidence and traverse this list from top to bottom by F 1 score on development set, and the score value corresponding to the maximum F 1 is picked as threshold θ. The hyper-parameter for the number of reasoning paths was tuned based on the development set. In addition, the results on the test set were evaluated through CodaLab 1 . Once a model is trained, we get the confidence scores for every triple example (subject,object,relation) as Eq. (12). We rank the predicted results by their confidence and traverse this list from top to bottom by F1 score on development set, the score value corresponding to the maximum F1 is picked as threshold θ. This threshold is used to control the number of extracted relational facts on the test set.

Baseline Systems
We reported the results of the recent graphbased DocRE methods as the comparison systems: GAT (Veličković et al., 2018) Therefore, we also reported state-of-the-art graphbased DocRE models with pre-trained BERT base model, including Two-Phase+BERT base , LSR+BERT base (Nan et al., 2020), GAIN+BERT base (Zeng et al., 2020), HeterGASN-Rec+BERT base (Xu et al., 2021), and ATLOP-BERT base (Zhou et al., 2021). Table 2 presents the detailed results on the development set and the test set for the DocRE dataset. First, the proposed DRN model significantly outperformed the existing graph-based DocRE systems. Second, the proposed DRN model was superior to all the existing graph-based DocRE systems on the test set, validating that modeling reasoning discriminatively is more beneficial to DocRE than the original universal neural network way.

Main Results
Meanwhile, it also outperformed the best HeterGSAN-Rec model by 1.10 points in terms of F1, validating the effectiveness of our discriminative reasoning method.
Third, for the comparisons with a pre-trained language model (BERT base ), F1 scores of the proposed DRN+BERT base model was higher than that of the existing graph-based DocRE ATLOP+BERT model systems with BERT base on the test set. In particular, our method (F1 61.37) was superior to the existing best ATLOP+BERT model (F1 61.30) in terms of F1, which is a new state-of-the-art result on the DocRE dataset.   To evaluate the effect of the number of reasoning path K in Eq.6, we reported the results for the different number of reasoning path K, as shown in Table 3. When K increased from 1 to 3, F1 scores of the proposed DRN model gradually improved from 55.81 to 56.33 on the test set and the percentage of covered reasoning paths reaches 90.40%. As the hyper-parameter K continues to increase, F1 scores began to drop on the dev and test sets. On the one hand, the reason may be that the reasoning information provided by too many reasoning paths is duplicated, even noises in the remaining 9.60% reasoning paths.  make the proposed DRN gain the highest F1 score on the dev and test sets. Therefore, we set the hyper-parameter K to three in our main results in Table 2.

Ablation Experiments
In the proposed DRN model, we model different reasoning tasks discriminatively using HGCRep and DLCRep, and we choose the highest scores as the final results. Instead of using the discriminative reasoning framework, previous work averaged the mention representation (HGCRep or DLCRep) to get the entity representation and concatenate the two entity representation to classify the relation, which we denote as Uniform model. Table 4 shows ablation experiments of the framework and different reasoning context on the test set. It is noted that Uniform model with the discriminative reasoning framework is our DRN model. First, the DocRE models benefit from our discriminative reasoning framework no matter what the reasoning context is used. Specially, the F1 score of the model with the framework was averagely 1.  Obviously, F1 scores drastically decreased on the test sets, confirming the necessity of learning DLCRep and HGCRep for modeling reasoning discriminatively.

Analysis of the Reasoning Tasks
In this section, we first showed the percent of all entity pairs (396,790) and entity pair with relation (12,332) on the dev set selected for three defined reasoning tasks through max operation in Eq.(12), as shown in Figure 3(a). For example, IR, LR, and CR are the intra-Sentence reasoning task, the logical reasoning task, and the coreference reasoning task, respectively. The percentages of IR, LR, and CR which is selected for all the entity pair are 19.12%, 19.17%, and 61.71% for all entity pairs, respectively. This indicates that our defined three reasoning skills can completely cover all entity pairs regardless of whether these entity pairs have relationships or not. Also, the percentages of IR, LR, and CR are 47.58%, 13.91%, and 38.51% for entity pairs with relation, respectively. This is consistent with the statistical result in the Yao et al.'s work that more than 40.7% relational facts can only be extracted from multiple sentences, validating that our method can model different reasoning skills discriminatively on the DocRE dataset. [1] The Eminem Show includes the commercially successful singles " Without Me " , " Cleanin ' Out My Closet " , " Superman " , and " Sing for the Moment " .
[2] At the 2003 Grammy Awards , it was nominated for Album of the Year and became Eminem 's third album in four years to win the award for Best Rap Album .
[3] On March 7 , 2011 , the album was certified 10× Platinum ( Diamond ) by the RIAA , making it Eminem 's second album to go Diamond in the United States . Moreover, Figure 3(b) showed the results of HerterGSAN-Rec (abbreviated as Rec), GAIN, and our DRN models on three different reasoning tasks. As seen, F1 scores of the proposed DRN model are higher than that of Rec and GAIN models over all three tasks. This means that modeling reasoning types explicitly can effectively advance the DocRE. For all DocRE models, F1 scores of LR task and CR task were far inferior to that of IR task, which is consistent with the intuitive perception that the inter-sentence reasoning is more difficult than the intra-sentence reasoning.  To further show the selected different reasoning types in Eq.(12), we randomly sampled 72 documents from the dev set which contain 916 relation instances, and we ask three human to annotate the reasoning types of all the entity pairs with relation in the sampled document according to three defined reasoning types, including the intra-sentence reasoning, the logical reasoning, and the coreference reasoning (The annotation data can be found in https://github.com/ xwjim/DRN). Table 5 shows the number and F1 scores of each selected reasoning types on the sampled 72 documents. As seen, F1 scores of IR, LR, and CR are 79.95%, 38.77%, and 44.87%, respectively, indicating that modeling reasoning discriminatively is working during selecting of reasoning paths in Eq.(12). Also, our method is the capacity of recognizing not only the intrasentence reasoning but also the intra-sentence reasoning.

Analysis of the Reasoning Type
In addition, there is a certain percentage of the mistakenly selected reasoning types, indicating that our method may have more room for improvement in the future. Figure 4 shows the relation classification about two entity pairs for our DRN model. For the first entity pair {"Superman"} and {"May 26,2002"}, there are reasoning paths for Task2 and Task3, and their scores are 1.7604, and 0.2841,respectively As a result, Task2 was used to predict the relation "{publication date}" between {"Superman"} and {"May 26,2002"} correctly. Meanwhile, the selection of Task2 is consistent with the groundtruth logical reasoning type. Moreover, the above reasoning processing is also similar to the entity pair {"The Eminem show"} and {"Eminem"} with three reasoning types.

Related Work
Early research efforts on relation extraction concentrate on predicting the relation between two entities with a sentence (Zeng et al., 2014(Zeng et al., , 2015Wang et al., 2016;Sorokin and Gurevych, 2017;Feng et al., 2018;Song et al., 2019;. These approaches do not consider interactions across mentions and ignore relations expressed across sentence boundaries. The semantics of a document context is coherent and a part of relation can only be extracted among sentences. However, as large amounts of relationships are expressed by multiple sentences, recent work starts to explore document-level relation extraction. People begin to consider the relation between disease and chemicals in the entire document of biomedical domain Gupta et al., 2019;Zhang et al., 2018;Christopoulou et al., 2019;Zhu et al., 2019). A large-scale general-purpose dataset for DocRE constructed from Wikipedia articles has been proposed in (Yao et al., 2019), which has advanced the DocRE a lot. Most approaches on DocRE are based on document graphs, which were introduced by Quirk and Poon. Specifically, they use words as nodes and construct a homogenous graph using syntax parsing tools and a graph neural network is used to capture the document information. This document graph provides a unified way of extracting the features for entity pairs. Later work extends the idea by improving neural architectures (Peng et al., 2017;Verga et al., 2018;Gupta et al., 2019) or adding more types of edges (Christopoulou et al., 2019). In the Christopoulou et al.'s work, the author construct the graph which contains different granularities (sentence, mention, entity) through co-occurrence and heuristic rule to model the graph without external tools. More recent most of the approach (Christopoulou et al., 2019;Zeng et al., 2020;Xu et al., 2021) constructs heterogeneous graph through co-occurrence and heuristic rule to model the graph without external tools. In the (Zeng et al., 2020) constructed double graphs in different granularity to capture document-aware features and the interaction between entities. In the (Xu et al., 2021) introduced a reconstructor to reconstruct the path in the graph to guide the model to learning a good node representation. Other attempts focus on the multi-entity and multi-label problems (Zhou et al., 2021). Zhou et al. proposed two techniques to solve the problems, adaptive thresholding and localized context pooling.

Conclusion
In this paper, we propose a novel discriminative reasoning framework to consider different reasoning types explicitly. We use meta-path strategy to extract the reasoning path for different reasoning types. Based on the framework, we propose a Discriminative Reasoning Network (DRN), in which we use both the heterogeneous graph context and the document-level context to represent different reasoning paths. The ablation study validates the effectiveness of our discriminative framework and different modules on the largescale human-annotated DocRE dataset.
In particular, our method archives a new state-ofthe-art performance on the DocRE dataset. In the future, we will explore more diverse structure information (Chen et al., 2018;Chen et al., 2020;Cohen et al., 2020) from the input document for the discriminative reasoning framework, and apply the proposed approach to other NLP tasks (Zhang et al., 2020a;Chen et al., 2020;Zhang et al., 2020b).