MRN: A Locally and Globally Mention-Based Reasoning Network for Document-Level Relation Extraction

Document-level relation extraction aims to detect the relations within one document, which is challenging since it requires complex reasoning using mentions, entities, local and global contexts. Few previous studies have distinguished local and global reasoning explicitly, which may be problematic because they play different roles in intra-and inter-sentence relations. Moreover, the interactions between local and global contexts should be considered since they could help relation reasoning based on our observation. In this paper, we pro-pose a novel mention-based reasoning (MRN) module based on explicitly and collaboratively local and global reasoning. Based on MRN, we design a co-predictor module to predict entity relations based on local and global entity and relation representations jointly. We evaluate our MRN model on three widely-used benchmark datasets, namely DocRED, CDR, and GDA. Experimental results show that our model outperforms previous state-of-the-art models by a large margin.


Introduction
Relation extraction (RE), identifying the semantic relations among target entities in the text, has long been a fundamental task in the natural language processing (NLP) community (Zeng et al., 2014;Xu et al., 2015). Prior efforts largely focus on sentencelevel RE (Lin et al., 2016;Zhang et al., 2018). However, recent studies reveal that a large number of relations can actually be expressed through multiple sentences, which necessitates document-level RE . Compared with sentencelevel RE, the entities for document-level relations may be mentioned in multiple sentences across a document. Therefore, document-level RE requires capturing the complex interactions between all enti- * Corresponding author. P570 for the position of a Chicago municipal court judge in 1939. [S1] Edward Rowan Finnegan (June 5, 1905-February 2, 1971 was a U.S. Representative from Illinois from 1961 to 1964.

P102
... Figure 1: An example of document-level RE from the DocRED dataset. We use the same color to denote the mentions of the same entity.
ties in the entire document (Nan et al., 2020;Zeng et al., 2020;. It is well known that local and global contexts are two key performance enhancers for the task. Intuitively, the former can benefit the identification of nearer (e.g., intra-sentence) relations, while the latter is more useful for distant (e.g., inter-sentence) relations. For document-level RE, such context information is closely related to the ubiquitous mentions in a document, so mention-based reasoning with different context granularities is highly important for the task. However, most previous studies have not distinguished local and global reasoning explicitly (Peng et al., 2017;Sahu et al., 2019;Nan et al., 2020;Zhou et al., 2020;Zeng et al., 2020;. Recently  investigate local and global contexts for documentlevel RE by performing global and local reasoning consecutively. However, their pipeline method can be problematic because it ignores the interactions and communications of local and global contexts, 1 which limits the performance of the task. In this paper, we aim to address the above issues for improving document-level RE by presenting a novel Mention-based Reasoning Network (MRN). As shown in Figure 2, MRN consists of several innovative modules for modeling the relation reasoning locally and globally, including (1) a twodimensional windowed convolution for capturing the local mention-to-mention interactions between the subject and object arguments of relations, and (2) a co-attention module for capturing the global interaction between each pair of entity mentions. Note that the above two modules also provide mutual information to each other, in order to capture the interactions between local and global contexts.
Moreover, different from previous work that expresses entities with just one kind of representation (Peng et al., 2017;Sahu et al., 2019;Nan et al., 2020;Zhou et al., 2020;Zeng et al., 2020;, our method distinguishes mentions into subjects and objects, and generates entity representations from both local and global mention-based reasoning. Specifically, we design a novel module, called co-predictor, to utilize both local and global entity representations for jointly reasoning the relations between close and distant entities. We conduct extensive experiments on three widely-used benchmarks, including DocRED , CDR (Li et al., 2016) and GDA (Wu et al., 2019). The results show that our MRN model outperforms the current best model by a large margin, demonstrating its advances. In summary, we make the following contributions: • We propose a mention-based reasoning network (MRN), to distinguish the impacts of close and distant entity mentions in relation extraction and meanwhile consider the interactions between local and global contexts, which we call locally and globally mention-based reasoning. • We also propose a co-predictor to work in concert with the mention reasoning block, and predict the relation of a pair of entities using local and global features simultaneously. • Our model achieves the state-of-the-art performances on three benchmark datasets for document-level RE. We also conduct extensive analyses of our model to better understand its working mechanism. 2 2 Related Work Relation extraction (RE), including sentence-level RE and document-level RE, plays a crucial role in a wide variety of knowledge-based applications, such as question answering (Hixon et al., 2015), dialogue generation (He et al., 2017), etc. Recent studies largely focus on sentence-level RE by various neural network methods, such as CNN (Zeng et al., 2014;dos Santos et al., 2015), BiLSTM (Zhang et al., 2015;Cai et al., 2016), attention mechanism Lin et al., 2016), and neural graph models (Zhang et al., 2018;Zhu et al., 2019). However, in practice, many relational facts need to be inferred across multiple sentences in a document, so researchers have shown a growing interest in document-level RE. Compared with sentence-level RE, documentlevel RE needs to consider the complicated interactions between entities across multiple sentences. With this in mind, researchers begin to use graph neural networks to reason intra-and inter-sentence relations and make certain progress in extracting inter-sentence relations with document-level graph convolutional neural network (Peng et al., 2017;Velickovic et al., 2018;. Sahu et al. (2019). For example, Zhou et al. (2020) use entities as nodes and the context between entity pairs as edges to construct graphs. Nan et al. (2020) treat the graph as a latent structure and perform relational reasoning. However, most existing approaches only use entity-level information and ignore mention-level information.
Some studies also take mentions into account by adding mention nodes in the graph. For instance,  put mentions and entities in the same graph.  utilize a dual-tier heterogeneous graph to propagate relational information among entity mentions, and then summarize them into corresponding entities. More recently,  use multi-head attention to aggregate multiple mentions of specific entities. Different from the above methods, Zeng et al. (2020) propose a graph aggregation and inference network that includes two graphs, one for capturing complex interactions among different mentions, and the other for integrating mentionlevel information of the same entities. Although these methods introduce mention-level nodes or graphs, none of them consider local mention-based contextual information or mention-level relative distances, which are all considered in our model.

Methodology
Task Formulation Given an input document consisting of N tokens , the task aims to extract a subset of relations from R between the entity pairs (e s i , e o j ), where R is a pre-defined relation type set, e s i and e o j are identified as subject and object entities. An entity e i can appear multiple times in the document via K e i mentions M e i = {m j } Ke i j=1 . Framework As illustrated in Figure 2, the overall architecture consists of three tiers. An encoder first yields contextual representations, and then the mention-based reasoning block (in multiple layers) performs local-level feature extraction with a two-dimensional (2D) convolution, and globallevel feature retrieval with a co-attention module. Afterward, a co-predictor layer that contains a 1D and 2D dynamic pooling aggregates the entity-level features from the mention-based reasoning block. Finally, a multi-layer perceptron and a biaffine classifier are leveraged for jointly reasoning the relations between subject and object entities.

Encoding Layer
We first map each word w i into a vector, and concatenate it with its corresponding entity type t i : 3 3 ti denotes the type of entity mention that contains this word (e.g. if an entity type is Person, its mention word type is also Person).
Then, we adopt BiLSTM to encode the vectorial word representations into contextualized word representations: where h i is the token hidden representation. Note that we also can use the BERT (Devlin et al., 2019) as an alternative to improve performances. Based on h i , we can obtain the mention representation: where a i and b i are the start and end positions of the i-th mention, respectively.

Interactive Mention-based Reasoning Layer
As we argued earlier, local and global context information is closely related to the ubiquitous mentions in a document. We thus propose a mention-based module for multi-hop reasoning among the relationships of all the mentions. Considering that there exist overlapping relations where multiple relations share the same mention, we distinguish mentions into subjects and objects by their directions. Moreover, near neighbors are more informative than distant ones for determining relations. To this end,  we adopt a two-dimensional (2D) windowed convolution to capture local interactions between close subject and object mentions.
As shown in Figure 3, we perform multi-hop reasoning by stacking multiple layers (i.e., L) of our interactive mention-based reasoning blocks. The inputs of the l-th layer block include: (1) a set of subject mention representations ]. Next, we concatenate C l and Q l−1 , and adopt a Feedforward Network (FFN) to reduce their dimensions. We then employ a 2D convolution to capture local contextual interactions, which can be regarded as extracting subgraph representations from a fullyconnected bipartite graph containing subject and object nodes. The overall process can be formulated as: where σ is a LeakyReLU activation function.
Global Reasoning We introduce a co-attention mechanism to compute attention coefficients that indicate the importance of subjects to objects and vice versa, so that the interaction between each pair of mentions can be considered. Inspired by the max max Figure 4: The 1D and 2D dynamic pooling.
success of graph attention network (GAT) (Velickovic et al., 2018), we apply two learnable linear transformations to transform subject and object mention representations M l−1,s and M l−1,o into higher-level features. Then, we leverage the mention relation representation Q l to calculate the coefficients and inject them into the co-attention process:m where W l,ψ is a learnable parameter matrix, and α l,ψ ij = Softmax(FFN(Q l )). Here φ, ψ ∈ {s, o}. Afterward, the mention representation of the next layer m l,φ i is generated by adding the residual of the last layer m l−1,φ i and the non-linear transformation of the co-attention outputm l,φ i , which can be formulated as:

Co-Predictor Layer
After the last reasoning block (i.e., the L-th block), we obtain the final mention representations M L,s and M L,o , as well as the mention relation representation Q L . Since different mentions may belong to different entities, we apply 1D and 2D max-pooling to aggregate mention-level features into the entity level ( Figure 4). Then, we apply two predictors to calculate two relation distributions for entity pair (e i , e j ) and then combine them for obtaining the final prediction.
Local Predictor Based on the mention relation representation Q L generated from the mentionbased reasoning block, we adopt a 2D dynamic max-pooling (cf. Figure 4 right) to aggregate mention-level features into entity-level features: where M e i and M e j are the mention sets corresponding to the i-th and j-th entities, respectively.
Then we employ a FFN to generate prediction scores for entity pair: Global Predictor Based on the mention representations M L,s and M L,o , the representations of the i-th and j-th entities can also be generated by the 1D dynamic max-pooling (Figure 4 left).
where M e i is the mention set of the i-th entity, φ can be {s, o} and m L,φ ∈ M L,φ . Then, a biaffine classifier (Dozat and Manning, 2017) is used to compute the relation scores between a pair of subject and object entities: where U, W and b are trainable parameters.

Joint Prediction
The final relation probability for entity pair (e i , e j ) with regards to the relationship r comes from the combination of the scores from both local and global predictors:

Learning
Considering the imbalance of positive and negative samples in document-level RE, we use asymmetric loss (ASL) (Ben-Baruch et al., 2020) instead of binary cross-entropy loss: L − =(P n (r|e i , e j )) γ − log(1 − P n (r|e i , e j )) , where γ+ and γ− are the focusing hyperparameters for positive and negative samples, which aim to emphasize the contribution of positive samples and meanwhile down-weight the contribution of easy negatives samples (γ − > γ + ). P n (r|e i , e j ) = max(P (r|e i , e j ) − n, 0) is a probability shift mechanism that further filters out easy negative samples (probability margin n ≥ 0 is a hyper-parameter). Here, the final loss function can be formulated as： where S denotes the whole dataset, and I(·) refers to the indicator function.

Results on DocRED
Furthermore, we observe that the performance can be substantially boosted with the help of BERT, where the Ign-F1 and F1 increase by 3.33% and 3.28% on the test set. Notably, MRN with GloVe embeddings is able to achieve better results than several BERT-based models, such as CorefBERT and GLRE. This suggests that our model is more F1 Intra-F1 Inter-F1 • CDR data ME-CNN (Gu et al., 2017) 61.3 57.2 11.7 BRAN (Verga et al., 2018) 62.1 --C-CHAR (Nguyen and Verspoor, 2018) 62.3 --GCNN (Sahu et al., 2019) 58.6 --EoG  63.6 68.2 50.9 DHG  64.7 68.6 54.1 LSR (Nan et al., 2020) 64  effective in capturing complex interactions between close and distant mentions even without the help of pre-trained embeddings. Table 3 shows the results on two biomedical datasets. Here, the baselines are also divided into sequence-based models (ME-CNN, BRAN, and C-CHAR) and graph-based models (GCNN, EoG, DHG, and LSR). Similar to the DocRED dataset, the graph-based models generally outperform the sequence-based models on CDR, which reveals the effectiveness of incorporating structural information and reasoning mechanisms in document-level  Table 4: Ablation studies on the DocRED dataset.

Results on CDR and GDA
RE. Besides, our MRN model achieves better performances than the state-of-the-art models on CDR and GDA datasets, outperforming LSR by 1.1% and the DHG by 0.7% respectively.

Intra-and Inter-sentence Relation Extraction
According to recent work , identifying 40.7% relations need the information of multiple sentences, which indicates that the reasoning ability of a model plays an important role in document-level RE. Thus, we also report the performances of intra-and inter-sentence relation extraction on three datasets in Table 2 and 3. We find that our model outperforms the current best models on all datasets in regard to both intra-and inter-F1. For example, MRN improves the intra-F1 by 3.57% and inter-F1 by 2.14% compared with GAIN on the development set of DocRED. This shows that mention-level reasoning is highly effective to capture complex interactions between mention objects and subjects, especially when not only local contexts but also long-range dependencies are considered.

Ablation Studies
We ablate each part of our MRN model on the development set of DocRED, as shown in Table  4. First, without entity type embeddings at the encoding layer, we observe slight performance drops. By removing relative distance information, the performance also decreases in a small degree. Furthermore, after removing interactive mentionbased reasoning layer, global or local predictors, the performance goes down significantly. In particular, we find that the decrease of inter-F1 after removing global predictor (3.92%=50.91%-46.99%) is obviously higher than that for local predictor (1.90%=50.91%-49.01%), which verifies the usefulness of global features for long-dependency relation reasoning. A significant drop can be found  when replacing the dynamic max pooling with average pooling. Finally, sharing object and subject representations and using binary cross-entropy loss, the model also has a certain degree of degradation.

Effect Analysis for Co-Predictor
In this section, we investigate the effect of global and local predictors for MRN. We divide the relation instances in the development set of DocRED into three groups: the one where both subject and object arguments have single mention, the one where either subject or object argument has single mention, and the one where both subject and object arguments have multiple mentions. We also evaluate our model using different predictor configurations. As shown in Figure 5, the model with both local and global predictors consistently outperforms [S1] Allen Francis Moore (September 30, 1869-August 18, 1945  the other ones. The F1s increase as the times that subjects and objects are mentioned increase (from s-s to m-m), indicating that multiple mentions appeared in various positions can provide more information to models. When removing the local predictor, we can observe that there is a huge drop for the s-s group, especially in Intra-F1. This demonstrates that the s-s group are mostly intra-sentence relations and they depend on local reasoning mostly. Moreover, if the global predictor is discarded, inter-F1 scores for the groups where subjects and objects are mentioned multiple times (s-m and m-m) drop the most. This reveals that the global predictor is more beneficial for extracting relations from multiple-mention entities or inter-sentence entities.

Effect Analysis for Inter-and Intra-sentence Training Data
As shown in Figure 6, we analyze the variation of inter-and intra-F1 scores when increasing or decreasing the proportions of intra-and inter-sentence training instances on the DocRED dataset. Note that the proportions of intra-and inter-sentence relation instances are 54.5% and 45.5% in the training set. The experimental setting is as below: first, we use 5% of inter-sentence training instances and observe the intra-F1. Then we gradually increase the percentage such as 10%, 20%, 50% and 100%. During this process, all intra-sentence training instances are used. The object for the above steps is to observe the effect of inter-sentence training instances on the intra-F1. In addition, we conduct similar experiments to observe the inter-F1 by gradually increasing the proportion of intra-sentence training instances.
As the red line shows in Figure 6, the inter-F1 is influenced slightly by the number of intra-sentence training instances. In contrast, the number of intrasentence training instances has a significant impact on the inter-F1, since the inter-F1 grows dramatically (the blue line) when more intra-sentence training instances are added. This suggests that the interactions between intra-and inter-sentence relations indeed exist, and one may be helpful for reasoning the other.

Case Study
As shown in Figure 7, we present a case study to better understand the effect of our proposed MRN, in comparison with previous state-of-the-art baseline GAIN. We can observe that Monticello is the object of the intra-sentence relation triple ('Moore', educated at, 'Monticello') and also the subject of the inter-sentence ('Monticello', country, 'U.S'). However, GAIN fails to identify the relation between 'U.S.' and 'Monticello', while MRN deduces it successfully. This demonstrates that the effectiveness of distinguishing subjects and objects at the inference stage, and MRN has strong capability for inter-sentence reasoning. Meanwhile, GAIN has made a wrong prediction between 'Monticello' and 'Illinois' in the 4-th sentence, indicating that our model has better local inference ability.

Conclusion
We propose a novel mention-based reasoning network (MRN) for document-level relation extraction. Our model is capable of capturing local and global contextual information as well as close and distant mention interactions, via multiple mechanisms such as a multi-hop mention-level reasoning block and collaborative predictors. Experimental results show that our proposed model achieves new state-of-the-art on three widely-used datasets. Through empirical analyses, we find that it is reasonable for document-level RE models to pay more attention on local context and close mentions. Meanwhile, global context and distant mention interactions are also highly important for document-level RE. Last but not least, joint reasoning with local and global context information is a reasonable and effective method for the task.

A Effect Analysis for Sentence Number
As shown in Figure 8, we display the F1 scores of MRN and GAIN in the development set of Do-cRED with regards to the document length. As seen, we count the document length using the number of sentences within a document, which varies from 3 to 13. Results show that MRN attains better performances than GAIN, no matter that how the document length changes. In addition, the performance difference between MRN and GAIN becomes larger, when the document becomes longer. For instance, MRN outperforms GAIN by about 10% with regards to the group where the document length is 13. This demonstrates that our model is more robust for long documents.

B Architecture Analysis for MRN Block
We conduct experiments for the interactive mention-based reasoning block based on the development set of DocRED, to understand which configuration works better. As shown in the left of Figure 9, our model performs the best when the kernel size is 3 in terms of all evaluation metrics. Meanwhile, the right part of Figure 9 shows that 3 seems to be a reasonable choice for the number of the MRN block.

C Analysis of Loss Functions
To compare the ASL and BCE loss, we compare their learning curves on DocRED and keep other settings of our model the same, as shown in Figure  10. With the comparisons of 80 epochs, we observe that the ASL loss helps our model converge to better performance than the BCE loss at a faster speed, demonstrating the effectiveness of the asymmetric strategies for positive and negative samples.

D Implementation Details
In this section, we provide more details of our experiments. We implemented MRN with PyTorch