Learning Logic Rules for Document-Level Relation Extraction

Document-level relation extraction aims to identify relations between entities in a whole document. Prior efforts to capture long-range dependencies have relied heavily on implicitly powerful representations learned through (graph) neural networks, which makes the model less transparent. To tackle this challenge, in this paper, we propose LogiRE, a novel probabilistic model for document-level relation extraction by learning logic rules. LogiRE treats logic rules as latent variables and consists of two modules: a rule generator and a relation extractor. The rule generator is to generate logic rules potentially contributing to final predictions, and the relation extractor outputs final predictions based on the generated logic rules. Those two modules can be efficiently optimized with the expectation-maximization (EM) algorithm. By introducing logic rules into neural networks, LogiRE can explicitly capture long-range dependencies as well as enjoy better interpretation. Empirical results show that significantly outperforms several strong baselines in terms of relation performance and logical consistency. Our code is available at https://github.com/rudongyu/LogiRE.


Introduction
Extracting relations from a document has attracted significant research attention in information extraction (IE). Recently, instead of focusing on sentencelevel (Socher et al., 2012;dos Santos et al., 2015;Han et al., 2018;Wang et al., 2021a,b), researchers have turned to modeling directly at the document level (Wang et al., 2019;Ye et al., 2020;Zhou et al., 2021), which provides longer context and requires more complex reasoning. Early efforts focus mainly on learning a powerful relation (i.e., entity pair) representation, which * corresponding authors.
† Work is done while at ByteDance. implicitly captures long-range dependencies. According to the input structure, we can divide the existing document-level relation extraction work into two categories: the sequence-based model and the graph-based model.
The sequence-based model first leverages different sequence encoder (e.g., BERT (Devlin et al., 2019), RoBERTa ) to obtain token representations, and then computes relation representations by various pooling operations, e.g., average pooling (Yao et al., 2019;Xu et al., 2021), attentive pooling (Zhou et al., 2021). To further capture long-range dependencies, graph-based models are proposed. By constructing a graph, words or entities that are far away can become neighbor nodes. On top of the sequence encoder, the graph encoder (e.g., GNN) can aggregate information from all neighbors, thus capturing longer dependencies. Various forms of graphs are proposed, including dependency tree (Peng et al., 2017;, co-reference graph (Sahu et al., 2019), mention-entity graph Zeng et al., 2020), entity-relation bipartite graph  and so on. Despite their great success, there is still no comprehensive understanding of the internal representations, which are often criticized as mysterious "black boxes".
Learning logic rules can discover and represent knowledge in explicit symbolic structures that can be understood and examined by humans. At the same time, logic rules provide another way to explicitly capture interactions between entities and output relations in a document. For example in Fig. 1, the identification of royalty_of(Kate,UK) requires information in all three sentences. The demonstrated logic rule can be applied to directly obtain this relation from the three relations locally extracted in each sentence. Reasoning over rules bypasses the difficulty of capturing long-range dependencies and interprets the result with intrinsic correlations. If the model could automatically learn rules and use them to make predictions, then we would get better relation extraction performance and enjoy more interpretation.
In this paper, we propose LogiRE, a novel probabilistic model modeling intrinsic interactions among relations by logic rules. Inspired by RNN-Logic (Qu et al., 2021), we treat logic rules as latent variables. Specifically, LogiRE consists of a rule generator and a relation extractor, which are simultaneously trained to enhance each other. The rule generator provides logic rules that are used by the relation extractor for prediction, and the relation extractor provides some supervision signals to guide the optimization of the rule generator, which significantly reduces the search space. In addition, the proposed relation extractor is model agnostic, so it can be used as a plug-and-play technique for any existing relation extractors. Those two modules can be efficiently optimized with the EM algorithm. By introducing logic rules into neural networks, LogiRE can explicitly capture long-range dependencies between entities and output relations in a document and enjoy better interpretation. Our main contributions are listed below: • We propose a novel probabilistic model for relation extraction by learning logic rules. The model can explicitly capture dependencies between entities and output relations, while enjoy better interpretation.
• We propose an efficient iterative-based method to optimize LogiRE based on the EM algorithm.
• Empirical results show that LogiRE significantly outperforms several strong baselines in terms of relation performance (∼1.8 F1 score) and logical consistency (over 3.3 logic score).

Related Work
For document-level relation extraction, prior efforts on capturing long-range dependencies mainly focused on two directions: pursuing stronger sequence representation (Nguyen and Verspoor, 2018;Verga et al., 2018;Zheng et al., 2018) or including prior for interactions among entities as graphs . For more powerful representations, they introduced pre-trained language models (Wang et al., 2019;Ye et al., 2020), leveraged attentions for context pooling (Zhou et al., 2021), or integrated the scattered information according to a hierarchical level (Tang et al., 2020). Aiming to model the intrinsic interactions among entities and relations, they utilized implicit reasoning structures by carefully designing graphs connecting: mentions to entities, mentions in the same sentence , mentions of the same entities Zeng et al., 2020), etc. Nan et al. (2020); Xu et al. (2021) directly integrated similar structural dependencies to attention mechanisms in the encoder. These approaches contributed to obtaining powerful representations for distinguishing various relations but lacked interpretability on the implicit reasoning. Another approach that can capture dependencies between relations is the global normalized model (Andor et al., 2016;. In this work, we focus on how to learn and use logic rules to capture long-range dependencies between relations. Another category of related work is logical reasoning. Many studies were conducted on learning or applying logic rules for reasoning. Most of them (Qu and Tang, 2019; concentrated on reasoning over knowledge graphs, aiming to deduct new knowledge from existing triples. Neural symbolic systems (Hu et al., 2016;Wang and Poon, 2018) combined logic rules and neural networks to benefit from regularization on deep learning approaches. These efforts demonstrated the effectiveness of integrating neural networks with logical reasoning. Despite doc-RE providing a suitable scenario for logical reasoning (with relations serving as predicates and entities as variables), Figure 2: The overview of LogiRE. LogiRE consists of two modules: a rule generator and a relation extractor . For a given document D and a query triple , we treat the required logic rules as latent variables , aiming to identify the corresponding truth value . During inference, we sample from the rule generator for the latent rule set and use the relation extractor to predict given the rules. The overall objective (maximizing the likelihood) is optimized by the EM algorithm. In the E-step, we estimate the approximate posterior ( ); In the M-step, we maximize a lower bound of the likelihood w.r.t. , . no existing work attempted to learn and utilize rules in this field. Using hand-crafted rules, Wang and Pan (2020); Wu et al. (2020) achieved great success on sent-level information extraction tasks. However, the rules were predefined and limited to lowlevel operations, restricting their applications.

Method
In this section, we describe the proposed method LogiRE that learns logic rules for document-level relation extraction. We first define the task of document-level relation extraction and logic rules.

Document-level Relation Extraction
Given a set of entities E with mentions scattered in a document D, we aim to extract a set of relations R. A relation is a triple (ℎ, , ) ∈ R (also denoted by (ℎ, )), where ℎ ∈ E is the head entity, ∈ E is the tail entity and is the relation type describing the semantic relation between two entities. Let T be the set of possible relation types (including reverse relation types). For simplicity, we define a query = (ℎ, , ) and aim to model the probabilistic distribution ( | , D), where ∈ {−1, 1} is a binary variable indicating whether (ℎ, , ) is valid or not, and ℎ, ∈ E, ∈ T . In this paper, bold letters indicate variables.
Logic Rule We extract relations from the document by learning logic rules, where logic rules in this work have the conjunctive form: ∈ T and is the rule length. This form can express a wide range of common logical relations such as symmetry and transferability. For example, transferability can be expressed as Inspired by RNNLogic (Qu et al., 2021), to infer high-quality logic rules in the large search space, we separate rule learning and weight learning and treat the logic rules as the latent variable. LogiRE consists of two main modules: the rule generator and the relation extractor, which are simultaneously trained to enhance each other. Given the query = (ℎ, , ) in the document D, on the one hand, the rule generator adopts an auto-regressive model to generate a set of logic rules based on , which was used to help the relation extractor make the final decision; on the other hand, the relation extractor can provide some supervision signals to update the rule generator with posterior inference, which greatly reduces the search space with high-quality rules.
Unlike existing methods to capture the interactions among relations in the document by learning powerful representations, we introduce a novel probabilistic model LogiRE (Sec. 3.1, Fig. 2), which explicitly enhances the interaction by learning logic rules. LogiRE uses neural networks to parameterize the rule generator and the relation extractor (Sec. 3.2), optimized by the EM algorithm in an iterative manner (Sec. 3.3).

Overview
We formulate the document-level relation extraction in a probabilistic way, where a set of logic rules is assigned as a latent variable . Given a query variable = ( , , ) in the document D, we define the target distribution ( | , D) as below 1 : where is the distribution of the rule generator which defines a prior over the latent variable conditioned on a query (we assume the distribution of is independent from the document D), and is the relation extractor which gives the probability of conditioned on the query , latent , and the document D. Given the gold label * of the query in the document D, the objective function is to maximize the likelihood as follows: Due to the existence of latent variables in the objective function L, we use the EM algorithm for optimization (Sec. 3.3).

Parameterization
We use neural networks to parameterize the rule generator and the relation extractor.

Rule Generator
The rule generator defines the distribution ( | ). For a query , the rule generator generates a set of logic rules denoted by for predicting the truth value of the query . Formally, given a query = (ℎ, , ), we generate logic rules that takes the form of ← 1 ∧ · · · ∧ . Such relation sequences [ 1 , . . . , ] can be effectively modeled by an autoregressive model. In this work, we employ a Transformer-based autoregressive model AutoReg to parameterize the rule generator, which sequentially generates each relation . In this process, the probabilities of generated rules are simultaneously computed. Next, we assume that the rule set obeys a multinomial distribution with rules independently sampled from the distribution AutoReg (rule| ): where Multi denotes multinomial distribution, is a hyperparameter for the size of the set and AutoReg defines a distribution over logic rules conditioned on the query . 2 Relation Extractor The relation extractor defines ( | , ). It utilizes a set of logic rules to get the truth value of corresponding to the query . For each query , a rule ∈ is able to find different grounding paths on the document . Following the product t-norm fuzzy logic (Cignoli et al., 2000), we score each rule as follows: where P (rule) is the set of grounding paths which start at ℎ and end at following a rule. ( −1 , , ) is the confidence score obtained by any existing relation models. 3 To get the probability (fuzzy truth value) of , we synthesize the evaluation result of each rule in the latent rule set . The satisfaction of any rule body will imply the truth of . Accordingly, we take the disjunction of all rules in as the target truth value. Following the principled sigmoid-based fuzzy logic function for disjunction (Sourek et al., 2018;, we define the fuzzy truth value as: where ( ) and ( , rule) are learnable scalar weights.
( ) is a bias term for balancing the score of positive and negative cases.
( , rule) estimates the score, namely, the quality of a specific rule.
(rule) evaluates the accessibility from the head entity ℎ to the tail entity through the meta path defined by rule's body. Applying logic rules and reasoning over the rules enable the relation extractor to explicitly modeling the long-range dependencies as the interactions among entities and relations.

Optimization
To optimize the likelihood L( , ) (Eq. 1), we update the rule generator and the relation extractor alternately in an iterative manner, namely the EM algorithm. The classic EM algorithm estimates the posterior of the latent variable according to current parameters in the E-step; The parameters are updated in the M-step with obeys the estimated posterior. However, in our setting, it is difficult to compute the exact posterior ( | , ) due to the large space of . To tackle this challenge, we seek an approximate posterior ( ) by a second-order Taylor expansion. This modified version of posterior forms a lower bound on log , ( | ), since the difference between them is a KL divergence and hence positive: Once we get ( ), we can maximize this lower bound of log , ( | ). E-step Given the current parameters , , E-step aims to compute the posterior of according to the current parameters , . However, the exact posterior , ( | , ) is nontrivial due to its intractable partition function (space of is large). In this work, we aim to seek an approximate posterior ( ).
By approximating the likelihood with the secondorder Taylor expansion, we can obtain a conjugate form of the posterior as a multinomial distribution. The detailed derivation is listed in Appendix. A. Formally, we first define (rule) as the score function estimating the quality of each rule: For each instance, use the rule generator to generate a set of logic rulesˆ(|ˆ| = ).

3:
Calculate the rule score (rule) of each rule for approximating the posterior of rule: (rule| ). ⊲ E-step 4: For each instance, update the rule generator AutoReg based on the sampled rules from (rule| ).

5:
For each instance, update the relation extractor based on generated logic rulesf rom the updated rule generator.
⊲ M-step 6: end while which servers as the prior probability for each rule. The other is based on the relation extractor, and it takes into account the contribution of the current rule to the final correct answer * . Next, we usê (rule| ) to denote the posterior distribution of the rule given the query : Thus the approximate posterior also obeys a multinomial distribution.

( ) ∼ Multi ( ,ˆ(rule| ))
M-step After obtaining the ( ), M-step is to maximize the lower bound log , ( | ) with respect to both and . Formally, given each data instance ( * , , D) and the ( ), the objective is to maximize where L G , L R are the objective of the rule generator and the relation extractor, repectively. For the objective L , it can be further converted equally as To compute the expectation term of L G we sample from the current prior ( | ) for a samplê , and evaluate the score of each rule as (rule), normalized score over (rule) are regarded as the approximatedˆ(rule| ). Then we use sampled rules to update the AutoReg (rule| ). Intuitively, we update the rule generator ( | ) to make it consistent with the high-quality rules identified by the approximated posterior.
For the objective L , we update the relation extractor according to the logic rules sampled from the updated rule generator. The logic rules explicitly capturing more interactions between relations can be fused as input to the relation extractor, which yields better empirical results and enjoys better interpretation. Finally, we summarize the optimization procedure in Algorithm 1.

Experiments
We conduct experiments on multi-relational document-level relation extraction datasets: Do-cRED (Yao et al., 2019) and DWIE (Zaporojets et al., 2020). The statistics of the two datasets are listed in Table 1. Pre-processing details of DWIE are described in Appendix B.
Evaluation Besides the commonly used F1 metric for relation extraction, we also include other two metrics for comprehensive evaluation of the models: ign F1, logic. ign F1 was proposed in (Yao et al., 2019) for evaluation with triples appearing in the training set excluded. It avoids information leakage from the training set. We propose logic for evaluation of logical consistency among the prediction results. Specifically, we use the 41 pre-defined rules on the DWIE dataset to evaluate whether the predictions satisfy these gold rules. The rules have a similar form to logic rules defined in Sec. 3. We name the precision of these rules on predictions as logic score. Note that these rules are independent of the rule learning and utilization in Sec. 3 but only used for logic evaluation.
Experimental Settings The rule generator in our experimental settings is implemented as a transformer with a two-layer encoder and a two-layer decoder, hidden size set to 256. We empirically find the tiny structure is enough for modeling the required rule set. We set the size of the latent rule set to 50. We limit the maximum length of logic rules to 3 in our setting.

Baselines
We compare our LogiRE with the following baselines on document-level RE. The baselines are also used as corresponding backbone models in our framework. Yao et al. (2019) proposed to apply four state-of-the-art sentence-level RE models to document-level relation extraction: CNN, LSTM, BiLSTM, and Context-Aware. (Zeng et al., 2020) proposed GAIN to leverage both mention-level graph and aggregated entity-level graph to simulate the inference process in document-level RE, using graph neural networks. Zhou et al. (2021) proposed ATLOP, using adaptive thresholding to learn a better adjustable threshold and enhancing the representation of entity pairs with localized context pooling. The implementation details of the baselines are shown in Appendix B.

Main Results
Our LogiRE outperforms the baselines on all of the three metrics. (We mainly analyze the results on DWIE with all three metrics can be evaluated. The results on DocRED are demonstrated in Table 3 and discussed in Sec. 4.3.) Our LogiRE consistently outperforms various backbone models. It outperforms various baselines on the DWIE dataset as shown in Table 2. We achieve 2.02 test ign F1 and 1.84 test F1 improvements on the current SOTA, ATLOP. The compatibility between LogiRE and various backbone models shows the generalization ability of our LogiRE. The consistent improvements on both sequencebased and graph-based models empirically verified the benefits of explicitly injecting logic rules to document-level relation extraction.
The improvements on graph-based models indicate the effectiveness of modeling interactions among multiple relations and entities. Despite graph-based models provide graphs  consisting of connections among mentions, entities, and sentences, they seek more powerful representations which implicitly model the intrinsic connections. Our Lo-giRE instead builds explicit interactions among the entities and relations through the meta path determined by the rules. The improvements on the current SOTA for graph-based model empirically proved the superiority of such explicit modeling.   Our model achieves better logical consistency compared with the baselines. The results show that LogiRE achieves up to 18.78 enhancement on the logic metric. Even on the graph-based model, GAIN, we obtain a significant improvement of 5.03 on logical consistency. The improved logic score shows that the predictions of LogiRE are more consistent with the regular logic patterns in the data. These numbers are evidence of the strength of our iterative-based optimization approach by introducing logic rules as latent variables.

Analysis & Discussion
We analyze the results on DocRED data and discuss the superiority of our LogiRE on capturing long-range dependencies and interpretability. The capability of capturing long-range dependencies is studied by inspecting the inference performance on entity pairs of various distances. The interpretability is verified by checking the logic rules learned by our rule generator and the case study on predictions.

Analysis on DocRED Results
In comparison with the significant improvements on DWIE, the enhancement of LogiRE on DocRED is less significant. Our analysis shows that the reasons are a) Shorter Dependencies in DocRED Shorter dependencies in DocRED lower the demand for capturing long-range correlations among entities and relations. We show the distribution of distance between entity pairs in Fig. 3. 79.26% of entity pairs in DocRED have distances less than 100 tokens. The examples in DocRED are less difficult on capturing long-range dependencies. More analysis and comparison can be found in Zaporojets et al. (2020). The representation-based approaches can already perform well in such cases. The benefits of modeling long-range dependencies through logical reasoning will be smaller.
b) Logical Inconsistency in DocRED The justification of predictions after reasoning may be not accurate because of missing annotations. We calculated the error rate of a few easy-to-verify logic rules as shown in Table. 4. The 7 rules, selected implication rule error rate father(ℎ, ) ∧ spouse( , ) → mother(ℎ, ) 24.07% replaces −1 (ℎ, ) → replaced_by (ℎ, ) 22.22% capital −1 (ℎ, ) → capital_of (ℎ, ) 28.24% father −1 (ℎ, ) → child(ℎ, ) 10.26% followed −1 (ℎ, ) → follows (ℎ, ) 22.40% capital −1 (ℎ, ) → capital_of(ℎ, ) 28.24% P150 −1 (ℎ, ) → P131(ℎ, ) 19.71% Table 4: The logical inconsistency in the DocRED (for conciseness, P150 represents the relation 'contains administrative territorial entity' and P131 represents the relation 'located in the administrative territorial entity'). The shown easy-to-verify gold rules have high error rates in DocRED while a considerable part of relations (12.96%) are involved in as atoms in shown rules. Those missing annotations make the learning of logic rules difficult. Inconsistent patterns or statistics between training and test may lead to unfair evaluation of relation extraction performance. by case study, have a considerable part (12.96%) of labeled relations may participate in as atoms. However, the statistics in the table demonstrated that all the 7 rules have error rates higher than 10%. The numbers indicated that a notable partition of true relations are missing. The results obtained by reasoning over logic rules may be wrongly justified since the data is not exhaustively annotated. According to the analysis above, our LogiRE has greater potential than that demonstrated as the overall performance on DocRED.
Logic rules are shortcuts for comprehension. The performance enhancement of our LogiRE becomes more prominent when the distance between entity pairs gets longer. We plot the performance of ATLOP and ATLOP-based LogiRE on the DWIE dataset with four groups of entity pair distances in Fig. 4. The distance is calculated as the number of tokens in between the nearest mentions of an entity pair. Results indicate that our LogiRE performs better on capturing long dependencies.
Relation extraction for entity pairs with longer distances in between generally performs worse. As shown in the figure, the performance starts to drop as the distance surpasses 100 tokens, indicating the difficulty of modeling long-range dependencies. The redundant information in a long context impedes accurate semantic mapping through powerful representations. This issue increases the complexity of modeling and limits the potential of representation-based approaches.
Our framework with latent logic rules injected can effectively alleviate this problem. The performance drop of our LogiRE is smaller when the distance between entities gets larger. For entity pairs of distances larger than 400, our LogiRE achieves up to 4.47 enhancement on test ign F1. By reasoning over local logic units (atoms in rules), we ignore the noisy background information in the text but directly integrate high-level connections among concepts to get the answer.
The reasoning process of our LogiRE is in line with the comprehension way of we human beings when reading long text. We construct basic concepts and connections between (local logic atoms) for each local part of the text. When the collected information is enough to fit some prior knowledge (logic rules), we deduct new cognition from the existing knowledge. Our LogiRE provides shortcuts for modeling long text semantics by adding logic reasoning to naive semantics mapping.
Interpretability by Generating Rules Our Lo-giRE enjoys better interpretability with the generated latent rule set. After the EM optimization, we can sample from the rule generator for high-quality rules that may contribute to the final predictions. Besides the gold rules previously shown for evaluating logic, LogiRE mines more logic rules from the data, as shown in Table. 5. These logic rules explicitly reveal the interactions among entities and Figure 5: Inference cases of our LogiRE on DWIE by using ATLOP as the backbone model. The grey arrows are relations extracted by the backbone model, solid lines representing true relations while dashed lines representing false relations. The green arrows are new relations correctly extracted by logical reasoning. The blue arrows indicate the potential reasoning paths. We also demonstrate a negative case. In the third example, the red arrow represents a wrong relation extracted by reasoning over wrongly estimated atoms. relations in the same document as regular patterns. LogiRE is more transparent, exhibiting the latent rules by the rule generator.
Case Study Fig. 5 shows a few inference cases of our LogiRE, including two positive examples and a negative one. As shown in the first two examples, LogiRE can complete the missing relations in the backbone model's outputs by utilizing logical rules. The soft logical reasoning can remedy the defects of representation-based approaches under specific circumstances. However, the extra reasoning may also exacerbate errors by reasoning over wrongly estimated logic units. The third example shows such a case. The wrongly estimated atom in0(Vega, Germany) leads to one more wrong relation extracted by reasoning. Fortunately, such errors in our LogiRE will be more controllable because of the transparency in the logical reasoning part.