From What to Why: Improving Relation Extraction with Rationale Graph

Which type of information affects the existing neural relation extraction (RE) models to make correct decisions is an important question. In this paper, we observe that entity type and trigger are the most indicative information for RE in each instance. Moreover, these indicative clues are always constrained to co-occur with speciﬁc relations at the corpus level. Motivated by this, we propose a novel RAtionale Graph (RAG) to organize such co-occurrence constraints among entity types, triggers and relations in a holistic graph view. By introducing two subtasks of entity type prediction and trigger labeling, we build the connection between each instance and RAG, and then leverage relevant global co-occurrence knowledge stored in the graph to improve the performance of neural RE models. Extensive experimental results indicate that our method outperforms strong baselines signiﬁcantly and achieves state-of-the-art performance on the document-level and sentence-level RE benchmarks.


Introduction
Relation extraction (RE), which aims to identify the semantic relation between two entities in plain text, is one of the fundamental tasks in information extraction (IE). In the deep learning era, many approaches are proposed including models based on attention mechanism (Lin et al., 2016;Zhang et al., 2017), graph neural networks (Zhang et al., 2018;Guo et al., 2019), and pre-trained language models (Joshi et al., 2020;. While these neural RE models have achieved the latest state-of-the-art results, little is known about which type of information affects the models to make decisions. Recently, an empirical study shows that the understanding of two main information sources, entity type, and textual context, is necessary and effective for training a RE model (Peng * Corresponding Author. Augustus is the youngest of five children of Hawkins . et al., 2020). Entity type, is always an important side information for RE Vashishth et al., 2018). In the textual context, some words play an indicative role in relation expression.  initially annotated the minimal contiguous indicative word span and named them trigger. For example, in Figure 1, when we notice that both the subject and object entities are person, as well as the trigger children appears in the context, our immediate reaction is that they probably hold a parent-child relation, then we make a further judgment by reading the complete text.
What is the support behind such rapid and accurate decision-making of human beings? In RE, if we look at the entire corpus from a global view, we can find a common phenomenon that one certain entity type or trigger is constrained to co-occur with specific relations. Taking entity type as an example, two entities of type person can only participate in person-related relations (e.g., per:parents, per:siblings). Such global co-occurrence induced by multiple seen instances serves as the crucial prior knowledge in the process of human cognition (Chater et al., 2006), and can naturally form a bipartite graph, in which the nodes on two sides are entity types and relations respectively. Similarly, the same logic can also go for triggers. Inspired by the above observation, in this paper, we propose a RAtionale Graph (RAG) to organize the global co-occurrence statistics aggregated from the corpus. Specifically, nodes in the graph are constructed based on the relations and patterns 1 . There are totally four types of directed edges that exist between different types of nodes. For example, the edge between a trigger node and a relation node depicts the co-occurrence probability of a text expressing the relation when the trigger appears in the text. This probabilistic knowledge, together with the involved nodes, is collectively referred to as rationale. In the end, RAG is excepted to present a holistic view of all patterns and relations, and then facilitate the relation prediction. Now we incorporate RAG with neural networks to improve the RE performance. Given an instance with a text and two entities, we first predict the entity type and label the trigger, then establish the link between the input instance with the known patterns in RAG, and finally enhance the instance representation with the attended relation node features in the graph. Meanwhile, we introduce the gate mechanism and graph neural networks (GNNs) to perform the information propagation from the input instance to relation nodes. Hence, this workflow makes full use of all aforementioned rationale knowledge to guide the processing of new instances by linking them to each seen pattern stored in the graph, like humans recognizing new things by intuitively associating with the knowledge they have memorized. In the training phase, the model learns simultaneously (1) the relation along with (2) the entity type and trigger for each instance. This means that we care about not only the final relation label (what), but also the intermediate results, i.e., whether the entity type and trigger are correctly predicted (why). By doing so, we can retrieve the relevant global pattern knowledge from the graph with the predicted trigger and entity types, during testing.
To evaluate our approach, we first conduct experiments on the document-level RE task Dialo-gRE . Experimental results show the benefits of the proposed method, leading to state-of-the-art performance. An exciting discovery is that our method is very effective in small-scale annotation scenes, using only half (with 2,584 positive instances) of the pattern-annotated instances results in a comparable performance as using all conventional annotated instances. To further validate this advantage, we manually annotate 20% (with 2,585 positive instances) patterns of the sentence-level RE benchmark TACRED (Zhang et al., 2017), and empirically demonstrate similar experimental conclusions with DialogRE.

Related Work
Extracting relational facts between entities from text is an essential and classical problem in natural language processing. The popular research methods have gone through the iteration from patternbased methods (Mooney, 1999;Chang and Lui, 2001) to feature-based methods (Kambhatla, 2004;Zhou et al., 2005), and then to neural-based methods (Zeng et al., 2014;Zhang et al., 2017). Nowadays, most state-of-the-art work develops powerful neural models based on pre-trained language models or graph neural networks (Soares et al., 2019;Guo et al., 2019). All the time, there are two main consensuses in the community: when extracting a relation, entity types are important side indicators, which are often used to enhance the input or output layer (Vashishth et al., 2018;Kuang et al., 2020). On the other hand, not all the words in the text are beneficial to RE. Thus there are also efforts focusing on the heuristic or implicit selection of the key clues related to relation expression (Zhang et al., 2018;Yu et al., 2019), and  is the first work to annotate such clue words in texts and name them trigger.
However, most previous studies are only based on local features, in other words, models are trained on individual instance, limiting the ability to capture the connection between textual indicative information and relations globally. Conversely, Su et al. (2018) emphasized the importance of the global view, and embed the textual relations with global statistics to combat the wrong labeling problem of distant supervision. Wang et al. (2020) proposed an interpretable network embedding model based on a corpus-level entity graph to rationalize medical relation prediction. Unfortunately, their methods are not suitable for the supervised RE task in the general domain. The most related work, , collected a global type-relation mapping as prior knowledge to guide the optimization with knowledge distillation. One major difference is that we systematically consider both entity type and textual trigger to collect all indicative knowledge in a holistic view. Another unique aspect of this work is that we perform the prediction of entity type and trigger as two subtasks, while previous studies only focus on the final relation labels.

Rationale Graph (RAG)
Different from existing work only using raw text for RE, we assume the global co-occurrence statistics among relations, triggers, and entity types is given, which are pre-construed based on the whole corpus, and denoted as a graph G = (V, E), where each vertex v ∈ V refers the relation, trigger, or entity type pair extracted from the corpus and each edge e ∈ E is associated with the global co-occurrence count for the connected nodes. Inspired by , we organize the global co-occurrence count between two kinds of nodes as bipartite rationale mapping and pack all bipartite mappings together to obtain a rationale graph (RAG). Figure  2 shows the schematic diagram for clarity.

Bipartite Rationale Mapping
Here we take type (short for entity type pair) and relation as an example to describe the construction process of bipartite rationale mapping. Specifically, for instance with a text x and two entities (s, o), we combine two entity types to achieve a pattern p. From this step, we obtain the pattern set T = {t i } and formulate a support set S(t i ) for each t i , in which the support set S(t i ) contains all instances with pattern t i . Besides, we also collect a set of relations R = {r j }, and the support set S(r j ) denoting the set of instances holding relation r j . The co-occurrence number of pattern t i and relation r j is defined as w ij = |S(t i ) ∩ S(r j )|. In other word, every instance (x, s, o) with pattern t i and relation r j is counted as a co-occurrence of t i and r j .
However, it is inappropriate to take the raw cooccurrence count as mapping weight directly. The relation distribution in reality typically has a powerlaw tail (Zhang et al., 2017), meaning that the count spans several orders of magnitude in different relations. To meet this challenge, for each pattern, we normalize its co-occurrence count to form a valid probability distribution over relations. In the end, the bipartite mapping M tp2re is constructed, with one node set being the types, the other being the relations, and the weighted edges w ij = p(r j |t i ) = w ij / j w ij representing the normalized global co-occurrence probability.

Graph Construction
Considering that trigger and type are two kinds of information sources for RE (Peng et al., 2020), we first introduce the bipartite rationale mapping from type to relation M tp2re and the mapping from trig- ger to relation M tg2re in RAG. In this way, we assume that the graph reflects the prior probability of relation when some indicative information appears in the text. Furthermore, triggers are actually relations in the form of natural language (Hu et al., 2020) and entity types are tightly bound to certain trigger words within the context . In other words, type and trigger are mutually related and restricted. Therefore, we introduce a set of bidirectional mapping, that is, from type to trigger M tp2tg and from trigger to type M tg2tp . Finally, we place four kinds of edges in the graph:

Relation Extraction with RAG
In this section, we exemplify how to incorporate existing RE models with RAG. Given a text, a subject entity, and an object entity, the model aims to identify the semantic relationship between these two entities with the aid of RAG. Moreover, we also require the model to predict entity type pair and label trigger (if possible) as two auxiliary subtasks. For the example in Figure 3, we build a unified model that not only accurately predicts the relation per:parents, but also provides meaningful rationales on how the prediction is made: the subject and object entities are both person, and the key clue children appears in the context.

Encoding Module
We utilize BERT (Devlin et al., 2019) as the feature encoder to extract token representations due to its effectiveness in representation learning. Theoretically, the encoding module can be easily replaced by other advanced models. The encoder receives a BERT-style packed sequence and outputs a context representation matrix H ∈ R n×d with an overall vector h cls ∈ R d (the representation of the [CLS] token in BERT), where d is the vector dimension  Figure 3: The overall architecture of the proposed model. Rationale enhancing module is the core component in our approach, which enhances the instance representation by retrieving pertinent rationales stored in RAG.
of the last layer of BERT. Typically, existing BERTbased RE solutions first concatenate target entities with the text or mark them in the input sequence with special tokens, and then directly take h cls as the input of final classification module (Joshi et al., 2020;.

Rationale Enhancing Module
The rationale enhancing module consists of two enhancing branches and one rationale integration unit. In each branch, we first predict pattern (type or trigger) for the input instance and then calculate the pattern probability that the instance belongs to each pattern in RAG. The integration unit aims to collect rationale enhancing features for final relation extraction based on the pattern probability and the rationale in the graph.

Type Enhancing Branch
In this branch, we predict the types of subject and object entities at the same time. Similar to RE, type prediction is regarded as a closed-world classification problem, and the class space is all seen entity type pairs, that is, all type nodes in RAG. Following the classification paradigm of BERT (Devlin et al., 2019), we project the overall vector h cls into a new space for type prediction: Here MLP d,ntp (·) denotes a multi-layer perceptron module with input dimension d and output dimension n tp , p tp ∈ R ntp is the type probability that the given instance belongs to each type pair, where n tp is the number of all known type pairs.

Trigger Enhancing Branch
Different from the prediction of entity type, triggers are flexible and can be any word or phrase in the text. We formulate the trigger recognition task as a labeling problem with two label sequences. Given the representation matrix H output from BERT, the model predicts two probabilities of each token being the start index and end index of a trigger, respectively. To handle the instances without clear trigger (about half of them), we concatenate H with h cls to formH = [H; h cls ], and set the boundary index pointing to the [CLS] token. These two probability distributions over the entire sequence p sta , p end ∈ R (n+1) can be obtained by To align the labeling result with the triggers in RAG, we first weight each token inH based on the two index probabilities and get the representation of predicted trigger h tg pre ∈ R d , then calculate and normalize the similarity between h tg pre and all known triggers V tg ∈ R ntg×d : where p tg ∈ R ntg is the probability of the given instance corresponding to each known trigger, n tg is the number of all triggers, and sim(·) is a similarity function as follows: where v i tg ∈ R d is the i-th trigger in V tg and • denotes element-wise product. In that case, even if we run into a new trigger that we have never seen before, we can also estimate the correlation between the new trigger and the known triggers via semantic similarity, and then absorb more global statistics from similar triggers. It provides the possibility for the rationale enhancing on trigger branch.

Rationale Integration
For each type node in RAG, we update its embedding with the instance type feature h tp cls . It is intuitive that the higher the probability of an instance to a type, the more its contribution to the updating process of that type. Specifically, we first compute the update representation for each type node based on the pattern probability p tp , and then aggregate information on the text side V h tp ∈ R ntp×d and graph side V tp ∈ R ntp×d via a gate mechanism: (5) Similarly, we perform the same computation on the trigger branch to reconstruct the trigger node embeddings in RAG and result inV tg ∈ R ntg×d .
Next, we execute GNNs-based algorithm on the RAG to update the representation of relation nodes. R-GCN (Schlichtkrull et al., 2018) is chosen as the message propagation strategy here because RAG is naturally a heterogeneous graph: After that, for the type enhancing branch, we first calculate the mapping probability of an instance to each relation based on the type probability p tp and corresponding bipartite rationale mapping M tp2re ∈ R ntp×nre (i.e., the edge weight M tp2re ), and then weight the updated relation embeddings based on the mapping probability to obtain type enhancing vector h tp ∈ R d . Meanwhile, similar operations are performed in the trigger branch:

Classification Module
The output module combines the overall vector and two enhancing features to get final representation, which is fed into a multi-layer perceptron followed by a softmax function for relation classification:

Training Objectives
Recall that there are totally three tasks in our model, including relation extraction, type prediction, and trigger (start and end indexes) labeling, which are all reduced to the classification problem. In optimization, we train the model end-to-end in a multitask manner here, and adopt cross-entropy as the loss function for each task: L task = CrossEntropy(y task , p task ), (9) where y task denotes the ground truth, represented by one-hot vector, p task∈{re,tp,sta,end} is the estimated probability for each class. Towards learning to perceive the strong signal that a known trigger exactly in the text, we utilize contrastive loss (Hadsell et al., 2006). The intuition is that the trigger in text h tg pre and the matched trigger in RAG v mat tg should have similar representations (i.e., have a small distance in vector space, d). For the mismatched trigger, we expect a margin m between their embeddings. The contrastive loss of trigger matching is as follows, where 1 mat is 1 if a trigger is originally in the text and 0 if it is not: The joint loss of trigger labeling is thus Finally, the losses from the main RE task and two subtasks are aggregated to form the training objective, with two weight factors λ tp and λ tg : Extension. Here, we introduce a simple extension to simultaneously make full use of all data with relation label and any number of data with pattern annotation. Specifically, when there are intact pattern annotations for an instance, we set 1 ext to 1 and calculate the losses of type prediction and trigger labeling. Otherwise, we do not calculate them and set 1 ext to 0. In this way, the training objective (Equation 12) is modified as follow,

Document-Level Relation Extraction
DialogRE ) is a human-annotated document-level RE dataset constructed from the transcripts of an American television situation comedy Friends. It is also the first RE dataset with both entity type and trigger annotation.

Experimental Setup
We employ BERT and BERTs  as the encoding module of RARE in this task. BERTs is a speaker-aware version of BERT, achieving the best performance on the dataset. For the completeness of experiments, we include all official baselines: Majority strategy and CNN/LSTM/BiLSTMbased models .

Results and Analysis
Main Results. Comparing the performance of different models in Table 1, the first conclusion we draw is that RARE BERTs outperforms all baseline models in all evaluation matrices, which demonstrates the effectiveness of our rationale enhanced approach, as well as the motivation of using global pattern co-occurrence statistics to boost the performance of RE models. Secondly, RARE BERTs improves by a relative margin against RARE BERT . It is strong evidence that RARE is flexible enough to adapt to various encoders. Thus, we have reason to believe that a more powerful encoding module could bring further performance gain for RARE. Lastly, TypeKD-based models have a similar trend, but their performance is relatively worse than models based on RARE, which shows that trigger and type are two non-overlapping information sources, and only considering one of them is not enough to capture complete indicative knowledge.
We report the performance of RARE BERTs on the two subtasks in Table 2. From the results, we find  that type prediction is relatively simpler than trigger labeling. We explain that the entity type is a kind of shallow linguistic feature, while the labeling trigger requires a full understanding of context semantics. We also notice that trigger labeling performance is even worse than that of RE, since about half of the positive instances have no explicit trigger , meaning that the recognition of trigger faces a more serious data imbalance problem than RE. Overall, there is still a long way to improve the performance of these two subtasks, which can be left as a possible future direction.
Ablation Study. To investigate the effectiveness of each module in RARE, we conduct an ablation study on the DialogRE dev set. From the ablations in Table 3, we observe that: (1) Rationale graph is a necessary component that contributes 2.3% F1. The performance superiority of this ablation over BERT also shows that the two auxiliary subtasks of type prediction and trigger labeling are beneficial to RE.
(2) Without the type or trigger enhancing branch, the performance degradation suggests that both type and trigger are necessary for our RARE.
(3) The ablation of removing the trigger matching loss hurts the final result by 0.6% F1, which justifies the design philosophy of entrusting the model with the ability to perceive whether the trigger is exactly in text. (4) We also try to remove the probabilistic edge weights in RAG to make it degenerate into a standard heterogeneous graph. In that case, the performance drops by 0.9% F1. We think that such probabilistic weights are capable of carrying more global information than one-hot constraints.
(5) The information propagation (i.e., gate mechanism and GNNs) brings the improvement of 1.1% F1, which provides a channel to integrate the features of input instance in the output layer.
Labor-Efficiency Study. Considering that most RE datasets have no trigger annotation, we seek to study the cost-effectiveness of adding patterns as additional annotation in this experiment. Accord-  Figure 4: Performance of models on DialogRE dev set with partial training data. The positive instance number with pattern annotation is shown in brackets.
ingly, we explore the performance of RARE BERT and BERT for various fractions of training data. From Figure 4, we can see that RARE BERT with pattern annotations delivers competitive or even better performance as BERT with twice the traditional training data. The drastic performance gain justifies the slightly additional cost incurred in annotating patterns. Furthermore, we also introduce RARE-Ext, the extension of RARE, to fully use the partial data with pattern annotations and the remaining data with only relation labels in training, which provides a plug-and-play manner to utilize pattern annotations. The results show that with the increase of annotations, the performance improvement becomes less significant. When using 50% (with 2,584 positive instances) pattern annotations, the performance of the model is comparable to that of 100% annotations.

Sentence-Level Relation Extraction
In this section, we evaluate RARE on the sentencelevel RE task with two datasets TACRED (Zhang et al., 2017) and TACREV (Alt et al., 2020). TA-CRED is the most widely used sentence-level RE dataset that constructed from New York Times. The recent TACREV (a.k.a TACRED-Revised) dataset has the same training set as TACRED, which corrects the wrong labels in the dev and test sets.

Experimental Setup
To our knowledge, SpanBERT (Joshi et al., 2020) is the best performance model without external knowledge in TACRED. We employ it as another encoder (besides BERT) for RARE. For completeness, we also include two official baselines, LSTM and PA-LSTM (Zhang et al., 2017), as well as two recent graph-based models, AG-GCN (Guo et al., 2019) and LST-AGCN , here. Different from DialogRE, TACRED/V annotates only entity types. Inspired by the results of the label-efficiency study on DialogRE, we annotate   Figure 5: Performance of models on TACREV dev set with partial training data. triggers for 2,585 positive instances, which accounts for about 20% of all positive instances in the training set of TACRED/V, to verify whether RARE could maintain such excellent lab efficient performance on sentence-level RE task. We repeat our experiments for five random seed initializations, and the results are statistically significant with a p-value of less than 0.05.

Results and Analysis
Main Results. With 20% pattern annotations, we compare RARE-Ext against several representative baselines and summarize the results in Table 4. Similar observations hold that RARE is capable of achieving superior performances with advanced encoding modules. Moreover, RARE-Ext achieves or even surpasses the performance of TypeKD that using 100% type annotations. Although sometimes RARE does not make significant improvements on TACRED, it outperforms the baselines in TACREV and leads to state-of-the-art performances, which is a more accurate evaluation set. Overall, the performance gain of RARE on this task is not as amazing as the document-level task. We analyze that because the sentence is much shorter than the document, and involves fewer relations, BERT-based models are sufficient in capturing the key seman-tic clue for decision-making, thus the benefits of global knowledge are slightly limited. Labor-Efficiency Study. Following the approximate number of positive instances in DialogRE, we split the pattern-annotated data to perform the laborefficiency study on TACREV (see Figure 5). The results indicate that when both using partial data, RARE BERT consistently outperforms BERT. It enlightens us to fully exploit the potential knowledge of the dataset, including local annotation and global statistics, to improve the performance of RE, especially under a low-resource scenario. The considerable progress of RARE BERT -Ext demonstrates that RARE is able to improve RE by annotating patterns on any part of an existing dataset. Considering the differences between DialogRE and TACREV (e.g., relation number, domain and style, the ratio of positive and negative instances), it is under investigation whether further improvements could be made by increasing annotations on TACREV, and we leave it as future work.

Case Study
In Figure 6, we select two representative cases to demonstrate the working principle of RARE. The first case is a short snippet from a DialogRE document, in which two entities are scattered in different sentences, and the context semantics is complex and changeable, BERT fails to capture the relation between them. Conversely, RARE predicts the trigger engaged and aligns it with the known trigger engagement, and then highlights the strong signal to identify the relation correctly. In the second case, which is from TACREV, BERT mistakenly regards Jackson Hewitt as a person, leading to a wrong answer of person-related relation. With the help of type prediction and the global type-relation Speaker 2 [SUBJ] : Phoebe, I'm engaged! Speaker 1: I'm just saying, get his number just in case. But no Chandler [OBJ] is in an accident …… BERT: unanswerable Jackson Hewitt [SUBJ] , based in Parsippany [OBJ] , NJ, is the nation's second-largest tax preparation chain after H&R Block.
BERT: per:cities_of_residence constraints in RAG, RARE could avoid this error and make the right decision.

Conclusion
In this paper, we propose a novel rationale graph to organize the global co-occurrence statistics among entity types, triggers, and relations. By introducing the two subtasks of entity type prediction and trigger labeling, we build the connection between input instance and the known patterns in rationale graph, which provides the model with the possibility to benefit from the global co-occurrence knowledge stored in the graph, so as to improve the performance of RE. Experimental results on two public datasets prove the effectiveness of our method. We also highlight two directions for future work: the first is to improve the performance of two subtasks, especially trigger labeling, the other is to adopt the proposed approach in more RE scenarios.