PRGC: Potential Relation and Global Correspondence Based Joint Relational Triple Extraction

Joint extraction of entities and relations from unstructured texts is a crucial task in information extraction. Recent methods achieve considerable performance but still suffer from some inherent limitations, such as redundancy of relation prediction, poor generalization of span-based extraction and inefficiency. In this paper, we decompose this task into three subtasks, Relation Judgement, Entity Extraction and Subject-object Alignment from a novel perspective and then propose a joint relational triple extraction framework based on Potential Relation and Global Correspondence (PRGC). Specifically, we design a component to predict potential relations, which constrains the following entity extraction to the predicted relation subset rather than all relations; then a relation-specific sequence tagging component is applied to handle the overlapping problem between subjects and objects; finally, a global correspondence component is designed to align the subject and object into a triple with low-complexity. Extensive experiments show that PRGC achieves state-of-the-art performance on public benchmarks with higher efficiency and delivers consistent performance gain on complex scenarios of overlapping triples.


Introduction
Identifying entity mentions and their relations which are in the form of a triple (subject, relation, object) from unstructured texts is an important task in information extraction. Some previous works proposed to address the task with pipelined approaches which include two steps: named entity recognition (Tjong Kim Sang and De Meulder, 2003;Ratinov and Roth, 2009) and relation prediction (Zelenko et al., 2002  2005; Pawar et al., 2017;Wang et al., 2020b). Recent end-to-end methods, which are based on either multi-task learning (Wei et al., 2020) or singlestage framework (Wang et al., 2020a), achieved promising performance and proved their effectiveness, but lacked in-depth study of the task.
To better comprehend the task and advance the state of the art, we propose a novel perspective to decompose the task into three subtasks: i) Relation Judgement which aims to identify relations in a sentence, ii) Entity Extraction which aims to extract all subjects and objects in the sentence and iii) Subject-object Alignment which aims to align the subject-object pair into a triple. On the basis, we review two end-to-end methods in Table 1. For the multi-task method named CasRel (Wei et al., 2020), the relational triple extraction is performed in two stages which applies object extraction to all relations. Obviously, the way to identify relations is redundant which contains numerous invalid operations, and the span-based extraction scheme which just pays attention to start/end position of an entity leads to poor generalization. Meanwhile, it is restricted to process one subject at a time due to its subject-object alignment mechanism, which is inefficient and difficult to deploy. For the single-stage framework named TPLinker (Wang et al., 2020a), in order to avoid the exposure bias in subject-object alignment, it exploits a rather complicated decoder which leads to sparse label and low convergence rate while the problems of relation redundancy and poor generalization of span-based extraction are still unsolved.
To address aforementioned issues, we propose an end-to-end framework which consists of three components: Potential Relation Prediction, Relation-Specific Sequence Tagging and Global Correspondence, which fulfill the three subtasks accordingly as shown in Table 1.
For Relation Judgement, we predict potential relations by the Potential Relation Prediction component rather than preserve all redundant relations, which reduces computational complexity and achieves better performance, especially when there are many relations in the dataset. 2 For Entity Extraction, we use a more robust Relation-Specific Sequence Tagging component (Rel-Spec Sequence Tagging for short) to extract subjects and objects separately, to naturally handle overlapping between subjects and objects. For Subjectobject Alignment, unlike TPLinker which uses a relation-based token-pair matrix, we design a relation-independent Global Correspondence matrix to determine whether a specific subject-object pair is valid in a triple.
Given a sentence, PRGC first predicts a subset of potential relations and a global matrix which contains the correspondence score between all subjects and objects; then performs sequence tagging to extract subjects and objects for each potential relation in parallel; finally enumerates all predicted entity pairs, which are then pruned by the global correspondence matrix. It is worth to note that the experiment (described in Section 5.2.1) shows that the Potential Relation Prediction component of PRGC is overall beneficial, even though it introduces the exposure bias that is usually mentioned in prior single-stage methods to prove their advantages.
Experimental results show that PRGC outperforms the state-of-the-art methods on public benchmarks with higher efficiency and fewer parameters. Detailed experiments on complex scenarios such as various overlapping patterns, which contain the Single Entity Overlap (SEO), Entity Pair Overlap (EPO) and Subject Object Overlap (SOO) types 3 show that our method owns consistent advantages. The main contributions of this paper are as follows: 1. We tackle the relational triple extraction task from a novel perspective which decomposes the task into three subtasks: Relation Judgement, Entity Extraction and Subject-object Alignment, and previous works are compared on the basis of the proposed paradigm as shown in Table 1.
2. Following our perspective, we propose a novel end-to-end framework and design three components with respect to the subtasks which greatly alleviate the problems of redundant relation judgement, poor generalization of spanbased extraction and inefficient subject-object alignment, respectively.
3. We conduct extensive experiments on several public benchmarks, which indicate that our method achieves state-of-the-art performance, especially for complex scenarios of overlapping triples. Further ablation studies and analyses confirm the effectiveness of each component in our model.
4. In addition to higher accuracy, experiments show that our method owns significant advantages in complexity, number of parameters, floating point operations (FLOPs) and inference time compared with previous works.

Related Work
Traditionally, relational triple extraction has been studied as two separated tasks: entity extraction and relation prediction. Early works (Zelenko et al., 2002;Chan and Roth, 2011) apply the pipelined methods to perform relation classification between entity pairs after extracting all the entities. To establish the correlation between these two tasks, joint models have attracted much attention. Prior featurebased joint models (Yu and Lam, 2010;Li and Ji, 2014;Miwa and Sasaki, 2014;Ren et al., 2017) require a complicated process of feature engineering and rely on various NLP tools with cumbersome manual operations.
Recently, the neural network model which reduces manual involvement occupies the main part of the research. Zheng et al. (2017) proposed a Figure 1: The overall structure of PRGC. Given a sentence S, PRGC predicts a subset of potential relations R pot and a global correspondence M which indicates the alignment between subjects and objects. Then for each potential relation, a relation-specific sentence representation is constructed for sequence tagging. Finally we enumerate all possible subject-object pairs and get four candidate triples for this particular example, but only two triples are left (marked red) after applying the constraint of global correspondence. novel tagging scheme that unified the role of the entity and the relation between entities in the annotations, thus the joint extraction task was converted to a sequence labeling task but it failed to solve the overlapping problems. Bekoulis et al. (2018) proposed to first extract all candidate entities, then predict the relation of every entity pair as a multihead selection problem, which shared parameters but did not decode jointly. Nayak and Ng (2020) employed an encoder-decoder architecture and a pointer network based decoding approach where an entire triple was generated at each time step.
To handle the problems mentioned above, Wei et al. (2020) presented a cascade framework, which first identified all possible subjects in a sentence, then for each subject, applied span-based taggers to identify the corresponding objects based on each relation. This method leads to redundancy on relation judgement, and is not robust due to the span-based scheme on entity extraction. Meanwhile, the alignment scheme of subjects and objects limits its parallelization. In order to represent the relation of triple explicitly,  presented a relationspecific attention to assign different weights to the words in context under each relation, but it applied a naive heuristic nearest neighbor principle to combine the entity pairs which means the nearest subject and object entities will be combined into a triple. This is obviously not in accordance with intuition and fact. Meanwhile, it is also redundant on relation judgement. The state-of-the-art method named TPLinker (Wang et al., 2020a) employs a token pair linking scheme which performs two O(n 2 ) matrix operations for extracting entities and aligning subjects with objects under each relation of a sentence, causing extreme redundancy on relation judgement and complexity on subject-object alignment, respectively. And it also suffers from the disadvantage of span-based extraction scheme.

Method
In this section, we first introduce our perspective of relational triple extraction task with a principled problem definition, then elaborate each component of the PRGC model. An overview illustration of PRGC is shown in Figure 1.

Problem Definition
The input is a sentence S = {x 1 , x 2 , ..., x n } with n tokens. The desired outputs are relational triples as T (S) = {(s, r, o)|s, o ∈ E, r ∈ R}, where E and R are the entity and relation sets, respectively. In this paper, the problem is decomposed into three subtasks: Relation Judgement For the given sentence S, this subtask predicts potential relations it contains. The output of this task is Y r (S) = {r 1 , r 2 , ..., r m |r i ∈ R}, where m is the size of potential relation subset.
Entity Extraction For the given sentence S and a predicted potential relation r i , this subtask identifies the tag of each token with BIO (i.e., Begin, Inside and Outside) tag scheme (Tjong Kim Sang and Veenstra, 1999;Ratinov and Roth, 2009). Let t j denote the tag. The output of this task is Y e (S, r i |r i ∈ R) = {t 1 , t 2 , ..., t n }.
Subject-object Alignment For the given sentence S, this subtask predicts the correspondence score between the start tokens of subjects and objects. That means only the pair of start tokens of a true triple has a high score, while the other token pairs have a low score. Let M denote the global correspondence matrix. The output of this task is Y s (S) = M ∈ R n×n .

PRGC Encoder
The output of PRGC Encoder is Y enc (S) = {h 1 , h 2 , ..., h n |h i ∈ R d×1 }, where d is the embedding dimension, and n is the number of tokens. We use a pre-trained BERT model 4 (Devlin et al., 2019) to encode the input sentence for a fair comparison, but theoretically it can be extended to other encoders, such as Glove (Pennington et al., 2014) and RoBERTa .

PRGC Decoder
In this section, we describe the instantiation of PRGC decoder that consists of three components.

Potential Relation Prediction
This component is shown as the orange box in Figure 1 where R pot is the potential relations. Different from previous works (Wei et al., 2020;Wang et al., 2020a) which redundantly perform entity extraction to every relation, given a sentence, we first predict a subset of potential relations that possibly exist in the sentence, and then the entity extraction only needs to be applied to these potential relations. Given the embedding h ∈ R n×d of a sentence with n tokens, each element of this component is obtained as: where Avgpool is the average pooling operation (Lin et al., 2014), W r ∈ R d×1 is a trainable weight and σ denotes the sigmoid function.
We model it as a multi-label binary classification task, and the corresponding relation will be assigned with tag 1 if the probability exceeds a certain threshold λ 1 or with tag 0 otherwise (as shown in Figure 1), so next we just need to apply the relation-specific sequence tagging to the predicted relations rather than all relations.

Relation-Specific Sequence Tagging
As shown in Figure 1, we obtain several relationspecific sentence representations of potential relations described in Section 3.3.1. Then, we perform two sequence tagging operations to extract subjects and objects, respectively. The reason why we extract subjects and objects separately is to handle the special overlapping pattern named Subject Object Overlap (SOO). We can also simplify it to one sequence tagging operation with two types of entities if there are no SOO patterns in the dataset. 5 For the sake of simplicity and fairness, we abandon the traditional LSTM-CRF (Panchendrarajan and Amaresan, 2018) network but adopt the simple fully connected neural network. Detailed operations of this component on each token are as follows: where u j ∈ R d×1 is the j-th relation representation in a trainable embedding matrix U ∈ R d×nr where n r is the size of full relation set, h i ∈ R d×1 is the encoded representation of the i-th token, and W sub , W obj ∈ R d×3 are trainable weights where the size of tag set {B, I, O} is 3.

Global Correspondence
After sequence tagging, we acquire all possible subjects and objects with respect to a relation of the sentence, then we use a global correspondence matrix to determine the correct pairs of the subjects and objects. It should be noted that the global correspondence matrix can be learned simultaneously with potential relation prediction since it is independent of relations. The detailed process is as follows: first we enumerate all the possible subjectobject pairs; then we check the corresponding score in the global matrix for each pair, retain it if the value exceeds a certain threshold λ 2 or filter it out otherwise.  As shown in the green matrix M in Figure 1, given a sentence with n tokens, the shape of global correspondence matrix will be R n×n . Each element of this matrix is about the start position of a paired subject and object, which represents the confidence level of a subject-object pair, the higher the value, the higher the confidence level that the pair belongs to a triple. For example, the value about "Tom" and "Jerry" at row 1, column 3 will be high if they are in a correct triple such as "(Tom, like, Jerry)". The value of each element in the matrix is obtained as follows: where h sub i , h obj j ∈ R d×1 are the encoded representation of the i-th token and j-th token in the input sentence forming a potential pair of subject and object, W g ∈ R 2d×1 is a trainable weight, and σ is the sigmoid function.

Training Strategy
We train the model jointly, optimize the combined objective function during training time and share the parameters of the PRGC encoder. The total loss can be divided into three parts as follows: where n r is the size of full relation set and n pot r is the size of potential relation subset of the sentence. The total loss is the sum of these three parts, Performance might be better by carefully tuning the weight of each sub-loss, but we just assign equal weights for simplicity (i.e., α = β = γ = 1).

Datasets and Experimental Settings
For fair and comprehensive comparison, we follow Yu et al. (2019) and Wang et al. (2020a) to evaluate our model on two public datasets NYT (Riedel et al., 2010) and WebNLG (Gardent et al., 2017), both of which have two versions, respectively. We denote the different versions as NYT*, NYT and WebNLG*, WebNLG. Note that NYT* and WebNLG* annotate the last word of entities, while NYT and WebNLG annotate the whole entity span. The statistics of the datasets are described in Table 2. Following Wei et al. (2020), we further characterize the test set w.r.t. the overlapping patterns and the number of triples per sentence. Following prior works mentioned above, an extracted relational triple is regarded as correct only if it is an exact match with ground truth, which means the last word of entities or the whole entity span (depending on the annotation protocol) of both subject and object and the relation are all correct. Meanwhile, we report the standard micro Precision (Prec.), Recall (Rec.) and F1-score for all the baselines. The implementation details are shown in Appendix B.
We compare PRGC with eight strong baseline models and the state-of-the-art models CasRel (Wei et al., 2020) and TPLinker (Wang et al., 2020a). All the experimental results of the baseline models are directly taken from Wang et al. (2020a) unless specified.

Experimental Results
In this section, we present the overall results and the results of complex scenarios, while the results on different subtasks corresponding to different  (Zeng et al., 2018) 61   components in our model are described in Appendix C.  without taking advantage of the pre-trained BERT language model.

Overall Results
It is important to note that even though TPLinker BERT has more parameters than CasRel BERT , it only obtains 0.1% improvements on the WebNLG* dataset, and the authors attributed this to problems with the dataset itself. However, our model achieves a 10× improvements than TPLinker on the WebNLG* dataset and a significant promotion on the WebNLG dataset. The reason behind this is that the relation judgement component of our model greatly reduces redundant relations particularly in the versions of WebNLG which contain hundreds of relations. In other words, the reduction in negative relations provides an additional boost compared to the models that perform entity extraction under every relation.  Table 6: Comparison of model efficiency on both NYT* and WebNLG* datasets. Results except F1-score (%) of other methods are obtained by the official implementation with default configuration, and bold marks the best result. Complexity are the computation complexity, FLOPs and Params decoder are both calculated on the decoder, and we measure the inference time (ms) with the batch size of 1 and 24, respectively.

Detailed Results on Complex Scenarios
Following previous works (Wei et al., 2020;Wang et al., 2020a), to verify the capability of our model in handling different overlapping patterns and sentences with different numbers of triples, we conduct further experiments on NYT* and WebNLG* datasets. As shown in Table 4, our model exceeds all the baselines in all overlapping patterns in both datasets except the SOO pattern in the NYT* dataset. Actually, the observation on the latter scenario is not reliable due to the very low percentage of SOO in NYT* (i.e., 45 out of 8,110 as shown in Table 2). As shown in Table 5, the performance of our model is better than others almost in every subset regardless of the number of triples. In general, these two further experiments adequately show the advantages of our model in complex scenarios. As shown in Table 6, we evaluate the model ef-ficiency with respect to Complexity, floating point operations (FLOPs) (Molchanov et al., 2017), parameters of the decoder (Params decoder ) and Inference Time 6 of CasRel, TPLinker and PRGC in two datasets which have quite different characteristics in the size of relation set, the average number of relations per sentence and the average number of subjects per sentence. All experiments are conducted with the same hardware configuration. Because the number of subjects in a sentence varies, it is difficult for CasRel to predict objects in a heterogeneous batch, and it is restricted to set batch size to 1 in the official implementation (Wang et al., 2020a). For the sake of fair comparison, we set batch size to 1 and 24 to verify the single-thread decoding speed and parallel processing capability, respectively.

Model Efficiency
The results indicate that the single-thread decoding speed of PRGC is 2× as CasRel and 3× as TPLinker, and our model is significantly better than TPLinker in terms of parallel processing. Note that the model efficiency of CasRel and TPLinker decreases as the size of relation set increases but our model is not affected by the size of relation set, thus PRGC overwhelmingly outperforms both models in terms of all the indicators of efficiency in the WebNLG* dataset. Compared with the stateof-the-art model TPLinker, PRGC is an order of magnitude lower in Complexity and the FLOPs is even 200 times lower, thus PRGC has fewer parameters and obtains 3× speedup in the inference phase while the F1-score is improved by 1.1%. Even though CasRel has lower Complexity and FLOPs in the NYT* dataset, PRGC still has significant advantages and obtains a 5× speedup in the inference time and 3% improvements in F1-score. Meanwhile, Figure 2 proves our advantage in convergence rate. These all confirm the efficiency of Figure 3: Case study for the ablation study of Rel-Spec Sequence Tagging. Examples are from WebNLG*, and we supplement the whole entity span through WebNLG to facilitate viewing. The red cross marks bad cases, the correct entities are in bold and the correct relations are colored. our model.

Ablation Study
In this section, we conduct ablation experiments to demonstrate the effectiveness of each component in PRGC with results reported in Table 7

Effect of Potential Relation Prediction
We use each relation in the relation set to perform sequence tagging when we remove the Potential Relation Prediction component to avoid the exposure bias. As shown in Table 7, the precision significantly decreases without this component, because the number of predicted triples increases due to relations not presented in the sentences, especially in the WebNLG* dataset where the size of relation set is much bigger and brings tremendous relation redundancy. Meanwhile, with the increase of relation number in sentences, the training and inference time increases three to four times. Through this experiment, the validity of this component that aims to predict a potential relation subset is proved, which is not only beneficial to model accuracy, but also to efficiency.

Effect of Rel-Spec Sequence Tagging
As a comparison for sequence tagging scheme, following Wei et al. (2020) and Wang et al. (2020a), we perform binary classification to detect start and end positions of an entity with the span-based scheme. As shown in Table 7, span-based scheme brings significant decline of performance.
Through the case study shown in Figure 3, we observe that the span-based scheme tends to extract long entities and identify the correct subject-object pairs but ignore their relation. That is because the model is inclined to remember the position of an entity rather than understand the underlying semantics. However, the sequence tagging scheme used by PRGC performs well in both cases, and experimental results prove that our tagging scheme is more robust and generalizable.

Effect of Global Correspondence
For comparison, we exploit the heuristic nearest neighbor principle to combine the subject-object pairs which was used by Zheng et al. (2017) and . As shown in Table 7, the precision also significantly decreases without Global Correspondence, because the number of predicted triples increases with many mismatched pairs when the model loses the constraint imposed by this component. This experiment proves that the Global Correspondence component is effective and greatly outperforms the heuristic nearest neighbor principle in the subject-object alignment task.

B Implementation Details
We implement our model with PyTorch and optimize the parameters by Adam (Kingma and Ba, 2015) with batch size of 64/6 for NYT/WebNLG. The encoder learning rate for BERT is set as 5 × 10 −5 , and the decoder learning rate is set as 0.001 in order to converge rapidly. We also conduct weight decay (Loshchilov and Hutter, 2017) with a rate of 0.01.
For fair comparison, we use the BERT-Base-Cased English model 7 as our encoder, and set the max length of an input sentence to 100, which is the same as previous works (Wei et al., 2020;Wang et al., 2020a). Our experiments are conducted on the workstation with an Intel Xeon E5 2.40 GHz CPU, 128 GB memory, an NVIDIA Tesla V100 GPU, and CentOS 7.2. We train the model for 100 epochs and choose the last model. The performance will be better if the higher the threshold of Potential Relation Prediction (λ 1 ), but tuning the threshold of Global Correspondence (λ 2 ) will not help which is consistent with the analysis in Appendix C. 7 Available at https://huggingface.co/bert-base-cased.

C Results on Different Subtasks
To further verify the results of the three subtasks in our new perspective and the performance of each component in our model, we present more detailed evaluations on NYT* and WebNLG* datasets in Table 8

Relation Judgement
We evaluate outputs of the Potential Relation Prediction component which are potential relations contained in a sentence. Recall is more important for this task because if a true relation is missed, it will not be recovered in the following steps. We get high recall in this task and the results show that effectiveness of Potential Relation Prediction component is not affected by the size of relation set.
Entity Extraction This task is related to the Relation-Specific Sequence Tagging component, and we evaluate it as a Named Entity Recognition (NER) task with two types of entities: subjects and objects. The predicted entities are from all potential relations of a sentence, and recall is more important for this task because most false negatives can be filtered out by Subject-object Alignment. Experimental results show that we extract almost all correct entities, and it further proves that the influence of the exposure bias is negligible.
Subject-object Alignment This task is related to the Global Correspondence component, and we just evaluate the entity pair in a triple and ignore the relation. Both recall and precision are important for this component, experimental results indicate that our alignment scheme is useful but still can be further improved, especially in the recall.
Overall, the combination of three components in our model accomplishes the relational triple extraction task with a fine-grained perspective, and achieves better and solid results.