BiSPN: Generating Entity Set and Relation Set Coherently in One Pass

,


Introduction
Extracting entities and relation triples from text is a fundamental task of Information Extraction.There have been many efforts that decompose the problem into separate tasks, i.e. named entity recognition (NER) and relation triple extraction (RE), and solve them respectively.Among these efforts, Set Prediction Networks (SPNs) have demonstrated state-of-the-art performance on NER (Tan et al., 2021;Shen et al., 2022) and RE (Sui et al., 2020;Tan et al., 2022).
Typically, SPNs leverage a set of learnable queries to model the interaction among instances (entities or relation triples) via attention mechanism and generate the set of instances naturally.The success of SPNs on NER and RE inspires us to explore * Corresponding Author.the possibility of jointly solving the extraction of entities and relation triples with SPNs, which is a promising but unexplored direction.
In this paper, we propose Bipartite Set Prediction Network (BiSPN), a variant of SPNs, to generate target entity set and relation set in one pass.It can not only avoid the negative effect of cascade error but also enjoy the benefit of high inference speed.However, it is challenged by the difficulty to maintain the coherence between the generated entity set and relation set, due to its parallel design.
As illustrated in Figure 1, the head/tail entities of generated relation triples should be included in the generated entity set.The significance of this coherence is two-fold: 1) by requiring the generated entity set to contain the head/tail entities, the recall of the generated entities is more guaranteed; 2) by restricting the head/tail entities within the generated entity set, the precision and recall of the generated triples are more guaranteed.
Despite that, it is difficult to maintain such consistency when generating the two sets in parallel, since all instance queries are assumed to be of equal status and their hidden representations are updated using bidirectional attention without any further restriction.
To overcome this challenge, we come up with two novel solutions.The first one is a Bipartite Consistency Loss function.It works by looking for a reference entity from the generated entity set for each relational subject/object and forcing the subject/object to simulate the reference entity.Symmetrically, it also find a reference subject/object for each entity classified as involved in relation and force the entity to simulate the reference subject/object.Our second solution is an Entity-Relation Linking Loss function, which works in the hidden semantic space.By computing the linking scores between the projected representations of entity queries and relation queries, it encourages the model to learn the interaction between entity instances and relation triple instances.
To sum up, our main contributions include:  (Wang et al., 2020(Wang et al., , 2021;;Yan et al., 2021) that fill out a table for each entity/relation type via token pair classification; (3) machine reading comprehension (MRC)-based methods (Li et al., 2019) that casts the task as a multi-turn question answering problem via manually designed templates; (4) autoregressive generation-based methods (Zeng et al., 2018;Lu et al., 2022) that reformulate the task as a sequence-to-sequence problem by linearizing target entities/relations into a pointer sequence or augmented natural language.Among them, only the methods based on table filling can extract entities and relation triples in one stage, all other methods perform multi-step prediction, suffering from cascade errors and low inference speed.In this paper, we provide a new choice for one-stage joint entity and relation extraction, which is based on set prediction networks.

Set Prediction Networks
Set prediction networks are originally proposed for the task of object detection in computer vision (Carion et al., 2020).And they are successfully extended for information extraction by (Sui et al., 2020;Tan et al., 2021;Shen et al., 2022;Tan et al., 2022).
Generally, these methods employ a set of learnable queries as additional input, model the interaction among instances (entities/relations) via selfattention among the queries and one-way attention between the queries and the textual context.However, all these methods can only perform named entity recognition (Tan et al., 2021;Shen et al., 2022) or relation triple extraction (Sui et al., 2020;Tan et al., 2022), rather than jointly solve both of them.

Problem Formulation
Given an input sentence x = x 1 , x 2 , ..., x L , the aim of joint entity and relation extraction is to predict the set of entities {e i } Ne i=1 and the set of relation triples {r j } Nr j=1 mentioned in the sentence.Here, L is the length of the input sentence, N e and N r are the numbers of target entities and target relation triples respectively.The i-th entity e i is denoted as (start i , end i , t e i ), where start i , end i are the start token index and end token index of the entity, t e i ∈ T e is the entity type label.The j-th relation triple r j is denoted as (e h j , t r j , e t j ), where e h j and e t j are the head entity (start h j , end h j , t e,h j ) and tail entity (start t j , end t j , t e,t j ) of the relation triple, t r j ∈ T r is the relation type label.We additionally define a null label ∅ for the set of entity types T e and the

BiSPN
As illustrated in Figure 2, the proposed model mainly consists of a shared encoder, a shared decoder, an entity decoding module (in blue) and a relation decoding module (in red).

Shared Encoder
The encoder of BiSPN is essentially a bidirectional pretrained language model (Devlin et al., 2019) with modified input style and attention design.
We first transform the input sentence x into input embeddings X ∈ R L×d , and then concatenate X with a series of learnable entity queries Q e and relation queries Q r to form the model input X: where d is the model dimension, M e and M r are hyperparameters controlling the number of entity queries and the number of relation queries (M e ≫ N e , M r ≫ N r ).
To prevent the randomly initialized queries from negatively affecting the contextual token encodings, we follow the work (Shen et al., 2022) to modify the bidirectional self-attention into one-way selfattention.Concretely, the upper right L × (M e + M r ) sub-matrix of the attention mask is filled with negative infinity value so that the entity/relation queries become invisible to the token encodings, while the entity/relation queries can still attend to each other and the token encodings.
After multiple one-way self-attention layers and feed-forward layers, the encoder outputs the contextual token encodings as well as the contextual entity/relation queries.

Shared Decoder
The shared decoder consists of N decoding blocks.Each decoding block includes an one-way selfattention layer (as described above), a bidirectional self-attention layer and feed-forward layers.The one-way self-attention layer here functions as the cross-attention layer of Transformer decoder (Vaswani et al., 2017), which aggregates textual context for decoding.(The main difference between one-way self-attention and cross-attention is that the contextual token encodings also get updated by one-way self-attention.)The bidirectional self-attention layer updates the entity/relation queries via modeling the interaction among them.
After shared decoding, the decoder outputs the updated token representations H x , entity queries H e and relation queries H r .

Entity Decoding Module
The entity decoding module consists of an entityview projection layer, an entity decoder and an entity predictor.
The entity-view projection layer first linearly transforms the token encodings H x into entityview: H The entity decoder, which includes multiple layers of cross-attention and bidirectional selfattention, receives the transformed token encodings H x e as decoding context and the entity queries H e as decoder input, and output the final representation of entity queries He : The entity predictor is responsible for predicting the boundary and entity type of each entity query.
For each entity query, it first fuse the query representation with the transformed token encodings, and then calculate the probability of each token in the sentence being the start/end token of the corresponding entity: where S δ i ∈ R L is a vector of logits of each token being the start/end token of the entity associated with i-th entity query, P δ i is the corresponding probability distribution.
An MLP-based classifier is leveraged to predict the type of the entity associated with the i-th entity query: During inference, the predicted boundary and entity type corresponding to the k-th entity query are calculated as: Note that, the entity predictor will filter out the entity whose predicted type label is ∅.

Relation Decoding Module
The relation decoding module consists of a relationview projection layer, a relation decoder, a head-tail predictor and a relation type predictor.
The relation-view projection layer and relation decoder work in the same manner as the entityview projection layer and entity decoder, except that the relation decoder splits relation queries into head/tail queries before decoding: The head-tail predictor then predicts the boundary and entity type of the head/tail entity associated with each relation queries.This process is similar to the entity prediction process (Equation 4-10).The only difference is that the entities queries becomes the head/tail queries Hh/t and the token encodings is now in relation-view H x r .The relation type predictor classifies the category of i-th relation query according to Hr i :

Prediction Loss
To train the model, we should find the optimal assignment between the gold entity set and the generated entity set, as well as the optimal assignment between the gold relation set and the generated relation set, which are calculated in the same way as in (Tan et al., 2021;Shen et al., 2022) using the Hungarian algorithm (Kuhn, 1955).
After the optimal assignments are obtained, we calculate the following prediction loss L pred for each sample: + log P t e j,h/t [t e,h/t σ(j) ] (15) ) where φ(i) is the index of the gold entity assigned to the i-th generated entity, σ(j) is the index of the gold relation triple assigned to the j-th generated relation triple, L h/t ent represents the loss of head/tail entity prediction.

Bipartite Consistency Loss
To calculate the bipartite consistency loss, we first find a reference entity from the generated entity set for each head/tail entity.A reference entity is defined as the entity most similar to the referring head/tail entity.Concretely, the similarity between e a , the a-th generated entity, and e h/t b , the head/tail entity of the b-th generated relation triple is measured in KL divergence between the start/end/type probability distributions of e a and e h/t b : sim(e a , e b .We want every head/tail entity to simulate its reference entity, which is equivalent to maximizing the similarity.Hence, the consistency loss in the relation → entity direction is computed as: sim(e j , e h i ) sim(e j , e t i ) Symmetrically, we also find a reference head/tail entity for each generated entity that is classified as having relation.The classification is conducted by a binary classifier, which is trained with a binary cross-entropy loss function: where y has-rel i = 1 only if the gold entity assigned to the i-th entity query is involved in relation.
The consistency loss in the entity → relation direction is then calculated as follows: s im(e h/t j , e i ) ( 21) where Ω is the set of indices of the entities classified as involved in relation.
We sum up L ent→rel , L rel→ent and L has-rel to get the overall bipartite consistency loss L ent↔rel .

Entity-Relation Linking Loss
While the bipartite consistency loss softly aligns the predicted distributions between the generated entity set and relation set during training, the entityrelation linking loss encourages BiSPN to model the interaction between entity queries and relation queries.
To this end, we first project the intermediate representations of entity queries and relation queries and then compute the linking scores between them via a Biaffine layer: S link = Biaffine( He , Hr ) ∈ R Me×Mr (25) With the linking scores, we calculate the following binary cross-entropy loss: where y link i,j = 1 only if the gold entity assigned to the i-th entity query appears in the gold relation triple assigned to the j-th relation query.  1 for detailed statistics of the datasets.

Experiments
We additionally experiment on the SciERC dataset, where we follow the same setting as in (Wang et al., 2021;Ye et al., 2022).See Appendix B for the results.
Evaluation Metrics.Strict evaluation metrics are applied, where an entity is confirmed correct only if its boundary and entity type are correctly predicted; a relation triple is confirmed correct only if its relation type and head/tail entity are correctly predicted.For ACE05 and BioRelEx, we report the averaged Micro F1 scores over 3 random seeds.For ADE, we follows (Ji et al., 2020;Lai et al., 2021) to conduct 10-fold cross-validation and report the averaged Macro F1 scores.For Text2DT, we follows the top-1 system on the evaluation task to ensemble 5 models trained with different random seeds and report the Micro F1 scores.

Implementation Details
We implement BiSPN with Pytorch (Paszke et al., 2019) and run experiments on NVIDIA Tesla V100 GPUs.For ACE05, we follow (Wang et al., 2021;Ye et al., 2022) to initialize the shared encoder with BERT-base (Devlin et al., 2019).For BioRelEx and ADE, we follow (Haq et al., 2023;Lai et al., 2021) to initialize the shared encoder with BioBERT-base.For Text2DT, we initialize the shared encoder with Chinese-bert-wwm-ext (Cui et al., 2021).The decoding modules are randomly initialized.Following (Shen et al., 2022), we freeze the encoder in the first 5 epochs and unfreeze it in the remaining epochs.The learning rate of decoding modules is set to be larger than the learning rate of the encoder.We adopt an AdamW optimizer (Loshchilov and Hutter, 2017) equipped with a linear warm-up scheduler to tune the model.See Appendix A for details of hyperparameter tuning.

Compared Baselines
We compare BiSPN with several SOTA methods listed as follows.KECI (Lai et al., 2021): A knowledge-enhanced two-stage extraction model based on span graphs.(Haq et al., 2023): A pipeline of independent NER and RE models.

MADR
PL-Marker (Ye et al., 2022): A span-based method that models the interrelation between spans by packing levitated markers in the encoder.
PFN (Yan et al., 2021): A partition filter network that models two-way interaction between NER and RE subtasks.
UniRE (Wang et al., 2021): A method based on table filling, featured with a unified label space for one-stage joint entity and relation extraction.
Note that, we do not compare with SPN (Sui et al., 2020), UniRel (Tang et al., 2022) and QIDN (Tan et al., 2022), since these methods can only extract relation triples and cannot recognize those entities uninvolved in relation.

Main Results
Table 2 summarizes the overall performance of BiSPN and compared baselines on ACE05, BioRelEx, ADE and Text2DT.In terms of entity extraction, BiSPN performs competitively with or slightly better than SOTA methods on ACE05, BioRelEx and ADE, while outperforming the SOTA method PL-Marker by 0.8 F1 on Text2DT.
In terms of relation extraction, BiSPN boosts SOTA performance by 0.7, 0.5 and 0.5 F1 on BioRelEx, ADE and Text2DT respectively, verifying the effectiveness of our method on knowledgeintensive scene.However, although BiSPN outperforms SOTA one-stage methods by 0.6 F1 on ACE05, it is no match for the two-stage method PL-Marker on this general-domain dataset.We will look into the reason behind this in Section 4.6.2.

Inference Efficiency
We compare the inference efficiency of KEIC, UniRE, PL-Marker, BiSPN on the BioRelEx and Text2DT datasets.For a fair comparison, the experiments are all conducted on a server with Intel(R) Xeon(R) E5-2698 CPUs and NVIDIA Tesla V100 GPUs.And we fix the batch size as 8 during evaluation.As shown in Table 3, BiSPN can process around 40∼60 sentences per second, which is 2 times faster than the SOTA two-stage method PL-Marker.Although UniRE, a SOTA one-stage method, is about 2.5 times faster than BiSPN, its performance of relation extraction is uncompetitive against BiSPN.The input text in this sample is "Moreover, the in vitro binding of NF-B or Sp1 to its target DNA was not affected by the presence of K-12".[10051, 2050] [2643, 597] [1203, 302] PL-Marker 66.5 63.8 57.9 BiSPN 65.5 (-1.0) 63.4 (-0.4) 58.2 (+0.3)The results of ablation study are shown in Table 2. Without the bipartite consistency loss, the entity F1 scores drop by 0.5, 1.4, 0.7, 0.3 and the relation F1 scores drop by 1.5, 0.8, 0.2, 0.5 on ACE05, BioRelEx, ADE and Text2DT respectively.Without the entity-relation linking loss, the entity F1 scores decrease by 0.2, 0.1, 0.3, 0 and the relation F1 scores decrease by 0.4, 0.3, 0.2, 0.3 on the four datasets respectively.After removing the bipartite consistency loss and entity-relation linking loss together, the entity F1 scores decrease by 0.8, 1.8, 1.2, 0.9 and the relation F1 scores decrease by 2.1, 1.0, 0.7, 0.8 on the four datasets respectively.
Three conclusions can be derived from these results: 1) both the bipartite consistency loss and entity-relation linking loss contribute to the overall performance of BiSPN; 2) the bipartite consistency loss is much more effective than the entity-relation linking loss; 3) the effects of the two losses are not orthogonal but still complementary in some degree.
To visualize the attention between entity queries and relation queries, we record the attention weight in the last bidirectional self-attention layer of shared decoder for a sample from BioRelEx.Since the original weight is bidirectional, we average the weight over two directions and conduct normalization to obtain the weight for visualization.As shown in Figure 3, without L ent↔rel and L link , the interaction between entity queries and relation queries is chaotic and pointless.After applying L ent↔rel or L link separately, the relation queries associated with gold relation triples tend to focus on the entity queries associated with gold entities.When applying the two loss functions together, the relation queries associated with gold relation triples further concentrate on the entity queries whose target entities are the same as their head/tail entities.This phenomenon coincides with the purpose of the two loss functions.

Influence of Knowledge Density
The performance of BiSPN on ACE05 is not so good as its performance on BioRelEx, ADE and Text2DT.We hypothesize it is the sparsity of relation triples that hinders the learning of BiSPN.As listed in Table 1, the average number of relations per sentence is 0.53 on ACE05.In contrast, the average numbers of relations per sentence are 1.61, 1.60 and 6.39 on the other three datasets.
To verify our hypothesis, we filter samples according to the number of relations they contain and experiment on different versions of the filtered dataset.As shown in Table 4, when the samples without relation are discarded, the performance gap between PL-Marker and BiSPN narrows from 1.0 to 0.4.When further discarding the samples with less than 2 relation triples, BiSPN even performs slightly better than PL-Marker.This reveals that the strength of BiSPN emerges in knowledge-intensive scene, which is reasonable, since BiSPN works by modeling interaction among knowledge instances.

Case Study
Figure 4 illustrates two test cases from ACE05 and BioRelEx respectively.In the first case, BiSPN without L ent↔rel and L link fails to recognizes the WEA entity "more" and the PART-WHOLE relation between "it" and "more", while BiSPN successfully extracts them by considering the context in entity-view and relation-view concurrently.Likewise, in the second case, BiSPN successfully recognizes the entity "promoter" after L ent↔rel and L link is applied.However, it still fails to recognize "RNA polymerase II preinitiation complex", which is complicated and may require domain knowledge for recognition.

Conclusion
In this work, we present BiSPN, a novel joint entity relation extraction framework based on bipartite set prediction.It generates entity set and relation set in a distributed manner, so as to avoid error propagation.To maintain the coherency between the generated entity set and relation set, we come up with two novel loss designs, namely bipartite consistency loss and entity-relation linking loss.The first one pulls closer the predicted boundary/type distributions of entities and head/tail entities, while the second one enforces the interaction between entity queries and relation queries.Extensive experiments demonstrate the advantage of BiSPN in knowledge-intensive scene, as well as the effectiveness of the proposed bipartite consistency loss and entity-relation linking loss.

Limitations
As mentioned in Section 4.6.2, the performance of our BiSPN framework can be hindered when the distribution of relation triples are overly sparse in a dataset.This suggests that BiSPN is a better choice for biomedical and clinical domains but not the general-domain, where knowledge sparsity is common.
Another limitation of BiSPN is its reliance on a fixed number of entity/relation queries.Although it is possible to set the number of queries larger in order to let BiSPN generalize to longer text input, the cost is additional memory and time consumption that grows quadratically.To effectively address this, future work can draw lessons from the field of dynamic neural network and explore dynamic selection of instance queries.value that results in the highest relation F1 on the development set (For ADE, the validation set of its first fold of data is employed as the development set).After trial, we find it optimal to set the numbers of shared decoder layers, entity decoder layers and relation decoder layers as 4, 1, 1 for all datasets.The trial intervals and final configuration of other hyperparameters are shown in Table 5.

B Experiment Results on the SciERC dataset
Here, we append the results of additional experiment on the SciERC dataset.

Figure 1 :
Figure 1: The target output of joint entity and relation extraction is essentially an entity set and a relation set that should be consistent with each other.Such coherence is difficult to be guaranteed when generating entity/relation sets in parallel.And our work manages to address this challenge.

Figure 2 :
Figure 2: An overview of BiSPN, the proposed joint entity and relation extraction framework that is capable of generating target entity set and relation set coherently in one pass.

Figure 3 :
Figure 3: Visualization of attention between entity queries and relation queries of a sample from BioRelEx.The input text in this sample is "Moreover, the in vitro binding of NF-B or Sp1 to its target DNA was not affected by the presence of K-12".

Figure 4 :
Figure 4: Cases from ACE05 and BioRelEx.False negative predictions are in blue.

Table 2 :
Main results on the ACE05, BioRelEx, ADE and Text2DT datasets.The bold font indicates the best score and the underline font indicates the second-best score.

Table 4
Effects of L ent↔rel and L linkWe conduct ablation study and qualitative visualization to analyze how the bipartite consistency loss L ent↔rel and entity-relation linking loss L link work.

Table 5 :
As shown in Table 6, in terms of entity recognition, BiSPN underperforms SOTA two-stage method PL-Marker by Ent-F1, but still outperforms SOTA one-stage methods (PFN, UniRE) substantially.In terms of relation triple extraction, BiSPN establishes new SOTA (Rel-F1) on the SciERC dataset.In terms of inference speed, BiSPN is about faster than PL-Marker, but slower than other one-stage methods.The results after ablating the consistency loss and the linking loss verify the effectiveness of them on the dataset.Configuration of Hyperparameters.lr represents the initial learning rate.M e , M r are the number of entity queries and the number of relation queries respectively.α, β are the weights of the bipartite consistency loss and entity-relation linking loss respectively.

Table 6 :
Results on the SciERC dataset.