Synchronous Dual Network with Cross-Type Attention for Joint Entity and Relation Extraction

Joint entity and relation extraction is challenging due to the complex interaction of interaction between named entity recognition and relation extraction. Although most existing works tend to jointly train these two tasks through a shared network, they fail to fully utilize the interdependence between entity types and relation types. In this paper, we design a novel synchronous dual network (SDN) with cross-type attention via separately and interactively considering the entity types and relation types. On the one hand, SDN adopts two isomorphic bi-directional type-attention LSTM to encode the entity type enhanced representations and the relation type enhanced representations, respectively. On the other hand, SDN explicitly models the interdependence between entity types and relation types via cross-type attention mechanism. In addition, we also propose a new multi-task learning strategy via modeling the interaction of two types of information. Experiments on NYT and WebNLG datasets verify the effectiveness of the proposed model, achieving state-of-the-art performance.


Introduction
Joint entity and relation extraction is a fundamental and important task in information extraction, providing necessary information for knowledge base construction (Mesquita et al., 2019), question answering (Yu et al., 2017), and dialogue systems (Xu et al., 2019), etc. This task can be decomposed into two subtasks: named entity recognition (NER) and relation extraction (RE), aiming respectively to detect text spans with specific types (entities) and semantic relations among those text spans (relations) from unstructured texts.
Early studies employ pipeline models (Zelenko et al., 2002;Zhou et al., 2005;Chan and Roth, 2011), which first extract all entities of the sentence by an entity model, and then these extracted entities are used as the inputs of a relation model to identify semantic relations between entity pairs. However, pipeline models disregard the correlation between NER and RE (Li and Ji, 2014).
In recent years, joint learning models have gained increasing attention. Among them, multitask learning methods (Miwa and Bansal, 2016;Katiyar and Cardie, 2017;Fu et al., 2019; are popular and they utilize a shared network to learn common features, but making independent predictions for the two tasks. Later,  propose a new multi-task learning method to dynamically learn the interactions between the two tasks.  propose a joint neural framework OneIE to study the interaction of different feature categories by a set of global feature templates. Other methods such as novel tagging (Zheng et al., 2017;Wei et al., 2020; and generative models (Zeng et al., 2018(Zeng et al., , 2019Nayak and Tou Ng, 2020;Ye et al., 2021) adopt a unified model to directly extract relational triplets. Although these methods show effectiveness for joint entity and relation extraction, they only apply a shared network or a unified model to capture the connection between NER and RE without taking into account the interdependence of entity types and relation types.
Intuitively, the relation types of relational triplets are not only relevant to the textual context and entities, but also to entity types (Peng et al., 2020).
Conversely, the entity types of subject and object entities are also constrained by relation types in the triplets. For example, the instance in Figure 1 contains three entities: "Jackie R. Brown" (PER), "Washington" (LOC), and "United States of America" (LOC). The relation type of ("Jackie R. Brown", "Washington") and ("Jackie R. Brown", "United States of America") is Birth_Place, while the relation type of ("Washington", "United States of America") is Capital_of. When the subject entity type is PER and the object entity type is LOC, the relation type between them may be Birth_Place, but never Capital_of. Conversely, when the relation type is Capital_of, the corresponding entity type does not include PER. In addition, although Sun et al. (2019) adopt graph convolutional network to handle the joint type inference problem, they fail to discuss the fine-grained correlation between entity types and relation types.
In this paper, we propose a Synchronous Dual Network (SDN) with cross-type attention to separately and interactively capture the vector representations related to entity types and relation types. First, SDN adopts two isomorphic bi-directional type-attention LSTM to learn two different feature representations, where one is the entity type enhanced representations and the other is the relation type enhanced representations. Then the entity type information and the relation type information are introduced into the relation type enhanced representations and the entity type enhanced representations respectively to explicitly model the interaction between entity types and relation types via cross-type attention mechanism. These type-related representations are concatenated together for NER and RE via a multi-task learning strategy. The above typeattention LSTM is a general structure to select the preferred type distribution. The main idea is to inject all possible entity types simultaneously via multiple type-related cells based on the standard LSTM. In this way, our model obtains the preferred type information via training auxiliary tasks.
To summarize, the main contributions of this work are as follows 1 : (1) We design a general type-attention LSTM structure to inject all possible type information, which can capture the preferred type features via training the corresponding auxiliary task.
(2) We propose a novel synchronous dual net-work with cross-type attention, which adopts crosstype attention mechanism to explicitly model the interdependence between entity types and relation types, to capture the vector representations related to entity types and relation types.
(3) Experiments on two public datasets verify the effectiveness of the multi-task learning strategy via fusing the interaction of two types of information.

Related Work
Multi-Task Learning Model. Some multi-task learning models like (Miwa and Bansal, 2016;Katiyar and Cardie, 2017) learn the shared features through parameter sharing and then use them for the two subtasks of entity recognition and relation extraction. Based on this, Sun et al. (2018) enhance the interaction between the two subtasks by optimizing a global loss function. Sun et al. (2019) apply graph convolutional network to handle the interaction in type inference. These approaches essentially belong to a pipeline decoder: first extract entities and then identify the relation of the predicted entities. Later, Fu et al. (2019),  and  do joint learning via a shared network and simultaneously make independent predictions.  propose CopyMTL, a multi-task learning framework to enhance the capability of handling with multi-token entities. However, it is based on the strong assumption that the shared network is sufficient to capture the correlations between the tasks.
Tagging Model. Zheng et al. (2017) firstly convert the joint extraction task to a sequence labeling problem and propose a unified tagging scheme. Later Wei et al. (2020) and  propose different tagging schemes.  propose a relation-attentive sequence labeling framework. Yu et al. (2020) adopt a novel decomposition strategy, which first recognizes head entities and then extracts corresponding tail entities and relations.
Generative Model. Zeng et al. (2018) firstly propose a sequence-to-sequence model with copy mechanism to generate relational triplets but fail to generate multi-word entities. Subsequently, Zeng et al. (2019), Nayak and Tou Ng (2020) and Sui et al. (2020) adopt different encoder-decoder architectures to generate relational triplets. Ye et al. (2021) propose contrastive triple extraction with a generative transformer.  Figure 2: Overall structures. The red in (a) represents the calculation of type-attention mechanism and "T C m " represents the m-th type-related cell. "ET C p " and "RT C q " in (b) represent the p-th entity type cell and q-th relation type cell, respectively. denotes concatenation operation.

Relation Type Prediction
Compared with these works, we propose a new multi-task learning strategy to explicitly model the interdependence between entity types and relation types within one relation to synchronize their information and make them mutually beneficial.

Type-Attention LSTM
In this section, we will introduce the general framework of Type Attention LSTM (TA-LSTM). As shown in Figure 2 (a), each TA-LSTM unit is composed of a LSTM unit and a type-attention unit. For each token w t , the LSTM unit is used to obtain the contextual representation h c t , while the typeattention unit uses the scaled dot-product attention (Vaswani et al., 2017) to obtain the type representation h l t by integrating m type-related cells for controlling type information flow.
We adopt the standard LSTM (Graves and Schmidhuber, 2005) to encode the contextual representation for each token. At each time step t(t ∈ [1, ..., n]), the current hidden state h c t based on a memory cell c t is calculated as follows: where [W; b] are trainable parameters. x t ∈ R dw is the word embedding of each token w t . σ represents the sigmoid activation function. i t , o t , and f t represent the input gate, output gate, and forget gate, respectively. The hidden state h c t ∈ R de represents the vector representation with context information.
Type-Attention Mechanism. Given x t and h t−1 , the key-value pair of the k-th (k ∈ [1, . . . , m]) type-related cell is computed as: where [W k ; b k ] represent the trainable parameters specific to the k-th type-related cell. σ represents the sigmoid activation function. k (t) k ∈ R de and v (t) k ∈ R de represent the key and value of the k-th type-related cell, respectively.
The above operations are repeated for m typerelated cells. At the time step t, we finally acquire a set of key-value pairs We regard the contextual representation h c t as the query. The scaled dot-product attention first computes the dot products of the query with all corresponding keys, divide each by √ d e , and apply a softmax function to obtain the weight α (t) on the values. Then the type representation h l t is computed as a weighted sum of the values: Finally, the hidden state h t ∈ R de of each TA-LSTM unit is computed as:

Synchronous Dual Network with Cross-Type Attention
Given a sentence s = [w 1 , . . . , w n ] of n words, joint entity and relation extraction task aims to identify a collection of relation triplets T = {(e i , r, e j )|e i , e j ∈ E, r ∈ R}, where e i , e j , and r represent the subject entity, the object entity, and their semantic relation, respectively. Note that the subject and object entities belong to the set of entities E = {e i } P i=1 existing in the sentence and the relation is selected from a pre-defined set In this section, we will describe how to build a synchronous dual network on top of the proposed TA-LSTM and how to interact between entity types and relation types via cross-type attention mechanism for joint entity and relation extraction. As shown in Figure 2 (b), synchronous dual network adopts two bi-directional TA-LSTM to encode the vector representations with corresponding type information from the perspective of entity types and relation types by synchronous dual learning, respectively. Then synchronous dual learning adopts the cross-type attention mechanism to model the interaction between entity types and relation types through introducing entity type information into relation type learning or introducing relation type information into entity type learning. Finally, the vector representations with type information are concatenated together for joint learning.

Synchronous Dual Learning
We design entity type learning and relation type learning to obtain the entity type enhanced representation h e t and the relation type enhanced representation h r t , respectively. Entity Type Learning. To encode the entity type information, we first define p entity type cells (ETCs) according to the number of entity types. Following Jia and Zhang (2020), we regard each entity type as a label, such as PER, LOC, ORG, and so on. Entity type learning mainly aims at extracting entity type knowledge from a set of labeled training data by training entity type cells.
From Equation (1) and (2), we can acquire the . To explicitly highlight entity type knowledge, we design entity type prediction as an auxiliary task in Figure 3 (a). Given a sentence s and its corresponding entity types T e = [T e 1 , . . . , T e n ] (T e t ∈ [O, PER, LOC, ORG, . . .] and t ∈ [1, . . . , n]), we regard entity type prediction as the sequence tagging problem by converting the attention scores to the aligned entity type distribution for w t : And the negative log-likelihood loss is used for training on the sentence s: Relation Type Learning. Similar to entity type learning, q kinds of new relation types are designed to correspond to entity types by considering M valid relation types of the pre-defined set R and the triplet composition. As shown in Figure 3 (b), these new relation types are composed of relation labels and subject or object labels, namely "R 1 -S", "R 1 -O", . . ., "R M -S", "R M -O" and "None" ("None" represents the invalid relation type). Obviously, the number of new relation types q = 2 × M + 1.
These new relation types are mainly used to learn the distribution of the subject or object type affected by valid relation types. Thus, we design relation-related TA-LSTM as the other model of synchronous dual network to extract new relation type knowledge.
At the time step t, we can acquire the contex-  (4).
Different from entity type prediction, the second auxiliary task relation type prediction is a multilabel classification problem because the same entity We omit some words for simplicity. Entity type prediction and relation type prediction are regarded as different sequence tagging tasks to learn entity types and relation types, respectively. "R1" and "Rn" represent relation labels; "R1-S" and "R1-O" represent the labels of the subject and object entities in the relation triplet (S, R1, O), respectively. may exist in multiple relational triplets (namely overlapping triplets). We adopt multiple identical binary classifiers to detect different relation types by assigning each token a binary tag (0/1) that indicates whether the current token corresponds to a new relation type. So the aligned relation type distribution for w t is computed as: where T r l belongs to q new relation type labels, such as "R 1 -S", "R 1 -O" and so on.
The binary cross-entropy loss is used for training on the sentence s: whereT r t represents the gold relation type label.

Cross-Type Attention Mechanism
Since entity types and relation types are interdependent in the joint entity and relation extraction task, it is important to synchronize their information and make them mutually beneficial. So we propose a novel cross-type attention mechanism to model the interaction between entity types and relation types. Given the entity type enhanced representation h e t and a set of key-value pairsK (t) andV (t) related to new relation types, the relation-entity representation c e t is computed by Equation (3) via introducing relation type information into entity type learning.
Similarly, given h r t and the entity type key-value pairsK (t) andV (t) , we can obtain the entityrelation representation c r t . Finally, we obtain the new entity type enhanced representationh e t and new relation type enhanced representationh r t to model the interdependency between entity types and relation types by adding c e t and c r t into h e t and h r t , respectively.

Joint Entity and Relation Extraction
We first concatenate the vector representationsh e t andh r t to obtain the overall representationh t : ht =h e t ⊕h r t Then the overall sequence representationH = [h 1 , . . . ,h n ] is used for NER and RE.
Named Entity Recognition. NER is a typical sequence labeling task. Here we use the BIESO tagging as the tagging scheme to recognize the entity boundary accurately. Given a sentence s, the probability distribution y t of a word w t over these five labels is calculated based on the overall representationh t ∈H as follows: where [W e ; b e ] are trainable parameters. The negative log-likelihood loss is used for training on the sentence s: Relation Extraction. Following , we classify all relations between pairs of words in the sentence based on the overall represen-tationH. Similar to relation type prediction, the relation extraction task belongs to a multilabel classification problem. We design multiple identical binary classifiers to detect different relations.
The probability distribution y r i,j of the word pair (w i , w j ) belonging to the relation r ∈ R is computed as follows: where [W m ; b m ; W r ; b r ] are trainable parameters. φ represents the ReLU activation function. The binary cross-entropy loss is used for training on the sentence s: whereŷ r i,j represents the gold relation label.

Training
Given a training instance s, the training objective of joint entity and relation extraction is as follows: where λ t1 , λ t2 , λ e , and λ r represent the different task weights. λ is the L 2 regularization parameters and Θ represents the parameters set.

Inference
Following Fu et al. (2019), we adopts the average prediction method to infer whether the extracted triplet is right. Concretely, we can obtain the entity set E by named entity recognition. Then, given the subject entity e i = [w ξ i , . . . , w ζ i ] and the object entity e j = [w ξ j , . . . , w ζ j ], the probability p r that they belong to the r-th relation type can be calculated as follows: where |e i | and |e j | represent the length of e i and e j , respectively. The triplet (e i , r, e j ) is extracted only if p r > θ, where θ is a free threshold parameter.

Experiments
In this section, we empirically verify the effectiveness of our proposed SDN on two public datasets.
In addition, we also investigate how different components in the model impact the performance of joint entity and relation extraction with different settings.

Experimental Settings
Datasets. We conduct experiments to evaluate SDN on two public datasets NYT (Riedel et al., 2010) and WebNLG (Gardent et al., 2017). The NYT dataset is produced by the distant supervision method which automatically aligns Freebase with New York Times news articles. It includes 3 entity types (e.g., "PER", "LOC", and "ORG") and 24 valid relation types. The WebNLG dataset is originally created for the natural language generation task. Given a group of triplets, annotators are asked to write a sentence which contains the information of all triplets in this group. We directly use the dataset preprocessed by Zeng et al. (2018), which includes 15 entity types and 246 valid relation types. The statistics of NYT and WebNLG are described in Table 1.
Hyperparameters. We initialize the word embeddings with Glove 300-dimension vectors (Pennington et al., 2014). The dimensions of hidden states for TA-LSTM, entity extraction module, and relation extraction module are set to 100, 200, 400, respectively. During training, we use the Adam optimizer with the initial learning rates of 1e −3 on NYT and 5e −4 on WebNLG, a maximum epoch number of 100 and the batch size of 30. To avoid overfitting, we apply Dropout with a rate of 0.3. In SDN CROSS-TA-LSTM + BERT and SDN TA-LSTM + BERT, we use bert-base-cased 2 to initialize BERT (Devlin et al., 2018) and the initial learning rates is 1e −5 for fine-tuning BERT.
Baselines and Evaluation Metrics. We compare our method against three groups of state-ofthe-art methods of joint learning: i) Multi-task baselines. MRT (Sun et al., 2018) applies a minimum risk training method to highlight connections between an entity model and a relation model. RIN  uses a recurrent interaction network to dynamically learn. SMHSA  uses an attention-based joint model to identify the overlapping triplets. CopyMTL-One and CopyMTL-Mul ) utilize a multi-task framework to extract multi-token entities.
ii) Tagging baselines. PA-LSTM-CRF (Dai et al., 2019) uses a position-attention mechanism to model n tag sequences. RSAN  utilizes a relation-attentive sequence labeling framework. TPLinker ) designs a handshaking tagging scheme for joint entity and relation extraction.
iii) Generative baselines. OneDecoder and MultiDecoder (Zeng et al., 2018) utilize a Seq2Seq model to generate relational triplets. PNDec (Nayak and Tou Ng, 2020) uses pointer networks with an encoder-decoder model. CGT (Ye et al., 2021) adopts a framework of contrastive triple extraction with a generative transformer.
We adopt Precision(P), Recall(R), and standard Micro-F1 (F 1 ) to evaluate the performance. A predicted triplet is regarded as correct only if the relation type and the two corresponding entities are all the same as the golden standard annotation 3 . We report the corresponding results of the test set when the development set achieves the best results.

Main Results
As shown in Table 2, SDN CROSS-TA-LSTM is significantly superior to these multi-task baselines (over at least 6.2% P, 0.7% R, 3.4% F 1 on NYT and 13.5% P, 12.0% R, 12.8% F 1 on WebNLG). It shows that explicitly modeling the interdependence of entity types and relation types related to entities in the NER and RE tasks can obtain more useful representations for joint entity and relation extraction.
Without considering the pre-training language model BERT, SDN CROSS-TA-LSTM achieves better performances than other tagging baselines and generative baselines (over at least 0.8% P, 2.6% R, 3.5% F 1 on NYT and 5.0% R, 3.4% F 1 on WebNLG). This indicates that our model can predict as many accurate triples as possible while ensuring a higher precision.
Generally speaking, generative triple extraction requires a huge amount of manual annotation or further constraint to improve the precision. But our model focuses on how to extract more relation triples, which indicates the importance of the entity type and relation type information for joint entity and relation extraction. In fact, injecting the effective entity and relation type information via the interaction of entity types and relation types is promising for joint entity and relation extraction.
To more explicitly illustrate the effectiveness of our method, we make comparisons with some deformed models: (i) SDN LSTM , which only uses bi-directional LSTM as encoders without any type knowledge and the cross-type interaction; (ii) SDN TA-LSTM , which uses two bidirectional TA-LSTM to encode different type information and only introduce the knowledge of entity and relation type without considering cross-type interaction; (iii) SDN CROSS-TA-LSTM , which not only introduces the type knowledge but also models the interaction between entity and relation types; (iv) SDN TA-LSTM + BERT, which adds BERT to (ii) SDN TA-LSTM ; (v) SDN CROSS-TA-LSTM +BERT, which adds BERT to (iii) SDN CROSS-TA-LSTM . SDN TA-LSTM achieves better results of 2.7% F 1 on NYT and 3.3% F 1 on WebNLG compared with SDN LSTM , indicat- ing the effectiveness of TA-LSTM for capturing more type informations. SDN CROSS-TA-LSTM outperforms SDN TA-LSTM over 1.3% F 1 on NYT and over 2.8% F 1 on WebNLG, which shows that explicitly modeling the interaction between entity and relation types within one relation can further improve performance. In addition, SDN CROSS-TA-LSTM + BERT and is significantly superior to SDN TA-LSTM + BERT (over 1.5% F 1 on NYT and 0.8% F 1 on WebNLG), indicating the effectiveness of cross-type attention mechanism.

Ablation Experiments
We conduct ablation experiments of auxiliary tasks on NYT. The results are listed in Table 3.
When we only ablate L ET , the results on the two subtasks suffer significant declines (−1.5% F 1 on NER and −1.4% F 1 on RE, respectively). When we only ablate L RT , the results on NER and RE also suffer significant declines (over −0.9% F 1 on both subtasks). When we both ablate L ET and L RT , our model achieves similar results (−3.4% F 1 on NER and −4.6% F 1 on RE, respectively). On the one hand, it demonstrates that two auxiliary tasks of entity type prediction and relation type prediction can highlight useful type information as well as decrease the noise. On the other hand, it indicates that our model depends heavily on both auxiliary tasks to capture the right type of knowledge.

Analysis of Inference Threshold
We conduct analysis experiments to explore the value of threshold θ of the inference methods. Figure 4 illustrates how performance varies on different inference threshold θ of the NYT and WebNLG datasets. It can be seen that the threshold inference method effectively adjusts the performance Table 4: Examples from the NYT test set. Red, blue, and green represent entities whose types are PER, LOC, and ORG, respectively. denotes incorrent relational triplets.
of SDN with different choices of θ. With the increase of θ from 0.1 to 0.6, the F 1 scores on NYT and WebNLG show a consistent trend of first increasing and then decreasing. When the θ values of NYT and WebNLG are set to 0.4 and 0.3 respectively, the average inference method achieves the best performance on both datasets. Through detailed analysis of NYT and WebNLG, we find that this gap is due to the difference of the entity length. More precisely, the average entity length and max entity length on the NYT test set are 1.4 and 8, respectively. while the average entity length and max entity length on the WebNLG test set are 2.2 and 15, respectively. Longer entities are harder to identify entities and relations in the NER and RE tasks.

Case Study
As shown in Table 4, we present two examples from the NYT test set as illustrations. The relation type of "(Gary Chaison (PER) , Worcester (LOC))" is "/people/person/place_lived". RIN only identifies "(Gary Chaison, /business/person/company, Clark University)" and "(Worcester, /location/location/contains, Clark University)", but fails to identify "(Gary Chaison, /people/person/place_lived, Worcester)". In addition, RIN does not recognize "(Sudan, /location/country/administrative_divisions, Darfur)" correctly. Because RIN dynamically learns the interaction of the two subtasks without considering any type of information, and some details may be lost to some extent. In contrast, our method correctly extracts all relation triplets, which shows that explicitly modeling the interaction between entity types and relation types can synchronize their in-formation and make them mutually beneficial for joint entity and relation extraction.

Conclusion
In this paper, we propose a Synchronous Dual Network (SDN) with cross-type attention for joint entity and relation extraction. Firstly, we use two isomorphic bi-directional type-attention LSTM as encoders to learn the entity type enhanced representations and the relation type enhanced representations from two different perspectives. Then the entity type information (or the relation type information) is introduced into the relation type enhanced representations (or the entity type enhanced representations) to explicitly model the interaction between entity types and relation types via crosstype attention mechanism. In addition, the proposed type-attention LSTM is a general structure to obtain the preferred type distribution. Experiments on two public datasets verify the effectiveness of the proposed model, achieving state-of-the-art performance.