TDEER: An Efficient Translating Decoding Schema for Joint Extraction of Entities and Relations

Joint extraction of entities and relations from unstructured texts to form factual triples is a fundamental task of constructing a Knowledge Base (KB). A common method is to decode triples by predicting entity pairs to obtain the corresponding relation. However, it is still challenging to handle this task efficiently, especially for the overlapping triple problem. To address such a problem, this paper proposes a novel efficient entities and relations extraction model called TDEER, which stands for Translating Decoding Schema for Joint Extraction of Entities and Relations. Unlike the common approaches, the proposed translating decoding schema regards the relation as a translating operation from subject to objects, i.e., TDEER decodes triples as subject + relation \rightarrow objects. TDEER can naturally handle the overlapping triple problem, because the translating decoding schema can recognize all possible triples, including overlapping and non-overlapping triples. To enhance model robustness, we introduce negative samples to alleviate error accumulation at different stages. Extensive experiments on public datasets demonstrate that TDEER produces competitive results compared with the state-of-the-art (SOTA) baselines. Furthermore, the computation complexity analysis indicates that TDEER is more efficient than powerful baselines. Especially, the proposed TDEER is 2 times faster than the recent SOTA models. The code is available at https://github.com/4AI/TDEER.


Introduction
Extraction of entities and relations from unstructured texts is one of the most essential information extraction tasks. It aims to extract entities and their corresponding semantic relations from unstructured texts, which are usually presented in a triple form of (subject, relation, object), e.g., (Microsoft, co-founder, Bill Gates). It is also a crucial step in building a large-scale KB and exerts an important role in the development of web search (Szumlanski and Gomez, 2010), question answering (Fader et al., 2014), biomedical text mining (Huang and Lu, 2016), etc.
Traditional approaches (Zelenko et al., 2003;Chan and Roth, 2011;Rink and Harabagiu, 2010) handle this task in a pipeline manner, i.e., extracting the entities first and then identifying their relations. The pipeline framework simplifies the extraction task, but it ignores the relevance between entity identification and relation prediction. To address this problem, several joint learning models have been proposed and can be categorized into feature-based models and end-to-end deep models. Feature-based models (Li and Ji, 2014;Ren et al., 2017) introduce a complex process of feature engineering and profoundly depend on Natural Language Processing tools for feature extraction. More recently, the end-to-end neural network models (Gupta et al., 2016;Zheng et al., 2017;Zeng et al., 2018;Fu et al., 2019;Wei et al., 2020) have become the mainstream method for relation extraction tasks. Such models utilize the learned representation from pre-trained language models and are a more promising approach than manual features.
More research interests have been concerned with complicated entity and relation extraction problems, such as the overlapping triple problem. Zeng et al. (2018) summarized the overlapping triple problem into three categories, i.e. Normal, SEO, and EPO, which are depicted in Figure 1. Many methods have been proposed to address the overlapping issue, for instance, encoder-decoder framework (Zeng et al., 2018) and decomposition approaches Wei et al., 2020). However, such approaches still suffer from setbacks when handling the overlapping triple problem. More specifically, the encoder-decoder framework can only resolve the one-word entity overlapping problem and fail to handle the multi-word entity overlapping problem. Meanwhile, the decomposition approaches suffer error accumulation between dependent stages. To address these problem,  presented a one-stage method, TPLinker, that transforms the joint extraction task into a token pair linking problem to resolve the overlapping triple problem. TPLinker does not contain any inter-dependent stages, hence it can alleviate error accumulation. However, processing all token pairs at encoder layers suffers from high computational complexity, which is an obstacle for TPLinker to encode long text.
We present a novel framework TDEER to jointly extract the entities and relations by a translating decoding schema to handle the overlapping triple problem. More concretely, TDEER interprets the relation as a translating operation from subject entity to object entities, i.e., it decodes triples by subject + relation → objects. The proposed translating decoding schema can effectively resolve the overlapping triple problem. TDEER iterates all pairs of subjects and relations to recognize objects (or no object), hence all possible triples, including overlapping or non-overlapping triples, can be considered. We propose a negative sample strategy to detect and alleviate error propagation in different stages. This strategy can enable TDEER to alleviate error accumulation to achieve higher results. TDEER is an efficient approach as it first retrieves all possible relations and entities, then uses distinguished entities and relations to decode triples. By doing this, the search space can be reduced, thus it is more efficient than previous works. The computational complexity of the proposed translating decoding schema is O(n + sr), where n is the sequence length, s is the number of subjects in the input sentence, r denotes the number of relations in the input sentence. Extensive experiments illustrate that TDEER achieves better results than SOTA models in most datasets and is competent in handling the overlapping triple problem.
In summary, our contributions are as follows: (1) We propose a novel translating decoding schema for joint extraction of entities and relations from unstructured texts. (2) TDEER can handle the intractable overlapping triple problem effectively and efficiently. (3) Notably, TDEER is about 2 times faster than the current SOTA models.

Related Work
The pipeline approach and joint approach are the two mainstream methods for extracting entities and relations from unstructured texts.
Traditionally, extracting entities and relations to form triples has been studied as two separated independent tasks: Named Entity Recognition (NER) and Relation Extraction. Mintz et al. (2009) introduced a distant supervision model, and Hoffmann et al. (2011) used a weak supervision method to extract entities and relations. The features of distant supervision and weak supervision approaches are often derived from Natural Language Processing (NLP) tools. It suffers from data labeling errors that inevitably exist in NLP tools. To address this problem, Zeng et al. (2015) employed a multi-instance learning approach to tackle the problem of data labeling errors. Qin et al. (2018) applied reinforcement learning for extraction of entities and relations. Although the pipeline models produced promising results, they neglect the triple-level dependencies between entities and relations. Recently, Zhong and Chen (2021) presented a pipeline approach incorporating entity information for entity and relation extraction.
To exploit the dependencies between entities and relations, multiply joint extraction models have been proposed. Zheng et al. (2017) introduced a unified tagging scheme and transformed the relationship extraction problem into a sequence labeling problem. Zeng et al. (2018) applied a sequenceto-sequence model with a copy mechanism to solve the overlapping triple problem. Trisedya et al.
(2019) employed the encoder-decoder framework to jointly extract triples from sentences and map them into an existing KB. Fu et al. (2019) applied graph convolutional networks to jointly learn named entities and relations. Dai et al. (2019) presented a unified joint extraction model to tag entity and relation labels directly according to a query word position, which can simultaneously extract all entities and their types. Wei et al. (2020) proposed a cascade binary tagging framework.  formulated the joint extraction as a token pair linking problem.
Moreover, some knowledge representation models Wang et al., 2014;Tu et al., 2017) are adopted to refine the triple extraction model via scoring candidate facts by knowledge graph embedding. Although some of them also use the "translation" idea, the function is different from ours. In their setting, they use the "translation" idea to construct rank-based knowledge graph embedding models. They cannot be used to extract entities and relations from texts directly. In our setting, the "translation" idea is applied to end-to-end joint extract entities and relations from text.
The proposed translating decoding schema is a novel approach to solve the overlapping triple problem effectively and efficiently, which makes our model crucially different from previous works.

Methodology
This paper proposes a three-stage model, TDEER. In the first stage, TDEER uses a span-based entity tagging model to extract all subjects and objects. In the second stage, TDEER employs the multilabel classification strategy to detect all relevant relations. In the third stage, TDEER iterates the pairs of subjects and relations to identify respective objects by the proposed translating decoding schema. Figure 2 shows the generic framework of TDEER. In subsequent sections, we will describe the three stages of TDEER in detail.

Input Layer
The input of TDEER is a sentence T . We pad the sentence to keep a uniform length n for all sentences. For an LSTM-based model, we first map each word into a k-dimensional continuous space and obtains the word embedding t i ∈ R k . Then we concatenate all word vectors to form a k × n matrix as model input: t = [t 1 , t 2 , . . . , t n ]. we employ LSTM on the embedding matrix to produce latent semantic feature map X: (1) As for a BERT-based model, TDEER extracts feature map via the pre-trained BERT (Devlin et al., 2019) over text input:

Entity Tagging Model
To obtain entities and their positions efficiently, we adopt a span-based tagging model following prior works Wei et al., 2020). We apply two binary classifiers to predict the start and end position of entities respectively. The operations on each token in a sentence are as follows: where p start i and p end i stand for the probabilities of recognizing the i-th token in input sequence as the start and end position of an entity, respectively. σ(·) denotes a sigmoid activation function.
The entity tagging model is trained by minimizing the following loss function: is the likelihood for the start positions, and p end θ is the likelihood for the end positions. We apply the entity tagging model to obtain all subjects and objects in one sentence. Detected subjects and objects are denoted as Ω s and Ω o , respectively. The extracted entity is presented into a tuple like (start, end).

Relation Detector
In general, more than one relation can be detected in a sentence. For example, there are four relations Star In, Direct Movie, Live In, and Capital Of in the sentence in Figure 2. To identify related relations in a sentence, we adopt a multi-label classification strategy. For the BERT-based/LSTM-based model, we project the "[CLS]" token/last output (LO) representation into a relation-detection space for multi-label classification, as follows: where σ(·) denotes sigmoid function.
The relation detector minimizes the following binary cross-entropy loss function to detect relations: BERT Encoder where y i ∈ {0, 1} indicates the ground truth label of relations. We denote the detected relations in a sentence as Ω r .

Translating Decoding Schema
We iterate the pairs of detected subjects Ω s and relations Ω r to predict the start positions of objects. For each subject and relation pair, we first combine the representation of subject and relation. Next, we use the attention mechanism to obtain a selective representation, which is expected to assign higher weights to possible positions of objects. Finally, we pass the selective representation to a fully-connect layer to get the output, i.e. the positions of objects. More concretely, for the i-th subject in Ω s and j-th relation in Ω r , TDEER takes the averaged vector span representation between the start and end tokens of the subject as v i sub . TDEER maps the relation into a continuous space with the same feature dimension as v i sub to produce the relation embedding vector e j rel . Then TDEER applies a fully-connect layer to encode the relation: v j rel = FullyConnect(e j rel ).
TDEER links subject and relation via addition op-eration, as follows: We adopt the addition operation because it is intuitive, and it does not change the tensor shape of inputs, which is convenient for attention computation. Next, TDEER applies the attention mechanism to obtain the selective representation.
where d k is the dimension of the attention key. Furthermore, TDEER adopts a binary classifier to identify the start positions of objects given the current subject and relation.
where p obj_start i indicates the probability of identifying the i-th token in input sequence as the start position of an object entity.
In this stage, TDEER minimizes the following loss function to discern the start positions of object entities.
where I is the indicator function. After obtaining the start positions of objects, TDEER takes the corresponding entities from the Ω o which has the same start position as the final objects. If no start positions match, there is no triple for the current subject and relation.

Negative Sample Strategy
Most entities and relations extraction models consisting of multiple components suffer from error accumulation. Errors from upstream components will propagate to downstream components because of the dependency between components. In TDEER, the translating decoder is dependent on the entity tagger and relation detector, hence the translating detector may receive error entities or relations from upstream components. Therefore, we introduce a negative sample strategy to detect and alleviate errors from upstream components. For each sentence, we produce incorrect triples as negative samples by replacing the correct subject/relation with other inappropriate subjects/relations during the training phase. We do not assign any objects to negative samples, namely the probabilities of start positions of Eq.(11) are all expected to be 0. This strategy enables TDEER to handle noisy inputs of subjects and relations at the decoding phase.

Joint Training
We jointly train the span-based entity tagging model, the relation detector, and the translating decoder. The joint loss function is defined as follows: where α, β and λ are constants. In our experiment, we set 1.0, 1.0, and 5.0, respectively. The values are obtained by grid search on the validation set.

Datasets and Evaluation Metrics
We conduct experiments on widely used datasets.
NYT (Riedel et al., 2010) dataset was produced by distant supervision method from New York Times news articles. WebNLG was created for Natural Language Generation and adapted by Zeng et al. (2018) for relational triple extraction. For a fair comparison, we apply the two datasets released by Zeng et al. (2018). Apart from evaluating the model on standard splitting, we follow (Wei et al., 2020) to partition the test sentences according to different overlapping categories, different triple numbers for experiments on overlapping triples, and various triple numbers. Furthermore, we also conduct experiments on NYT11-HRL (Takanobu et al., 2019), in which most test sentences belong to Normal, to demonstrate that the proposed model can handle not only the overlapping triple problem but also the general problem. The adopted public datasets with summary statistics in Table 1. We report the standard micro Precision, Recall, and F1-score following the same setting in Fu et al. (2019).

Baselines
We compare the proposed model with following SOTA models: NovelTagging (Zheng et al., 2017) incorporates both entity and relation roles and models relational triple extraction problem as a sequence labeling problem; CopyR (Zeng et al., 2018) applies a sequence-to-sentence architecture; GraphRel (Fu et al., 2019) uses graph convolutional networks to jointly learn named entities and relations; OrderCopyR (Zeng et al., 2019) applies the reinforcement learning into an sequence-tosequence model to generate triplets; CasRel (Wei et al., 2020) employs a cascade binary tagging framework; TPLinker  iterates all token pairs and use matrices to tag token links to recognize relations between token pair.

Implementation Details
We adopt the Adam (Kingma and Ba, 2015) optimizer. The hyper-parameters are tuned by grid search on the validation set. The learning rate is set to 1e-3/5e-5 and the batch size is set to 32/8 in the backbone as LSTM/BERT. For the LSTM-based model, we apply the 300-dimension pre-trained GloVe embedding (Pennington et al., 2014) (Zeng et al., 2018;Wei et al., 2020). We correct the number in the statistics and use to mark the accurate number.

Main Results
The main results of the proposed models and baseline models are reported in Table 2. CasRel, TPLinker, and TDEER achieve absolute improvements on NYT and WebNLG datasets against the rest baselines. Especially, TDEER produces competitive results compared with the previous SOTA model TPLinker and achieves 7 out of 9 best results. Moreover, TDEER outperforms baseline models over the F1 score on all datasets. From Table 1, we can observe that the data size of WebNLG is small while it consists of a large number of predefined relations. It is difficult to make improvements on WebNLG, as existing models can achieve an F1 score over 90%, which has already exceeded human-level performance . Even though, TDEER achieves around 1.2% gain to 93.1% on WebNLG against TPLinker, which verifies the effectiveness of the proposed framework. Apart from F1, we find that TDEER performs better on precision score than baselines models in most results. Although without a pre-trained language model as the backbone, TDEER LSTM still performs well. TDEER LSTM achieves a higher F1 score on NYT against baseline models except for BERT-Based CasRel and TPLinker. Furthermore, TDEER LSTM outperforms baseline models on WebNLG and NYT11-HRL over precision against all baseline models except for BERT-Based CasRel. Therefore, the proposed framework is efficacious even though without a powerful pre-trained language model. NYT and WebNLG contain a large number of overlapping-triple instances. Therefore, the results on NYT and WebNLG indicate that TDEER can address the overlapping triple problem. Almost all triples in NYT11-HRL belong to Normal. TDEER achieves better results than baselines on NYT11-HRL, which shows that TDEER can solve the gen-eral extraction problem.

Results of Ablation Study
We conduct ablation studies on different strategies to explore the effect of the negative sample strategy. It shows that negative samples from subjects achieve better results than negative samples from relations. TDEER performs better by combining the two types of negative sample strategy than adopting each strategy individually or without negative samples. This evidence illustrates that negative samples are helpful to alleviate error accumulation. We also conduct ablation studies on TDEER without the relation detector or attention. It also shows that the results of TDEER are better than TDEER without the relation detector or attention. We notice that the model will malfunction without the relation detector. This evidence suggests that the relation detector and attention are crucial for TDEER.
To investigate the effect of the attention mechanism, we pick up a sample from the NYT test set which contains a triple (Netherlands, /location/country/administrative_division, Utrecht). We visualize the attention heatmap of different subject and relation pairs as depicted in Figure 3. The heatmap indicates that when the extracted subject and relation pair are proper, the attention weights on object positions are higher than others. If the weights are close to each other, then the extracted subject and relation pair can not be decoded to form a valid triple.

Discussion on Triple Numbers
In general, the more triples in a sentence, the more complicated the sentence is. To explore the model performance regarding different sentence complexities, we also conduct experiments on sentences with different triple numbers. The results are reported in Table 4. From the results, we can find that TDEER outperforms baseline models except for four triple numbers in NYT. Notably, TDEER   achieves 2.8% gain in NYT and 0.7% gain in WebNLG against TPLinker for complicated sentences containing five or more triples. This evidence illuminates that TDEER is effective to model sentences with multiple triples.

Discussion on Overlapping Patterns
To further investigate the performance of different overlapping patterns, we conducted extensive experiments and report the results on different overlapping patterns on NYT and WebNLG in Table 5.
The results suggest that TDEER outperforms baseline models, which demonstrates the advantages of TDEER in processing the overlapping triple problem.
H is f a t h e r is a p u lm o n a r y s p e c ia li s t in U t r e c h t , t h e N e t h e r la n d s .

Discussion on Computation Complexity
Computation efficiency is an important problem that is not paid enough attention to by most previous works. We compare TDEER with baselines in computational complexity and inference time on the test set. The results are shown in Table 6. Pipeline approaches usually use NER tools to detect entities. NER tools usually apply the Viterbi algorithm to decode sequences with O(nK 2 ) complexity, where n denotes the input length and K is the tag size. Pipeline approaches recognize relations from each entity pair. Thus, the computation complexity of pipeline approaches is O(nK 2 +e 2 ), where e denotes the number of entities in the input.
Despite the successes of CasRel (Wei et al., 2020) and TPLinker , they still struggle with computation efficiency. CasRel jointly decodes relations and objects. The com-     . e denotes the number of entities in input, s/o stands for the number of subjects/objects in input, r denotes the number of relations in input, K denotes the tag size, and n stands input length. Inference time presents the average time BERTbased models take to process a sample.
putational complexity of CasRel is O(n + sro), where n is the input length, s/r/o is the number of subjects/relations/objects in the input, respectively. TPLinker iterates all token pairs and uses matrices to tag token links to recognize relations. The main computation overhead is on the encoder with O(n 2 ) complexity, where n is the input length.
The computation complexity of TDEER is O(n+ sr), where n is the input length, s denotes the number of subjects in input, and r is the number of relations in the input sentence. It is 0.7 times faster than CasRel and 1.6 times faster than TPLinker on NYT, and 1.1 times faster than CasRel and 2.1 times faster than TPLinker on WebNLG from Table  6. Therefore, we can conclude that TDEER is more efficient than baselines, which makes TDEER competent in constructing a large-scale KB.

Conclusion & Future work
In this paper, we have proposed a novel translating decoding schema for joint extraction of entities and relations, namely TDEER. It models the relation as a translating operation from subjects to objects, which can handle the overlapping triple problem naturally. We have conducted extensive experiments on widely used datasets to demonstrate the effectiveness and efficiency of the proposed model. The proposed negative sample strategy is used to alleviate the error accumulation problem. Though it is effective, it may increase training time. For future work, we plan to explore more efficient approaches to alleviate error accumulation.