A Partition Filter Network for Joint Entity and Relation Extraction

In joint entity and relation extraction, existing work either sequentially encode task-specific features, leading to an imbalance in inter-task feature interaction where features extracted later have no direct contact with those that come first. Or they encode entity features and relation features in a parallel manner, meaning that feature representation learning for each task is largely independent of each other except for input sharing. We propose a partition filter network to model two-way interaction between tasks properly, where feature encoding is decomposed into two steps: partition and filter. In our encoder, we leverage two gates: entity and relation gate, to segment neurons into two task partitions and one shared partition. The shared partition represents inter-task information valuable to both tasks and is evenly shared across two tasks to ensure proper two-way interaction. The task partitions represent intra-task information and are formed through concerted efforts of both gates, making sure that encoding of task-specific features is dependent upon each other. Experiment results on six public datasets show that our model performs significantly better than previous approaches. In addition, contrary to what previous work has claimed, our auxiliary experiments suggest that relation prediction is contributory to named entity prediction in a non-negligible way. The source code can be found at https://github.com/Coopercoppers/PFN.


Introduction
Joint entity and relation extraction intend to simultaneously extract entity and relation facts in the given text to form relational triples as (s, r, o). The extracted information provides a supplement to many studies, such as knowledge graph construction (Riedel et al., 2013), question * * Corresponding author.
answering (Diefenbach et al., 2018) and text summarization (Gupta and Lehal, 2010). typical examples of this category. Their methods extract features for different tasks in a predefined order. In parallel encoding, task-specific features are generated independently using shared input. Compared with sequential encoding, models build on this scheme do not need to worry about the implication of encoding order. For example, Fu et al. (2019) encodes entity and relation information separately using common features derived from their GCN encoder. Since both taskspecific features are extracted through isolated submodules, this approach falls into the category of parallel encoding. However, both encoding designs above fail to model two-way interaction between NER and RE tasks properly. In sequential encoding, interaction is only unidirectional with a specified order, resulting in different amount of information exposed to NER and RE task. In parallel encoding, although encoding order is no longer a concern, interaction is only present in input sharing. Considering adding two-way interaction in feature encoding, we adopt an alternative encoding design: joint encoding. This design encodes task-specific features jointly with a single encoder where there should exist some mutual section for inter-task communication.
In this work, we instantiate joint encoding with a partition filter encoder. Our encoder first sorts and partitions each neuron according to its contribution to individual tasks with entity and relation gates. During this process, two task partitions and one shared partition are formed (see figure 1). Then individual task partitions and shared partition are combined to generate task-specific features, filtering out irrelevant information stored in the opposite task partition.
Task interaction in our encoder is achieved in two ways: First, the partitions, especially the taskspecific ones, are formed through concerted efforts of entity and relation gates, allowing for interaction between the formation of entity and relation features determined by these partitions. Second, the shared partition, which represents information useful to both task, is equally accessible to the formation of both task-specific features, ensuring balanced two-way interaction. The contributions of our work are summarized below: 1. We propose partition filter network, a framework designed specifically for joint encoding. This method is capable of encoding taskspecific features and guarantees proper twoway interaction between NER and RE.
2. We conduct extensive experiments on six datasets. The main results show that our method is superior to other baseline approaches, and the ablation study provides insight into what works best for our framework.
3. Contrary to what previous work has claimed, our auxiliary experiments suggest that relation prediction is contributory to named entity prediction in a non-negligible way.

Related Work
In recent years, joint entity and relation extraction approaches have been focusing on tackling triple overlapping problem and modelling task interaction. Solutions to these issues have been explored in recent works (Zheng et al., 2017;Zeng et al., 2018Zeng et al., , 2019Fu et al., 2019;. The triple overlapping problem refers to triples sharing the same entity (SEO, i.e. SingleEntityOverlap) or entities (EPO, i.e. EntityPairOverlap). For example, In "Adam and Joe were born in the USA", since triples (Adam, birthplace, USA) and (Joe, birthplace, USA) share only one entity "USA", they should be categorized as SEO triples; or in "Adam was born in the USA and lived there ever since", triples (Adam, birthplace, USA) and (Adam, residence, USA) share both entities at the same time, thus should be categorized as EPO triples. Generally, there are two ways in tackling the problem. One is through generative methods like seq2seq (Zeng et al., 2018(Zeng et al., , 2019 where entity and relation mentions can be decoded multiple times in output sequence, another is by modeling each relation separately with sequences , graphs (Fu et al., 2019) or tables (Wang and Lu, 2020). Our method uses relation-specific tables (Miwa and Sasaki, 2014) to handle each relation separately. Task interaction modeling, however, has not been well handled by most of the previous work. In some of the previous approaches, Task interaction is achieved with entity and relation prediction sharing the same features (Tran and Kavuluru, 2019;Wang et al., 2020b). This could be problematic as information about entity and relation could sometimes be contradictory. Also, as models that use sequential encoding (Bekoulis et al., 2018b;Eberts and Ulges, 2019; or parallel encoding (Fu et al., 2019) lack proper two-way interaction in feature extraction, predictions made on these features suffer the problem of improper interaction. In our work, the partition filter encoder is built on joint encoding and is capable of handling communication of inter-task information more appropriately to avoid the problem of sequential and parallel encoding (exposure bias and insufficient interaction), while keeping intra-task information away from the opposite task to mitigate the problem of negative transfer between the tasks.

Problem Formulation
Our framework split up joint entity and relation extraction into two sub-tasks: NER and RE. Formally, Given an input sequence s = {w 1 , . . . , w L } with L tokens, w i denotes the i-th token in sequence s. For NER, we aim to extract all typed entities whose set is denoted as S, where ⟨w i , e, w j ⟩ ∈ S signifies that token w i and w j are the start and end token of an entity typed e ∈ E. E represents the set of entity types. Concerning RE, the goal is to identify all head-only triples whose set is denoted as T , each triple ⟨w i , r, w j ⟩ ∈ T indicates that tokens w i and w j are the corresponding start token of subject and object entity with relation r ∈ R. R represents the set of relation types. Combining the results from both NER and RE, we should be able to extract relational triples with complete entity spans.

Model
We describe our model design in this section. Our model consists of a partition filter encoder and two task units, namely NER unit and RE unit. The partition filter encoder is used to generate taskspecific features, which will be sent to task units as input for entity and relation prediction. We will discuss each component in detail in the following three sub-sections.

Partition Filter Encoder
Similar to LSTM, the partition filter encoder is a recurrent feature encoder with information stored in intermediate memories. In each time step, the encoder first divides neurons into three partitions: entity partition, relation partition and shared partition. Then it generates task-specific features by selecting and combining these partitions, filtering out information irrelevant to each task. As shown in figure 2, this module is designed specifically to jointly extract task-specific features, which strictly follows two steps: partition and filter.
Partition This step performs neuron partition to divide cell neurons into three partitions: Two task partitions storing intra-task information, namely entity partition and relation partition, as well as one shared partition storing inter-task information. The neuron to be divided are candidate cellc t representing current information and previous cell c t−1 representing history information. c t−1 is the direct input from the last time step andc t is calculated in the same manner as LSTM: where Linear stands for the operation of linear transformation. We leverage entity gateẽ and relation gater, which are referred to as master gates in (Shen et al., 2019), for neuron partition. As illustrated in figure 1, each gate, which represents one specific task, will divide neurons into two segments according to their usefulness to the designated task. For example, entity gateẽ will separate neurons into two partitions: NER-related and NER-unrelated. The shared partition is formed by combining partition results from both gates. Neurons in the shared partition can be regarded as information valuable to both tasks. In order to model twoway interaction properly, inter-task information in the shared partition is evenly accessible to both tasks (which will be discussed in the filter subsection). In addition, information valuable to only one task is invisible to the opposing task and will be stored in individual task partitions. The gates are calculated using cummax activation function cummax (⋅) = cumsum(sof tmax(⋅)) 1 , whose output can be seen as approximation of a binary gate with the form of (0, . . . , 0, 1, . . . , 1): The intuition behind equation (2) is to identify two cut-off points, displayed as scissors in figure 2, which naturally divide a set of neurons into three segments. As a result, the gates will divide neurons into three partitions, entity partition ρ e , relation partition ρ r and shared partition ρ s . Partitions for  In partition, we first segment neurons into two task partitions and one shared partition. Then in filter, partitions are selected and combined to form task-specific features and shared features, filtering out information irrelevant to each task.
previous cell c t−1 are formulated as below: 2 Note that if you add up all three partitions, the result is not equal to one. This guarantees that in forward message passing, some information is discarded to ensure that message is not overloaded, which is similar to the forgetting mechanism in LSTM.
Then, we aggregate partition information from both target cells, and three partitions are formed as a result. For all three partitions, we add up all related information from both cells: Filter We propose three types of memory block: entity memory, relation memory and shared memory. Here we denote µ e as entity memory, µ r as in all three memories is used to form cell state c t , which will then be used to generate hidden state h t (The hidden and cell state at time step t are input to the next time step):

Global Representation
In our model, we employ a unidirectional encoder for feature encoding. The backward encoder in the bidirectional setting is replaced with task-specific global representation to capture the semantics of future context. Empirically this shows to be more effective. For each task, global representation is the combination of task-specific features and shared features computed by:

Task Units
Our model consists of two task units: NER unit and RE unit. In NER unit, the objective is to identify and categorize all entity spans in a given sentence. More specifically, the task is treated as a type-specific table filling problem. Given a entity type set E, for each type k, we fill out a table whose element e k ij represents probability of word w i and word w j being start and end position of an entity with type k. For each word pair (w i , w j ), we concatenate word-level entity features h e i and h e j , as well as sentence-level global features h ge before feeding it into a fully-connected layer with ELU activation to get entity span representation h e ij : With the span representation, we can predict whether the span is an entity with type k by feeding it into a feed forward neural layer: where σ represents sigmoid activation function.
Computation in RE unit is mostly symmetrical to NER unit. Given a set of gold relation triples denoted as T , this unit aims to identify all triples in the sentence. We only predict starting word of each entity in this unit as entity span prediction is already covered in NER unit. Similar to NER, we consider relation extraction as a relation-specific table filling problem. Given a relation label set R, for each relation l ∈ R, we fill out a table whose element r l ij represents the probability of word w i and word w j being starting word of subject and object entity. In this way, we can extract all triples revolving around relation l with one relation table. For each triple (w i , l, w j ), similar to NER unit, triple representation h r ij and relation score r l ij are calculated as follows:

Training and Inference
For a given training dataset, the loss function L that guides the model during training consists of two parts: L ner for NER unit and L re for RE unit: e k ij andr l ij are respectively ground truth label of entity table and relation table. e k ij and r l ij are the predicted ones. We adopt BCELoss for each task 3 . The training objective is to minimize the loss function L, which is computed as L ner + L re .
During inference, we extract relational triples by combining results from both NER and RE unit. For each legitimate triple prediction (s k i,j , l, o k ′ m,n ) where l is the relation label, k and k ′ are the entity type labels, and the indexes i, j and m, n are respectively starting and ending index of subject entity s and object entity o, the following conditions should be satisfied: λ e and λ r are threshold hyper-parameters for entity and relation prediction, both set to be 0.5 without further fine-tuning.
Following previous work, we assess our model on NYT/WebNLG under partial match, where only the tail of an entity is annotated. Besides, as entity type information is not annotated in these datasets, we set the type of all entities to a single label "NONE", so entity type would not be predicted in our model. On ACE05, ACE04, ADE and SciERC, we assess our model under exact match where both head and tail of an entity are annotated. For ADE and ACE04, 10-fold and 5fold cross validation are used to evaluate the model respectively, and 15% of the training set is used to construct the development set. For evaluation metrics, we report F1 scores in both NER and RE. In NER, an entity is seen as correct only if its type and boundary are correct. In RE, A triple is correct only if the types, boundaries of both entities and their relation type are correct. In addition, we report Macro-F1 score in ADE and Micro-F1 score in other datasets. We choose our model parameters based on the performance in the development set (the best average F1 score of NER and RE) and report the results on the test set. More details of hyperparameters can be found in Appendix B Table 1 shows the comparison of our model with existing approaches. In partially annotated datasets WebNLG and NYT, under the setting of BERT. For RE, our model achieves 1.7% improvement in WebNLG but performance in NYT is only slightly better than previous SOTA TpLinker (Wang et al., 2020b) by 0.5% margin. We argue that this is because NYT is generated with distant supervision, and annotation for entity and relation are often incomplete and wrong. Compared to TpLinker, the strength of our method is to reinforce two-way interaction between entity and relation. However, when dealing with noisy data, the strength might be counter-productive as error propagation between both tasks is amplified as well.

Main Result
For NER, our method shows a distinct advantage over baselines that report the figures. Compared to Casrel , a competitive method, our F1 scores are 2.3%/2.5% higher in NYT/WebNLG. This proves that exposing relation information to

Ablation Study
In this section, we take a closer look and check the effectiveness of our framework in relation extraction concerning five different aspects: number of encoder layer, bidirectional versus unidirectional, encoding scheme, partition granularity and decoding strategy.

Number of Encoder Layers
Similar to recurrent neural network, we stack our partition filter encoder with an arbitrary number of layers. Here we only examine frameworks with no more than three layers. As shown in table 2, adding layers to our partition filter encoder leads to no improvement in F1-score. This shows that one layer is good enough for encoding task-specific features.
Bidirection Vs Unidirection Normally we need two partition filter encoders (one in reverse order) to model interaction between forward and backward context. However, as discussed in section 4.2, our model replaces the backward encoder with a global representation to let future context be visible to each word, achieving a similar effect with bidirectional settings. In order to find out which works best, we compare these two methods in our ablation study. From table 2, we find that unidirectional encoder with global representation outperforms bidirectional encoder without global representation, showing that global representation is more suitable in providing future context for each word than backward encoder. In addition, when global representation is involved,  unidirectional encoder achieves similar result in F1 score compared to bidirectional encoder, indicating that global representation alone is enough in capturing semantics of future context.

Encoding Scheme
We replace our partition filter encoder with two LSTM variants to examine the effectiveness of our encoder. In the parallel setting, we use two LSTM encoders to learn task-specific features separately, and no interaction is allowed except for sharing the same input. In the sequential setting where only one-way interaction is allowed, entity features generated from the first LSTM encoder is fed into the second one to produce relation features. From table 2, we observe that our partition filter outperforms LSTM variants by a large margin, proving the effectiveness of our encoder in modelling two-way interaction over the other two encoding schemes.
Partition Granularity Similar to (Shen et al., 2019), we split neurons into several chunks and perform partition within each chunk. Each chunk shares the same entity gate and relation gate. Thus partition results for all chunks remain the same. For example, with a 300-dimension neuron set, if we split it into 10 chunks, each with 30 neurons, only two 30-dimension gates are needed for neuron partition. We refer to the above operation as coarse partition. In contrast, our fine-grained partition can be seen as a special case as neurons are split into only one chunk. We compare our fine-grained partition (chunk size = 300) with coarse partition  (chunk size = 10). Table 2 shows that fine-grained partition performs better than coarse partition. It is not surprising as in coarse partition, the assumption of performing the same neuron partition for each chunk might be too strong for the encoder to separate information for each task properly.
Decoding Strategy In pipeline-like methods, relation prediction is performed on entities that the system considers as valid in their entity prediction. We argue that a better way for relation prediction is to take into account all the invalid word pairs. We refer to the former strategy as selective decoding and the latter one as universal decoding. For selective decoding, we only predict the relation scores for entities deemed as valid by their entity scores calculated in the NER unit. Table 2 shows that universal decoding, where all the negative instances are included, is better than selective decoding. Apart from mitigating error propagation, we argue that universal decoding is similar to contrastive learning as negative instances helps to better identify the positive instances through implicit comparison.

Effects of Relation Signal on Entity Recognition
It is a widely accepted fact that entity recognition helps in predicting relations, but the effect of relation signals on entity prediction remains divergent among researchers. Through two auxiliary experiments, we find that the absence of relation signals has a considerable bearing on entity recognition.

Analysis on Entity Prediction of Different Types
In Table 1, NER performance of our model is consistently better than other baselines except for ACE05 where the performance falls short with a non-negligible margin. We argued that it could be attributed to the fact that ACE05 contains many entities that do not belong to any triples.
To corroborate our claim, in this section we try to quantify the performance gap of entity prediction between entities that belong to certain triples and those that have no relation with other entities. The former ones are referred to as In-triple entities and the latter as Out-of-triple entities. We split the entities into two groups and test the NER performance of each group in ACE05/ACE04/SciERC. In NYT/WebNLG/ADE, since Out-of-triple entity is non-existent, evaluation is not performed on these datasets.
As is shown in table 3, there is a huge gap between In-triple entity prediction and Out-oftriple entity prediction, especially in SciERC where the diff score reaches 26.6%. We argue that it might be attributed to the fact that entity prediction in SciERC is generally harder given that it involves identification of scientific terms and also the average length of entities in SciERC are longer. Another observation is that the diff score is largely attributed to the difference of precision, which means that without guidance from relational signal, our model tends to be over-optimistic about entity prediction.
In addition, compared to PURE (Zhong and Chen, 2021) we find that the overall performance of NER is negatively correlated with the percentage of out-of-triple entities in the dataset. especially in ACE05, where the performance of our model is relatively weak, over 64% of the entities are Out-of-triple. This phenomenon is a manifest of the weakness in joint model: Joint modeling of NER and RE might be somewhat harmful to entity prediction as the inference patterns of In-triple and Out-of-triple entity are different, considering that the dynamic between relation information and entity prediction is different for In-triple and Outof-triple entity.

Robustness Test on Named Entity Recognition
We use robustness test to evaluate our model under adverse circumstances. In this case, we use the domain transformation methods of NER from (Wang et al., 2021). The compared baselines are all relation-free models, including BiLSTM-CRF (Huang et al., 2015), BERT (Devlin et al., 2019), TENER (Yan et al., 2019) and Flair-Embeddings (Akbik et al., 2019). Descriptions of the transformation methods can be found in Appendix D From table 4, we observe that our model is mostly more resilient against input perturbations compared to other baselines, especially in the category of CrossCategory, which is probably attributed to the fact that relation signals used in our training impose type constraints on entities, thus inference of entity types is less affected by the semantic meaning of target entity itself, but rather the (relational) context surrounding the entity.

Does Relation Signal Helps in Predicting Entities
Contrary to what (Zhong and Chen, 2021) has claimed (that relation signal has minimal effects on entity prediction), we find several clues that suggest otherwise. First, in section 6.1, we observe that In-triple entities are much more easier to predict than Out-of-triple entities, which suggests that relation signals are useful to entity prediction. Second, in section 6.2, we perform robustness test in NER to evaluate our model's capability against input perturbation. In the robustness test we compare our method -the only joint model to other relation-free baselines. The result suggests that our method is much more resilient against adverse circumstances, which could be (at least partially) explained by the introduction of relation signals. To sum up, we find that relation signals do have non-negligible effect on entity prediction.
The reason for (Zhong and Chen, 2021) to conclude that relation information has minimal influence on entity prediction is most probably due to selective bias, meaning that the evaluated dataset ACE05 contains a large proportion of Out-of-triple entities (64%), which in essence does not require any relation signal themselves.

Conclusion
In this paper, we encode task-specific features with our newly proposed model: Partition Filter Network in joint entity and relation extraction. Instead of extracting task-specific features in a sequential or parallel manner, we employ a partition filter encoder to generate task-specific features jointly in order to model two-way inter-task interaction properly. We conduct extensive experiments on six datasets to verify the effectiveness of our model. Overall experiment results demonstrate that our model is superior to previous baselines in entity and relation prediction. Furthermore, dissection on several aspects of our model in ablation study sheds some light on what works best in our framework. Lastly, contrary to what previous work has claimed, our auxiliary experiments suggest that relation prediction is contributory to named entity prediction in a non-negligible way.

A Dataset
We evaluate our model on six datasets. NYT (Riedel et al., 2010) is sampled from New York Times news articles and annotated by distant supervision. WebNLG is originally created for Natural Language Generation task and is applied by (Zeng et al., 2018) as a relation extraction dataset. ACE05 and ACE04 (Walker et al., 2006) are collected from various sources, including news articles and online forums. ADE (Gurulingappa et al., 2012) contains medical descriptions of adverse effects of drug use. SciERC (Luan et al., 2018) is collected from 500 AI paper abstracts originally used for scientific knowledge graph construction. Following previous work, we filter out samples containing overlapping entities in ADE, which makes up only 2.8% of the whole dataset. Statistics of the datasets can be found in table 5

B Implementation Details
We leverage pre-trained language models as our embedding layer. Following previous work, the versions we use are bert-base-cased, albertxxlarge-v1 and scibert-scivocab-uncased. Batch size and learning rate are set to be 4/20 and 1e-5/2e-5 for SciERC/Others respectively. In order to prevent overfitting, dropout (Srivastava et al., 2014) is used in our word embedding, entity span and triple representation of task units (set to be 0.1). We use Adam (Kingma and Ba, 2015) to optimize our model parameters and train our model for 100 epochs. Also, to prevent gradient explosion, gradient clipping is applied during training.

C Analysis on Overlapping Pattern and Triple Number
For more comprehensive evaluation, we assess our model on NYT/WebNLG datasets on different triple overlapping patterns (see section 2 for the detailed description of these patterns) and sentences containing a different number of triples. Since previous work does not compare triple overlapping pattern and triple number in ADE/ACE05/ACE04/ScIERC given that EPO triples are non-existent in these datasets, comparison result is not included for these datasets. As is shown in figure 3, Our model is mostly superior to the other two baselines in all three categories. Interestingly in normal class, our model performs significantly better in WebNLG, but the score in NYT is basically on par with TpLinker. We argue that this could probably be caused by the fact that NYT, generated by distant supervision, is much more noisier than WebNLG. Besides, sentences of normal triples are likely to be much noisier than sentences of EPO and SEO triples since there is a higher chance for incomplete annotation. Thus it is unsurprising that no significant improvement is achieved in predicting normal triples of NYT.
Besides, from figure 4 we observe that our model performs better in sentences with more than five triples on both datasets, where interaction between entity and relation becomes very complex. The strong performance in those sentences confirms the

D Details of Robustness Test
Descriptions of the transformation methods used in Table 4 are listed as follows: 1. ConcatSent -Concatenate sentences to a longer one.
2. CrossCategory -Entity Swap by swaping entities with ones that can be labeled by different types.
Transformations of RE are not viable for the following reasons: 1. The input is restricted to one triple per sentence.
2. The methods include entity swap, which is already covered in NER.