UniRE: A Unified Label Space for Entity Relation Extraction

Many joint entity relation extraction models setup two separated label spaces for the two sub-tasks (i.e., entity detection and relation classification). We argue that this setting may hinder the information interaction between entities and relations. In this work, we propose to eliminate the different treatment on the two sub-tasks’ label spaces. The input of our model is a table containing all word pairs from a sentence. Entities and relations are represented by squares and rectangles in the table. We apply a unified classifier to predict each cell’s label, which unifies the learning of two sub-tasks. For testing, an effective (yet fast) approximate decoder is proposed for finding squares and rectangles from tables. Experiments on three benchmarks (ACE04, ACE05, SciERC) show that, using only half the number of parameters, our model achieves competitive accuracy with the best extractor, and is faster.


Introduction
Extracting structured information from plain texts is a long-lasting research topic in NLP. Typically, it aims to recognize specific entities and relations for profiling the semantic of sentences. An example is shown in Figure 1, where a person entity "David Perkins" and a geography entity "California" have a physical location relation PHYS.
Methods for detecting entities and relations can be categorized into pipeline models or joint models. In the pipeline setting, entity models and relation models are independent with disentangled feature spaces and output label spaces. In the joint setting, on the other hand, some parameter sharing of feature spaces (Miwa and Bansal, 2016; Katiyar and * Equal contribution. † Corresponding Author. Each cell corresponds to a word pair. Entities are squares on diagonal, relations are rectangles off diagonal. Note that PER-SOC is a undirected (symmetrical) relation type, while PHYS and ORG-AFF are directed (asymmetrical) relation types. The table exactly expresses overlapped relations, e.g., the person entity "David Perkins" participates in two relations, ("David Perkins", "wife", PER-SOC) and ("David Perkins", "California", PHYS). For every cell, a same biaffine model predicts its label. The joint decoder is set to find the best squares and rectangles. Cardie, 2017) or decoding interactions (Yang and Cardie, 2013; are imposed to explore the common structure of the two tasks. It was believed that joint models could be better since they can alleviate error propagations among sub-models, have more compact parameter sets, and uniformly encode prior knowledge (e.g., constraints) on both tasks. However, Zhong and Chen (2020) recently show that with the help of modern pre-training tools (e.g., BERT), separating the entity and relation model (with independent encoders and pipeline decoding) could surpass existing joint models. They argue that, since the output label spaces of entity and relation models are different, comparing with shared encoders, separate encoders could better capture distinct contextual information, avoid potential conflicts among them, and help decoders making a more accurate prediction, that is, separate label spaces deserve separate encoders.
In this paper, we pursue a better joint model for entity relation extraction. After revisiting existing methods, we find that though entity models and relation models share encoders, usually their label spaces are still separate (even in models with joint decoders). Therefore, parallel to (Zhong and Chen, 2020), we would ask whether joint encoders (decoders) deserve joint label spaces?
The challenge of developing a unified entityrelation label space is that the two sub-tasks are usually formulated into different learning problems (e.g., entity detection as sequence labeling, relation classification as multi-class classification), and their labels are placed on different things (e.g., words v.s. words pairs). One prior attempt (Zheng et al., 2017) is to handle both sub-tasks with one sequence labeling model. A compound label set was devised to encode both entities and relations. However, the model's expressiveness is sacrificed: it can detect neither overlapping relations (i.e., entities participating in multiple relation) nor isolated entities (i.e., entities not appearing in any relation).
Our key idea of defining a new unified label space is that, if we think Zheng et al. (2017)'s solution is to perform relation classification during entity labeling, we could also consider the reverse direction by seeing entity detection as a special case of relation classification. Our new input space is a two-dimensional table with each entry corresponding to a word pair in sentences ( Figure 1). The joint model assign labels to each cell from a unified label space (union of entity type set and relation type set). Graphically, entities are squares on the diagonal, and relations are rectangles off the diagonal. This formulation retains full model expressiveness regarding existing entity-relation extraction scenarios (e.g., overlapped relations, directed relations, undirected relations). It is also different from the current table filling settings for entity relation extraction (Miwa and Sasaki, 2014;Gupta et al., 2016;Zhang et al., 2017;Wang and Lu, 2020), which still have separate label space for entities and relations, and treat on/off-diagonal entries differently.
Based on the tabular formulation, our joint entity relation extractor performs two actions, filling and decoding. First, filling the table is to predict each word pair's label, which is similar to arc prediction task in dependency parsing. We adopt the biaffine attention mechanism (Dozat and Manning, 2016) to learn interactions between word pairs. We also impose two structural constraints on the table through structural regularizations. Next, given the table filling with label logits, we devise an approximate joint decoding algorithm to output the final extracted entities and relations. Basically, it efficiently finds split points in the table to identify squares and rectangles (which is also different with existing table filling models which still apply certain sequential decoding and fill tables incrementally).
Experimental results on three benchmarks (ACE04, ACE05, SciERC) show that the proposed joint method achieves competitive performances comparing with the current state-of-the-art extractors (Zhong and Chen, 2020): it is better on ACE04 and SciERC, and competitive on ACE05. 1 Meanwhile, our new joint model is fast on decoding (10x faster than the exact pipeline implementation, and comparable to an approximate pipeline, which attains lower performance). It also has a more compact parameter set: the shared encoder uses only half the number of parameters comparing with the separate encoder (Zhong and Chen, 2020).

Task Definition
Given an input sentence s = x 1 , x 2 , . . . , x |s| (x i is a word), this task is to extract a set of entities E and a set of relations R. An entity e is a span (e.span) with a pre-defined type e.type ∈ Y e (e.g., PER, GPE). The span is a continuous sequence of words. A relation r is a triplet (e 1 , e 2 , l), where e 1 , e 2 are two entities and l ∈ Y r is a pre-defined relation type describing the semantic relation among two entities (e.g., the PHYS relation between PER and GPE mentioned before). Here Y e , Y r denote the set of possible entity types and relation types respectively.
We formulate the joint entity relation extraction as a table filling task (multi-class classification between each word pair in sentence s), as shown in Figure 1. For the sentence s, we maintain a table T |s|×|s| . For each cell (i, j) in table T , we assign a label y i,j ∈ Y, where Y = Y e ∪ Y r ∪ {⊥} ( ⊥ denotes no relation). For each entity e, the label of corresponding cells y i,j (x i ∈ e.span, x j ∈ e.span) should be filled in e.type. For each relation r = (e 1 , e 2 , l), the label of corresponding cells y i,j (x i ∈ e 1 .span, x j ∈ e 2 .span) should be filled in l. 2 While others should be filled in ⊥.
In the test phase, decoding entities and relations becomes a rectangle finding problem. Note that solving this problem is not trivial, and we propose a simple but effective joint decoding algorithm to tackle this challenge.

Approach
In this section, we first introduce our biaffine model for table filling task based on pre-trained language models (Section 3.1). Then we detail the main objective function of the table filling task (Section 3.2) and some constraints which are imposed on the table in training stage (Section 3.3). Finally we present the joint decoding algorithm to extract entities and relations (Section 3.4). Figure 2 shows an overview of our model architecture. 3

Biaffine Model
Given an input sentence s, to obtain the contextual representation h i for each word, we use a pre-trained language model (PLM) as our sentence encoder (e.g., BERT). The output of the encoder is where x i is the input representation of each word x i . Taking BERT as an example, x i sums the corresponding token, segment and position embeddings.
To capture long-range dependencies, we also employ cross-sentence context following (Zhong and Chen, 2020), which extends the sentence to a fixed window size W (W = 200 in our default settings).
To better encode direction information of words in table T , we use the deep biaffine attention mechanism (Dozat and Manning, 2016), which achieves impressive results in the dependency parsing task. Specifically, we employ two dimension-reducing MLPs (multi-layer perceptron), i.e., a head MLP and a tail MLP, on each h i as where h head i ∈ R d and h tail i ∈ R d are projection representations, allowing the model to identify the head or tail role of each word. Next, we calculate the scoring vector g i,j ∈ R |Y| of each word pair with biaffine model, where U 1 ∈ R |Y|×d×d and U 2 ∈ R |Y|×2d are weight parameters, b ∈ R |Y| is the bias, ⊕ denotes concatenation.

Table Filling
After obtaining the scoring vector g i,j , we feed g i,j into the softmax function to predict corresponding label, yielding a categorical probability distribution over the label space Y as P (y i,j |s) = Softmax(dropout(g i,j )).
In our experiments, we observe that applying dropout in g i,j , similar to de-noising autoencoding, can further improve the performance. 4 . We refer this trick to logit dropout And the training objective is to minimize where the gold label y i,j can be read from annotations, as shown in Figure 1.

Constraints
In fact, Equation 1 is based on the assumption that each label is independent. This assumption simplifies the training procedure, but ignores some structural constraints. For example, entities and relations correspond to squares and rectangles in the table. Equation 1 does not encode this constraint explicitly. To enhance our model, we propose two intuitive constraints, symmetry and implication, which are detailed in this section. Here we introduce a new notation P ∈ R |s|×|s|×|Y| , denoting the stack of P (y i,j |s) for all word pairs in sentence s. 5 M H D m n H u 5 9 x 4 v Y l Q q y / o y C k v L K 6 t r x f X S x u b W 9 o 6 5 u 9 e Q Y S w w q e O Q h a L l I U k Y 5 a S u q G K k F Q m C A o + R p j e 4 n v j N J y I k D f m 9 G k b E D V C P U 5 9 i p L T U M S g H m l r y p G e 6 S b Z l S k 8 0 k o X + q H Q j y u Y q b 8 7 E h R I O Q w 8 X T l Z U s 5 7 E / E / r x 0 r / 8 J N K I 9 i R T i e D v J j B l U I J 5 H B L h U E K z b U B G F B 9 a 4 Q 9 5 F A W O l g S z o E e / 7 k R d I 4 q d h n F e v 2 t F y 9 y u M o g g N w C I 6 B D c 5 B F d y A G q g D D J 7 B K 3 g H Y + P F e D M + j M 9 p a c H I e / b B H x j f P 0 1 h p c U = < / l a t e x i t > P 2 R 4⇥4⇥|Y| Decoding < l a t e x i t s h a 1 _ b a s e 6 4 = " E 9 v V m N i i R V p 9 h e X z A + c 7 U a j G K I Q = " > A A A B 8 3 i c b V B N S 8 N A F H y p X 7 V + V T 1 6 W S y C p 5 K I o s e i F 4 8 V b C 0 0 p W y 2 L + 3 S z S b s b s Q S + j e 8 e F D E q 3 / G m / / G T Z u D t g 4 s D D P v 8 W Y n S A T X x n W / n d L K 6 t r 6 R n m z s r W 9 s 7 t X 3 T 9 o 6 z h q 4 9 W f c + T d m 2 l l o 6 4 H A 4 Z x 7 u S c n T D j T x n W / n d L K 6 t r 6 R n m z s r W 9 s 7 t X 3 T 9 o a 5 k q Q l t E c q m 6 I d a < l a t e x i t s h a 1 _ b a s e 6 4 = " e w a 9 l 9 C 3 R G Q n r q U R Z 0 k 2 6 7 8 y 1 r Q < l a t e x i t s h a 1 _ b a s e 6 4 = " P h 5 A S l q 5 G g J A C 7 X a e m D g c M 6 9 3 D M n T K U w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z J p x l s s k Y n u h t R w K R R v o U D J u 6 n m N A 4 l 7 4 T j u 5 n f e e L a i E Q 9 4 i T l Q U y H S k S C U b S S 7 8 c U R 2 G U j 6 Z 9 r 1 + t u X V 3 D r J < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 g i V N 2 O f 6 5 h 4 5 3 7 x z A K Y 9 8 0 2 S s s = " > A A A B 8 3 i c b V D L S s N A F L 3 x W e u r 6 t L N Y B F c l a Q o u i y 6 c V n B P q A p Z T K d t E M n k z B z I 5 T Q 3 3 D j Q h G 3 / o w 7 / 8 Z J m 4 W 2 H h g 4 n H M v 9 8 w J E i k M u u 6 3 s 7 a + s b m 1 X d o p 7 + 7 t H x x W j o 7 b J k 4 1 4 y 0 W y 1 h 3 A 2 q 4 F I q 3 U K D k 3 U R z G g W S d 4 L J X e 5 3 n r g 2 I l a P O E 1 4 P 6 I j J U L B K F r J 9 y O K 4 y D M x r N B f V C p u j V 3 D r J K v I J U o U B z U P n y h z F L I 6 6 Q S W p M z 3 M T 7 G d U o 2 C S z 8 p + a n h C 2 Y S O e M 9 S R S N u + t k 8 8 4 y c W 2 V I w l j b p 5 D < l a t e x i t s h a 1 _ b a s e 6 4 = " Y 3 V V k f Y 2 h C A 6 c 4 a / a 1 X 7 i d e q r R g = " > A A A C A 3 i c b V D L S s N A F J 3 U V 6 2 v q D v d D B b B V U l E 0 W X R j c s K 9 g F t D J P J p B k 6 m Y S Z i V B C w I 2 / 4 s a F I m 7 9 C X f + j Z M 0 C 2 0 9 M H D m n H u 5 9 x 4 v Y V Q q y / o 2 a k v L K 6 t r 9 f X G x u b W 9 o 6 5 u 9 e T c S o w 6 e K Y x W L g I U k Y 5 a S r q G J k k i < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 f X t M F a a M 7 + i W x C i p 4 3 a n 4 J n 3 a E = " > A A A C N X i c b V D L S s N A F J 3 U V 6 2 v q E s 3 w S I I Q k l E 0 W X R j Y s u K t g H t C F M p p N 2 6 M w k z E y E E P J T b v w P V 7 p w o Y h b f 8 F J m o W 2 v X D h c M 6 9 3 H O P H 1 E i l W 2 / G Z W V 1 b X 1 j e p m b W t 7 Z 3 f P 3 D / o y j A W C H d Q S E P R 9 6 H E l H D c U U R R 3 I 8 E h s y n u O d P b 3 O 9 9 4 i F J C F / U E m E X Q b H n A Q E Q a U p z 2 w N G V Q T B G n a y r w C C 5 Z i r k S S n S 2 T Z M K W C 4 R F m W f W 7 Y Z d l L U I n B L U Q V l t z 3 w Z j k I U M 3 0 Q U S j l w L E j 5 a Z Q K I I o z m r D W O I I o i k c 4 4 G G H D I s 3 b T 4 O r N O N D O y g l D o 5 s o q 2 L 8 b K W R S 2 / X 1 Z O 5 R z m s 5 u U w b x C q 4 d l P C o 1 h h j m a H g p h a K r T y C K 0 R E R g p m m g A k S D a q 4 U m U E C k d N A 1 H Y I z / / I i 6 J 4 3 n M u G f X 9 R b 9 6 U c V T B E T g G p 8 A B V 6 A J 7 k A b d A A C T + A V f I B P 4 9 l 4 N 7 6 M 7 9 l o x S h 3 D s G / M n 5 + A Z E 1 r v A = < / l a t e x i t > L entry + L sym + L imp PER GPE PHYS 、 、 、 Figure 2: Overview of our model architecture. One main objective (L entry ) and two additional objectives (L sym , L imp ) are imposed on probability tensor P and optimized jointly.  Symmetry We have several observations from the table in the tag level. Firstly, the squares corresponding to entities must be symmetrical about the diagonal. Secondly, for symmetrical relations, the relation triples (e 1 , e 2 , l) and (e 2 , e 1 , l) are equivalent, thus the rectangles corresponding to two counterpart relation triples are also symmetrical about the diagonal. As shown in Figure 1, the rectangles corresponding to ("his", "wife", PER-SOC) and ("wife", "his", PER-SOC) are symmetrical about the diagonal. We divide the set of labels Y into a symmetrical label set Y sym and an asymmetrical label set Y asym . The matrix P :,:,t should be symmetrical about the diagonal for each label t ∈ Y sym . We formulate this tag-level constraint as symmetrical loss, We list all Y sym in Table 1 for our adopted datasets.
Implication A key intuition is that if a relation exists, then its two argument entities must also exist. In other words, it is impossible for a relation to exist without two corresponding entities. From the perspective of probability, it implies that the probability of relation is not greater than the probability of each argument entity. Since we model entity and relation labels in a unified probability space, this idea can be easily used in our model as the implication constraint. We impose this constraint on P: for each word in the diagonal, its maximum possibility over the entity type space Y e must not be lower than the maximum possibility for other words in the same row or column over the relation type space Y r . We formulate this table-level constraint as implication loss, where [u] * = max(u, 0) is the hinge loss. It is worth noting that we do not add margin in this loss function. Since the value of each item is a probability and might be relatively small, it is meaningless to set a large margin. Finally, we jointly optimize the three objectives in the training stage as L entry + L sym + L imp . 6

Decoding
In the testing stage, given the probability tensor P ∈ R |s|×|s|×|Y| of the sentence s, 7 how to decode all rectangles (including squares) corresponding to entities or relations remains a non-trivial problem. Since brute force enumeration of all rectangles is intractable, a new joint decoding algorithm is needed. We expect our decoder to have, M H D m n H u 5 9 x 4 v Y l Q q y / o y C k v L K 6 t r x f X S x u b W 9 o 6 5 u 9 e Q Y S w w q e O Q h a L l I U k Y 5 a S u q G K k F Q m C A o + R p j e 4 n v j N J y I k D f m 9 G k b E D V C P U 5 9 i p L T U M S + d A K k + R i y p p d C h H G Z / z 0 v u 0 s f k F D q K B k T C G R n N y h / S U d o x y 1 b F y g A X i Z 2 T M s h R 6 5 h j p x v i O C B c Y Y a k b N t W p N w E C U U x I 2 n J i S W J E B 6 g H m l r y p G e 6 S b Z l S k 8 0 k o X + q H Q j y u Y q b 8 7 E h R I O Q w 8 X T l Z U s 5 7 E / E / r x 0 r / 8 J N K I 9 i R T i e D v J j B l U I J 5 H B L h U E K z b U B G F B 9 a 4 Q 9 5 F A W O l g S z o E e / 7 k R d I 4 q d h n F e v 2 t F y 9 y u M o g g N w C I 6 B D c 5 B F d y A G q g D D J 7 B K 3 g H Y + P F e D M + j M 9 p a c H I e / b B H x j f P 0 1 h p c U = < / l a t e x i t >

l a t e x i t s h a 1 _ b a s e 6 4 = " K 1 r v h / 9 I A G L n L H a / f B F y 7 F W H w a M = " > A A A C L n i c b V D L S g M x F M 3 U V 6 2 v q k s 3 w S L U T Z m R i i 6 L I r i s 0 p d 0 p i W T Z t r Q T G Z I M k K Z z h e 5 8 V d 0 I a i I W z / D d F p E W w 8 E D u f c m 3 v v c U N G p T L N V y O z t L y y u p Z d z 2 1 s b m 3 v 5 H f 3 G j K I B C Z 1 H L B A t F w k C a O c 1 B V V j L R C Q Z D v M t J 0 h 5 c T v 3 l P h K Q B r 6 l R S B w f 9 T n 1 K E Z K S 9 3 8 V d H 2 k R p g x O J q 0 k m 5 8 G P 9 b 3 L c q U G b c p h q r h v f J p 2 4 D G 1 F f S J h e f z T d p e M k 2 6 + Y J b M F H C R W D N S A D N U u / l n u x f g y C d c Y Y a k b F t m q J w Y C U U x I 0 n O j i Q J E R 6 i P m l r y p E e 6 s T p u Q k 8 0 k o P e o H Q j y u Y q r 8 7 Y u R L O f J d X T l Z U s 5 7 E / E / r x 0 p 7 9 y J K Q 8 j R T i e D v I i B l U A J 9 n B H h U E K z b S B G F B 9 a 4 Q D 5 B A W O m E c z o E a / 7 k R d I 4 K V m n J f O m X K h c z O L I g g N w C I r A A m e g A q 5 B F d Q B B g / g C b y B d + P R e D E + j M 9 p a c a Y 9 e y D P z C + v g E y H K n d < / l a t e x i t >
(P col ) T 2 R 4⇥4|Y| < l a t e x i t s h a 1 _ b a s e 6 4 = " S 6 k P p o f d 6 6 e l l 1 8 L 1 W O N 2 w 0 O Y F Y = " > A A A C K n i c b V D L T s J A F J 3 i C / F V d e l m I j F x R V q D 0 S X q x i U a e R h a y H Q Y Y M J 0 2 s x M N a T 0 e 9 z 4 K 2 5 Y a I h b P 8 S h N E b B k 9 z k 5 J x 7 c + 8 9 X s i o V J Y 1 N X I r q 2 v r G / n N w t b 2 z u 6 e u X 9 Q l 0 E k M K n h g A W i 6 S F J G O W k p q h i p B k K g n y P k Y Y 3 v J n 5 j S c i J A 3 4 g x q F x P V R n 9 M e x U h p q W N e O T 5 S A 4 x Y X E 3 a K R d + L I L n B D q U w 1 T w v P g + a c d l 6 C j q E w n L 4 5 + Z x 2 S c d M y i V b J S w G V i Z 6 Q I M l Q 7 5 s T p B j j y C V e Y I S l b t h U q N 0 Z C U c x I U n A i S U K E h 6 h P W p p y p J e 6 c f p q A k + 0 0 o W 9 Q O j i C q b q 7 4 k Y + V K O f E 9 3 z o 6 U i 9 5 M / M 9 r R a p 3 6 c a U h 5 E i H M 8 X 9 S I G V Q B n u c E u F Q Q r N t I E Y U H 1 r R A P k E B Y 6 X Q L O g R 7 8 e V l U j 8 r 2 e c l 6 6 5 c r F x n c e T B E T g G p 8 A G F 6 A C b k E V 1 A A G L + A N v I M P 4 9 W Y G F P j c 9 6 a M 7 K Z Q / A H x t c 3 8 s 2 o z A = = < / l a t e x i t > Figure 3: Overview of our joint decoding algorithm. It consists of three steps: span decoding, entity type decoding, and relation type decoding.
• Simple implementation and fast decoding. We permit slight decoding accuracy drops for scalability.
• Strong interactions between entities and relations. When decoding entities, it should take the relation information into account, and vice versa.
Inspired by the procedures of , We propose a three-steps decoding algorithm: decode span first (entity spans or spans between entities), and then decode entity type of each span, and at last decode relation type of each entity pair (Figure 3). We consider each cell's probability scores on all labels (including entity labels and relation labels) and predict spans according to a threshold. Then, we predict entities and relations with the highest score. Our heuristic decoding algorithm could be very efficient. Next we will detail the entire decoding process, and give a formal description in the Appendix A.
Span Decoding One crucial observation of a ground-truth table is that, for an arbitrary entity, its corresponding rows (or columns) are exactly the same in the table (e.g., row 1 and row 2 of Figure 1 are identical), not only for the diagonal entries (entities are squares), but also for the off-diagonal entries (if it participates in a relation with another entity, all its rows (columns) will spot that relation label in the same way). In other words, if the adjacent rows/columns are different, there must be an entity boundary (i.e., one belonging to the entity and the other not belonging to the entity). Therefore, if our biaffine model is reasonably trained, given a model predicted table, we could use this property to find split positions of entity boundary. As expected, experiments (Figure 4) verify our assumption. We adapt this idea to the 3-dimensional probability tensor P.  Specifically, we flatten P ∈ R |s|×|s|×|Y| as a matrix P row ∈ R |s|×(|s|·|Y|) from row perspective, and then calculate the Euclidean distances (l 2 distances) of adjacent rows. Similarly, we calculate the other Euclidean distances of adjacent columns according to a matrix P col ∈ R (|s|·|Y|)×|s| from column perspective, and then average the two distances as the final distance. If the distance is larger than the threshold α (α = 1.4 in our default settings), this position is a split position. In this way, we can decode all the spans in O(|s|) time complexity.
Entity Type Decoding Given a span (i, j) by span decoding, 8 we decode the entity typet according to the corresponding square symmetric about the diagonal:t = arg max t∈Ye∪{⊥} Avg(P i:j,i:j,t ). Ift ∈ Y e , we decode an entity. Ift = ⊥, the span (i, j) is not an entity.
Relation Type Decoding After entity type decoding, given an entity e 1 with the span (i, j) and another entity e 2 with the span (m, n), we decode the relation typel between e 1 and e 2 according to the corresponding rectangle. Formally, l = arg max l∈Yr∪{⊥} Avg(P i:j,m:n,l ). Ifl ∈ Y r , we decode a relation (e 1 , e 2 ,l). Ifl = ⊥, e 1 and e 2 have no relation.
Evaluation Following suggestions in (Taillé et al., 2020), we evaluate Precision (P), Recall (R), and F1 scores with micro-averaging and adopt the Strict Evaluation criterion. Specifically, a predicted entity is correct if its type and boundaries are correct, and a predicted relation is correct if its  relation type is correct, as well as the boundaries and types of two argument entities are correct.

Implementation Details
We tune all hyperparameters based on the averaged entity F1 and relation F1 on ACE05 development set, then keep the same settings on ACE04 and SciERC. For fair comparison with previous works, we use three pre-trained language models: bert-base-uncased (Devlin et al., 2019), albert-xxlarge-v1  and scibert-scivocab-uncased (Beltagy et al., 2019) as the sentence encoder and fine-tune them in training stage. 12 For the MLP layer, we set the hidden size as d = 150 and use GELU as the activation function. We use AdamW optimizer (Loshchilov and Hutter, 2017) with β 1 = 0.9 and β 2 = 0.9, and observe a phenomenon similar to (Dozat and Manning, 2016) in that setting β 2 from 0.9 to 0.999 causes a significant drop on final performance. The batch size is 32, and the learning rate is 5e-5 with weight decay 1e-5. We apply a linear warm-up learning rate scheduler with a warm-up ratio of 0.2. We train our model with a maximum of 200 epochs (300 epochs for SciERC) and employ an early stop strategy. We 12 The first two are for ACE04 and ACE05, and the last one is for SciERC. perform all experiments on an Intel(R) Xeon(R) W-3175X CPU and a NVIDIA Quadro RTX 8000 GPU. Table 3 summarizes previous works and our UNIRE on three datasets. 13 In general, UNIRE achieves the best performance on ACE04 and Sci-ERC and a comparable result on ACE05. Comparing with the previous best joint model (Wang and Lu, 2020), our model significantly advances both entity and relation performances, i.e., an absolute F1 of +0.9 and +0.7 for entity as well as +3.4 and +1.7 for relation, on ACE04 and ACE05 respectively. For the best pipeline model (Zhong and Chen, 2020) (current SOTA), our model achieves superior performance on ACE04 and SciERC and comparable performance on ACE05. Comparing with ACE04/ACE05, SciERC is much smaller, so entity performance on SciERC drops sharply. Since (Zhong and Chen, 2020) is a pipeline method, its relation performance is severely influenced by the poor entity performance. Nevertheless, our model is less influenced in this case and 13 Since (Luan et al., 2019a;Wadden et al., 2019) neglect the argument entity type in relation evaluation and underperform our baseline (Zhang et al., 2020), we do not compare their results here.  achieves better performance. Besides, our model can achieve better relation performance even with worse entity results on ACE04. Actually, our base model (BERT BASE ) has achieved competitive relation performance, which even exceeds prior models based on BERT LARGE (Li et al., 2019) and ALBERT XXLARGE (Wang and Lu, 2020). These results confirm the proposed unified label space is effective for exploring the interaction between entities and relations. Note that all subsequent experiment results on ACE04 and ACE05 are based on BERT BASE for efficiency.

Ablation Study
In this section, we analyze the effects of components in UNIRE with different settings (Table 4). Particularly, we implement a naive decoding algorithm for comparison, namely "hard decoding", which takes the "intermediate table" as input. The "intermediate table" is the hard form of probability tensor P output by the biaffine model, i.e., choosing the class with the highest probability as the label of each cell. To find entity squares on the diagonal, it first tries to judge whether the largest square (|s| × |s|) is an entity. The criterion is simply counting the number of different entity labels appearing in the square and choosing the most frequent one. If the most frequent label is ⊥, we shrink the size of square by 1 and do the same work on two (|s| − 1) × (|s| − 1) squares and so on. To avoid entity overlapping, an entity will be discarded if it overlaps with identified entities. To find relations, each entity pair is labeled by the most frequent relation label in the corresponding rectangle.
From the ablation study, we get the following observations.
• When one of the additional losses is removed, the performance will decline with varying de-  grees (line 2-3). Specifically, the symmetrical loss has a significant impact on SciERC (decrease 1.1 points and 1.4 points for entity and relation performance). While removing the implication loss will obviously harm the relation performance on ACE05 (1.0 point). It demonstrates that the structural information incorporated by both losses is useful for this task.
Logit dropout prevents the model from overfitting, and cross-sentence context provides more contextual information for this task, especially for small datasets like SciERC.
• The "hard decoding" has the worst performance (its relation performance is almost half of the "Default") (line 6). The major reason is that "hard decoding" separately decodes entities and relations. It shows the proposed decoding algorithm jointly considers entities and relations, which is important for decoding.

Inference Speed
Following (Zhong and Chen, 2020), we evaluate the inference speed of our model (Table 5) on ACE05 and SciERC with the same batch size and pre-trained encoders (BERT BASE for ACE05 and SciBERT for SciERC). Comparing with the pipeline method (Zhong and Chen, 2020), we obtain a more than 10× speedup and achieve a comparable or even better relation performance with W = 200. As for their approximate version, our inference speed is still competitive but with better performance. If the context window size is set the same as (Zhong and Chen, 2020) (W = 100), we can further accelerate model inference with slight performance drops. Besides, "hard decoding" is much slower than UNIRE, which demonstrates the efficiency of the proposed decoding algorithm.

Impact of Different Threshold α
In Figure 4, the distance between adjacent rows not at entity boundary ("Non-Ent-Bound") mainly concentrates at 0, while that at entity boundary ("Ent-Bound") is usually greater than 1. This phenomenon verifies the correctness of our span decoding method. Then we evaluate the performances, with regard to the threshold α in Figure 5. 14 Both span and entity performances sharply decrease when α increases from 1.4 to 1.5, while the relation performance starts to decline slowly from α = 1.5. The major reason is that relations are so sparse that many entities do not participate in any relation, so the threshold of relation is much higher than that of entity. Moreover, we observe a similar phenomenon on ACE04 and SciERC, and α = 1.4 is a general best setting on three datasets. It shows the stability and generalization of our model. 14 We use an additional metric to evaluate span performance, "Span F1", is Micro-F1 of predicted split positions.

Context Window and Logit Dropout Rate
In Table 4, both cross-sentence context and logit dropout can improve the entity and relation performance. Table 6 shows the effect of different context window size W and logit dropout rate p. The entity and relation performances are significantly improved from W = 100 to W = 200, and drop sharply from W = 200 to W = 300. Similarly, we achieve the best entity and relation performances when p = 0.2. So we use W = 200 and p = 0.2 in our final model.

Error Analysis
We further analyze the remaining errors for relation extraction and present the distribution of five errors: span splitting error (SSE), entity not found (ENF), entity type error (ETE), relation not found (RNF), and relation type error (RTE) in Figure 6. The proportion of "SSE" is relatively small, which proves the effectiveness of our span decoding method. Moreover, the proportion of "not found error" is significantly larger than that of "type error" for both entity and relation. The primary reason is that the table filling suffers from the class imbalance issue, i.e., the number of ⊥ is much larger than that of other classes. We reserve this imbalanced classification problem in the future. Finally, we give some concrete examples in Figure 7 to verify the robustness of our decoding algorithm. There are some errors in the biaffine model's prediction, such as cells in the upper left corner (first example) and upper right corner (second example) in the intermediate table. However, these errors are corrected after decoding, which demonstrates that our decoding algorithm not only recover all entities and relations but also corrects errors leveraging table structure and neighbor cells' information.  Figure 6: Distribution of five relation extraction errors on ACE05 and SciERC test data.

Related Work
Entity relation extraction has been extensively studied over the decades. Existing methods can be roughly divided into two categories according to the adopted label space.
Separate Label Spaces This category study this task as two separate sub-tasks: entity recognition and relation classification, which are defined in two separate label spaces. One early paradigm is the pipeline method (Zelenko et al., 2003;Miwa et al., 2009) that uses two independent models for two sub-tasks respectively. Then joint method handles this task with an end-to-end model to explore more interaction between entities and relations. The most basic joint paradigm, parameter sharing (Miwa and Bansal, 2016;Katiyar and Cardie, 2017), adopts two independent decoders based on a shared encoder. Recent span-based models (Luan et al., 2019b;Wadden et al., 2019) also use this paradigm.
To enhance the connection of two decoders, many joint decoding algorithms are proposed, such as ILP-based joint decoder (Yang and Cardie, 2013), joint MRT (Sun et al., 2018), GCN-based joint inference . Actually, table filling method (Miwa and Sasaki, 2014;Gupta et al., 2016;Zhang et al., 2017;) is a special case of parameter sharing in table structure. These joint models all focus on various joint algorithms but ignore the fact that they are essentially based on separate label spaces.
Unified Label Space This family of methods aims to unify two sub-tasks and tackle this task in a unified label space. Entity relation extraction has been converted into a tagging problem (Zheng et al., 2017), a transition-based parsing problem (Wang et al., 2018), and a generation problem with Gold Table  Intermediate Table  Decoded Table   、 Table" presents the gold label. "Intermediate Table" presents the biaffine model's prediction (choosing the label with the highest probability for each cell). "Decoded Table" presents the final results after decoding.
Seq2Seq framework (Zeng et al., 2018;Nayak and Ng, 2020). We follow this trend and propose a new unified label space. We introduce a 2D table to tackle the overlapping relation problem in (Zheng et al., 2017). Also, our model is more versatile as not relying on complex expertise like (Wang et al., 2018), which requires external expert knowledge to design a complex transition system.

Conclusion
In this work, we extract entities and relations in a unified label space to better mine the interaction between both sub-tasks. We propose a novel table that presents entities and relations as squares and rectangles. Then this task can be performed in two simple steps: filling the table with our biaffine model and decoding entities and relations with our joint decoding algorithm. Experiments on three benchmarks show the proposed method achieves not only state-of-the-art performance but also promising efficiency.