Modeling Joint Entity and Relation Extraction with Table Representation

This paper proposes a history-based structured learning approach that jointly extracts entities and relations in a sentence. We introduce a novel simple and ﬂexible table representation of entities and relations. We investigate several feature settings, search orders, and learning meth-ods with inexact search on the table. The experimental results demonstrate that a joint learning approach signiﬁcantly out-performs a pipeline approach by incorporating global features and by selecting appropriate learning methods and search orders.


Introduction
Extraction of entities and relations from texts has been traditionally treated as a pipeline of two separate subtasks: entity recognition and relation extraction. This separation makes the task easy to deal with, but it ignores underlying dependencies between and within subtasks. First, since entity recognition is not affected by relation extraction, errors in entity recognition are propagated to relation extraction. Second, relation extraction is often treated as a multi-class classification problem on pairs of entities, so dependencies between pairs are ignored. Examples of these dependencies are illustrated in Figure 1. For dependencies between subtasks, a Live in relation requires PER and LOC entities, and vice versa. For in-subtask dependencies, the Live in relation between "Mrs. Tsutayama" and "Japan" can be inferred from the two other relations.
Figure 1 also shows that the task has a flexible graph structure. This structure usually does not cover all the words in a sentence differently from other natural language processing (NLP) tasks such as part-of-speech (POS) tagging and depen-Mrs. Tsuruyama is from Kumamoto Prefecture in Japan .

PER LOC LOC
Live_in Located_in Live_in Figure 1: An entity and relation example (Roth and Yih, 2004). Person (PER) and location (LOC) entities are connected by Live in and Located in relations.
dency parsing, so local constraints are considered to be more important in the task. Joint learning approaches (Yang and Cardie, 2013;Singh et al., 2013) incorporate these dependencies and local constraints in their models; however most approaches are time-consuming and employ complex structures consisting of multiple models. Li and Ji (2014) recently proposed a history-based structured learning approach that is simpler and more computationally efficient than other approaches. While this approach is promising, it still has a complexity in search and restricts the search order partly due to its semi-Markov representation, and thus the potential of the historybased learning is not fully investigated.
In this paper, we introduce an entity and relation table to address the difficulty in representing the task. We propose a joint extraction of entities and relations using a history-based structured learning on the table. This table representation simplifies the task into a table-filling problem, and makes the task flexible enough to incorporate several enhancements that have not been addressed in the previous history-based approach, such as search orders in decoding, global features from relations to entities, and several learning methods with inexact search.

Method
In this section, we first introduce an entity and relation table that is utilized to represent the whole entity and relation structures in a sentence. We then overview our model on the table. We finally explain the decoding, learning, search order, and features in our model.

Entity and relation table
The task we address in this work is the extraction of entities and their relations from a sentence. Entities are typed and may span multiple words. Relations are typed and directed.
We use words to represent entities and relations. We assume entities do not overlap. We employ a BILOU (Begin, Inside, Last, Outside, Unit) encoding scheme that has been shown to outperform the traditional BIO scheme (Ratinov and Roth, 2009), and we will show that this scheme induces several label dependencies between words and between words and relations in §2.3.2. A label is assigned to a word according to the relative position to its corresponding entity and the type of the entity. Relations are represented with their types and directions. ⊥ denotes a non-relation pair, and → and ← denote left-to-right and right-to-left relations, respectively. Relations are defined on not entities but words, since entities are not always given when relations are extracted. Relations on entities are mapped to relations on the last words of the entities.
Based on this representation, we propose an entity and relation table that jointly represents entities and relations in a sentence. Figure 2 illustrates an entity and relation table corresponding to an example in Figure 1. We use only the lower triangular part because the table is symmetric, so the number of cells is n(n + 1)/2 when there are n words in a sentence. With this entity and relation table representation, the joint extraction problem can be mapped to a table-filling problem in that labels are assigned to cells in the table.

Model
We tackle the table-filling problem by a historybased structured learning approach that assigns labels to cells one by one. This is mostly the same as the traditional history-based model (Collins, 2002) except for the table representation.
Let x be an input table, Y(x) be all possible assignments to the table, and s(x, y) be a scoring function that assesses the assignment of y ∈ Y(x) to x. With these definitions, we define our model to predict the most probable assignment as fol-lows: This scoring function is a decomposable function, and each decomposed function assesses the assignment of a label to a cell in the table.
Here, i represents an index of a cell in the table, which will be explained in §2.3.1. The decomposed function s(x, y, 1, i) corresponds to the i-th cell. The decomposed function is represented as a linear model, i.e., an inner product of features and their corresponding weights.
The scoring function are further divided into two functions as follows: s(x, y, 1, i) = s local (x, y, i) + s global (x, y, 1, i) (4) Here, s local (x, y, i) is a local scoring function that assesses the assignment to the i-th cell without considering other assignments, and s global (x, y, 1, i) is a global scoring function that assesses the assignment in the context of 1st to (i − 1)-th assignments. This global scoring function represents the dependencies between entities, between relations, and between entities and relations. Similarly, features f are divided into local features f local and global features f global , and they are defined on its target cell and surrounding contexts. The features will be explained in §2.5. The weights w can also be divided, but they are tuned jointly in learning as shown in §2.4.

Decoding
The scoring function s(x, y, 1, i) in Equation (2) uses all the preceding assignments and does not rely on the Markov assumption, so we cannot employ dynamic programming.
We instead employ a beam search to find the best assignment with the highest score (Collins and Roark, 2004). The beam search assigns labels to cells one by one with keeping the top K best assignments when moving from a cell to the next cell, and it returns the best assignment when labels are assigned to all the cells. The pseudo code for decoding with the beam search is shown in Figure 3.

Mrs.
Tsutayama is from Kumamoto Prefecture in Japan .  We explain how to map the table to a sequence (line 2 in Figure 3), and how to calculate possible assignments (line 6 in Figure 3) in the following subsections. Table-to-sequence mapping Cells in an input table are originally indexed in two dimensions. To apply our model in §2.2 to the cells, we need to map the two-dimensional table to a one-dimensional sequence. This is equivalent to defining a search order in the table, so we will use the terms "mapping" and "search order" interchangeably.

2.3.1
Since it is infeasible to try all possible mappings, we define six promising static mappings (search orders) as shown in Figure 4. Note that the "left" and "right" directions in the captions correspond to not word orders, but tables. We de-  fine two mappings (Figures 4(a) and 4(b)) with the highest priority on the "up to down" order, which checks a sentence forwardly (from the beginning of a sentence). Similarly, we also define two mappings (Figures 4(c) and 4(d)) with the highest priority on the "right to left" order, which check a sentence backwardly (from the end of a sentence).  Label and 4(d). We further define two close-first mappings (Figures 4(e) and 4(f)) since entities are easier to find than relations and close relations are easier to find than distant relations. We also investigate dynamic mappings (search orders) with an easy-first policy (Goldberg and Elhadad, 2010). Dynamic mappings are different from the static mappings above, since we reorder the cells before each decoding 1 . We evaluate the cells using the local scoring function, and assign indices to the cells so that the cells with higher scores have higher priorities. In addition to this naïve easy-first policy, we define two other dynamic mappings that restricts the reordering by combining the easy-first policy with one of the following two policies: entity-first (all entities are detected before relations) and close-first (closer cells are detected before distant cells) policies.

Label dependencies
To avoid illegal assignments to a table, we have to restrict the possible assignments to the cells according to the preceding assignments. This restriction can also reduce the computational costs.
We consider all the dependencies between cells to allow the assignments of labels to the cells in an arbitrary order. Our representation of entities and relations in §2.1 induces the dependencies between entities and between entities and relations. Tables 1-3 summarize these dependencies on the ith word w i in a sentence. We can further utilize dependencies between entity types and relation types if some entity types are involved in a limited num-  ber of relation types or vice versa. We note that the dependencies between entity types and relation types include not only words participating in relations but also their surrounding words. For example, the label on w i−1 can restrict the types of relations involving w i . We employ these type dependencies in the evaluation, but we omit these dependencies here since these dependencies are dependent on the tasks.

Learning
The goal of learning is to minimize errors between predicted assignments y * and gold assignments y gold by tuning the weights w in the scoring function in Equation 3. We employ a margin-based structured learning approach to tune the weights w. The pseudo code is shown in Figure 5. This approach enhances the traditional structured percep- y * ← best assignment for x using decoding in Figure 3 with s ′ in Equation (5)   5: if y * ̸ = y gold then 6: end for 10: end for 11: return w Margin-based structured learning approach with a max-violation update. update(w, f (x, y gold , 1, m), f (x, y * , 1, m)) depends on employed learning methods.
tron (Collins, 2002) in the following ways. Firstly, we incorporate a margin ∆ into the scoring function as follows so that wrong assignments with small differences from gold assignments are penalized (lines 4 and 6 in Figure 5) (Freund and Schapire, 1999).
Similarly to the scoring function s, the margin ∆ is defined as a decomposable function using 0-1 loss as follows: Secondly, we update the weights w based on a max-violation update rule following Huang et al.
(2012) (lines 6-7 in Figure 5). Finally, we employ not only perceptron (Collins, 2002) but also AROW (Mejer and Crammer, 2010;Crammer et al., 2013), AdaGrad (Duchi et al., 2011), and DCD-SSVM (Chang and Yih, 2013) for learning methods (line 7 in Figure 5.) We employ parameter averaging except for DCD-SSVM. AROW and AdaGrad store additional information for covariance and feature counts respectively, and DCD-SSVM keeps a working set and performs additional updates in each iteration. Due to space limitations, we refer to the papers for the details of the learning methods.

Features
Here, we explain the local features f local and the global features f global introduced in §2.2.

Local features
Our focus is not to exploit useful local features for entities and relations, so we incorporate several features from existing work to realize a reasonable baseline.

Global features
We design global features to represent dependencies among entities and relations. Table 5 summarizes the global features 2 . These global features are activated when all the information is available during decoding. We incorporate label dependency features like traditional sequential labeling for entities. Although our model can include other non-local features between entities (Ratinov and Roth, 2009), we do not include them expecting that global features on entities and relations can cover them. We design three types of global features for relations. These features are activated when all the participating relations are not ⊥ (non-relations). Features except for the "Crossing" category are similar to global relation features in Li and Ji (2014). We further incorporate global features for both entities and relations. These features are activated when the relation label is not ⊥. These features can act as a bridge between entities and relations.

Evaluation
In this section, we first introduce the corpus and evaluation metrics that we employed for evaluation. We then show the performance on the training data set with explaining the parameters used Word n-grams (n=1,2,3) within a context window size of 2 Word pair Entity Entity lexical features of each word (Relation) Contextual Word n-grams (n=1,2,3) within a context window size of 2 Shortest path Walk features (word-dependency-word or dependency-worddependency) on the shortest paths in parsers' outputs n-grams (n=2,3) of words and dependencies on the paths n-grams (n=1,2) of token modifier-modifiee pairs on the paths The length of the paths Combinations of a label and its corresponding entity Relation Entitysharing Combinations of two relation labels that share a word (i.e., relations in same columns or same rows in a table) Combinations of two relation labels and the shared word Relation shortest path features between non-shared words, augmented by a combination of relation labels and the shared word Cyclic Combinations of three relation labels that make a cycle Crossing Combinations of two relation labels that cross each other Entity + Entity-Relation label and the label of its participating entity Relation relation Relation label and the label and word of its participating entity for the test set evaluation, and show the performance on the test data set.

Evaluation settings
We used an entity and relation recognition corpus by Roth and Yih (2004) 3 . The corpus defines four named entity types Location, Organization, Person, and Other and five relation types Kill, Live In, Located In, OrgBased In and Work For.
All the entities were words in the original corpus because all the spaces in entities were replaced with slashes. Previous systems (Roth and Yih, 2007; Kate and Mooney, 2010) used these word boundaries as they were, treated the boundaries as given, and focused the entity classification problem alone. Differently from such systems, we recovered these spaces by replacing these slashes with spaces to evaluate the entity boundary detection performance on this corpus. Due to this replacement and the inclusion of the boundary detection problem, our task is more challenging than the original task, and our results are not comparable with those by the previous systems.
The corpus contains 1,441 sentences that contain at least one relation. Instead of 5-fold cross validation on the entire corpus by the previous systems, we split the data set into training (1,153 sentences) and blind test (288 sentences) data sets and developed the system on the training data set. We tuned the hyper-parameters using a 5-fold cross validation on the training data set, and evaluated the performance on the test set.
We prepared a pipeline approach as a baseline. We first trained an entity recognition model using the local and global features, and then trained a relation extraction model using the local features and global features without global "Relation" features in Table 5. We did not employ the global "Relation" features in this baseline since it is common to treat relation extraction as a multi-class classification problem.
We extracted features using the results from two syntactic parsers Enju (Miyao and Tsujii, 2008) and LRDEP (Sagae and Tsujii, 2007). We employed feature hashing (Weinberger et al., 2009) and limited the feature space to 2 24 . The numbers of features greatly varied for categories and targets. They also caused biased predictions that prefer entities to relations in our preliminary experiments. We thus chose to re-scale the features as follows. We normalized local features for each feature category and then for each target. We also normalized global features for each feature category, but we did not normalize them for each target since normalization was impossible during decoding. We instead scaled the global features, and the scaling factor was tuned by using the same 5-fold cross validation above.
We used the F1 score on relations with entities as our primary evaluation measure and used it for tuning parameters. In this measure, a relation with two entities is considered correct when the offsets and types of the entities and the type of the relation are all correct. We also evaluated the F1 scores for entities and relations individually on the test data set by checking their corresponding cells. An entity is correct when the offset and type are correct, and a relation is correct when the type is correct and the last words of two entities are correct.

Performance on Training Data Set
It is infeasible to investigate all the combinations of the parameters, so we greedily searched for a default parameter setting by using the evaluated results on the training data set. The default parameter setting was the best setting except for the beam size. We show learning curves on the training data set in Figure 6 when we varied each parameter from the default parameter setting. We employed 5-fold cross validation. The default parameter setting used DCD-SSVM as the learning method, entity-first, easy-first as the search order, local and global features, and 8 as the beam size. This section discusses how these parameters affect the performance on the training data set and explains how the parameter setting was selected for the test set. Figure 6(a) compares the learning methods introduced in §2.4. DCD-SSVM and AdaGrad performed slightly better than perceptron, which has often been employed in history-based structured learning. AROW did not show comparable performance to the others. We ran 100 iterations to find the number of iterations that saturates learning curves. The large number of iterations took time and the performance of DCD-SSVM almost converged after 30 iterations, so we employed 50 iterations for other evaluation on the training data set. AdaGrad got its highest performance more quickly than other learning methods and AROW converged slower than other methods, so we employed 10 for AdaGrad, 90 for AROW, and 50 iterations for other settings on the test data set.
The performance was improved by widening the beam as in Figure 6(b), but the improvement was gradually diminished as the beam size increased. Since the wider beam requires more training and test time, we chose 8 for the beam size.
Figure 6(c) shows the effects of joint learning as well as features explained in §2.5. We show the performance of the pipeline approach (Pipeline) introduced in §3.1, and the performance with local features alone (Local), local and global features without global "Relation" features in Table 5 (Local+global (−relation)) and all local and global features (Local+global). We note that Pipeline shows the learning curve of relation extraction in the pipeline approach. Features in "Local+global (−relation)" are the same as the features in the pipeline approach, and the result shows that the joint learning approach performed slightly better than the pipeline approach. The incorporation of global "Entity" and "Entity+Relation" features improved the performance as is common with the existing pipeline approaches, and relation-related features further improved the performance.
Static search orders in §2.3.1 also affected the performance as shown in Figure 6 between the performances with the best order and worst order was about 0.04 in an F1 score, which is statistically significant, and the performance can be worse than the pipeline approach in Figure 6(c). This means improvement by joint learning can be easily cancelled out if we do not carefully consider search order. It is also surprising that the second worst order (Figure 4(b)) is the most intuitive "left-to-right" order, which is closest to the order in Li and Ji (2014) among the six search orders.
Figure 6(e) shows the performance with dynamic search orders. Unfortunately, the easy-first policy did not work well on this entity and relation task, but, with the two enhancements, dynamic orders performed as well as the best static order in Figure 6(d). This shows that entities should be de-tected earlier than relations on this data set. Table 6 summarizes the performance on the test data set. We employed the default parameter setting explained in §3.2, and compared parameters by changing the parameters shown in the first column. We performed a statistical test using the approximate randomization method (Noreen, 1989) on our primary measure ("Entity+Relation"). The results are almost consistent with the results on the training data set with a few exceptions.

Performance on Test Data Set
Differently from the results on the training data set, AdaGrad and AROW performed significantly worse than perceptron and DCD-SSVM and they performed slightly worse than the pipeline approach. This result shows that DCD-SSVM performs well with inexact search and the selection of learning methods can significantly affect the entity and relation extraction performance.
The joint learning approach showed a significant improvement over the pipeline approach with relation-related global features, although the joint learning approach alone did not show a significant improvement over the pipeline approach. Unfortunately, no joint learning approach outperformed the pipeline approach in entity recognition. This may be partly because hyper-parameters were tuned to the primary measure. The results on the pipeline approach also indicate that the better performance on entity recognition does not necessarily improve the relation extraction performance.
Search orders also affected the performance, and the worst order (right to left, down to up) and best order (close-first, left to right) were significantly different. The performance of the worst order was worse than that of the pipeline approach, although the difference was not significant. These results show that it is necessary to carefully select the search order for the joint entity and relation extraction task.

Comparison with Other Systems
To compare our model with the other systems (Roth and Yih, 2007;Kate and Mooney, 2010), we evaluated the performance of our model when the entity boundaries were given. Differently from our setting in §3.1, we used the gold entity boundaries encoded in the BILOU scheme and assigned entity labels to the boundaries. We performed 5-fold cross validation on the data set following Roth and Yih (2007) although the split was different from theirs since their splits were not available. We employed the default parameter setting in §3.2 for this comparison. Table 7 shows the evaluation results. Although we cannot directly compare the results, our model performs better than the other models. Compared to Table 6, Table 7 also shows that the inclusion of entity boundary detection degrades the performance about 0.09 in F-score.

Related Work
Search order in structured learning has been studied in several NLP tasks. Left-to-right and rightto-left orderings have been often investigated in sequential labeling tasks (Kudo and Matsumoto, 2001). Easy-first policy was firstly introduced by Goldberg and Elhadad (2010) for dependency parsing, and it was successfully employed in several tasks, such as joint POS tagging and dependency parsing (Ma et al., 2012) and co-reference resolution (Stoyanov and Eisner, 2012). Search order, however, has not been focused in relation extraction tasks.
Named entity recognition (Florian et al., 2003;Nadeau and Sekine, 2007) and relation extraction (Zelenko et al., 2003;Miwa et al., 2009) have often been treated as separate tasks, but there are some previous studies that treat entities and relations jointly in learning. Most studies built joint learning models upon individual models for subtasks, such as Integer Linear Programming (ILP) (Roth and Yih, 2007;Yang and Cardie, 2013) and Card-Pyramid Parsing (Kate and Mooney, 2010). Our approach does not require such individual models, and it also can detect entity boundaries that these approaches except for Yang and Cardie (2013) did not treat. Other studies (Yu and Lam, 2010;Singh et al., 2013) built global probabilistic graphical models. They need to compute distributions over variables, but our approach does not. Li and Ji (2014) proposed an approach to jointly find entities and relations. They incorporated a semi-Markov chain in representing entities and they defined two actions during search, but our approach does not employ such representation and actions, and thus it is more simple and flexible to investigate search orders.

Conclusions
In this paper, we proposed a history-based structured learning approach that jointly detects enti-   Table 6: Performance of entity and relation extraction on the test data set (precision / recall / F1 score). The † denotes the default parameter setting in §3.2 and ⋆ represents a significant improvement over the underlined "Pipeline" baseline (p<0.05). Labels (a)-(f) correspond to those in Figure 4.
Kate and Mooney (2010) Roth and Yih (2007) Table 7: Results of entity classification and relation extraction on the data set using the 5-fold cross validation (precision / recall / F1 score).
ties and relations. We introduced a novel entity and relation table that jointly represents entities and relations, and showed how the entity and relation extraction task can be mapped to a simple table-filling problem. We also investigated search orders and learning methods that have been fixed in previous research. Experimental results showed that the joint learning approach outperforms the pipeline approach and the appropriate selection of learning methods and search orders is crucial to produce a high performance on this task. As future work, we plan to apply this approach to other relation extraction tasks and explore more suitable search orders for relation extraction tasks.
We also plan to investigate the potential of this table representation in other tasks such as semantic parsing and co-reference resolution.