Learning Relatedness between Types with Prototypes for Relation Extraction

Relation schemas are often pre-defined for each relation dataset. Relation types can be related from different datasets and have overlapping semantics. We hypothesize we can combine these datasets according to the semantic relatedness between the relation types to overcome the problem of lack of training data. It is often easy to discover the connection between relation types based on relation names or annotation guides, but hard to measure the exact similarity and take advantage of the connection between the relation types from different datasets. We propose to use prototypical examples to represent each relation type and use these examples to augment related types from a different dataset. We obtain further improvement (ACE05) with this type augmentation over a strong baseline which uses multi-task learning between datasets to obtain better feature representation for relations. We make our implementation publicly available: https://github.com/fufrank5/relatedness


Introduction
Relation extraction identifies specific semantic relationships between two entities within a single sentence. For example, there is a Physical.Located relationship between George Bush and France in the sentence: George Bush traveled to France on Thursday for a summit. Relation extraction is a crucial task for many applications such as knowledge base population.
Relation schemas are mostly pre-defined in existing datasets. The definition of the relation type depends on the annotation guide. There is no clear intrinsic Ontology for relation types. In practice, relation types can be created based on interests.
This leaves datasets with similar, related or overlapping schemas. For example, the annotation guides for Automatic Content Extraction (ACE) 03-05 changed from year to year. The later created Entities, Relations and Events (ERE) dataset was similar in the schema, but differs in details. Because of the difficulty of annotating relations, these datasets are all small individually and hard to be utilized together.
It is not an easy task to learn relatedness between relation schemas across different datasets since there is no instance-level labels available for the relatedness. However, we can observe the connections between the relation types from different datasets based on relation names or annotation guides. We propose to simplify the relatedness as binary (related or not) and to use manual review of relation names to decide the relatedness labels. This would give the prior knowledge that one relation type in one dataset may have closer relationships to some types than the others in another dataset. Then we design a model to recognize this similarity. We propose to use prototypical examples to represent each relation type. We rank these representations higher for related types, and lower for unrelated types using a pairwise loss function. Our base model is a multi-task learning model which focuses on learning a strong encoder using multiple datasets regardless of the relation schemas. We take the step further to explore utilizing the relatedness between the relation types. Experiments on ACE05 and ERE show that it can further boost the performance, especially in the low-resource settings.

Related Work
Relation type dependency: There have been a few ways to model the relationships between types in a multi-label relation dataset where we can learn the similarity or dependency from annotated examples. Surdeanu et al. (2012) used a two-layer hierarchical model. The object-level classifier is able to capture the label dependency, while the mention-level classifier is focused on multi-label classification. Riedel et al. (2013) used a neighborhood model to explicitly model the dependency between the labels in a matrix factorization framework. Both of models are designed to work on multi-label examples, which require annotation to capture the dependency between labels. In the recent work of neural methods for relation extraction, most of the work (Zeng et al., 2015;Lin et al., 2016;Liu et al., 2017) ignores the multi-label setting and does not explicitly model the label dependency. Ye et al. (2017), on the other hand, ranks the similarity between feature representation of the instance and the label embedding. In addition to ranking the positive classes higher than the negative ones, it ranks positive classes against each other to learn the connections between the positives classes. These methods all require annotated examples to learn the connections. In the case of relation types across different datasets, such annotation does not exist. We attempt to learn the similarity nevertheless using prototypes from each type.
Multi-task learning: Training multiple relation datasets at the same time could improve the robustness of the model and reduce annotation cost for relation extraction. (Fu et al., 2018) proposed to use a shared encoder to learn more general feature representation. We use a similar multitask learning base model and incorporate the similarity between the relation schemas to further improve the performance.

Relation Model with Multi-task Learning
The majority of neural relation models (Zeng et al., 2014;Nguyen and Grishman, 2015b;Zeng et al., 2015;Lin et al., 2016) encode a sentence using a deep architecture to a vector representation followed by a softmax classifier, while the others (dos Santos et al., 2015;Ye et al., 2017) use a function to compute the score between label embedding and sentence representation. Inspired by Fu et al. (2018) where the shared encoder helps in the case of the multi-task learning, we choose the latter so that all relation types (including from different datasets) will share the whole model pa-rameters except the label embeddings. Suppose we obtain the sentence representation φ(x) with a neural architecture. We define the label embedding as W l ∈ R D , a D-dimension vector for each type. We compute the L 1 distance between them and learn a scoring function to estimate the scores S θ (x) for every type: where W o ∈ R D and b o ∈ R are shared for all types.
We do not use the dot product (dos Santos et al., 2015;Ye et al., 2017) as the scoring function because the L 1 distance works slightly better in the multi-task learning experiments. The probability of every class is computed as the softmax output of the scores. Similar to (Fu et al., 2018), we jointly train two relation tasks at the same time with cross-entropy losses.
where L r1 and L r2 are the cross-entropy losses for the two relation tasks. λ is the hyperparameter to control the learning speed between the two tasks. This would give a strong baseline of utilizing the two datasets together.

Prototypes of Relation Types for Learning Similarity
For each relation type, we randomly select k examples (S k ) from the training set as supporting examples. We use the mean of the representations of these examples as the prototype for the relation type:x These prototypes are inspired by the Prototypical Networks (Snell et al., 2017). However, in the training procedure, these supporting examples are randomly selected for every mini-batch. We have dynamic prototypes during training. We define S θ (x c ) l as the similarity score to type c for type l. We hypothesize that if the two relation types are similar in semantics, they should obtain high similarity score. Within the dataset, if the relation types in the schema are mutually exclusive to each other, then we would expect a high similarity score to itself and low scores to the other types. Across the datasets, the prototypes would obtain high scores for related types and low scores in the unrelated types. We use a pair-wise ranking loss (dos Santos et al., 2015) to learn this relatedness across the datasets.
For l ∈ L and c ∈ C, S θ (x c ) l gives the score for the similarity between the type l in the relation schema L and the type c in the relation schema C. Let c + ∈ C be a class related to l and c − ∈ C be a class unrelated to l. The similarity scores would be S θ (x c + ) l and S θ (x c − ) l respectively. We define the pair-wise ranking loss as: m + and m − are the margins and γ is the scaling factor. This loss function would push S θ (x c + ) l higher for related type pair between c + and l and S θ (x c − ) l lower for unrelated type pair between c − and l. We manually create a relatedness matrix to state whether the two types are related or not between the types in C and L based on the definition of the relation types. For each step of training, we pick the highest scored c − from unrelated types and lowest scored c + for related types corresponding to type l.
where C − are types unrelated to l and C + are types related to l. In experiments, we use looser margins (m + = 0.5, m − = 0.5, γ = 1.0) compared to (dos Santos et al., 2015) as we are learning the relatedness between types rather than doing classification for individual instance. The ranking loss is jointly trained as an auxiliary task with the main relation tasks: where we use β to control the weight for learning the relatedness. If β is too large, it could disrupt the learning of main relation tasks. With appropriate weight, it could help augment the label embeddings for the relation types by considering the similarity between them.

Datasets
We select two datasets with similar relation schemas in this experiment. There is overlapping of relation types between ACE05 and ERE, but the annotation guides are different in details (Aguilar et al., 2014). Thus, we can not combine the training data directly as the same type. Doing so would actually lead to worse result as it introduces more noise than benefit. The multi-task learning would be a better choice at this setting. We take a step further and try to learn the similarity between the types at the same time. There are 6 main semantic types in ACE and 5 in ERE. Manual review of the relatedness (related or not) is trivial in this case because the relation names are almost identical for related relation types. In practice, it may take a few minutes to review more complicated relation schemas, but it would cost significantly less than annotating on the instance-level in text. For preprocessing the data, we follow previous work (Gormley et al., 2015;Nguyen and Grishman, 2015a;Fu et al., 2017Fu et al., , 2018 on ACE05. It contains 6 domains: broadcast conversation (bc), broadcast news (bn), telephone conversation (cts), newswire (nw), usenet (un) and weblogs (wl). We use newswire as training set (bn & nw), half of bc as the development set, and the other half of bc, cts and wl as the test sets. We followed their split of documents and their split of the relation types for asymmetric relations (directionality taken into account expect for physcial and personsocial types). We perform the same preprocessing for the ERE dataset, which contains documents from newswire and discussion forums. We follow the document split from (Fu et al., 2018).

Multi-task Learning Baseline
Following previous work (Nguyen and Grishman, 2015a;Fu et al., 2018), We use a similar encoder to obtain the feature representation φ(x) as our baseline. The input layer is the concatenation of word embedding, entity embedding and position embeddings. We use pretrained word2vec (Mikolov et al., 2013) as the word embedding with embedding size d w . The entity embedding and position embeddings are randomly initialized vectors according to the entity type of the token and relative distance to the two arguments of the relation. The embedding sizes are d e and d p respectively. We follow previous work for these input embedding sizes as d w , d e , d p = 300, 50, 50. It is followed by Bidirectional RNN with attention and a fully connected layer to match the size for the label embedding. We use 150 for the RNN state size  and 200 for the label embedding size. The output of this encoder is φ(x). Then we can perform classification using the scores obtained from Equation 1.
In a mini-batch of training step, we randomly select examples from both datasets proportionally according to the dataset size so that the model can finish reading both datasets at the same time after every epoch. Because the difference of the number of examples for the two datasets in the batch, we set λ = λ d |D 1 | |D 1 |+|D 2 | , where |D 1 | and |D 2 | are the number of examples for each dataset in a single batch. In a special case where the two datasets are actually split from one original dataset, we can set λ d = 1.0, and then the two datasets are going to be learned at the same speed. In our case, we use λ = 0.8 so that the two relation tasks are roughly learning at the same speed. As the result, our multi-task model using label embedding is comparable to (Fu et al., 2018) (Table 1), which serves as a strong baseline since it is already better than training a single relation task. We obtain all our scores as the average of 10 runs to report stable results.

Learning the Relatedness between Two Relation Schemas
By learning the relatedness at the same time (Equation 4,7, β = 0.001), we obtain better re- sults at the full training set ( Table 1). The improvement is more obvious with a smaller training set (Table 2 at 50%). The regularization in the previous work does not take the relatedness on the type-specific basis into account, which fails to obtain clear improvement over the multi-task baseline. Our method is more effective in incorporating additional knowledge from multiple sources. We also set up a low-resource setting where we only have N examples for each relation type (Figure 1 at N = [10,20,30,40,50]). The negatives are randomly selected according to the pos/neg ratio. We can observe larger improvement with less training data. This is also to consider the skewed data distribution in the dataset where there are far more examples for some types than the others. The k supporting prototype examples are drawn randomly at every step. We use k = 50 for the experiments and k = N for the low-resource settings. Overall, the improvement is impressive, especially for the low-resource settings. It is also worth to note that the single task models for these low-resource settings obtain virtually zero scores without multi-task learning as there is not enough data to train the encoder. The multi-task learning between two relation tasks is better than training on a single task and more effective for a smaller training set. We now show that learning the relatedness between the types could further improve the model.

Conclusion
We use prototypes of relation types to learn the relatedness between them in a multi-task learning framework. With prior knowledge of relatedness between relation types, the model obtains further improvement in addition to sharing the encoder of the sentence. The prior knowledge is ob-tained through manual review of relation names, which costs significantly less than annotating on the instance-level in text. In this paper, we simplify the relatedness as binary. It would be interesting to further explore the relationships between relation types as a more dynamic metric.