A Dependency-Based Neural Network for Relation Classification

Previous research on relation classification has verified the effectiveness of using dependency shortest paths or subtrees. In this paper, we further explore how to make full use of the combination of these dependency information. We first propose a new structure, termed augmented dependency path (ADP), which is composed of the shortest dependency path between two entities and the subtrees attached to the shortest path. To exploit the semantic representation behind the ADP structure, we develop dependency-based neural networks (DepNN): a recursive neural network designed to model the subtrees, and a convolutional neural network to capture the most important features on the shortest path. Experiments on the SemEval-2010 dataset show that our proposed method achieves state-of-art results.


Introduction
Relation classification aims to classify the semantic relations between two entities in a sentence.It plays a vital role in robust knowledge extraction from unstructured texts and serves as an intermediate step in a variety of natural language processing applications.Most existing approaches follow a machine learning based framework and focus on designing effective features to obtain better classification performance.
The effectiveness of using dependency relations between entities for relation classification has been reported in previous approaches (Bach and Badaskar, 2007).For example, Suchanek et al. (2006) carefully selected a set of features from tokenization and dependency parsing, and extended some of them to * Contribution during internship at Microsoft Research.
generate high order features in different ways.Culotta and Sorensen (2004) designed a dependency tree kernel and attached more information including Part-of-Speech tag, word chunking tag to each node in the tree.Interestingly, Bunescu and Mooney (2005) provided an important insight that the shortest path between two entities in a dependency graph concentrates most of the information for identifying the relation between them.Nguyen et al. (2007) developed these ideas by analyzing multiple subtrees with the guidance of pre-extracted keywords.Previous work showed that the most useful dependency information in relation classification includes the shortest dependency path and dependency subtrees.These two kinds of information serve different functions and their collaboration can boost the performance of relation classification (see Section 2 for detailed examples).However, how to uniformly and efficiently combine these two components is still an open problem.In this paper, we propose a novel structure named Augmented Dependency Path (ADP) which attaches dependency subtrees to words on a shortest dependency path and focus on exploring the semantic representation behind the ADP structure.
Recently, deep learning techniques have been widely used in modeling complex structures.This provides us an opportunity to model the ADP structure in a neural network framework.Thus, we propose a dependency-based neural network where two sub-neural networks are used to model shortest dependency paths and dependency subtrees respectively.One convolutional neural network (CNN) is applied over the shortest dependency path, because CNN is suitable for capturing the most useful features in a flat structure.A recursive neural network (RNN) is used for extracting semantic representations from the dependency subtrees, since RNN is good at modeling hierarchical structures.To connect these two sub-A thief who tried to steal the truck broke the ignition with screwdriver.Figure 2: The bold part is the shortest path between two entities in the undirected version of dependency tree, and some subtrees are attached to it.They two are combined as an augmented dependency path.
networks, each word on the shortest path is combined with a representation generated from its subtree, strengthening the semantic representation of the shortest path.In this way, the augmented dependency path is represented as a continuous semantic vector which can be further used for relation classification.
The major contributions of the work presented in this paper are as follows.
1. We extend the shortest dependency path into the augmented dependency path to better model the relation between two entities.
2. We propose a dependency-based neural network, DepNN, to model the augmented dependency path.It combines the advantages of the convolutional neural network and the recursive neural network.
3. We conduct extensive experiments on the Se-mEval 2010 dataset and the experimental results show that DepNN outperforms baseline methods and yields state-of-the-art F1 measure on the relation classification task.

Problem Definition and Motivation
The task of relation classification can be defined as follows.Given a sentence S with a pair of entities e 1 and e 2 annotated, the task is to identify the semantic relation between e 1 and e 2 in accordance with a set of predefined relation  the structure of "broke prep with screwdriver", helping judging the Instrument-Agency relation.
Although the shortest dependency paths prove useful for relation classification, there exists other information on the dependency tree that can be exploited to represent the relation more precisely.For example, Figure 2a and 2b show two instances which have similar shortest dependency paths but belong to different relation types.In this situation, if we only use the shortest dependency paths for judging relation types, it is difficult for us to distinguish these two instances.However, we notice that the subtrees attached to the shortest dependency paths such as "dobj→commandment" and "dobj→ignition" can provide supplemental information for relation classification.Based on many observations like this, we propose the idea that we should employ these subtrees and combine them with the shortest path to form a more precise structure for classifying relations.This combined structure is called "augmented dependency path (ADP)", as illustrated in Figure 2.
Next, our goal is to capture the semantic representation of the ADP structure between two entities.The key problem here is how to combine the two components of ADP to incorporate more information.We propose that on the augmented dependency path, a word should be represented by both itself and its attached subtree.This is because the word itself contains its general meaning while the subtree can provide semantic information about how this word functions in this specific sentence.With this idea, we adopt the re-cursive neural network (RNN) that is proved suitable for modeling hierarchical structures to build semantic embeddings for the words on the shortest path along with their subtrees.After obtaining these more precise word representations, a convolutional neural network (CNN) can be applied, since it is good at modeling flat structures and can generate a fix-sized vector containing the most relevant features.

Dependency-Based Neural Networks
In this section, we will introduce how we use neural network techniques and dependency information to explore the semantic connection between two entities.We name our architecture of modeling ADP structures as dependency-based neural networks (DepNN).Figure 3 illustrates DepNN with a concrete example.First, we associate each word w and dependency relation r with a vector representation x w , x r ∈ R dim .For each word w on the shortest dependency path, we develop an RNN from its leaf words up to the root to generate a subtree embedding c w and concatenate c w with x w to serve as the final representation of w.
Next, a CNN is designed to model the shortest dependency path based on the representation of its words and relations.Finally our framework can efficiently represent the semantic connection between two entities with consideration of more comprehensive dependency information.

Modeling Dependency Subtree
The goal of modeling dependency subtrees is to find an appropriate representation for the words on the shortest path.As mentioned above, we assume that each word w can be interpreted by itself and its children on the dependency subtree.Then, for each word w on the subtree, its word embedding x w ∈ R dim and subtree representation c w ∈ R dimc are concatenated to form its final representation p w ∈ R dim+dimc .For a word that does not have a subtree, we set its subtree representation as c LEAF .The subtree representation of a word is derived through transforming the representations of its children words.During the bottom-up construction of the subtree, each word is associated with a dependency relation such as dobj as in Figure 3.
Take the ADP in Figure 3 for example, we first compute leaves' representations like p the , Once all leaves are finished, we move to interior nodes with already processed children.In the example, continuing from "the" to its parent, "Sabbath", we compute (2) where f is a non-linear activation function such as tanh, W det is the transformation matrix associated with dependency relation det and b is a bias term.We repeat this process until we reach the root on the shortest path, which in this case is "broke", The composition equation for any word w with children Q(w) is, (5) where R (w,q) denotes the dependency relation between word w and its child word q.This process continues recursively from leaves up to the root words on the shortest path.Each of these words will have a vector representation after this stage (p priests , p broke and p work in this example).

Modeling Shortest Dependency Path
To classify the relation between two entities, we further explore the semantic representation behind their shortest dependency path, which can be seen as sequence of words interspersed with dependency relations.Take the shortest dependency path in last subsection for example.The sequence S will be,

S: [priests nsubj broke prep-with work]
w 1 r 1 w 2 r 2 w 3 As the convolutional neural network (CNN) is good at capturing the salient features from a sequence of objects, we design a CNN to tackle the shortest dependency path.
A CNN contains a convolution operation over windows of object representations, followed by a pooling operation.As we know, a word w on the shortest path is associated with the representation p w through modeling the subtree.For a dependency relation r on the shortest path, we set its representation as a vector x r ∈ R dim .As a sliding window is applied on the sequence, we set the window size as k.For example, when k = 3, the sliding windows of S are {[r s w 1 r 1 ], [r 1 w 2 r 2 ], [r 2 w 3 r e ]} where r s and r e are used to denote the beginning and end of a shortest dependency path between two entities.
We concatenate k neighboring word (or dependency relation) representations within one window into a new vector.Assume X i ∈ R dim•k+dimc•nw as the concatenated representation of the i-th window, where n w is the number of words in one window.A convolution operation involves a filter W 1 ∈ R l×(dim•k+dimc•nw) , which operates on X i to produce a new feature vector L i with l dimensions, where the bias term is ignored for simplicity.
Then W 1 is applied to each possible window in the shortest dependency path to produce a feature map: Next, we adopt the widely-used max-over-time pooling operation (Collobert et al., 2011), which can retain the most important features, to obtain the final representation L from the feature map.That is, L = max(L 0 , L 1 , L 2 , . . .).
By this means, we are able to obtain the semantic representation of ADP with advantages of both RNN and CNN.

Learning
Like other relation classification systems, we also incorporate some lexical level features which are proved useful for this task.This includes named entity tags and WordNet hypernyms of e 1 and e 2 .We concatenate them with the ADP representation L to produce a combined vector M .We then pass M to a fully connected sof tmax layer whose output is the probability distribution y over relation labels.
We define the ground-truth label vector t for each instance as a binary vector.If the instance belongs to the the i-th type, only t i is 1 and the other dimensions are set to 0. To learn the parameters, we optimize the cross-entropy error between y and t using stochastic gradient descent (Bottou, 2004).
For each training instance, we define the objective function as: where θ represents the parameters.Gradients are computed using backpropagation (Rumelhart et al., 1988).

Experiments
Our experiments are performed on SemEval-2010 dataset (Hendrickx et al., 2010).The training part of the dataset includes 8000 instances, and the test part includes 2717 instances.Table 2 shows the statistics of the annotated relation types of this dataset.We can see that the distribution of relation types in the test set is similar to that in the training set.The official evaluation metric is the macro-averaged F1-score (excluding Other).We use dependency trees generated by the Stanford Parser (Klein and Manning, 2003) with the "collapsed" option, which regards a preposition as a kind of dependency relation.As de Marneffe and Manning (2008) pointed out, this option is more useful for event relation extraction.

Contributions of different components
We first show the contributions from different components of DepNN.In our experiments, two  For evaluation, we first design a relation extraction system (named PATH) which only models the shortest dependency path with a CNN.Based on PATH, We consider to incorporate the two kinds of lexical features including named entity tags (NER) and WordNet hypernyms (WN).Then, we get two systems which are named PATH+WN and PATH+NER respectively.We also add the attached subtrees (SUB) modeled by an RNN to form the complete augmented dependency path.2014) , since a large number of WordNet hypernyms may cause overfitting.Furthermore, the attached subtrees, as we expect, can provide an obvious boost to DepNN.The NER tags, WordNet hypernyms and subtrees all contribute to the performance by providing supplemental information for words on the shortest path.The experiments show that the subtree information does a better job than the other two kinds of information and can help build more precise representations for words in a sentence.To get a deeper understanding of what semantic information can be captured behind the ADP structure, we will look into our model and analyze it with specific examples.Since the Gigaword embeddings, with its larger corpus and dimensions, can significantly improve the classification performance, the following experiments and analysis are all based on Gigaword embeddings.

Intuitive Analysis of Shortest Path
We take the output vector of the CNN layer as the distributed representation of a dependency path.In this way, we can calculate the cosine similarity between any two paths and illustrate some paths with high similarity.Table 5 shows three training instances with different relation types and their three most similar paths in the test set.
From Table 5, we can see that our approach can capture the core meaning of the shortest dependency paths.For example, for the Instrument-Agency relation, we infer that the dependency relations "nsubj inv", "dobj" and "prep with" in the dependency path play a main role in the representation and our model can capture these similar paths.For the Product-Producer relation, our model focuses on representing the structure of "nsubj inv verb1 xcomp verb2 dobj" and exploits some words like "pencil" and "create" in the path representation.This is clearer for the Message-Topic relation, where the similarity of words like "point", "explore", "address" and "relate" are well learned.

Influence of Attached Subtree
In this subsection, we will discuss the role of attached subtree (SUB) in relation classification.By comparing the results of DepNN before and after adding the subtree, we find the influence of this structure varies from different relation types.We can see that the subtree information generally has a positive impact on all the relation types.It is especially salient for the Instrument-Agency and Product-Producer relations.With only using the shortest dependency paths, these two kinds of relation types are easily confused, as they both rely on the dependency paths such as ". . .verb prep-by/prep-with/using . . .".But after considering the subtree information, we can better distinguish these two relation types.Figure 4   Figure 4: ADP of instances that can be classified correctly after adding the subtrees.
ter adding the subtrees.Figure 4a belongs to the Producer-Produce relation which can be reflected by the subtree structures like "conj-and→valves" and "amod→manufacturing". Figure 4b belongs to the Instrument-Agency relation, and the subtree structure attached to the word "scaled" provides more supplemental information to the shortest path as explained above.

Comparison with Baselines
In this subsection, we compare DepNN with several baseline approaches of relation classification.
SVM (Rink and Harabagiu, 2010): This is the top performed system in SemEval-2010.It depends on the human compiled feature templates and then utilizes many external corpora to extract features for an SVM classifier.
MV-RNN (Socher et al., 2012): This model associates each word with a matrix.Based on the constituent parse tree structure, this model finds the path between two entities and learns the distributed representation of their highest parent node through the composition in a recursive neural network.
DT-RNN (Socher et al., 2014) : This model uses an RNN for modeling dependency trees.It assigns a composition matrix to each dependency relation.Different from our model DepNN, the embedding of each node is a linear combination of its children.The network is trained using the method provided by (Iyyer et al., 2014).We average the learned vectors of all nodes, stack it with the root node's embedding and additional features, and feed them into a sof tmax classifier.
CNN: Zeng et al. (2014) build a convolutional model to learn a sentence representation over the words in a sentence.To represent each word, they use a special position vector to indicate the relative distances of current input word to two marked entities, concatenating the position vector with the corresponding word embedding.Then the sentence representation is staked with some lexical features and fed into a sof tmax classifier.
FCM (Yu et al., 2014): FCM decomposes a sentence into some substructures and learns substructure embedding from each of them.Then the substructure embeddings in a sentence are combined via a sum-pooling operation and put into a sof tmax classifier.
Table 7 compares DepNN with the baseline approaches.Since many of our baselines are neural network models, it is convenient for them to use some features extracted with external resources or tools to enhance performance.We call these features "additional features" (AF) and list them in the second column.The F1-measures on SemEval-2010 dataset with/out these additional features are shown in the last two columns.
From Table 7, we can see that DepNN achieves the best result (83.6) with the NER features.SVM achieves a comparable result, though the quality of feature engineering highly relies on human experience and external NLP resources.MV-RNN models the constituent parse trees with a recursive procedure and its F1-measures with/out AF are about 1.7 percent and 4.6 percent lower than those of DepNN.This to some extent indicates that our proposed ADP structure is more suitable for relation classification task.Meanwhile, MV-RNN is very slow to train, since each word is associated with a matrix.Both CNN and FCM use features from the whole sentence and achieve similar performance.DT-RNN is the worst of all baselines, though it also considers the information from shortest dependency paths and attached subtrees.As we analyze, shortest dependency paths and subtrees play different roles in relation classification.But, we can see that DT-RNN does not distinguish the modeling processes of shortest paths and subtrees, and deems the representation of each node as a linear combination of its children.

Related Work
Relation classification is one traditional subproblem of Information Extraction (IE).It aims to detect and classify relations between the predefined types of objects in the corpus.These objects could be named entities or marked nominals3 .Much research has been performed in this field, most of which considers it as a supervised multi-classification task.Depending on the input to the classifier, these approaches can be further divided into feature-based, tree kernel-based and composite kernel-based.
Feature-based methods extract various kinds of linguistic features, including both syntactic features and semantic cues.These features are combined to form a feature vector employed in a Max Entropy (Kambhatla, ) or an SVM (Zhou et al., 2005;GuoDong et al., 2005) classifier.Feature-based methods usually need handcrafted features and lack the ability to represent structural information (e.g., parsing tree, word order).
Kernel methods use a more natural way of ex-ploring structural features by computing the inner product of two objects in the high-dimensional latent feature space.Zelenko et al. (2003) designed a tree kernel to compute the structural commonality between shallow parse trees by a weighted sum of the number of common subtrees.Culotta and Sorensen (2004) transferred this kernel to a dependency tree and attached more information including POS tag, word chunk tag to each node.Zhou et al. (2007) proposed a contextsensitive convolution tree kernel that used context information beyond the local tree.In another view, Bunescu and Mooney (2005) provided an important insight that the shortest path between the two entities concentrates most of the information for identifying the relation between them.Nguyen et al. (2007) used the dependency subtrees in a different manner by modeling the subtrees between entities and keywords of certain relations.Zhang et al. (2006) further proposed composite kernels to combine a tree kernel and a feature-based kernel to promote the performance.
Recently, Deep Neural Networks (DNN) have been developed to solve the relation classification problem.By associating each word a distributed representation, DNN can overcome the sparsity problem in traditional methods and automatically learn appropriate features.Socher et al. (2012) proposed a recursive neural network model by constructing compositional semantics for the minimal constituent of a constituent parse tree including both marked entities.Zeng et al. (2014) used a convolutional neural network over the whole sentence combined with some lexical features.They also pointed out that the position of each word in the sentence is very important for relation classification and concatenated a special position feature vector with the corresponding word embedding.Yu et al. (2014) proposed the Factor-based Compositional Embedding Model which extracted features from the substructures of a sentence and combined them through a sum-pooling layer.

Conclusion
In this paper, we propose to classify relations between entities by modeling the augmented dependency path in a neural network framework.For a given instance, we generate its ADP by combining the shortest path between two entities and the attached subtrees.We present a novel approach, DepNN, to taking advantages of both convolu- Figure 1: Sentences and their dependency trees.
c , and hidden layer size l.The final settings are shown in Table3.

Table 4 :
Performance of DepNN with different components.From Table 4, we can verify the effectiveness of modeling the shortest dependency path with a CNN, since PATH can achieve a relatively high result.The experiment results also indicate that both the NER and WordNet features can improve the performance of relation extraction.WordNet seems less useful than NER, which conforms to the results of Yu et al. (

Table 5 :
Instrument-Agency master nsubj inv teaches dobj lesson prep with stick analyzer prep of inv core nsubj inv identifies dobj paths vmod using dobj method architect nn inv measures dep Sage prep with strip shop nsubj inv fixed prep with method Shortest dependency paths and their closest neighbours in the learned feature space.

Table 6 :
Table6shows the F1 measures of each relation type before and after adding the subtree.Influence of the subtrees on each relation type.

Table 7 :
lists two instances that can be classified correctly only af-Results of evaluation on the SemEval-2010 dataset.