Classifying Relations by Ranking with Convolutional Neural Networks

Relation classification is an important semantic processing task whose state-of-the-art systems still rely on the use of costly handcrafted features. In this work we tackle the relation classification task using a convolutional neural network that performs classification by ranking (CR-CNN). We propose a new pairwise ranking loss function that makes it easy to reduce the impact of artificial classes. We perform experiments using the the SemEval-2010 Task 8 dataset, which encodes the task of classifying the relationship between two nominals marked in a sentence. Using CR-CNN, we outperform the state-of-the-art for this dataset and achieve a F1 of 84.1 without using any costly handcrafted features. Additionally, our experimental results also evidence that: (1) our approach is more effective than CNN followed by a softmax classifier; (2) omitting the representation of the artificial class Other improves both precision and recall; and (3) using only word embeddings as input features is enough to achieve state-of-the-art results if we consider only the text between the two target nominals.


Introduction
Relation classification is an important Natural Language Processing (NLP) task which is normally used as an intermediate step in many complex NLP applications such as question-answering and automatic knowledge base construction.Since the last decade there has been an increasing interest on applying machine learning approaches to this task (Zhang, 2004;Qian et al., 2009;Rink and Harabagiu, 2010).One reason for that is the availability of benchmark datasets such as the SemEval-2010 task 8 dataset (Hendrickx et al., 2010), which encodes the task of classifying the relationship between two nominals marked in a sentence.The following sentence contains an example of the Component-Whole relation between the nominals "introduction" and "book".Some recent work on relation classification have focused on the use of deep neural networks with the aim of reducing the number of used handcrafted features (Socher et al., 2012;Zeng et al., 2014;Yu et al., 2014).However, in order to achieve state-of-the-arte results these approaches still use some features derived from lexical resources such as WordNet or NLP tools such as dependency parsers and named entity recognizers (NER).
In this work, we propose a new convolutional neural network (CNN), which we name Classification by Ranking CNN (CR-CNN), to tackle the relation classification task.The proposed network learns a distributed vector representation for each relation class.Given an input text segment, the network uses a convolutional layer to produce a distributed vector representation of the text and compares it to the class representations in order to produce a score for each class.We propose a new pairwise ranking loss function that makes it easy to reduce the impact of artificial classes.We perform an extensive number of experiments using the the SemEval-2010 Task 8 dataset.Using CR-CNN, and without the need of any costly handcrafted feature, we outperform the state-of-the-art for this dataset and achieve a F1 of 84.1.Our experimental results also evidence that: (1) CR-CNN is more effective than CNN followed by a softmax classifier; (2) omitting the representation of the artificial class Other improves both precision and recall; and (3) using only word embeddings as input features is enough to achieve state-of-the-art re-sults if we consider only the text between the two target nominals.
The remainder of the paper is structured as follows.In Section 2, we discuss previous work in deep neural networks for relation classification and for other NLP tasks.Section 3 details the proposed neural network.In Section 4, we present details about the setup of experimental evaluation, and then describe the results in Section 5. Section 6 presents our conclusions.

Related Work
Over the years, various approaches have been proposed for relation classification.Most of them treat it as a multi-class classification problem and apply a variety of machine learning techniques to the task in order to achieve a high accuracy.
Recently, deep learning has become an attractive area for multiple applications, including computer vision, speech recognition and also natural language processing.People start to apply deep learning to relation classification, which is one of the important topics in NLP.For example, Zeng et al. (2014) propose a CNN-based approach for relation classification.Sentence-level features are learned through a CNN, which has word embedding and position features as its input.In parallel, lexical features are extracted according to given nouns.Then sentence-level and lexical features are concatenated into a single vector and fed into a softmax classifier for prediction.This approach achieves state-of-the-art performance on the SemEval-2010 Task 8 dataset.Yu et al. (2014) propose a Factor-based Compositional Embedding Model (FCM) by deriving sentence-level and substructure embeddings from word embeddings, utilizing dependency trees and named entities.It achieves slightly higher accuracy on the same dataset than (Zeng et al., 2014), but only when syntactic information is used.
Besides those on relation classification, there exist some work that apply CNN to other NLP tasks.Kalchbrenner et al. (2014) propose a Dynamic Convolutional Neural Network(DCNN), in which a dynamic k-max pooling is used instead of the general max pooling.The proposed network is evaluated on several different tasks, including sentiment prediction and question classification.Dos Santos & Gatti (2014) develop a CNN architecture for sentiment analysis of short texts that jointly uses character-level, word-level and sentence-level information, achieving state-of-theart results on well known sentiment analysis benchmarks.
Weston et al. ( 2014) present a work on hashtag prediction using a CNN.During the training, hashtags are used as a supervised signal in a pairwise hinge loss function.Experimental results on a large scale task are reported.Kim (2014) proposes a simple CNN for sentence classification built on top of the well known word vectors obtained from an unsupervised neural language model trained on billions of words, known as word2vec (2013).A multi-channel variant which combines static word vectors from word2vec and word vectors which are finetuned via backpropagation is also proposed.Experiments with different variants are performed on a number of benchmarks for sentence classification, showing that the simple CNN performs remarkably well, with state-of-the-art results in many of the benchmarks.Hu et al.(2014) propose a CNN architecture for hierarchical sentence modeling and, based on that, two architectures for sentence matching.They train the latter networks using a rankingbased loss function on three sentence matching tasks of different nature: sentence completion, response matching and paraphrase identification.The proposed architectures outperform previous work for sentence completion and response matching, while the results are slightly worse than the state-of-the-art in paraphrase identification.

The Proposed Neural Network
Given a sentence x and its target nouns, CR-CNN computes a score for each relation class c ∈ C. For each class c ∈ C, the network learns a distributed vector representation which is encoded as a column in the class embedding matrix W classes .As detailed in Figure 1, the only input for the network is tokenized text string of the sentence.In the first step, CR-CNN transforms words into realvalued feature vectors.Next, a convolutional layer is used to construct a distributed vector representations of the sentence, r x .Finally, CR-CNN computes a score for each relation class c ∈ C by performing a dot product between r x and W classes .

Word Embeddings
The first layer of the network transforms words into representations that capture syntactic and semantic information about the words.Given a sentence x consisting of N words x = {w 1 , w 2 , ..., w N }, every word w n is converted into a real-valued vector r wn .Therefore, the input to the next NN layer is a sequence of real-valued vectors emb x = {r w 1 , r w 2 , ..., r w N } Word representations are encoded by column vectors in an embedding matrix W wrd ∈ R d w ×|V | , where V is a fixed-sized vocabulary.Each column W wrd i ∈ R d w corresponds to the word embedding of the i-th word in the vocabulary.We transform a word w into its word embedding r w by using the matrix-vector product: where v w is a vector of size |V | which has value 1 at index w and zero in all other positions.The matrix W wrd is a parameter to be learned, and the size of the word embedding d w is a hyperparameter to be chosen by the user.

Word Position Embeddings
In the task of relation classification, information that is more important to determine the class of a relation between two target nouns normally comes from words which are closer to the target nouns.Zeng et al. (2014) propose the use of word position embeddings (position features) which help the CNN by informing how close words are to the target nouns.These features are similar to the position features proposed by Collobert et al. (2011) for the Semantic Role Labeling task.
In this work we also experiment the word position embeddings (WPE) proposed by Zeng et al. (2014).The WPE is the combination of the relative distances of the current word to the target noun 1 and noun 2 .For instance, in the sentence shown in Figure 1, the relative distances of left to car and plant are -1 and 2, respectively.As in (Collobert et al., 2011), each relative distance is mapped to a vector of dimension d wpe , which is initialized with random numbers.d wpe is a hyperparameter of the network.Given the vectors wp 1 and wp 2 for the word w with respect to the targets noun 1 and noun 2 , the position embedding of w is given by the concatenation of these two vectors, In the experiments where word position embeddings are used, the word embedding and the word position embedding of each word are concatenated to form the input for the convolutional layer, emb x = {[r w 1 , wpe w 1 ], [r w 2 , wpe w 2 ], ..., [r w N , wpe w N ]}.

Sentence Representation
The next step in the NN consists in creating the distributed vector representation r x for the input sentence x.The main challenges in this step are the sentence size variability and the fact that important information can appear at any position in the sentence.In recent work, convolutional approaches have been used to tackle these issues when creating representations for text segments of different sizes (Zeng et al., 2014;Hu et al., 2014;dos Santos and Gatti, 2014) and characterlevel representations of words of different sizes (dos Santos and Zadrozny, 2014).Here, we use a convolutional layer to compute distributed vector representations of the sentence.The convolutional layer first produces local features around each word in the sentence.Then, it combines these local features using a max operation to create a fixed-sized vector for the input sentence.
Given a sentence x, the convolutional layer applies a matrix-vector operation to each window of size k of successive windows in emb x = {r w 1 , r w 2 , ..., r w N }.Let us define the vector z n ∈ R d w k as the concatenation of a sequence of k word embeddings, centralized in the n-th word: Words with indices outside of the sentence boundaries use a common padding embedding.
The convolutional layer computes the j-th element of the vector r x ∈ R d c as follows: where W 1 ∈ R d c ×d w k is the weight matrix of the convolutional layer and f is the hyperbolic tangent function.The same matrix is used to extract local features around each word window of the given sentence.The fixed-sized distributed vector representation for the sentence is obtained by using the max over all word windows.Matrix W 1 and vector b 1 are parameters to be learned.The number of convolutional units d c , and the size of the word context window k are hyperparameters to be chosen by the user.It is important to note that d c corresponds to the size of the sentence representation.

Class embeddings and Scoring
Given the distributed vector representation of the input sentence x, the network with parameter set θ computes the score for a class label c ∈ C by using the dot product where W classes is an embedding matrix whose columns encode the distributed vector representations of the different class labels, and [W classes ] c is the column vector that contains the embedding of the class c.Note that the number of dimensions in each class embedding must be equal to the size of the sentence representation, which is defined by d c .The embedding matrix W classes is a parameter to be learned by the network.It is initialized by randomly sampling each value from an uniform distribution: U (−r, r), where r = 6 |C| + d c .

Training Procedure
Our network is trained by minimizing a pairwise ranking loss function over the training set D. The input for each training round is a sentence x and two different class labels y + ∈ C and c − ∈ C, where y + is a correct class label for x and c − is not.Let s θ (x) y + and s θ (x) c − be respectively the scores for class labels y + and c − generated by the network with parameter set θ.We propose a new logistic loss function over these scores in order to train CR-CNN: where m + and m − are margins and γ is a scaling factor that magnifies the difference between the score and the margin and helps to penalize more on the prediction errors.The first term in the right side of Equation 1 decreases as the score s θ (x) y + increases.The second term in the right side decreases as the score s θ (x) c − decreases.Training CR-CNN by minimizing this loss function has the effect of learning to give scores greater than m + for the correct class and (negative) scores smaller than −m − for incorrect classes.In our experiments we set γ to 2, m + to 2.5 and m − to 0.5.We use L2 regularization by adding the term β θ 2 to Equation 1.In our experiments we set β to 0.001.We use stochastic gradient descent (SGD) to minimize the loss function with respect to θ.
Like some other ranking approaches that only update two classes/examples at every training round (Weston et al., 2011;Gao et al., 2014), we can efficiently train the network for tasks which have a very large number of classes.This is an advantage over softmax classifiers.
On the other hand, sampling informative negative classes/examples can have a significant impact in the effectiveness of the learned model.In the case of our loss function, more informative negative classes are the ones with score larger than −m − .The number of classes in the relation classification dataset that we use in our experiments is small.Therefore, in our experiments, given a sentence x with class label y + , the incorrect class c − that we choose to perform a SGD step is the one with the highest score among all incorrect classes c − = arg max For tasks where the number of classes is large, we can fix a number of negative classes to be con-sidered at each example and select the one with the largest score to perform a gradient step.This approach is similar to the one used by Weston et al. (2014) to select negative examples.
We use the backpropagation algorithm to compute gradients of the network.In our experiments, we implement the CR-CNN architecture and the backpropagation algorithm using Theano (Bergstra et al., 2010).

Special Treatment of Artificial Classes
In this work, we consider a class as artificial if it is used to group items that do not belong to any of the actual classes.An example of artificial class is the class Other in the SemEval 2010 relation classification task.In this task, the artificial class Other is used to indicate that the relation between two nominals does not belong to any of the nine relation classes of interest.Therefore, the class Other is very noisy since it groups many different types of relations that may not have much in common.
An important characteristic of CR-CNN is that it makes it easy to reduce the effect of artificial classes by omitting their embeddings.If the embedding of a class label c is omitted, it means that the embedding matrix W classes does not contain a column vector for c.One of the main benefits from this strategy is to have the learning process focused on the actual classes only.Moreover, since the embedding of the artificial class is omitted, it will not influence the prediction step, i.e., CR-CNN does not produce a score for the artificial class.
In our experiments with the SemEval-2010 relation classification task, when training with a sentence x whose class label y = Other, the first term in the right side of Equation 1 is set to zero.During prediction time, a relation is classified as Other only if all actual classes have negative scores.Otherwise, it is classified with the class which has the largest score.The SemEval-2010 Task 8 dataset is already partitioned into 8,000 training instances and 2,717 test instances.We score our systems by using the SemEval-2010 Task 8 official scorer, which computes the macro-averaged F1-scores for the nine actual relations (excluding Other) and takes the directionality into consideration.

Word Embeddings Initialization
The word embeddings used in our experiments are initialized by means of unsupervised pre-training.We perform pre-training using the skip-gram NN architecture (Mikolov et al., 2013) available in the word2vec tool.We use the December 2013 snapshot of the English Wikipedia corpus to train word embeddings with word2vec.We preprocess the Wikipedia text using the steps described in (dos Santos and Gatti, 2014): (1) removal of paragraphs that are not in English; (2) substitution of non-western characters for a special character; (3) tokenization of the text using the tokenizer available with the Stanford POS Tagger (Toutanova et al., 2003); (4) removal of sentences that are less than 20 characters long (including white spaces) or have less than 5 tokens.( 5) lowercase all words and substitute each numerical digit by a 0. The resulting clean corpus contains about 1.75 billion tokens.

Neural Network Hyper-parameter
We use 4-fold cross-validation to tune the neural network hyperparameters.Learning rates in the range of 0.03 and 0.01 give relatively similar results.Best results are achieved using between 10 and 15 training epochs, depending on the CR-CNN configuration.In

Word Position Embeddings and Input Text Span
In the experiments discussed in this section we assess the impact of using word position embeddings (WPE) and also propose a simpler alternative approach that is almost as effective as WPEs.The main idea behind the use of WPEs in relation classification task is to give some hint to the convolutional layer of how close a word is to the target nouns, since we believe that closer words have more impact than distant words.
Here we hypothesize that most of the information needed to classify the relation appear between the two target nouns.Based on this hypothesis, we perform an experiment where the input for the convolutional layer consists of the word embeddings of the word sequence {w e 1 − 1, ..., w e 2 + 1} where e 1 and e 2 correspond to the positions of the first and the second target nouns, respectively.
In Table 2 we compare the results of different CR-CNN configurations.The first column indicates if it was used the full sentence (Yes) or text span between the target nouns (No).The second column informs if the WPEs were used or not.It is clear that the use of WPEs is essential when the full sentence is used, since F1 jumps from 74.3 to 84.1.This effect of WPEs is reported by (Zeng et al., 2014).On the other hand, when using only the text span between the target nouns, the impact of WPE is much smaller.With this strategy, we achieve a F1 of 82.8 using only word embeddings as input, which is a results as good as the previous state-of-the-art reported in the literature for the SemEval-2010 Task 8 dataset.This exper-imental result also suggests that, in this task, the CNN works better for short texts.
All experiments reported in the next sections use CR-CNN with full sentence and WPEs.

Impact of Omitting the Embedding of the artificial class Other
In this experiment we assess the impact of omitting the embedding of the class Other.As we mentioned before, this class is very noisy since it groups many different infrequent relation types.
Therefore, its embedding is also expected to be difficult to be defined and can also bring noise into the classification process of the actual classes.In Table 3 we present the results comparing the use and omission of embedding for the class Other.
The two first lines of results present the official F1, which does not take into account the results for the class Other.We can see that by omitting the embedding of the class Other both precision and recall for the other classes improve, which results in an increase of 1.4 in the F1.These results suggests that the strategy we use in CR-CNN to avoid the noise of artificial classes is effective.In the two last lines of Table 3 we present the results for the class Other.We can note that the precision for the examples classified as Other significantly decreases when the embedding of the class Other is not used.We believe this is due to the strategy we use to classify an instance as Other.Remember that in our approach an example is classified as Other if the score for all actual classes are negative (see Section 3.6).This induces many difficult cases of the actual classes to be classified as Other, which helps to decrease the precision of the class Other from 60.1 to 52.0 and to increase the precision of the actual classes from 81.3 to 83.7.

CR-CNN versus CNN+Softmax
In this section we report experimental results comparing CR-CNN with CNN+Softmax.In order to do a fair comparison, we've implemented a CNN+Softmax and trained it with the same data, word embeddings and WPEs used in CR-CNN.Concretely, our CNN+Softmax consists in getting the output of the convolutional layer, which is the vector r x in Figure 1, and giving it as input for a softmax classifier.We tune the parameters of CNN+Softmax by using a 4-fold crossvalidation with the training set.From the hyperparameter values for CR-CNN presented in Table 1, the only difference for CNN+Softmax is the number of convolutional units d c , which is set to 400.
In Table 4 we compare the results of CR-CNN and CNN+Softmax.CR-CNN outperforms CNN+Softmax in both precision and recall, and improves the F1 by 1.6.The third line in Table 4 shows the result reported by Zeng et al. (2014) when only word embeddings and WPEs are used as input to the network (similar to our CNN+Softmax).We believe that the word embeddings employed by them is the main factor for their result being much worse than the one of our CNN+Softmax.We use word embeddings of size 400 while they use word embeddings of size 50, which were trained using much less unlabeled data than ours.

Comparison with the State-of-the-art
In Table 5 we compare CR-CNN results with results recently published for the SemEval-2010 Task 8 dataset.In (Rink and Harabagiu, 2010), the authors present a support vector machine (SVM) classifier that is fed with a rich (traditional) feature set.This was the best performing system at SemEval-2010 Task 8. Socher et al. (2012) present results for a recursive neural network (RNN) that employs a matrix-vector representation to every node in a parse tree in order to compose the distributed vector representation for the complete sentence.Their method is named the matrix-vector recursive neural network (MVRNN).In (Zeng et al., 2014), the authors present results for a CNN+Softmax classifier which employ lexical and sentence-level features.Yu et al. (2014), present results for the Factorbased Compositional Embedding Model (FCM), which derives sentence-level and substructure embeddings from word embeddings utilizing dependency trees and named entities.
CR-CNN using the full sentence, word embeddings and WPEs outperforms all previous reported results and reaches a new state-of-the-art F1 of 84.1.This is a remarkable result since we do not use any complicated feature that depend on external lexical resources such as WordNet and NLP tools such as named entity recognizers (NERs) and dependency parsers.
We can see in Table 5 that CR-CNN1 also achieves the best result among the systems that use word embeddings as the only input features.The closest result (80.6), which is produced by the FCM system of Yu et al. ( 2014), is 2.2 F1 points behind CR-CNN result (82.8).

Most Representative Trigrams for each Relation
In Component-Whole e1 of the, of the e2, part of the, e2 's e1, with its e1, e2 has a, in the e2, e1 on the e2 comprises the, e2 with e1 Content-Container was in a, was hidden in, were in a, e2 full of, e2 with e1, e2 was full, was inside a, was contained in e2 contained a, e2 with cold Entity-Destination e1 into the, e1 into a, e1 to the, was put inside, imported into the Entity-Origin away from the, derived from a, had the source of, e2 grape e1, left the, derived from an, e1 from the e2 butter e1 Instrument-Agency are used by, e1 for e2, is used by, with a e1, by using e1, e2 finds a, trade for e2, with the e2 e2 with a, e2 , who Member-Collection of the e2, in the e2, of this e2, e2 of e1, of wild e1, of elven e1, the political e2, e1 collected in e2 of different, of 0000 e1 Message-Topic e1 is the, e1 asserts the, e1 that the, described in the, discussed in the, on the e2, e1 inform about featured in numerous, discussed in cabinet, documented in two, Product-Producer e1 by the, by a e2, of the e2, e2 of the, e2 has constructed, e2 's e1, by the e2, from the e2 e2 came up, e2 who created Table 6: List of most representative trigrams for each relation type.
In order to create the results presented in Table 6, we rank the trigrams which were selected as the most representative of any sentence in decreasing order of contribution value.If a trigram appears as the largest contributor for more than one sentence, its contribuition value becomes the sum of its contribution for each sentence.
We can see in Table 6 that for most classes, the trigrams that contributed the most to increase the score are indeed very informative regarding the relation type.As expected, different trigrams play an important role depending on the direction of the relation.For instance, the most informative trigram for Entity-Origin(e1,e2) is "away from the", while reverse direction of the relation, Entity-Origin(e2,e1) or Origin-Entity, has "the source of" as the most informative trigram.These results are a step towards the extraction of meaningful knowledge from models produced by CNNs.

Conclusion
In this work we tackle the relation classification task using a CNN that performs classification by ranking.The main contributions of this work are: (1) the definition of a new state-of-the-art for the SemEval-2010 Task 8 dataset without using any costly handcrafted feature; (2) the proposal of a new CNN for classification that uses class embeddings and a new rank loss function; (3) an effective method to deal with artificial classes by omitting their embeddings in CR-CNN; (4) the demonstration that using only the text between target nominals is almost as effective as using WPEs; and (5) a method to extract from the CR-CNN model the most representative contexts of each relation type.Although we apply CR-CNN to relation classification, this method can be used for any classification task.
The[introduction]e 1 in the [book]e 2 is a summary of what is in the text.

Figure 1 :
Figure 1: CR-CNN: a Neural Network for classifying by ranking.
and Evaluation Metric We use the SemEval-2010 Task 8 dataset to perform our experiments.This dataset contains 10,717 examples annotated with 9 different relation types and an artificial relation Other, which is used to indicate that the relation in the example does not belong to any of the nine main relation types.The nine relations are Cause-Effect, Component-Whole, Content-Container, Entity-Destination, Entity-Origin, Instrument-Agency, Member-Collection, Message-Topic and Product-Producer.Each example contains a sentence marked with two nominals e 1 and e 2 , and the task consists in predicting the relation between the two nominals taking into consideration the directionality.That means that the relation Cause-Effect(e1,e2) is different from the relation Cause-Effect(e2,e1), as shown in the examples below.More information about this dataset can be found in (Hendrickx et al., 2010).The [war]e 1 resulted in other collateral imperial [conquests]e 2 as well.⇒ Cause-Effect(e1,e2) The [burst]e 1 has been caused by water hammer [pressure]e2.⇒ Cause-Effect(e2,e1)

Table 1 ,
we show the selected hyperparameter values.Additionally, we use a learning rate schedule that decreases the learning rate λ according to the training epoch t.The learning rate for epoch t, λ t , is computed using the equation: λ t = λ t .

Table 2 :
Comparison of different CR-CNN configurations.

Table 3 :
Impact of not using an embedding for the artificial class Other.

Table 4 :
Comparison of results of CR-CNN and CNN+Softmax.

Table 5 :
Table 6, for each relation type we present the five trigrams in the test set which contributed the most for scoring correctly classified examples.Remember that in CR-CNN, given a sentence x the score for the class c is computed by sθ (x) c = r x [W classes ] c .In order to compute the most representative trigram of a sentence x, we trace back each position in r x to find the trigram responsible for it.For each trigram t, we compute its particular contribution for the score by summing the terms in score that use positions in r x that trace back to t.The most representative trigram in x is the one with the largest contribution to increase the score.Comparison with results published in the literature.