Distant Supervision for Relation Extraction with Matrix Completion

The essence of distantly supervised relation extraction is that it is an incomplete multi-label classiﬁcation problem with sparse and noisy features. To tackle the s-parsity and noise challenges, we propose solving the classiﬁcation problem using matrix completion on factorized matrix of minimized rank. We formulate relation classiﬁcation as completing the unknown labels of testing items (entity pairs) in a sparse matrix that concatenates training and testing textual features with training labels. Our algorithmic framework is based on the assumption that the rank of item-by-feature and item-by-label joint matrix is low. We apply two optimization models to recover the underlying low-rank matrix leveraging the sparsity of feature-label matrix. The matrix completion problem is then solved by the ﬁxed point continuation (FPC) algorithm, which can ﬁnd the global optimum. Experiments on two widely used datasets with different dimensions of textual features demonstrate that our low-rank matrix completion approach sig-niﬁcantly outperforms the baseline and the state-of-the-art methods.


Introduction
Relation Extraction (RE) is the process of generating structured relation knowledge from unstructured natural language texts.Traditional supervised methods (Zhou et al., 2005;Bach and Badaskar, 2007) on small hand-labeled corpora, such as MUC 1 and ACE 2 , can achieve high precision and recall.However, as producing handlabeled corpora is laborius and expensive, the supervised approach can not satisfy the increasing 1 http://www.itl.nist.gov/iaui/894.02/relatedprojects/muc/ 2 http://www.itl.nist.gov/iad/mig/tests/ace/demand of building large-scale knowledge repositories with the explosion of Web texts.To address the lacking training data issue, we consider the distant (Mintz et al., 2009) or weak (Hoffmann et al., 2011) supervision paradigm attractive, and we improve the effectiveness of the paradigm in this paper.
The intuition of the paradigm is that one can take advantage of several knowledge bases, such as WordNet3 , Freebase4 and YAGO5 , to automatically label free texts, like Wikipedia6 and New York Times corpora7 , based on some heuristic alignment assumptions.An example accounting for the basic but practical assumption is illustrated in Figure 1, in which we know that the two entities (<Barack Obama, U.S.>) are not only involved in the relation instances8 coming from knowledge bases (President-of(Barack Obama, U.S.) and Born-in(Barack Obama, U.S.)),

Observed Sparse Matrix
Training Items Testing Items

Incomplete Labels Noisy Features
Figure 2: The procedure of noise-tolerant low-rank matrix completion.In this scenario, distantly supervised relation extraction task is transformed into completing the labels for testing items (entity pairs) in a sparse matrix that concatenates training and testing textual features with training labels.We seek to recover the underlying low-rank matrix and to complete the unknown testing labels simultaneously.
but also co-occur in several relation mentions9 appearing in free texts (Barack Obama is the 44th and current President of the U.S. and Barack Obama was born in Honolulu, Hawaii, U.S., etc.).We extract diverse textual features from all those relation mentions and combine them into a rich feature vector labeled by the relation names (President-of and Born-in) to produce a weak training corpus for relation classification.
This paradigm is promising to generate largescale training corpora automatically.However, it comes up against three technical challeges: • Noisy features.Not all relation mentions express the corresponding relation instances.
For example, the second relation mention in Figure 1 does not explicitly describe any relation instance, so features extracted from this sentence can be noisy.Such analogous cases commonly exist in feature extraction.
• Incomplete labels.Similar to noisy fea-tures, the generated labels can be incomplete.
For example, the fourth relation mention in Figure 1 should have been labeled by the relation Senate-of.However, the incomplete knowledge base does not contain the corresponding relation instance (Senate-of(Barack Obama, U.S.)).Therefore, the distant supervision paradigm may generate incomplete labeling corpora.
In essence, distantly supervised relation extraction is an incomplete multi-label classification task with sparse and noisy features.
In this paper, we formulate the relationextraction task from a novel perspective of using matrix completion with low rank criterion.To the best of our knowledge, we are the first to apply this technique on relation extraction with distant supervision.More specifically, as shown in Figure 2, we model the task with a sparse matrix whose rows present items (entity pairs) and columns contain noisy textual features and incomplete relation labels.In such a way, relation classification is transformed into a problem of completing the unknown labels for testing items in the sparse matrix that concatenates training and testing textual features with training labels, based on the assumption that the item-by-feature and item-by-label joint matrix is of low rank.The rationale of this assumption is that noisy features and incomplete labels are semantically correlated.The low-rank factorization of the sparse feature-label matrix delivers the low-dimensional representation of de-correlation for features and labels.
We contribute two optimization models, DRMC11 -b and DRMC-1, aiming at exploiting the sparsity to recover the underlying low-rank matrix and to complete the unknown testing labels simultaneously.Moreover, the logistic cost function is integrated in our models to reduce the influence of noisy features and incomplete labels, due to that it is suitable for binary variables.We also modify the fixed point continuation (FPC) algorithm (Ma et al., 2011) to find the global optimum.
Experiments on two widely used datasets demonstrate that our noise-tolerant approaches outperform the baseline and the state-of-the-art methods.Furthermore, we discuss the influence of feature sparsity, and our approaches consistently achieve better performance than compared methods under different sparsity degrees.

Related Work
The idea of distant supervision was firstly proposed in the field of bioinformatics (Craven and Kumlien, 1999).Snow et al. (2004) used Word-Net as the knowledge base to discover more hpyernym/hyponym relations between entities from news articles.However, either bioinformatic database or WordNet is maintained by a few experts, thus hardly kept up-to-date.
As we are stepping into the big data era, the explosion of unstructured Web texts simulates us to build more powerful models that can automatically extract relation instances from large-scale online natural language corpora without handlabeled annotation.Mintz et al. (2009) adopted Freebase (Bollacker et al., 2008;Bollacker et al., 2007), a large-scale crowdsourcing knowledge base online which contains billions of relation instances and thousands of relation names, to distantly supervise Wikipedia corpus.The basic alignment assumption of this work is that if a pair of entities participate in a relation, all sentences that mention these entities are labeled by that relation name.Then we can extract a variety of textual features and learn a multi-class logistic regression classifier.Inspired by multi-instance learning (Maron and Lozano-Pérez, 1998), Riedel et al. (2010) relaxed the strong assumption and replaced all sentences with at least one sentence.Hoffmann et al. (2011) pointed out that many entity

Model
We apply a new technique in the field of applied mathematics, i.e., low-rank matrix completion with convex optimization.The breakthrough work on this topic was made by Candès and Recht (2009) who proved that most low-rank matrices can be perfectly recovered from an incomplete set of entries.This promising theory has been successfully applied on many active research areas, such as computer vision (Cabral et al., 2011), recommender system (Rennie andSrebro, 2005) and system controlling (Fazel et al., 2001).Our models for relation extraction are based on the theoretic framework proposed by Goldberg et al. (2010), which formulated the multi-label transductive learning as a matrix completion problem.The new framework for classification enhances the robustness to data noise by penalizing different cost functions for features and labels.

Formulation
Suppose that we have built a training corpus for relation classification with n items (entity pairs), d-dimensional textual features, and t labels (relations), based on the basic alignment assumption proposed by Mintz et al. (2009).Let X train ∈ R n×d and Y train ∈ R n×t denote the feature matrix and the label matrix for training, respectively.The linear classifier we adopt aims to explicitly learn the weight matrix W ∈ R d×t and the bias column vector b ∈ R t×1 with the constraint of minimizing the loss function l, where 1 is the all-one column vector.Then we can predict the label matrix Y test ∈ R m×t of m testing items with respect to the feature matrix This linear classification problem can be transformed into completing the unobservable entries in Y test by means of the observable entries in X train , Y train and X test , based on the assumption that the rank of matrix Z ∈ R (n+m)×(d+t) is low.
The model can be written as, arg min where we use Ω X to represent the index set of observable feature entries in X train and X test , and Ω Y to denote the index set of observable label entries in Y train .Formula (2) is usually impractical for real problems as the entries in the matrix Z are corrupted by noise.We thus define where Z * as the underlying low-rank matrix and E is the error matrix The rank function in Formula ( 2) is a non-convex function that is difficult to be optimized.The surrogate of the function can be the convex nuclear norm ||Z|| * = σ k (Z) (Candès and Recht, 2009), where σ k is the k-th largest singular value of Z.To tolerate the noise entries in the error matrix E, we minimize the cost functions C x and C y for features and labels respectively, rather than using the hard constraints in Formula (2).
According to Formula (1), Z * ∈ R (n+m)×(d+t) can be represented as [X * , WX * ] instead of [X * , Y * ], by explicitly modeling the bias vector b.Therefore, this convex optimization model is called DRMC-b, where µ and λ are the positive trade-off weights.More specifically, we minimize the nuclear norm ||Z|| * via employing the regularization terms, i.e., the cost functions C x and C y for features and labels.
If we implicitly model the bias vector b, where Z(:, 1) denotes the first column of Z.
For our relation classification task, both features and labels are binary.We assume that the actual entry u belonging to the underlying matrix Z * is randomly generated via a sigmoid function (Jordan, 1995): P r(u|v) = 1/(1 + e −uv ), given the observed binary entry v from the observed sparse matrix Z.Then, we can apply the log-likelihood cost function to measure the conditional probability and derive the logistic cost function for C x and C y , After completing the entries in Y test , we adopt the sigmoid function to calculate the conditional probability of relation r j , given entity pair p i pertaining to y ij in Y test , P r(r Finally, we can achieve Top-N predicted relation instances via ranking the values of P r(r j |p i ).
The matrix rank minimization problem is NPhard.Therefore, Candés and Recht (2009) suggested to use a convex relaxation, the nuclear norm minimization instead.Then, Ma et al. (2011) proposed the fixed point continuation (FPC) algorithm which is fast and robust.Moreover, Goldfrab and Ma (2011) proved the convergence of the FPC algorithm for solving the nuclear norm minimization problem.We thus adopt and modify the algorithm aiming to find the optima for our noisetolerant models, i.e., Formulae (3) and (4).

Fixed point continuation for DRMC-b
Algorithm 1 describes the modified FPC algorithm for solving DRMC-b, which contains two steps for each iteration, Gradient step: In this step, we infer the matrix gradient g(Z) and bias vector gradient g(b) as follows, We use the gradient descents A = Z − τ z g(Z) and b = b − τ b g(b) to gradually find the global minima of the cost function terms in Formula (3), where τ z and τ b are step sizes.
Shrinkage step: The goal of this step is to minimize the nuclear norm ||Z|| * in Formula (3).We perform the singular value decomposition (SVD) (Golub and Kahan, 1965) for A at first, and then cut down each singular value.During the iteration, any negative value in Σ − τ z µ is assigned by zero, so that the rank of reconstructed matrix Z will be reduced, where Z = Umax(Σ − τ z µ, 0)V T .
To accelerate the convergence, we use a continuation method to improve the speed.µ is initialized by a large value µ 1 , thus resulting in the fast reduction of the rank at first.Then the convergence slows down as µ decreases while obeying µ k+1 = max(µ k η µ , µ F ). µ F is the final value of µ, and η µ is the decay parameter.
For the stopping criteria in inner iterations, we define the relative error to measure the residual of matrix Z between two successive iterations, Algorithm 1 FPC algorithm for solving DRMC-b Input: Initial matrix Z 0 , bias b 0 ; Parameters µ, λ; Step sizes τ z , τ b .
where ε is the convergence threshold.

Fixed point continuation for DRMC-1
Algorithm 2 is similar to Algorithm 1 except for two differences.First, there is no bias vector b.Second, a projection step is added to enforce the first column of matrix Z to be 1.In addition, The matrix gradient g(Z) for DRMC-1 is Algorithm 2 FPC algorithm for solving DRMC-1 Input: Initial matrix Z 0 ; Parameters µ, λ; Step sizes τ z .16) Table 2: The range of optimal ranks for DRMC-b and DRMC-1 through five-fold cross validation.The threshold θ means filtering the features that appear less than θ times.The values in brackets pertaining to DRMC-b and DRMC-1 are the exact optimal ranks that we choose for the completed matrices on testing sets.

Experiments
In order to conduct reliable experiments, we adjust and estimate the parameters for our approaches, DRMC-b and DRMC-1, and compare them with other four kinds of landmark methods (Mintz et al., 2009;Hoffmann et al., 2011;Surdeanu et al., 2012;Riedel et al., 2013) on two public datasets.

Dataset
The two widely used datasets that we adopt are both automatically generated by aligning Freebase to New York Times corpora.The first dataset 12 , NYT'10, was developed by Riedel et al. (2010), and also used by Hoffmann et al. (2011) andSurdeanu et al. (2012).Three kinds of features, namely, lexical, syntactic and named entity tag features, were extracted from relation mentions.The second dataset 13 , NYT'13, was also released by Riedel et al. (2013), in which they only regarded the lexicalized dependency path between two entities as features.Table 1 shows that the two datasets differ in some main attributes.More specifically, NYT'10 contains much higher dimensional features than NYT'13, whereas fewer training and testing items.

Parameter setting
In this part, we address the issue of setting parameters: the trade-off weights µ and λ, the step sizes τ z and τ b , and the decay parameter η µ .
We set λ = 1 to make the contribution of the cost function terms for feature and label matrices equal in Formulae ( 3) and ( 4).µ is assigned by a series of values obeying µ k+1 = max(µ k η µ , µ F ).
12 http://iesl.cs.umass.edu/riedel/ecml/ 13http://iesl.cs.umass.edu/riedel/data-univSchema/We follow the suggestion in (Goldberg et al., 2010) that µ starts at σ 1 η µ , and σ 1 is the largest singular value of the matrix Z.We set η µ = 0.01.The final value of µ, namely µ F , is equal to 0.01.Ma et al. (2011) revealed that as long as the nonnegative step sizes satisfy , the FPC algorithm will guarantee to converge to a global optimum.Therefore, we set τ z = τ b = 0.5 to satisfy the above constraints on both two datasets.

Rank estimation
Even though the FPC algorithm converges in iterative fashion, the value of ε varying with different datasets is difficult to be decided.In practice, we record the rank of matrix Z at each round of iteration until it converges at a rather small threshold ε = 10 −4 .The reason is that we suppose the optimal low-rank representation of the matrix Z conveys the truly effective information about underlying semantic correlation between the features and the corresponding labels.
We use the five-fold cross validation on the validation set and evaluate the performance on each fold with different ranks.At each round of iteration, we gain a recovered matrix and average the F1 14 scores from Top-5 to Top-all predicted relation instances to measure the performance.Figure 3 illustrates the curves of average F1 scores.After recording the rank associated with the highest F1 score on each fold, we compute the mean and the standard deviation to estimate the range of optimal rank for testing.Table 2 lists the range of optimal ranks for DRMC-b and DRMC-1 on NYT'10 and NYT'13.On both two datasets, we observe an identical phenomenon that the performance gradually increases as the rank of the matrix declines before reaching the optimum.However, it sharply decreases if we continue reducing the optimal rank.An intuitive explanation is that the high-rank matrix contains much noise and the model tends to be overfitting, whereas the matrix of excessively low rank is more likely to lose principal information and the model tends to be underfitting.

Method Comparison
Firstly, we conduct experiments to compare our approaches with Mintz-09 (Mintz et al., 2009), MultiR-11 (Hoffmann et al., 2011), MIML-12 andMIML-at-least-one-12 (Surdeanu et al., 2012) on NYT'10 dataset.Surdeanu et al. (2012) released the open source code 15 to reproduce the experimental results on those previous methods.Moreover, their programs can control the feature spar-15 http://nlp.stanford.edu/software/mimlre.shtml sity degree through a threshold θ which filters the features that appears less than θ times.They set θ = 5 in the original code by default.Therefore, we follow their settings and adopt the same way to filter the features.In this way, we guarantee the fair comparison for all methods.Figure 4 (a) shows that our approaches achieve the significant improvement on performance.
We also perform the experiments to compare our approaches with the state-of-the-art NFE-13 16 (Riedel et al., 2013) and its sub-methods (N-13, F-13 and NF-13) on NYT'13 dataset.Figure 4 (b) illustrates that our approaches still outperform the state-of-the-art methods.In practical applications, we also concern about the precision on Top-N predicted relation instances.Therefore, We compare the precision of Top-100s, Top-200s and 16 Readers may refer to the website, http://www.riedelcastro.org/uschema for the details of those methods.We bypass the description due to the limitation of space.Top-500s for DRMC-1, DRMC-b and the state-ofthe-art method NFE-13 (Riedel et al., 2013).Table 3 shows that DRMC-b and DRMC-1 achieve 24.0% and 26.6% precision increments on average, respectively.

Discussion
We have mentioned that the basic alignment assumption of distant supervision (Mintz et al., 2009) tends to generate noisy (noisy features and incomplete labels) and sparse (sparse features) data.In this section, we discuss how our approaches tackle these natural flaws.Due to the noisy features and incomplete labels, the underlying low-rank data matrix with truly effective information tends to be corrupted and the rank of observed data matrix can be extremely high.Figure 5 demonstrates that the ranks of data matrices are approximately 2,000 for the initial optimization of DRMC-b and DRMC-1.However, those high ranks result in poor performance.As the ranks decline before approaching the optimum, the performance gradually improves, implying that our approaches filter the noise in data  Furthermore, we discuss the influence of the feature sparsity for our approaches and the stateof-the-art methods.We relax the feature filtering threshold (θ = 4, 3, 2) in Surdeanu et al. 's (2012) open source program to generate more sparse features from NYT'10 dataset.Figure 6 shows that our approaches consistently outperform the baseline and the state-of-the-art methods with diverse feature sparsity degrees.Table 2 also lists the range of optimal rank for DRMC-b and DRMC-1 with different θ.We observe that for each approach, the optimal range is relatively stable.In other words, for each approach, the amount of truly effective information about underlying semantic correlation keeps constant for the same dataset, which, to some extent, explains the reason why our approaches are robust to sparse features.

Conclusion and Future Work
In this paper, we contributed two noise-tolerant optimization models17 , DRMC-b and DRMC-1, for distantly supervised relation extraction task from a novel perspective.Our models are based on matrix completion with low-rank criterion.Experiments demonstrated that the low-rank representation of the feature-label matrix can exploit the underlying semantic correlated information for relation classification and is effective to overcome the difficulties incurred by sparse and noisy features and incomplete labels, so that we achieved significant improvements on performance.
Our proposed models also leave open questions for distantly supervised relation extraction task.First, they can not process new coming testing items efficiently, as we have to reconstruct the data matrix containing not only the testing items but also all the training items for relation classification, and compute in iterative fashion again.Second, the volume of the datasets we adopt are rela-tively small.For the future work, we plan to improve our models so that they will be capable of incremental learning on large-scale datasets (Chang, 2011).

Figure 1 :
Figure 1: Training corpus generated by the basic alignment assumption of distantly supervised relation extraction.The relation instances are the triples related to President Barack Obama in the Freebase, and the relation mentions are some sentences describing him in the Wikipedia.

Figure 3 :
Figure 3: Five-fold cross validation for rank estimation on two datasets.

Figure 4 :
Figure 4: Method comparison on two testing sets.

Figure 5 :
Figure 5: Precision-Recall curve for DRMC-b and DRMC-1 with different ranks on two testing sets.

Figure 6 :
Figure 6: Feature sparsity discussion on NYT'10 testing set.Each row (from top to bottom, θ = 4, 3, 2) illustrates a suite of experimental results.They are, from left to right, five-fold cross validation for rank estimation on DRMC-b and DRMC-1, method comparison and precision-recall curve with different ranks, respectively.

Table 1 :
Statistics about the two widely used datasets.