A Shortest-path Method for Arc-factored Semantic Role Labeling

We introduce a Semantic Role Labeling (SRL) parser that ﬁnds semantic roles for a predicate together with the syntactic paths linking predicates and arguments. Our main contribution is to formulate SRL in terms of shortest-path inference, on the assumption that the SRL model is restricted to arc-factored features of the syntactic paths behind semantic roles. Overall, our method for SRL is a novel way to exploit larger variability in the syntactic realizations of predicate-argument relations, moving away from pipeline architectures. Experiments show that our approach improves the robustness of the predictions, producing arc-factored models that perform closely to methods using unrestricted features from the syntax.


Introduction
Semantic role labeling (SRL) consists of finding the arguments of a predicate and labeling them with semantic roles (Gildea and Jurafsky, 2002;Màrquez et al., 2008). The arguments fill roles that answer questions of the type "who" did "what" to "whom", "how", and "why" for a given sentence predicate. Most approaches to SRL are based on a pipeline strategy, first parsing the sentence to obtain a syntactic tree and then identifying and classifying arguments (Gildea and Jurafsky, 2002;Carreras and Màrquez, 2005).
SRL methods critically depend on features of the syntactic structure, and consequently parsing mistakes can harm the quality of semantic role predictions (Gildea and Palmer, 2002). To alleviate this dependence, previous work has explored k-best parsers (Johansson and Nugues, 2008), combination systems (Surdeanu et al., 2007) or joint syntactic-semantic models (Johansson, 2009;Henderson et al., 2008;Lluís et al., 2013).
In this paper we take a different approach. In our scenario SRL is the end goal, and we assume that syntactic parsing is only an intermediate step to extract features to support SRL predictions. In this setting we define a model that, given a predicate, identifies each of the semantic roles together with the syntactic path that links the predicate with the argument. Thus, following previous work (Moschitti, 2004;Johansson, 2009), we take the syntactic path as the main source of syntactic features, but instead of just conditioning on it, we predict it together with the semantic role. The main contribution of this paper is a formulation of SRL parsing in terms of efficient shortest-path inference, under the assumption that the SRL model is restricted to arc-factored features of the syntactic path linking the argument with the predicate.
Our assumption -that features of an SRL model should factor over dependency arcs-is supported by some empirical frequencies. Table 1 shows the most frequent path patterns on CoNLL-2009(Hajič et al., 2009) data for several languages, where a path pattern is a sequence of ascending arcs from the predicate to some ancestor, followed by descending arcs to the argument. For English the distribution of path patterns is rather simple: the majority of paths consists of a number of ascending arcs followed by zero or one descending arc. Thus a common strategy in SRL systems, formulated by Xue and Palmer (2004), is to look for arguments in the ancestors of the predicate and their direct descendants. However, in Czech and Japanese data we observe a large portion of paths with two or more descending arcs, which makes it difficult to characterize the syntactic scope in which arguments are found. Also, in the datasets for German, Czech and Chinese the three most frequent patterns cover over the 90% of all arguments. In contrast, Japanese exhibits much more variability and a long tail of infrequent types of patterns. In general it is not feasible to capture path patterns manually, and it is not desirable that a statistical system depends on rather sparse nonfactored path features. For this reason in this paper we explore arc-factored models for SRL. Our method might be specially useful in applications were we are interested in some target semantic role, i.e. retrieving agent relations for some verb, since it processes semantic roles independently of each other. Our method might also be generalizable to other kinds of semantic relations which strongly depend on syntactic patterns such as relation extraction in information extraction or discourse parsing.

Arc-factored SRL
We define an SRL parsing model that retrieves predicate-argument relations based on arcfactored syntactic representations of paths connecting predicates with their arguments. Throughout the paper we assume a fixed sentence x = x 1 , . . . , x n and a fixed predicate index p. The SRL output is an indicator vector z, where z r,a = 1 indicates that token a is filling role r for predicate p. Our SRL parser performs argmax z∈Z(x,p) s(x, p, z), where Z(x, p) defines the set of valid argument structures for p, and s(x, p, z) computes a plausibility score for z given x and p. Our first assumption is that the score function factors over role-argument pairs: s(x, p, r, a) . (1) Then we assume two components in the model, one that scores the role-argument pair alone, and another that considers the best (max) syntactic dependency path π that connects the predicate p with the argument a: (2) The model does not assume access to the syntactic structure of x, hence in Eq.
(2) we locally retrieve the maximum-scoring path for an argumentrole pair. A path π is a sequence of dependencies h, m, l where h is the head, m the modifier and l the syntactic label. We further assume that the syntactic component factors over the dependencies in the path: (3) This will allow to employ efficient shortest-path inference, which is the main contribution of this paper and is described in the next section. Note that since paths are locally retrieved per roleargument pair, there is no guarantee that the set of paths across roles forms a (sub)tree.
As a final note, in this paper we follow Lluís et al. (2013) and consider a constrained space of valid argument structures Z(x, p): (a) each role is realized at most once, and (b) each token fills at most one role. As shown by Lluís et al. (2013), this can be efficiently solved as a linear assign- Figure 1: Graph representing all possible syntactic paths from a single predicate to their arguments. We find in this graph the best SRL using a shortest-path algorithm. Note that many edges are omitted for clarity reasons. We labeled the nodes and arcs as follows: p is the predicate and source vertex; u1, . . . , un are tokens reachable by an ascending path; v1, . . . , vn are tokens reachable by a ascending path (possibly empty) followed by a descending path (possibly empty); ai←j is an edge related to an ascending dependency from node ui to node uj; di→j is a descending dependency from node vi to node vj; 0i→i is a 0-weighted arc that connects the ascending portion of the path ending at ui with the descending portion of the path starting at vi. ment problem as long as the SRL model factors over role-argument pairs, as in Eq. (1).

SRL as a Shortest-path Problem
We now focus on solving the maximization over syntactic paths in Eq. (2). We will turn it into a minimization problem which can be solved with a polynomial-cost algorithm, in our case a shortestpath method. Assume a fixed argument and role, and define θ h,m,l to be a non-negative penalty for the syntactic dependency h, m, l to appear in the predicate-argument path. We describe a shortestpath method that finds the path of arcs with the smaller penalty: We find these paths by appropriately constructing a weighted graph G = (V, E) that represents the problem. Later we show how to adapt the arcfactored model scores to be non-negative penalties, such that the solution to Eq. (4) will be the negative of the maximizer of Eq.
(2). It remains only to define the graph construction where paths correspond to arc-factored edges weighted by θ penalties. We start by noting that any path from a predicate p to an argument v i is formed by a number of ascending syntactic arcs followed by a number of descending arcs. The ascending segment connects p to some ancestor q (q might be p itself, which implies an empty ascending segment); the descending segment connects q with v i (which again might be empty). To compactly represent all these possible paths we define the graph as follows (see Figure 1): 1. Add node p as the source node of the graph.
2. Add nodes u 1 , . . . , u n for every token of the sentence except p.
3. Link every pair of these nodes u i , u j with a directed edge a i←j weighted by the corresponding ascending arc, namely min l θ j,i,l . Also add ascending edges from p to any u i weighted by min l θ i,p,l . So far we have a connected component representing all ascending path segments.
4. Add nodes v 1 , . . . , v n for every token of the sentence except p, and add edges d i→j between them weighted by descending arcs, namely min l θ i,j,l . This adds a second strongly-connected component representing descending path segments.
5. For each i, add an edge from u i to v i with weight 0. This ensures that ascending and descending path segments are connected consistently.
6. Add direct descending edges from p to all the v i nodes to allow for only-descending paths, weighted by min l θ p,i,l .
Dijkstra's algorithm (Dijkstra, 1959) will find the optimal path from predicate p to all tokens in time O(V 2 ) (see Cormen et al. (2009) for an indepth description). Thus, our method runs this algorithm for each possible role of the predicate, obtaining the best paths to all arguments at each run.

Adapting and Training Model Scores
The shortest-path problem is undefined if a negative cycle is found in the graph as we may indefinitely decrease the cost of a path by looping over this cycle. Furthermore, Dijkstra's algorithm requires all arc scores to be non-negative penalties. However, the model in Eq. (3) computes plausibility scores for dependencies, not penalties. And, if we set this model to be a standard feature-based linear predictor, it will predict unrestricted realvalued scores.
One approach to map plausibility scores to penalties is to assume a log-linear form for our model. Let us denote byx the tuple x, p, r, a , which we assume fixed in this section. The loglinear model predicts: (5) where f (x, h, m, l ) is a feature vector for an arc in the path, w are the parameters, and Z(x) is the normalizer. We can turn predictions into non-negative penalties by setting θ h,m,l to be the negative log-probability of h, m, l ; namely θ h,m,l = −w · f (x, h, m, l ) + log Z(x). Note that log Z(x) shifts all values to the non-negative side.
However, log-linear estimation of w is typically expensive since it requires to repeatedly compute feature expectations. Furthermore, our model as defined in Eq. (2) combines arc-factored path scores with path-independent scores, and it is desirable to train these two components jointly. We opt for a mistake-driven training strategy based on the Structured Averaged Perceptron (Collins, 2002), which directly employs shortest-path inference as part of the training process.
To do so we predict plausibility scores for a dependency directly as w · f (x, h, m, l ). To map scores to penalties, we define and we set θ h,m,l = −w · f (x, h, m, l ) + θ 0 .
Thus, θ 0 has a similar purpose as the lognormalizer Z(x) in a log-linear model, i.e., it shifts the negated scores to the positive side; but in our version the normalizer is based on the max value, not the sum of exponentiated predictions as in log-linear models. If we set our model function to be s syn (x, h, m, l ) = w · f (x, h, m, l ) − θ 0 then the shortest-path method is exact.

Experiments
We present experiments using the CoNLL-2009 Shared Task datasets (Hajič et al., 2009), for the verbal predicates of English. Evaluation is based on precision, recall and F 1 over correct predicateargument relations 1 . Our system uses the feature set of the state-of-the-art system by Johansson (2009), but ignoring the features that do not factor over single arcs in the path.
The focus of these experiments is to see the performance of the shortest-path method with respect to the syntactic variability. Rather than running the method with the full set of possible dependency arcs in a sentence, i.e. O(n 2 ), we only consider a fraction of the most likely dependencies. To do so employ a probabilistic dependency-based model, following Koo et al. (2007), that computes the distribution over head-label pairs for a given modifier, Pr(h, l | x, m). Specifically, for each modifier token we only consider the dependencies or heads whose probability is above a factor γ of the most likely dependency for the given modifier. Thus, γ = 1 selects only the most likely dependency (similar to a pipeline system, but without enforcing tree constraints), and as γ decreases more dependencies are considered, to the point where γ = 0 would select all possible dependencies. Table 2 shows the ratio of dependencies included with respect to a pipeline system for the development set. As an example, if we set γ = 0.5, for a given modifier we consider the most likely dependency and also the dependencies with probability larger than 1/2 of the probability of the most likely one. In this case the total number of dependencies is 10.3% larger than only considering the most likely one. Table 3 shows results of the method on development data, when training and testing with different γ values. The general trend is that testing with the most restricted syntactic graph results in the best performance. However, we observe that as we allow for more syntactic variability during training, the results largely improve. Setting γ = 1 for both training and testing gives a semantic F 1 of 75.9. This configuration is similar to a pipeline approach but considering only factored features. If we allow to train with γ = 0.1 and we test with γ = 1 the results improve by 1.96 points to a semantic F 1 of 77.8 points. When syntactic variability is too large, e.g., γ = 0.01, no improvements are observed.
Finally , table 4 shows results on the verbal English WSJ test set using our best configuration   from the development set. We compare to the state-of-the art system by Zhao et al. (2009) that was the top-performing system for the English language in SRL at the CoNLL-2009 Shared Task. We also show the results for a shortest-path system trained and tested with γ = 1. In addition we include an equivalent pipeline system using all features, both factored and non-factored, as defined in Johansson (2009). We observe that by not being able to capture non-factored features the final performance drops by 1.6 F 1 points.

Conclusions
We have formulated SRL in terms of shortestpath inference. Our model predicts semantic roles together with associated syntactic paths, and assumes an arc-factored representation of the path. This property allows for efficient shortest-path al-  Factored γ = 1: our shortest-path system trained and tested with γ = 1, similar to a pipeline system but without enforcing tree constraints and restricted to arc-factored features. Factored best: our shortest-path system with the best results from table 3. Non-factored: an equivalent pipeline system that includes both factored and non-factored features.
gorithms that, given a predicate and a role, retrieve the most likely argument and its path.
In the experimental section we prove the feasibility of the approach. We observe that arcfactored models are in fact more restricted, with a drop in accuracy with respect to unrestricted models. However, we also observe that our method largely improves the robustness of the arc-factored method when training with a degree of syntactic variability. Overall, ours is a simple strategy to bring arc-factored models close to the performance of unrestricted models. Future work should explore further approaches to parse partial syntactic structure specific to some target semantic relations.