Calculating the optimal step of arc-eager parsing for non-projective trees

It is shown that the optimal next step of an arc-eager parser relative to a non-projective dependency structure can be calculated in cubic time, solving an open problem in parsing theory. Applications are in training of parsers by means of a ‘dynamic oracle’.


Introduction
A deterministic transition-based dependency parser is often driven by a classifier that determines the next step, given features extracted from the current configuration . The classifier may be trained on parser configurations and steps that exactly correspond to 'gold' trees from a treebank. However, better accuracy is generally obtained by also including configurations reached by letting the parser stray from the gold trees, to let the classifier learn how best to recover from any mistakes. This is associated with the term dynamic oracle. If the parser is projective while the gold trees are non-projective moreover, then it is unavoidable that configurations be considered that do not correspond to the gold trees.
Determining the desired output of the classifier requires calculation of the best next step given an arbitrary configuration and a gold tree. Typically, this is the step that allows the most accurate tree to be reached, in terms of the gold tree. 1 For a gold tree that is projective, the optimal step can be determined in linear time for arc-eager parsing Nivre, 2012, 2013) and for shift-reduce parsing (Nederhof, 2019). For a nonprojective gold tree, the optimal step can be determined for several types of non-projective parsers (Gómez-Rodríguez et al., 2014;Gómez-Rodríguez and Fernández-González, 2015;de Lhoneux et al., 2017;Fernández-González and Gómez-Rodríguez, 2018), as well as for shift-reduce parsing (Nederhof, 2019). However, for arc-eager parsing, the problem has been unsolved until now. Aufrant et al. (2018) propose an approximation of the optimal step, based on the procedure for projective gold trees, and demonstrate the advantages of training a projective parser directly on non-projective trees.
The current paper introduces an exact calculation of the optimal step for arc-eager parsing and a nonprojective gold tree, within the same framework as Nederhof (2019), which consists of a generic cubic-time tabular dependency parsing algorithm and a fixed context-free grammar that is applied on a string extracted from the current configuration, with edge weights determined by the gold tree.
For arc-eager parsing, the context-free grammar is considerably more complex than in the case of shift-reduce parsing. This is a consequence of theoretical properties of arc-eager parsing, which we first need to investigate in detail before we can define 'optimality' of the next step.

Dependency structures
Let w = a 1 · · · a n be a sentence consisting of n tokens, which can be words or punctuation. Where we use indices between 1 and n, we also refer to these as tokens, relying on the assumption that given an index i we can retrieve the actual token a i . An additional index 0 represents an imaginary token prepended to the sentence.
An unlabeled dependency structure T for w is an unlabeled tree with {0, 1, . . . , n} as the set of nodes, of which 0 is the root. We represent such a tree as a set of edges, each represented as a pair (a, b), where index a is the parent and index b is the child. The descendants of a node are the node itself and the descendants of its children. A dependency structure is projective if the set of de-scendants of each node in the tree can be written as {i, i + 1, . . . , j − 1, j} for some i and j (0 ≤ i ≤ j ≤ n).
We assume each sentence w = a 1 · · · a n has a distinguished gold tree T g . The score of an arbitrary tree T for the same w is defined as |T ∩ T g |. The accuracy of T is its score divided by n.
A dependency parser is usually designed to find a tree that is as accurate as possible, given an input sentence. Such a parser can generally be extended in a natural way to find a labeled dependency structure, which is analogously defined as a labeled tree with root 0. An edge label in such a structure is called a dependency relation. This paper focuses on unlabeled dependency parsing.

Transition-based dependency parsing
Transition-based dependency parsing is commonly formalized in terms of a set of configurations and a finite set of transitions between configurations. For now, a configuration for input sentence w = a 1 · · · a n is a 3-tuple (α, β, T ), where α is the stack, β is the remaining input, and T is a subset of (the set of edges of) a dependency structure. We assume αβ is a subsequence of 0 1 · · · n, and β is more specifically a suffix of 1 · · · n. A transition is a partial function, mapping one configuration to another. A step is one application of a transition. A computation is a sequence of steps, starting with the initial configuration (0, 1 · · · n, ∅) and ending in a final configuration (0, ε, T ) where ε denotes the empty string; here T is the resulting tree.
A transition may have a precondition, i.e. a condition on the current configuration that must hold for the transition to be applicable. Unrestricted preconditions are less than convenient for our purposes, and therefore we opt for a more uniform framework, in which a stack element is a pair (a, A) consisting of a token a and a label A taken from a fixed set. 2 To avoid clutter, we write a A in place of (a, A); this also emphasizes the relation to more traditional formulations of dependency parsing, which are obtained by omitting the superscripts.
A first illustration of this is traditional shiftreduce dependency parsing, defined by the transitions in Table 1, here without labels, or alternatively, one may consider there to be only one such 2 There is a close connection to bilexical context-free grammars (Eisner and Satta, 1999), on the basis of which one may alternatively choose to refer to such a label as a 'delexicalized stack symbol', in a kind of lexicalized pushdown automaton.
shift: (α, bβ, T ) SH (αb, β, T ) reduce left: (αa 1 a 2 , β, T ) RL (αa 1 , β, T ∪ {(a 1 , a 2 )}) reduce right: (αa 1 a 2 , β, T ) RR (αa 2 , β, T ∪ {(a 2 , a 1 )}), if |α| > 0 shift: label, which is left implicit. 3 This form of parsing suffers from spurious ambiguity in that left and right children may be attached in different orders. E.g. if token b has left child a 1 and right child a 2 , then after a shift of a 1 and b, there may be a reduce right followed by a shift of a 2 followed by a reduce left. Or there may be a shift of a 2 followed by a reduce left followed by a reduce right. This can be resolved by requiring that left children are attached before right children are. In our framework, this left-beforeright policy can be enforced by introducing a label C, which is given to a token to signal that it is 'complete' with regard to its left children. Initially, shifted tokens carry label N (for 'no restriction'). The 0 token always has label C. This results in Table 2.
There is a simple one-to-one correspondence between a computation according to Table 2 and a computation according to Table 1 satisfying the left-before-right policy. The difference is merely an application of complete just before a token ceases to be a topmost stack element, either because it is reduced into the token to its left, or because another token is pushed on top. If a token shift: (α, bβ, T ) SH (αb, β, T ) left arc:  (Nivre, 2008, p. 525).
becomes non-topmost and reappears later on top of the stack, after applications of reduce left that give it right children, then it will still have label C, which prevents it from taking further left children. Table 3 is almost verbatim the formulation of arc-eager parsing by Nivre (2008), except that we renamed symbols, and we ignore dependency relations; the formulations by e.g. Nivre (2003Nivre ( , 2004 and  are largely equivalent. It is easy to see that the condition a [(a , b) ∈ T ] for right arc is redundant, as no tokens in the remaining input can obtain parents before they are shifted to the stack.
Taking shift-reduce parsing as starting point, reduce right corresponds roughly to left arc, while the role of reduce left is only partly fulfilled by right arc, which postulates that b is a right child of a, but without removing b as yet, allowing b to take right children, and only later is that b removed by reduce. Here shift-reduce parsing would postpone the decision whether that b is the left or the right child of its parent until all descendants of b have been shifted and reduced into it.
From the perspective of parsing of artificial languages (Sippu and Soisalon-Soininen, 1990), this is counter-intuitive. The conventional wisdom of deterministic parsing is that one should postpone commitment to occurrences of grammar rules (or here, dependency edges) for as long as possible, until enough information is available to resolve any local ambiguity, assuming left-to-right processing of the input string, and the ability to inspect only the top of the stack and the next k tokens of the remaining input, for a fixed, small number k.
Two arguments have been given why arc-eager parsing is nonetheless superior for processing nat- ural language. The first is that this earlier commitment made by right arc, in terms of the earlier creation of the dependency edge, offers additional information about the tree under construction, to better predict the next steps, using some type of classifier. The second argument in favor of arc-eager parsing is that the earlier creation of the dependency edge ensures that the partial tree under construction remains as connected as possible, which may help simultaneous syntactic and semantic processing. See Nivre (2004Nivre ( , 2008 and Damonte et al. (2017) for related discussions.
Next we rephrase arc-eager parsing to use labels to express preconditions, to prepare us for Section 4. Note that a token is transferred from the remaining input to the stack by either shift or right arc. In the former case, it must eventually become a left child of its parent, and in the latter case, it becomes a right child. We use labels L and R for these cases. 4 In a configuration with set T as third element, existence of a stack element a L implies a [(a , a) ∈ T ] and a R implies ∃a [(a , a) ∈ T ]. We thereby obtain the first four transitions in Table 4. Token 0 always has label R, and cannot be popped by reduce due to the a X 1 . The fifth transition will be discussed later.
Arc-eager parsing in either of the above two formulations cannot work in practice. The problem is illustrated in Table 5. In the last configuration, none of the steps is applicable. The situation arises when the remaining input becomes empty while there is a L label anywhere in the stack. Assuming the classifier used for predicting the next step cannot look unboundedly deep in the stack, this problem is unavoidable. One possible fix is to add the unshift transition of Nivre and Fernández-González (2014); see also Honnibal and Johnson (2015). As this causes considerable complications to our framework, we will solve this in another way, reminiscent of Honnibal et al. (2013), which also helps to make a connection with shift-reduce parsing later. Our proposed fix is to allow a reduce even if the top of stack has label L, by means of the fifth transition of Table 4, reduce correct. This transition is not needed during training if only computations are considered that most straightforwardly correspond to gold trees, with left arc applied only on a token that is to become the left child of its parent. This may mean however that, in the case of labeled dependency parsing, the trained classifier has no basis to predict the dependency relation of the edge created by this transition when applied during testing. This can be solved by moving the creation of the edge from right arc to reduce, and by then merging reduce and reduce correct, as in Table 6.
This formulation at first sight appears to nullify the property that has been argued to give arc-eager an advantage over shift-reduce parsing, namely the early availability of edges connecting a parent and a right child. However, these edges are still identified by investigating which tokens in the stack have label R: their parent is the token immediately left to it in the stack. In other words, a classifier predicting the next step can be made to have access to the exact same feature values as before.
This formulation of arc-eager parsing, as do the original formulations (Goldberg and Nivre, 2012, p. 963), allow the same dependency structure to be obtained in two different ways; cf. Table 7. There are few studies that compare parsing accuracy between the two ways of resolving this, by preferring either shift before reduce, or reduce before shift, and some literature suggests the choice is arbitrary, 5 although the results from one study 5 Cf. "harmless SHIFT-REDUCE conflicts" (Nivre, 2006, p. shift:   (Qi and Manning, 2017) suggest the shift-beforereduce policy could be slightly better. One way to enforce the reduce-before-shift policy is to opt for a different division of labor between stack and the leftmost token of the remaining input, whereby we must shift a node from remaining input to stack before it obtains its first left child or before it is decided whether it is to be a left child or a right child. Table 8 presents this normalized arc-eager parsing. The shift is now simply the transfer of a token from remaining input to stack, without making a commitment whether it is to become a left or right child, which is indicated by the N label. Where before we had shift and right arc, we now have left child and right child, which commit a token on top of the stack to be a left or right child of its parent, respectively. Where before we had left arc, we now have reduce right, and reduce is more appropriately renamed to reduce left. There is a simple one-to-one correspondence between a computation according to Table 8 and a computation  according to Table 6 satisfying the reduce-before-98).
shift:  Table 6 should instead refer to the first element underneath the top of stack in the case of Table 8. 6 One advantage of the normalized formulation is that it clearly reveals the relation to shift-reduce parsing. In particular, instead of complete in Table 2, we have the more specific left child and right child, of which the latter constitutes the early commitment of a token to be right child, as explained before.

Calculating the optimal step
Assume there are τ transitions, denoted by 1 , . . . , τ . Let represent an application of any of these transitions, and let * denote the reflexive transitive closure of . For a given configuration (α, β, T ) for input sentence w with gold tree T g , there are up to τ steps (α, β, T ) i (α i , β i , T i ), i = 1, . . . , τ . For each of these, the score is the maximal ρ The task is now to compute that ρ i for each i. This determines which transition to apply next, to eventually obtain the highest-scoring tree, irrespective of any 'incorrect' steps performed in the past, that is, steps that were inconsistent with the gold tree. Because |T ∩ T g | is the same for all i, and because the value of |T i ∩ T g | ≤ 1 with (α, β, ∅) i (α i , β i , T i ) is easily determined by a single lookup, the remaining problem is to compute 6 The division of labor between stack and remaining input is also what distinguishes Table 2 from the hybrid model of Kuhlmann et al. (2011).
As shown by Nivre (2012, 2013), the optimal step can be determined in linear time for (uncorrected) arc-eager parsing, provided the gold tree is projective. The procedure is defined in terms of costs of transitions, rather than in terms of scores. We revisit this in Section 5.
For normalized arc-eager parsing (Table 8) and projective gold trees, the problem appears to be no easier than for shift-reduce parsing, but can still be solved in linear time, by a straightforward refinement of the algorithm by Nederhof (2019), blocking a token from becoming a left child of its parent if its label is R.
Now assume the gold tree may be non-projective. For shift-reduce parsing, Nederhof (2019) presents a cubic-time algorithm for calculating σ i , generalizing the procedure of Goldberg et al. (2014), which is applicable only on projective trees. The algorithm has a modular design, in terms of a generic tabular dependency parsing algorithm (Eisner and Satta, 1999), plus an explicitly 'split' bilexical context-free grammar (Eisner and Satta, 1999;Eisner, 2000;Johnson, 2007) that encodes computations of shift-reduce parsing. A given configuration is translated to an input string, and weights between pairs of input positions are set according to existence of edges between corresponding tokens in the gold tree. Exhaustive parsing of the string by the grammar, using an appropriate semiring, yields σ i .
Here we show that the same framework is applicable on arc-eager parsing. The generic tabular dependency parsing algorithm remains the same, but a new grammar is needed to encode computations of arc-eager parsing. Following Nederhof (2019), nonterminals are either single symbols or pairs of symbols, and rules are of one of the forms: where a is a terminal. The last two forms are shorthand for any rules obtained by consistent substitution of the two underscores; which symbols can be meaningfully substituted is clear from context, as exemplified below.
We start with the normalized form (Table 8), which requires the grammar in Table 9, with the indicated translation from the configuration to an input string. The intuition behind this grammar is similar to the one by Nederhof (2019), but more cases need to be distinguished due to the labels. Grammar symbols R and R t correspond to tokens Table 9: Grammar for normalized arceager dependency parsing of a string in {r, } k−1 {r t , t , b , n}n m , representing a stack of length k and a remaining input of length m. A label R in the top of stack is translated to r t , and other occurrences of R are translated to r. A label L in the top of stack is translated to t , unless the candidate transition is left child, in which case it is translated to b ; other occurrences of L are translated to . A label N in the top of stack is translated to n.
in the stack with label R, where R t specifically means that the token is on top of the stack. Rules (3)-(4) distinguish the two cases Y ∈ {R, L} of reduce left with X = R. The rules are best read from right to left, as here for example "if the top of stack has label R or L, and if the token underneath has label R, then the latter keeps its label R and becomes the top of stack". Rules (5)-(6) allow for reduce left with a right child that was in the remaining input, or that was the top of stack with label N . In (6), the underscore can be substituted by R or R t . In (3)-(5), the only meaningful substitution is by R. Rules (10)-(18) are analogous to (1)-(9). Rules (20)-(21) allow any projective parse of the remaining input (as well as of the top of stack if that had label N ), and (22)-(24) handle a token in the remaining input taking a left child in the stack, provided it has label L. In (20)-(24), the only meaningful substitution of the underscore is by N .
If a token has already been given label L, then it becoming a right child by (4) or (13) amounts to correcting a mistake made earlier, and may be necessary so the computation does not get stuck (cf. Table 5). If the next transition to be considered is left child however, which puts L in the top of stack, then we do not wish the corresponding token to become a right child; right child should be applied instead. Label L is then translated to b with b for 'blocking' (4) and (13). Figure 1 exemplifies a derivation encoding a computation. A formal proof of correctness is by induction, showing that existence of a subderivation of the grammar implies existence of a corresponding subcomputation with the same score, and vice versa. Cf. the proof sketch by Nederhof (2019) for shift-reduce parsing.
Unnormalized arc-eager parsing (Table 6) requires a different approach, due to the different division of labor between stack and remaining input. We now need to count the number of right children of a token in the stack that were themselves in the stack, up to but not exceeding 1. E.g. rule (5) in Table 10 counts the first right child, but there is no further rule with right-hand side ( , R 1 ) R to allow a second right child from among the stack elements; other children from the remaining input are allowed, as e.g. by rule (10).
We now also need to observe the chosen policy. With the shift-before-reduce policy, if the candidate transition is reduce, then the first symbol of the remaining input becomes n p (p for 'policy'). There is a notable absence of a rule with right-hand side N p (N, ), which means that this token cannot become a left child without first taking a child from the stack as by (39) and (43), because if it were, the policy would be violated: the token should have been shifted, and reduced into its parent on the right, preceding the reduce. There is no restriction on the token becoming a right child, as e.g. by (4).
A strict reduce-before-shift policy implies that a token in the stack should not be reduced into the token to its right if other tokens were previously shifted on top, unless it is to obtain more right children. This is because by the policy, the reduction should have happened earlier. Alternatively we may opt for a non-strict reduce-before-shift policy that allows us to correct mistakes made earlier. Either variant uses r p , p , R p and L p to enforce the policy. E.g. there are no rules with R p in the righthand side, effectively blocking a derivation. Here rules (7)-(9) are needed to give an R-labeled stack element at least one right child, which by (14)-(15) allows the token to participate in a full derivation.
In order to compute the score for arc-eager parsing without our correction (starting in Table 4 with reduce correct), one should omit the rules from Table 10 that correspond to L-labeled tokens becoming right children, i.e. (6), (9), (22), (25). Whether the unshift from Nivre and Fernández-González (2014) and Honnibal and Johnson (2015) can be handled in our framework requires further study.

Calculation for projective trees
If the gold tree is projective, then the problem becomes much easier. Here we assume the formulation of arc-eager parsing as in Table 6. The number σ i , as defined in Section 4, for a configuration with stack α i = a 1 · · · a k and remaining input β i = b 1 · · · b m , can be calculated by counting in the first instance: • the number of gold edges (a p−1 , a p ), where 1 < p ≤ k, plus • the number of gold edges (a p , b q ), plus • the number of gold edges (b p , a q ), such that a q has label L, plus • the number of gold edges (b p , b q ), but discounting a number of these, as follows. First, consider the case of the candidate transition being shift. If m = 0, the score becomes −∞, as there is no available parent for the shifted token. If m > 0, we discount a possible gold edge (a k , b p ) if the rightmost descendant of b p is b m , because no projective tree exists in which a k is a left child while its descendants include the end of the input. We further discount a possible gold edge Table 10: Grammar for unnormalized arc-eager dependency parsing. With reduce-before-shift, the string is in r{r, r p , , p } k−2 {r, r p , , p , b }n m , for stack length k and remaining input length m. Now b is used if the candidate transition is shift, and a nonbottommost symbol to the left of that becomes r p or p . For a strict reduce-before-shift policy moreover, the second to the k − 2-th symbols become r p or p , and furthermore the k − 1-th becomes r p or p if the candidate transition is left arc or reduce, and furthermore the k-th becomes r p or p if the candidate transition is left arc; otherwise, these symbols are r or . With shift-before-reduce, the string does not contain r p or p , and the first n is replaced by n p if the candidate transition is reduce.
(a k−1 , a k ), because if a k is to become a right child of a k−1 , then the correct step is right arc in place of shift.
Second, if the candidate transition is reduce, we discount up to one gold edge in case of the shiftbefore-reduce policy, as follows, and as illustrated by Figure 2. Let r be largest such that, for some p > 1, there is a gold edge (b p , a r ) where a r has label L, or there is a gold edge (a r , b p ); if no such gold edge exists, let r = 1. If there is no s (r < s ≤ k) such that a s has label L and (a s−1 , a s ) is not a gold edge, then we discount any gold edge a r a s−1 a L s a k | b 1 b q b p Figure 2: Discounting of (b q , b 1 ) if s does not exist.
(b q , b 1 ). The rationale is that if b 1 can be given a child from among the tokens in the stack (by a gold edge or otherwise, and without discounting another gold edge elsewhere), then this justifies postponing the shift until after the reduce. If it cannot be, then b 1 becoming a left child violates the shift-before-reduce policy.
Lastly, if the candidate transition is shift, we discount further gold edges in case of the non-strict reduce-before-right policy, which requires a k−1 to either become a child of some b p or take some b p as child, to justify it not having been reduced into a k−2 before the shift. 7 From among the cases to be distinguished, we choose the one that discounts the fewest edges. First, if the label of a k−1 is L, we can let it become a left child, but should then discount a possible gold edge (a k−2 , a k−1 ). Second, if there is a gold edge (a k−1 , b p ), then no edges need be discounted. Otherwise, we need to find a child b p of a k−1 , for which there are five options, illustrated in Figure 3: (A) The first option is applicable if a k has descendants among the remaining input or has a parent b q (q > 1) among the remaining input. In the former case, choose b p to be the rightmost among the descendants (but let p = m − 1 if the rightmost descendant is b m ), and in the latter case choose p = 1. In effect we assume non-gold edges (a k−1 , b p ) and (b p , a k ), and consequently we discount any gold edge (b q , b p ) and any gold edge to a k . (B1) If a k has a parent b q in the remaining input, choose b p to be b q . In effect we assume non-gold edge (a k−1 , b q ), and consequently we discount any gold edge (a r , b q ) with r ≤ k − 2 or any gold edge (b r , b q ), as well as any gold edges (b q , a s ) with s ≤ k − 2. (B2) If a k does not have a parent in the remaining input, let b q be the token immediately to the right of the rightmost descendant of a k among the remaining input (but let q = m if the rightmost descendant is b m ), and let q = 1 if a k has no descendants among the remaining input. As in (B1), we in effect assume non-gold edge (a k−1 , b q ), and discount any gold edge (a r , b q ) with r ≤ k − 2 or any gold edge (b u , b q ), as well as any gold edges (b q , a s ) with s ≤ k − 2.
(C1) and (C2) are similar to (B1) and (B2), but b p is chosen to be the first ancestor of b q that does not have a parent in the remaining input (but it may have in the stack). Much as before, we discount any gold edge (a r , b p ) with r ≤ k − 2, as well as any gold edges (b u , a s ) with s ≤ k − 2, where b u is b q or b p or any other token on the path of gold edges from b q to b p . One can show that choices of b p other than in (A), (B1), (B2), (C1), (C2) would entail discounting of at least as many edges. Aufrant et al. (2018) propose approximating the calculation of the optimal step for a non-projective gold tree, by a procedure defined in terms of costs of transitions, analogous to the procedure by Nivre (2012, 2013), but without taking full account of edges that violate projectivity. Similarly, if the above procedure to calculate scores is applied on a non-projective tree, then an approximation is obtained. The advantage is the simplicity and the linear time complexity.

Empirical results
The advantage of 'dynamic oracles' for improving parsing accuracy has been demonstrated before. Our experiments have therefore concentrated on two obvious questions, viz. whether the cubic-time calculation is feasible in practice, and whether the higher time costs are rewarded with a more accurate output, relative to a linear-time approximation of the kind discussed in Section 5.
Considered here is unnormalized arc-eager parsing. The classifier, implemented in Java and DL4J, uses simple features (gold POS of the three rightmost elements of the stack and three leftmost elements of the remaining input, and leftmost and rightmost dependency relations in the topmost two stack elements).
The parser was first trained on configurations corresponding to projectivized gold trees from the German (GSD) corpus of Universal Dependencies v2.2. The trained parser was then applied on the unprojectivized trees, and the optimal step was calculated for each configuration thus visited. Figure 4 presents running time, on a laptop with an Intel i7-7500U processor (4 cores, 2.70 GHz) with 8 GB of RAM. The larger context-free grammar of Table 10, relative to the one for shift-reduce parsing, leads to a higher constant factor in the time complexity. Nonetheless, the calculation is feasible even for long sentences.

Accuracy of the approximation
In 8.0% and 8.1% of the visited configurations, one or more of the values ρ 1 , . . . , ρ 4 for the four transitions differed between the exact calculation (Section 4) and the approximation (Section 5), for the shift-before-reduce and non-strict reduce-before-  shift policies respectively. However, we are less interested in the absolute values of the scores than in which of them is highest. Note further that more than one may be equal to their maximum. By comparing the sets of transitions with the maximum calculated score, we found that the true set and the approximate set differed for only 0.4% and 0.5% of the total number of configurations, for the two policies respectively. The most frequent errors are listed in Table 11. Somewhat surprisingly, in the great majority of cases, the approximate set was contained in the true set; these cases sum to 89.0% and 87.8% of the total number of errors, respectively. The implication is that if a parser trained with a 'dynamic oracle' does arbitrary tie breaking between multiple optimal transitions, then there are few immediate prospects to improve parsing accuracy by incorporating the exact calculation. The situation may change if future research reveals better alternatives to arbitrary tie breaking.

Conclusions
Our exact calculation of the optimal step solves an open problem in parsing theory. Further research into the application of 'dynamic oracles' is needed to determine whether this can be exploited to improve parsing accuracy.