On Finding the K-best Non-projective Dependency Trees

The connection between the maximum spanning tree in a directed graph and the best dependency tree of a sentence has been exploited by the NLP community. However, for many dependency parsing schemes, an important detail of this approach is that the spanning tree must have exactly one edge emanating from the root. While work has been done to efficiently solve this problem for finding the one-best dependency tree, no research has attempted to extend this solution to finding the K-best dependency trees. This is arguably a more important extension as a larger proportion of decoded trees will not be subject to the root constraint of dependency trees. Indeed, we show that the rate of root constraint violations increases by an average of 13 times when decoding with K=50 as opposed to K=1. In this paper, we provide a simplification of the K-best spanning tree algorithm of Camerini et al. (1980). Our simplification allows us to obtain a constant time speed-up over the original algorithm. Furthermore, we present a novel extension of the algorithm for decoding the K-best dependency trees of a graph which are subject to a root constraint.


Introduction
Non-projective, graph-based dependency parsers are widespread in the NLP literature. (McDonald et al., 2005;Dozat and Manning, 2017;Qi et al., 2020). However, despite the prevalence of K-best dependency parsing for other parsing formalismsoften in the context of re-ranking (Collins and Koo, 2005;Sangati et al., 2009;Zhu et al., 2015;Do and Rehbein, 2020) and other areas of NLP (Shen et al., 2004;Huang and Chiang, 2005;Pauls and Klein, 2009;Zhang et al., 2009), we have only found three works that consider K-best non-projective 1 Our implementation is available at https://github. com/rycolab/spanningtrees. dependency parsing (Hall, 2007;Hall et al., 2007;Agić, 2012). All three papers utilize the K-best spanning tree algorithm of Camerini et al. (1980). Despite the general utility of K-best methods in NLP, we suspect that the relative lack of interest in K-best non-projective dependency parsing is due to the implementation complexity and nuances of Camerini et al. (1980)'s algorithm. 2 We make a few changes to Camerini et al. (1980)'s algorithm, which result in both a simpler algorithm and simpler proof of correctness. 3 Firstly, both algorithms follow the key property that we can find the second-best tree of a graph by removing a single edge from the graph (Theorem 1); this property is used iteratively to enumerate the K-best trees in order. Our approach to finding the second-best tree (see §3) is faster because of it performs half as many of the expensive cycle-contraction operations (see §2). Overall, this change is responsible for our 1.39x speed-up (see §4). Secondly, their proof of correctness is based on reasoning about a complicated ordering on the edges in the K th tree (Camerini et al., 1980, Section 4); our proof side-steps the complicated ordering by directly reasoning over the ancestry relations of the K th tree. Consequently, our proofs of correctness are considerably simpler and shorter. Throughout the paper, we provide the statements of all lemmas and theorems in the main text, but defer all proofs to the appendix.
In addition to simplifying Camerini et al. (1980)'s algorithm, we offer a novel extension. For many dependency parsing schemes such as the Universal Dependency (UD) scheme (Nivre et al., 2018), there is a restriction on dependency trees to only have one edge emanate from the root. 4 Finding the maximally weighted spanning tree that obeys this constraint was considered by Gabow and Tarjan (1984) who extended the O(N 2 ) maximum spanning tree algorithm of Tarjan (1977); Camerini et al. (1979). However, no algorithm exists for Kbest decoding of dependency trees subject to a root constraint. As such, we provide the first K-best algorithm that returns dependency trees that obey the root constraint.
To motivate the practical necessity of our extension, consider Fig. 1. Fig. 1 shows the percentage of trees that violate the root constraint when doing one-best and 50-best decoding for 63 languages from the UD treebank (Nivre et al., 2018) using the pre-trained model of Qi et al. (2020). 5,6 We find that decoding without the root constraint has a much more extreme effect when decoding the 50-best than the one-best. Specifically, we observe that on average, the number of violations of the root constraint increased by 13 times, with the worst increase being 44 times. The results thus suggest that finding K-best trees that obey the root constraint from a non-projective dependency parser requires a specialist algorithm. We provide a more detailed results table in App. A, including root constraint violation rates for K = 5, K = 10, and K = 20. Furthermore, we note that the K-best algorithm may also be used for marginalization of latent variables (Correia et al., 2020) and for constructing parsers with global scoring functions (Lee et al., 2016). 4 There are certain exceptions to this such as the Prague Treebank (Bejček et al., 2013). 5 Zmigrod et al. (2020) conduct a similar experiment for only the one-best tree. 6 We note that Qi et al. (2020) do apply the root constraint for one-best decoding, albeit with a sub-optimal algorithm.  (2020)). Edges that are part of both the best tree G (1) and the best dependency tree G [1] are marked as thick solid edges. Edges only in G (1) are dashed and edges only in G [1] are dotted.

Finding the Best Tree
We consider the study of rooted directed weighted graphs, which we will abbreviate to simply graphs. 7 A graph is given by G = (ρ, N , E) where N is a set of N + 1 nodes with a designated root node ρ ∈ N and E is a set of directed weighted edges. Each edge e = (i A j) ∈ E has a weight w(e) ∈ R + . We assume that self-loops are not allowed in the graph (i.e., (i A i) ∈ E). Additionally, we assume our graph is not a multi-graph, therefore, there can exist at most one edge from node i to node j. 8 When it is clear from context, we abuse notation and use j ∈ G and e ∈ G for j ∈ N and e ∈ E respectively. When discussing runtimes, we will assume a fully connected graph (|E| = N 2 ). 9 An arborescence (henceforth called a tree) of G is a subgraph d = (ρ, N , E ) such that E ⊆ E and the following is true: 1. For all j ∈ N {ρ}, |{( A j) ∈ E }| = 1.

d does not contain any cycles.
Other definitions of trees can also include that there is at least one edge emanating from the root. However, this condition is immediately satisfied by the above two conditions. A dependency tree 7 As we use the algorithm in Zmigrod et al. (2020) as our base algorithm, we borrow their notation wherever convenient. 8 We make this assumption for simplicity, the algorithms presented here will also work with multi-graphs. This might be desirable for decoding labeled dependency trees. However, we note that in most graph-based parsers such as Qi et al. (2020) and Ma and Hovy (2017), dependency labels are extracted after the unlabeled tree has been decoded. 9 We make this assumption as in the context of dependency parsing, we generate scores for each possible edge. Furthermore, (Tarjan, 1977) prove that the runtime of finding the best tree for dense graphs is O(N 2 ). This is O(|E| log N ) in the non-dense case. The set of all trees and dependency trees in a graph are given by A (G) and D(G) respectively. The weight of a tree is given by the sum of its edge weights 10 This paper concerns finding the K highestweighted (henceforce called K-best) tree or dependency tree, these are denoted by G (K) and G [K] respectively. Tarjan (1977); Camerini et al. (1979) provided the details for an O(N 2 ) algorithm for decoding the one-best tree. This algorithm was extended by Gabow and Tarjan (1984) to find the best dependency tree in O(N 2 ) time. We borrow the algorithm (and notation) of Zmigrod et al. (2020), who provide an exposition and proofs of these algorithms in the context of non-projective dependency parsing. The pseudocode for finding G (1) and G [1] is given in Fig. 3. We briefly describe the key components of the algorithm. 11 The greedy graph of G is denoted by If we encounter a critical cycle in the algorithm, we contract the graph by the critical cycle. A graph contraction, G /C , by a cycle C replaces the nodes in C by a mega-node c such that the nodes of G /C are N C ∪ {c}. Furthermore, for each edge e = (i A j) ∈ G: 3. If i ∈ C and j ∈ C, then e ∈ G /C .
There also exists a bookkeeping function π such 10 For inference, the weight of a trees often decomposes multiplicatively rather than additively over the edges. One can take the exponent (or logarithm) of the original edge weights to make the weights distribute additively (or multiplicative). 11 For a more complete and detailed description as well as a proof of correctness, please refer to the original manuscripts.
that for all e ∈ G /C , π(e ) ∈ G. This bookkeeping function returns the edge in the original graph that led to the creation of the edge in the contracted graph using one of the constructions above. Finding G (1) is then the task of finding a con- (1) . Once this is done, we can stitch back the cycles we contracted.
is the tree made with edges π(d) (π applied to each edge d) and − A C j where C j is the subgraph of the nodes in C rooted at node j and π(e) = (i A j) for e = (i A c) ∈ d. The contraction weighting scheme means that w(d) = w(d C) (Georgiadis, 2003). Therefore, G (1) = (G (1) C) (1) . The strategy for finding G [1] is to find the contracted graph for G (1) and attempt to remove edges emanating from the root. This was first proposed by Gabow and Tarjan (1984). When we consider removing an edge emanating from the root, we are doing this in a possibly contracted graph, and so an edge (ρ A j) may exist multiple times in the graph. We denote G\ \e to be the graph G with all edges with the same end-points as e removed. Fig. 2 gives an example of a graph G, its best tree G (1) , and its best dependency tree G [1] .
The runtime complexity of finding G (1) or G [1] is O(N 2 ) for dense graphs by using efficient priority queues and sorting algorithms (Tarjan, 1977;Gabow and Tarjan, 1984). We assume this runtime  (G, e, G (1) )  (1) is not in G (2) . Then one of the dashed edges in (b) must be in G (2) as 4 must have an incoming edge. The edges emanating from ρ and 1 make up the set of blue edges, b (G, e, G (1) ) while the edge emanating from 3 makes the set of red edges, r(G, e, G (1) ). If e ∈ b (G, e, G (1) ) is in G (2) as in (c), then the solid lines in (c) make a tree and G (2) differs from G (1) by exactly one blue edge of e. Otherwise, we know that e ∈ r(G, e, G (1) ) is in G (2) as in (d). However, the solid edges in (d) contain a cycle between 3 and 4 with edges e and f . We could break the cycle at 3 and include edge f in our tree as in (e). However, while the solid edges in (e) make a valid tree, as w(e) > w(e ) and w(f ) > w(f ), the tree given by the solid lines of (f) will have a higher weight. This would mean that e ∈ G (2) which leads to a contradiction. Therefore, we must break the cycle at 4 , which leads us to a tree as in (c). Consequently, G (2) will differ from G (1) by exactly one blue edge of e.
for the remainder of the paper.

Finding the Second Best Tree
In the following two sections, we provide a simplified reformulation of Camerini et al. (1980) to find the K-best trees. The simplifications additionally provide a constant time speed-up over Camerini et al. (1980)'s algorithm. We discuss the differences throughout our exposition.
The underlying concept behind finding the Kbest tree, is that G (K) is the second best tree G (2) of some subgraph G ⊆ G. In order to explore the space of subgraphs, we introduce the concept of edge inclusion and exclusion graphs.
Definition 1 (Edge inclusion and exclusion). For any graph G and edge e ∈ G, the edge-inclusion graph G + e ⊂ G is the graph such that for any d ∈ A(G + e), e ∈ d. Similarly, the edgeexclusion graph G − e ⊂ G is the graph such that for any d ∈ A(G − e), e ∈ d.
When we discuss finding the K-best dependency trees in §5, we implicitly change the above definition to use D(G + e) and D(G − e) instead of A(G + e) and A(G − e) respectively.
In this section, we will specifically focus on finding G (2) , we extend this to finding the G (k) in §4. Finding G (2) relies on the following fundamental theorem.
Theorem 1. For any graph G and e ∈ G (1) where e = argmax e ∈G (1) w (G − e ) (1) Theorem 1 states that we can find G (2) by identifying an edge e ∈ G (1) such that G (2) = (G − e) (1) . We next show an efficient method for identifying this edge, as well as the weight of G (2) without actually having to find G (2) . Definition 2 (Blue and red edges). For any graph return − A G, w, e Figure 5: Algorithm for finding G (1) , the best edge e to delete to find G (2) , and w G (2) .
G, tree d ∈ A (G), and edge e = (i A j) ∈ d, the set of blue edges b (G, e, d) and red edges r (G, e, d) are defined by 12 An example of blue and red edges are given in Fig. 4.

Lemma 1.
For any graph G, if G (1) = − A G, then for some e ∈ G (1) and e ∈ b(G, e, G (1) ) Lemma 1 can be understood more clearly by following the worked example in Fig. 4. The moral of Lemma 1 is that in the base case where there are no critical cycles, we only need to examine the blue edges of the greedy graph to find the second best tree. Furthermore, our second best tree will only differ from our best tree by exactly one blue edge. Camerini et al. (1980) make use of the concepts of the blue and red edge sets, but rather than consider a base case as Lemma 1, they propose an ordering in which to visit the edges of the graph. This results in several properties about the possible orderings, 12 We can also define b (G, e, d) as (i A j) ∈ b(G, e, d) ⇐⇒ i is an ancestor of j in d and r (G, e, d) as (i A j) ∈ r(G, e, d) ⇐⇒ i is a descendant of j in d. This equivalence exists as we can only swap an incoming edge to j in d without introducing a cycle if the new edge emanates from an ancestor of j. The exposition using ancestors and descendants is more similar to the exposition originally presented by Camerini et al. (1980). requiring much more complicated proofs.
Definition 3 (Swap cost). For any graph G, tree d ∈ A (G), and edge e ∈ d, the swap cost denotes the minimum change to a tree weight to replace e by a single edge in d. It is given by w G,d (e) = min e ∈b (G,e,d) w(e) − w e (4) We will shorthand w G (e) to mean w G,G (1) (e). (1) where e is given by

Corollary 1. For any graph
Corollary 1 provides us a procedure for finding the best edge to remove to find G (2) as well as its weight in the base case of G having no critical cycles. We next illustrate what must be done in the recursive case when a critical cycle exists.

Lemma 2.
For any G with a critical cycle C, either G (2) = (G /C ) (2) C (with w G (2) Combining Corollary 1 and Lemma 2, we can directly modify opt to find the weight of G (2) and the edge we must remove to obtain it. We detail this algorithm as next in Fig. 5.
Runtime analysis. We know that without lines 5, 6, 9 and 10, next is identical to opt and so will run in O(N 2 ). We call w at most N + 2 times during a full call of next: N times from lines 5 and 9 combined, once from Line 6, and once from Line 10. To find w, we first need to find the set of blue edges, which can be done in O(N ) by computing the reachability graph. Then, we need another O(N ) to find the minimising value. Therefore, next does O(N 2 ) extra work than opt and so retains the runtime of O(N 2 ). Camerini et al. (1980) require G (1) to be known ahead of time. This results in having to run the original algorithm in O(N 2 ) time and then having to do the same amount of work as next because they must still contract the graph. Therefore, next has a constant-time speed-up over its counterpart in Camerini et al. (1979).  Fig. 2. We start with G (1) that has a weight of 260 and consider the best edge to remove to find G (2) . Using next we find that G (2) = (G − e) (1) for e = (4 A 3). We then know that either e ∈ G (3) or e ∈ G (3) . We can push these two possibilities to the queue using two calls to next. We find that G (3) comes from the graph without e, and also removes the edge e = (ρ A 2). We attempt to push two new elements to the queue, but we see that only by including e in the graph can we find another tree. We repeat this process until we have found G (K) or the queue is empty.

Finding the K th Best Tree
In the previous section, we found an efficient method for finding G (2) . We now utilize this method to efficiently find the K-best trees.
Lemma 3. For any graph G and K > 1, there exists a subgraph G ⊆ G and 1 ≤ l < K such that G (l) = G (1) and G (K) = G (2) .
Lemma 3 suggests that we can find the K-best trees by only examining the second best trees of subgraphs of G. This idea is formalized as algorithm kbest in Fig. 7. A walk-through of the exploration space using kbest for our example graph in Fig. 2 is shown in Fig. 6.

Runtime analysis.
We call next once at the   , 2018). Times are given in 10 −2 seconds for the average parse of the K-best spanning trees.
start of the algorithm, then every subsequent iteration we make two calls to next. As we have K −1 iterations , the runtime of kbest is O(KN 2 ). The first call to next in each iteration finds the K th best tree as well as an edge to remove. Camerini et al. (1980) make one call to of opt and two calls to next which only finds the weight-edge pair of our algorithm. Therefore, kbest has a constant time speed-up on the original algorithm. 13 A short experiment. We empirically measure the constant time speed-up between kbest and the original algorithm of Camerini et al. (1980). We take the English UD test set (as used for Fig. 1) and find the 10, 20, and 50 best spanning trees using both algorithms. 14 We give the results of the experiment in Tab. 1. 15 We note that on average kbest leads to a 1.39 times speed-up. This is 13 In practice, we maintain a set of edges to include and exclude to save space.
14 Implementations for both versions can be found in our code release (see footnote 1) 15 The experiment was conducted using an Intel(R) Core(TM) i7-7500U processor with 16GB RAM.
G [1] , w : 210  Fig. 2. We start with G [1] that has a weight of 210 and consider the best edge to remove to find G (2) . We consider removing the best dependency tree with the same edge emanating from the root e = (ρ A 1) using next. However, no such dependency tree exists, and so we only need to push the graph G − e. When we next pop from the queue, we see that we have removed root edge e, and so must consider removing the new root edge e = (ρ A e). In this case, no dependency tree exists without e and e , and so we only push to the queue the results of running next. We repeat this process until we have found G [K] or the queue is empty. lower than we anticipated as we have to make half as many calls to next than the original algorithm. However, in the original next of Camerini et al. (1980), we do not require to stitch together the tree, which may explain the slightly smaller speed-up.

Finding the K th Best Dependency Tree
In this section, we present a novel extension to the algorithm presented thus far, that allows us to efficiently find the K-best dependency trees. Recall that we consider dependency trees to be spanning trees with a root constraint such that only one edge may emanate from ρ. Naïvely, we can use kbest where we initialize the queue with (G + e ρ ) (1) for each e ρ = (ρ A j) ∈ G. However, this adds a O(N 3 ) component to our runtime as we have to call opt N times. Instead, our algorithm maintains the O(KN 2 ) runtime as the regular K-best algorithm. We begin by noting that we can find second best dependency tree, by finding either the best dependency tree with a different root edge or the second best tree with the same root edge.
Lemma 5. For any graph G and K > 1, if e = (ρ A j) ∈ G [K] , then either e is not in any of the K−1-best trees or there exists a subgraph G ⊆ G and 1 ≤ l < K such that G [l] = G [1] , e ∈ G [1] and G [K] = G [2] .
Lemma 5 suggests that we can find the K-best dependency trees, by examining the second best dependency trees of subgraphs of G or finding the best dependency tree with a unique root edge. This G [1] ← opt (G) 3: yield G [1] 4: e ρ ← outgoing edge from ρ in G [1] 5: ·, w, e ← next(G + e ρ ) yield G (k) Figure 9: K-best dependency tree enumeration algorithm.
idea is formalized as algorithm kbest dep in Fig. 9. A walk-through of the exploration space using kbest dep for our example graph in Fig. 2 is shown in Fig. 8.
Runtime analysis. At the start of the algorithm, we call opt twice and next once. Then, at each iteration we either make two calls two next, or two calls to opt and one call to next. As both algorithms have a runtime of O(N 2 ), each iteration has a runtime of O(N 2 ). Therefore, running K iterations gives a runtime of O(KN 2 ).

Conclusion
In this paper, we provided a simplification to Camerini et al. (1980)'s O(KN 2 ) K-best spanning trees algorithm. Furthermore, we provided a novel extension to the algorithm that decodes the K-best dependency trees in O(KN 2 ). We motivated the need for this new algorithm as using regular K-best decoding yields up to 36% trees which violation the root constraint. This is a substantial (up to 44 times) increase in the violation rate from decoding the one-best tree, and thus such an algorithm is even more important than in the one-best case. We hope that this paper encourages future research in K-best dependency parsing.