Greed is Good if Randomized: New Inference for Dependency Parsing

Dependency parsing with high-order features results in a provably hard decoding problem. A lot of work has gone into developing powerful optimization meth-ods for solving these combinatorial problems. In contrast, we explore, analyze, and demonstrate that a substantially simpler randomized greedy inference algorithm already sufﬁces for near optimal parsing: a) we analytically quantify the number of lo-cal optima that the greedy method has to overcome in the context of ﬁrst-order parsing; b) we show that, as a decoding algo-rithm, the greedy method surpasses dual decomposition in second-order parsing; c) we empirically demonstrate that our approach with up to third-order and global features outperforms the state-of-the-art dual decomposition and MCMC sampling methods when evaluated on 14 languages of non-projective CoNLL datasets. 1


Introduction
Dependency parsing is typically guided by parameterized scoring functions that involve rich features exerting refined control over the choice of parse trees. As a consequence, finding the highest scoring parse tree is a provably hard combinatorial inference problem (McDonald and Pereira, 2006). Much of the recent work on parsing has focused on solving these problems using powerful optimization techniques. In this paper, we follow a different strategy, arguing that a much simpler inference strategy suffices. In fact, we demonstrate that a randomized greedy method of inference surpasses the state-of-the-art performance in dependency parsing. * Both authors contributed equally. 1 Our code is available at https://github.com/ taolei87/RBGParser. Our choice of a randomized greedy algorithm for parsing follows from a successful track record of such methods in other hard combinatorial problems. These conceptually simple and intuitive algorithms have delivered competitive approximations across a broad class of NP-hard problems ranging from set cover (Hochbaum, 1982) to MAX-SAT (Resende et al., 1997). Their success is predicated on the observation that most realizations of problems are much easier to solve than the worst-cases. A simpler algorithm will therefore suffice in typical cases. Evidence is accumulating that parsing problems may exhibit similar properties. For instance, methods such as dual decomposition offer certificates of optimality when the highest scoring tree is found. Across languages, dual decomposition has shown to lead to a certificate of optimality for the vast majority of the sentences (Koo et al., 2010;Martins et al., 2011). These remarkable results suggest that, as a combinatorial problem, parsing appears simpler than its broader complexity class would suggest. Indeed, we show that a simpler inference algorithm already suffices for superior results.
In this paper, we introduce a randomized greedy algorithm that can be easily used with any rich scoring function. Starting with an initial tree drawn uniformly at random, the algorithm makes only local myopic changes to the parse tree in an attempt to climb the objective function. While a single run of the hill-climbing algorithm may indeed get stuck in a locally optimal solution, multiple random restarts can help to overcome this problem. The same algorithm is used both for learning the parameters of the scoring function as well as for parsing test sentences.
The success of a randomized greedy algorithm is tied to the number of local maxima in the search space. When the number is small, only a few restarts will suffice for the greedy algorithm to find the highest scoring parse. We provide an al-gorithm for explicitly counting the number of local optima in the context of first-order parsing, and demonstrate that the number is typically quite small. Indeed, we find that a first-order parser trained with exact inference or using our randomized greedy algorithm delivers basically the same performance.
We hypothesize that parsing with high-order scoring functions exhibits similar properties. The main rationale is that, even in the presence of highorder features, the resulting scoring function remains first-order dominant. The performance of a simple arc-factored first-order parser is only a few percentage points behind higher-order parsers. The higher-order features in the scoring function offer additional refinement but only a few changes above and beyond the first-order result. As a consequence, most of the arc choices are already determined by a much simpler, polynomial time parser.
We use dual decomposition to show that the greedy method indeed succeeds as an inference algorithm even with higher-order scoring functions. In fact, with second-order features, regardless of which method was used for training, the randomized greedy method outperforms dual decomposition by finding higher scoring trees. For the sentences that dual decomposition is optimal (obtains a certificate), the greedy method finds the same solution in over 99% of the cases. Our simple inference algorithm is therefore likely to scale to higher-order parsing and we demonstrate empirically that this is indeed so.
We validate our claim by evaluating the method on the CoNLL dependency benchmark that comprises treebanks from 14 languages.
Averaged across all languages, our method outperforms state-of-the-art parsers, including Tur-boParser (Martins et al., 2013) and our earlier sampling-based parser . On seven languages, we report the best published results. The method is not sensitive to initialization. In fact, drawing the initial tree uniformly at random results in the same performance as when initialized from a trained first-order distribution. In contrast, sufficient randomization of the starting point is critical. Only a small number of restarts suffices for finding (near) optimal parse trees.

Related Work
Finding Optimal Structure in Parsing The use of rich-scoring functions in dependency parsing inevitably leads to the challenging combinatorial problem of finding the maximizing parse. In fact, McDonald and Pereira (2006) demonstrated that the task is provably NP-hard for non-projective second-order parsing. Not surprisingly, approximate inference has been at the center of parsing research. Examples of these approaches include easy-first parsing (Goldberg and Elhadad, 2010), inexact search (Johansson and Nugues, 2007;Zhang and Clark, 2008;Huang et al., 2012;Zhang et al., 2013), partial dynamic programming (Huang and Sagae, 2010) and dual decomposition (Koo et al., 2010;Martins et al., 2011).
Our work is most closely related to the MCMC sampling-based approaches (Nakagawa, 2007;. In our earlier work, we developed a method that learns to take guided stochastic steps towards a high-scoring parse . In the heart of that technique are sophisticated samplers for traversing the space of trees. In this paper, we demonstrate that a substantially simpler approach that starts from a tree drawn from the uniform distribution and uses hillclimbing for parameter updates achieves similar or higher performance. Another related greedy inference method has been used for non-projective dependency parsing (McDonald and Pereira, 2006). This method relies on hill-climbing to convert the highest scoring projective tree into its non-projective approximation. Our experiments demonstrate that when hill-climbing is employed as a primary learning mechanism for high-order parsing, it exhibits different properties: the distribution for initialization does not play a major role in the final outcome, while the use of restarts contributes significantly to the quality of the resulting tree.

Greedy Approximations for NP-hard Problems
There is an expansive body of research on greedy approximations for NP-hard problems. Examples of NP-hard problems with successful greedy approximations include the traveling saleman problem problem (Held and Karp, 1970;Rego et al., 2011), the MAX-SAT problem (Mitchell et al., 1992;Resende et al., 1997) and vertex cover (Hochbaum, 1982). While some greedy methods have poor worst-case complexity, many of them work remarkably well in practice. Despite the apparent simplicity of these algorithms, understanding their properties is challenging: often their "theoretical analyses are negative and inconclusive" (Amenta and Ziegler, 1999;Spielman and Teng, 2001). Identifying conditions under which approximations are provably optimal is an active area of research in computer science theory (Dumitrescu and Tóth, 2013;Jonsson et al., 2013).
In NLP, randomized and greedy approximations have been successfully used across multiple applications, including machine translation and language modeling (Brown et al., 1993;Ravi and Knight, 2010;Daumé III et al., 2009;Moore and Quirk, 2008;Deoras et al., 2011). In this paper, we study the properties of these approximations in the context of dependency parsing.

Preliminaries
Let x be a sentence and T (x) be the set of possible dependency trees over the words in x. We use y ∈ T (x) to denote a dependency tree for x, and y(m) to specify the head (parent) of the modifier word indexed by m in tree y. We also use m to denote the indexed word when there is no ambiguity. In addition, we define T (y, m) as the set of "neighboring trees" of y obtained by changing only the head of the modifier, i.e. y(m).
The dependency trees are scored according to S(x, y) = θ · φ(x, y), where θ is a vector of parameters and φ(x, y) is a sparse feature vector representation of tree y for sentence x. In this work, φ(x, y) will include up to third-order features as well as a range of global features commonly used in re-ranking methods (Collins, 2000;Charniak and Johnson, 2005;Huang, 2008).
The parameters θ in the scoring function are estimated on the basis of a training set D = of sentencesx i and the corresponding gold (target) treesŷ i . We adopt a max-margin framework for this learning problem. Specifically, we aim to find parameter values that score the gold target trees higher than others: where ξ i ≥ 0 is the slack variable (non-zero values are penalized against) and ŷ i − y 1 is the hamming distance between the gold treeŷ i and a candidate parse y.
In an online learning setup, parameters are updated successively after each sentence. Each update still requires us to find the "strongest violation", i.e., a candidate treeỹ that scores higher than the gold treeŷ i : The parameters are then revised so as to select against the offendingỹ. Instead of a standard parameter update based onỹ as in perceptron, stochastic gradient descent, or passive-aggressive updates, our implementation follows  where the first-order parameters are broken up into a tensor. Each tensor component is updated successively in combination with the parameters corresponding to MST features (McDonald et al., 2005) and higher-order features (when included). 2

Algorithm
During training and testing, the key combinatorial problem we must solve is that of decoding, i.e., finding the highest scoring treeỹ ∈ T (x) for each sentence x (orx i ). In our notation, While the decoding problem with feature sets similar to ours has been shown to be NP-hard, many approximation algorithms work remarkably well. We commence with a motivating example.
Locality and Parsing One possible reason for why greedy or other approximation algorithms work well for dependency parsing is that typical sentences and therefore the learned scoring functions S(x, y) = θ · φ(x, y) are primarily "local". By this we mean that head-modifier decisions could be made largely without considering the surrounding structure (the context). For example, in English an adjective and a determiner are typically attached to the following noun. We demonstrate the degree of locality in dependency parsing by comparing a first-order treebased parser to the parser that predicts each head word independently of others. Note that the independent prediction of dependency arcs does not necessarily give rise to a tree. The parameters of for each word m in list do 6: end for 9: until no change in this iteration 10: returnỹ = y (t) ; Figure 1: A randomized hill-climbing algorithm for dependency parsing.
the two parsers, the independent prediction and a tree-based parser, are trained separately with the corresponding decoding algorithm but with the same feature set. Table 1 shows that the accuracy of the independent prediction ranges from 79% to 93% on four CoNLL datasets. The results are on par with the first-order structured prediction model. This experiment reinforces the conclusion in Liang et al. (2008), where a local classifier was shown to achieve comparable accuracy to a sequential model (e.g. CRF) in POS tagging and namedentity recognition.

Hill-Climbing with Random Restarts
We build here on the motivating example and explore greedy algorithms as generalizations of purely local decoding. Greedy algorithms break the decoding problem into a sequence of simple local steps, each required to improve the solution. In our case, simple local steps correspond to choosing the head for each modifier word.
We begin with a tree y (0) , which can be a sample drawn uniformly from T (x) (Wilson, 1996). Our greedy algorithm then updates y (t) to a better tree y (t+1) by revising the head of one modifier word while maintaining the constraint that the resulting structure is a tree. The modifiers are considered in the bottom-up order relative to the current tree (the word furthest from the root is considered first). We provide an analysis to motivate this bottom-up update strategy in Section 4.1. The algorithm continues until the score can no longer be improved by changing the head of a single word. The resulting tree represents a locally optimal prediction relative to a single-arc greedy algorithm. Figure 1 gives the algorithm in pseudo-code.
There are many possible variations of the simple randomized greedy hill-climbing algorithm. First, the Wilson sampling algorithm (Wilson, 1996) can be naturally extended to obtain i.i.d. samples from any first-order distributions. Therefore, we could initialize the tree y (0) with a tree from a first-order parser, or draw the initial tree from a first-order distribution other than uniform. However, perhaps surprisingly, as we demonstrate later, little is lost with uniform initialization. Second, since a single run of randomized hill-climbing is relatively cheap and runs are independent to each other, it is easy to execute multiple runs independently in parallel. The final predicted tree is then simply the highest scoring tree across the multiple runs. We demonstrate that only a small number of parallel runs are necessary for near optimal prediction.

First-Order Parsing
We provide here a firmer basis for why the randomized greedy algorithm can be expected to work. While the focus of the rest of the paper is on higher-order parsing, we limit ourselves in this subsection to first-order parsing. The reasons for this are threefold. First, a simple greedy algorithm is already not guaranteed a priori to work in the context of a first-order scoring function. The conclusions from this analysis are therefore likely to carry over to higher-order parsing scenarios as well. Second, a first-order arc-factored scoring provides us an easy way to ascertain when the randomized greedy algorithm indeed found the highest scoring tree. Finally, we are able to count the  number of locally optimal solutions for a greedy algorithm in the context of first-order parsing and can therefore relate this property to the success rates of the algorithm.

Reachability
We begin by highlighting a basic property of trees, namely that single arc changes suffice for transforming any tree to any other tree in a small number of steps while maintaining that each intermediate structure is also a tree. In this sense, a target tree is reachable from any starting point using only single arc changes. More formally, let y be any starting tree and y the desired target. Let m 1 , m 2 , · · · , m n be the bottomup list of words (modifiers) corresponding to tree y, where m 1 is the word furthest from the root. We can simply change each head y(m i ) to that of y (m i ) in this order i = 1, . . . , n. The bottom-up order guarantees that no cycle is introduced with respect to the remaining (yet unmodified) nodes of y. The fact that y is a valid tree implies no cycle will appear with respect to the already modified nodes. Note that, according to this property, any tree is reachable from any starting point using only k modifications, where k is the number of head differences, i.e. k = |{m : y(m) = y (m)}|. The result also suggests that it may be helpful to perform the greedy steps in the bottom-up order, a suggestion that we follow in our implementation.
Broadly speaking, we have established that the greedy algorithm is not inherently limited by virtue of its basic steps. Of course, it is a different question whether the scoring function supports such local changes towards the correct target tree.
Locally Optimal Trees While greedy algorithms are notoriously prone to getting stuck in locally optimal solutions, we establish here that are the arc scores Return: the number of local optima 1: Let y(0) = ∅ and y(i) = arg max j e ji ; 2: if y is a tree (no cycle) then return 1; 3: Find a cycle C ⊂ V in y; 4: count = 0; // contract the cycle 5: create a vertex w * ; 6: ∀j / ∈ C : e * j = max k∈C e kj ; 7: for each vertex w i ∈ C do 8: ∀j / ∈ C : e j * = e ji ; 9: 10: count += CountOptima(G = V , E ); 12: end for 13: return count; Figure 2: A recursive algorithm for counting local optima for a sentence with words w 1 , · · · , w n (first-order parsing). The algorithm resembles the Chu-Liu-Edmonds algorithm for finding the maximum directed spanning tree (Chu and Liu, 1965). decoding with learned scoring functions involves only a small number of local optima. In our case, a local optimum corresponds to a tree y where no single change of head y(m) results in a higher scoring tree. Clearly, the highest scoring tree is also a local optimum in this sense. If there were many such local optima, finding the one with the highest score would be challenging for a greedy algorithm, even with randomization.
We begin with a worst case analysis and estab-Dataset Trained with Hill-Climbing (HC) Trained with Dual Decomposition (DD) %Cert (DD) sDD > sHC sDD = sHC sDD < sHC %Cert (DD) sDD > sHC sDD = sHC sDD < sHC  Table 3: Decoding quality comparison between hill-climbing (HC) and dual decomposition (DD). Models are trained either with HC (left) or DD (right). s HC denotes the score of the tree retrieved by HC and s DD gives the analogous score for DD. The columns show the percentage of all test sentences for which one method succeeds in finding a higher or the same score. "Cert" column gives the percentage of sentences for which DD finds a certificate.
lish a tight upper bound on the number of local optima for a first-order scoring function.
Theorem 1 For any first-order scoring function that factorizes into the sum of arc scores S(x, y) = S arc (y(m), m): (a) the number of locally optimal trees is at most 2 n−1 for n words; (b) this upper bound is tight. 3 While the number of possible dependency trees is (n + 1) n−1 (Cayley's formula), the number of local optima is at most 2 n−1 . This is still too many for longer sentences, suggesting that, in the worst case, a randomized greedy algorithm is unlikely to find the highest scoring tree. However, the scoring functions we learn for dependency parsing are considerably easier.
Average Case Analysis In contrast to the worstcase analysis above, we will count here the actual number of local optima per sentence for a firstorder scoring function learned from data with the randomized greedy algorithm. Figure 2 provides pseudo-code for our counting algorithm. The algorithm is derived by tailoring the proof of Theorem 1 to each sentence. Table 2 shows the empirical number of locally optimal trees estimated by our algorithm across 4 different languages. Decoding with trained scoring functions in the average case is clearly substantially easier than the worst case. For example, on the English test set more than 70% of the sentences have at most 121 locally optimal trees. Since the average sentence length is 24, the discrepancy between the typical number (e.g., 121) and the worst case (2 24−1 ) is substantial. As a result, only a small number of restarts is likely to suffice for finding optimal trees in practice.
Optimal Decoding We can easily verify whether the randomized greedy algorithm indeed 3 A proof sketch is given in Appendix. succeeds in finding the highest scoring trees with a learned first-order scoring function. We have established above that there are typically only a small number of locally optimal trees. We would therefore expect the algorithm to work. We show the results in the second part of Table 2. For short sentences of length up to 15, our method finds the global optimum for all the test sentences. Success rates remain high even for longer test sentences.

Higher-Order Parsing
Exact decoding with high-order features is known to be provably hard (McDonald et al., 2005). We begin our analysis here with a second-order (sibling/grandparent) model, and compare our randomized hill-climbing (HC) method to dual decomposition (DD), re-implementing Koo et al. (2010). Table 3 compares decoding quality for the two methods across four languages. Overall, in 97.8% of the sentences, HC obtains the same score as DD, in 1.3% of the cases HC finds a higher scoring tree, and in 0.9% of cases DD results in a better tree. The results follow the same pattern regardless of which method was used to train the scoring function. The average rate of certificates for DD was 92%. In over 99% of these sentences, HC reaches the same optimum.
We expect that these observations about the success of HC carry over to other high-order parsing models for several reasons. First, a large number of arcs are pruned in the initial stage, considerably reducing the search space and minimizing the number of possible locally optimal trees. Second, many dependencies can be determined already with independent arc prediction (see our motivating example above), predictions that are readily achieved with a greedy algorithm. Finally, high-order features represent smaller refinements, i.e., suggest only a few changes above and beyond the dominant first-order scores. Greedy al-gorithms are therefore likely to be able to leverage at least some of this potential. We demonstrate below that this is indeed so.
Our methods are trained within the max-margin framework. As a result, we are expected to find the highest scoring competing tree for each training sentence (the "strongest violation"). One may question therefore whether possible sub-optimal decoding for some training sentences (finding "a violation" rather than the "strongest violation") impacts the learned parser. To this end, Huang et al. (2012) have established that weaker violations do suffice for separable training sets.

Experimental Setup
Dataset and Evaluation Measures We evaluate our model on CoNLL dependency treebanks for 14 different languages (Buchholz and Marsi, 2006;Surdeanu et al., 2008), using standard training and testing splits. We use part-of-speech tags and the morphological information provided in the corpus. Following standard practice, we use Unlabeled Attachment Score (UAS) excluding punctuation (Koo et al., 2010;Martins et al., 2013) as the evaluation metric in all our experiments.
Baselines We compare our model with the Tur-boParser (Martins et al., 2013) and our earlier sampling-based parser . For both parsers, we directly compare with the recent published results on the CoNLL datasets. We also compare our parser against the best published results for the individual languages in our datasets. This comparison set includes four additional parsers: Martins et al. (2011), Koo et al. (2010), Zhang et al. (2013) and our tensor-based parser .
Features We use the same feature templates as in our prior work Lei et al., 2014) 4 . Figure 3 shows the first-to third-order feature templates that we use in our model. For the global features we use right-branching, coordination, PP attachment, span length, neighbors, valency and non-projective arcs features. Implementation Details Following standard practices, we train our model using the passiveaggressive online learning algorithm (MIRA) and parameter averaging (Crammer et al., 2006;4 We refer the readers to  and  for the detailed definition of each feature template.   Collins, 2002). By default we use an adaptive strategy for running the hill-climbing algorithm -for a given sentence we repeatedly run the algorithm in parallel 5 until the best tree does not change for K = 300 consecutive restarts. For each restart, by default we initialize the tree y (0) by sampling from the first-order distribution using the current learned parameter values (and firstorder scores). We train our first-order and thirdorder model for 10 epochs and our full model for 20 epochs for all languages, and report the average performance across three independent runs.

Results
Comparison with the Baselines Table 4 summarizes the results of our model, along with the state-of-the-art baselines. On average across 14 languages, our full model with the tensor component outperforms both TurboParser and the sampling-based parser. The direct comparison   Koo et al. (2010), Zhang et al. (2013),  and . For the third-order model, we use the feature set of TurboParser (Martins et al., 2013). The full model combines features of our sampling-based parser  and tensor features .  Table 5: Comparison between different initialization strategies: (a) MAP-1st: only the MAP tree of the first-order score; (b) Uniform: random trees are sampled from the uniform distribution; and (c) Rnd-1st: random trees are sampled from the first-order distribution. For each method, the table shows the average accuracy of the initial tree and the final parsing accuracy.
with TurboParser is achieved by restricting our model to third order features which still outperforms TurboParser (89.10% vs 88.72%). To compare against the sampling-based parser, we employ our model without the tensor component. The two models achieve a similar average performance (89.24% and 89.23% respectively). Since relative parsing performance depends on a target language, we also include comparison with the best published results. The model achieves the best published results for seven languages.
Another noteworthy comparison concerns firstorder parsers. As Table 4 shows, the exact and approximate versions of the first-order parser deliver almost identical performance. Table 4 shows that the model can effectively utilize high-order features. Comparing the average performance of the model variants, we see that the accuracy on the benchmark languages consistently improves when higher-order features are added. This characteristic of the randomized greedy parser is in line with findings about other state-of-the-art highorder parsers (Martins et al., 2013;. Figure 4 breaks down these gains based on the sentence length. As expected, on most languages high-order features are particularly helpful when parsing longer sentences. Table 5 shows the impact of initialization on the model performance for several languages. We consider three strategies: the MAP estimate of the firstorder score from the model, uniform sampling and sampling from the first-order distribution. The accuracy of initial trees varies greatly, ranging from 78.4% for the MAP estimate to 25.9% and 44.5% for the latter randomized strategies. However, the resulting parsing accuracy is not determined by the initial accuracy. In fact, the two sampling strategies result in almost identical parsing performance. While the first-order MAP estimate gives the best initial guess, the overall parsing accuracy of this method lags behind. This result demonstrates the importance of restarts -in contrast to the randomized strategies, the MAP initialization performs only a single run of hill-climbing.  (b) English Figure 5: Convergence analysis on Slovene and English datasets. The graph shows the normalized score of the output tree as a function of the number of restarts. The score of each sentence is normalized by the highest score obtained for this sentence after 3,000 restarts. We only show the curves up to 1,000 restarts because they all reach convergence after around 500 restarts. Figure 5 shows the score of the trees retrieved by our full model with respect to the number of restarts, for short and long sentences in English and Slovene. To facilitate the comparison, we normalize the score of each sentence by the maximal score obtained for this sentence after 3,000 restarts. Overall, most sentences converge quickly. This view is also supported by Table 6 which shows the fraction of the sentences that converge within the first 300 restarts. We can see that all the short sentences (length up to 15) reach convergence within the allocated restarts. Perhaps surprisingly, more than 98% of the long sentences also converge within 300 restarts.

Convergence Properties
Decoding Speed As the number of restarts impacts the parsing accuracy, we can trade performance for speed. Figure 6 shows that the model achieves high performance with acceptable parsing speed. While various system implementation issues such as programming language and computational platform complicate a direct comparison with other parsing systems, our model delivers parsing time roughly comparable to other stateof-the-art graph-based systems (for example, Tur-boParser and MST parser) and the sampling-based parser.

Conclusions
We have shown that a simple, generally applicable randomized greedy algorithm for inference suffices to deliver state-of-the-art parsing performance. We argued that the effectiveness of such greedy algorithms is contingent on having a small number of local optima in the scoring function. By algorithmically counting the number of locally optimal solutions in the context of first-order parsing, we show that this number is indeed quite small. Moreover, we show that, as a decoding algorithm, the greedy method surpasses dual decomposition in second-order parsing. Finally, we empirically demonstrate that our approach with up to thirdorder and global features outperforms the state-ofthe-art parsers when evaluated on 14 languages of non-projective CoNLL datasets.
• Whenever independent selection of the heads results in a valid tree, there is only 1 optimum (Lines 1&2 of the algorithm). Otherwise there must be a cycle C in y (Line 3 of the algorithm) • We claim that any locally optimal tree y of the graph G = (V, E) must contain |C| − 1 arcs of the cycle C ⊆ V . This can be shown by contradiction. If y contains less than |C| − 1 arcs of C, then (a) we can construct a tree y that contains |C| − 1 arcs; (b) the heads in y are strictly better than those in y over the unused part of the cycle; (c) by reachability, there is a path y → y so y cannot be a local optimum.
• Any locally optimal tree in G must select an arc in C and reassign it. The rest of the |C|−1 arcs will then result in a chain.
• By contracting cycle C we obtain a new graph G of size |G| − |C| + 1 (Lines 5-11 of the algorithm). Easy to verify that (not shown): any local optimum in G is a local optimum in G and vice versa.
The theorem follows as a corollary of these steps. To see this, let F (G m ) be the number of local optima in the graph of size m: where G (i) m−c+1 is the graph (of size m − c + 1) created by selecting the i th arc in cycle C and contracting G m accordingly, and c = |C| is the size of the cycle. DefineF (m) as the upper bound of F (G m ) for any graph of size m. By the above formula, we know that F (m) ≤ max 2≤c<mF (m − c + 1) × c By solving forF (m) we getF (m) ≤ 2 m−2 . Since m = n + 1 for a sentence with n words, the upperbound of local optima is 2 n−1 .
To show the tightness, for any n > 0, create the graph G n+1 with arc scores e ij = e ji = i for any 0 ≤ i < j ≤ n. Note that w n → w n−1 → w n forms the circle C of size 2, it can be shown by induction on n and F (G n+1 ) that F (G n+1 ) = F (G n ) × 2 = 2 n−1 .