Exact Decoding for Phrase-Based Statistical Machine Translation

The combinatorial space of translation derivations in phrase-based statistical machine translation is given by the intersection between a translation lattice and a target language model. We replace this intractable intersection by a tractable relaxation which incorporates a low-order up-perbound on the language model. Exact optimisation is achieved through a coarse-to-ﬁne strategy with connections to adaptive rejection sampling. We perform exact optimisation with unpruned language models of order 3 to 5 and show search-error curves for beam search and cube pruning on standard test sets. This is the ﬁrst work to tractably tackle exact optimisation with language models of orders higher than 3.


Introduction
In Statistical Machine Translation (SMT), the task of producing a translation for an input string x = x 1 , x 2 , . . . , x I is typically associated with finding the best derivation d * compatible with the input under a linear model. In this view, a derivation is a structured output that represents a sequence of steps that covers the input producing a translation. Equation 1 illustrates this decoding process.
The set D(x) is the space of all derivations compatible with x and supported by a model of translational equivalences (Lopez, 2008). The function f (d) = Λ · H(d) is a linear parameterisation of the model (Och, 2003). It assigns a real-valued score (or weight) to every derivation d ∈ D(x), where Λ ∈ R m assigns a relative importance to different aspects of the derivation independently captured by m feature functions H(d) = H 1 (d), . . . , H m (d) ∈ R m . The fully parameterised model can be seen as a discrete weighted set such that feature functions factorise over the steps in a derivation. That is, H k (d) = e∈d h k (e), where h k is a (local) feature function that assesses steps independently and d = e 1 , e 2 , . . . , e l is a sequence of l steps. Under this assumption, each step is assigned the weight w(e) = Λ· h 1 (e), h 2 (e), . . . , h m (e) . The set D is typically finite, however, it contains a very large number of structures -exponential (or even factorial, see §2) with the size of x -making exhaustive enumeration prohibitively slow. Only in very restricted cases combinatorial optimisation techniques are directly applicable (Tillmann et al., 1997;Och et al., 2001), thus it is common to resort to heuristic techniques in order to find an approximation to d * (Koehn et al., 2003;Chiang, 2007).
Evaluation exercises indicate that approximate search algorithms work well in practice (Bojar et al., 2013). The most popular algorithms provide solutions with unbounded error, thus precisely quantifying their performance requires the development of a tractable exact decoder. To date, most attempts were limited to short sentences and/or somewhat toy models trained with artificially small datasets (Germann et al., 2001;Iglesias et al., 2009;Aziz et al., 2013). Other work has employed less common approximations to the model reducing its search space complexity (Kumar et al., 2006;Chang and Collins, 2011;Rush and Collins, 2011). These do not answer whether or not current decoding algorithms perform well at real translation tasks with state-of-the-art models.
We propose an exact decoder for phrase-based SMT based on a coarse-to-fine search strategy . In a nutshell, we relax the decoding problem with respect to the Language Model (LM) component. This coarse view is incrementally refined based on evidence col-lected via maximisation. A refinement increases the complexity of the model only slightly, hence dynamic programming remains feasible throughout the search until convergence. We test our decoding strategy with realistic models using standard data sets. We also contribute with optimum derivations which can be used to assess future improvements to approximate decoders. In the remaining sections we present the general model ( §2), survey contributions to exact optimisation ( §3), formalise our novel approach ( §4), present experiments ( §5) and conclude ( §6).

Phrase-based SMT
In phrase-based SMT (Koehn et al., 2003), the building blocks of translation are pairs of phrases (or biphrases). A translation derivation d is an ordered sequence of non-overlapping biphrases which covers the input text in arbitrary order generating the output from left to right. 1 Equation 2 illustrates a standard phrase-based model (Koehn et al., 2003): ψ is a weighted target n-gram LM component, where y is the yield of d; φ is a linear combination of features that decompose over phrase pairs directly (e.g. backward and forward translation probabilities, lexical smoothing, and word and phrase penalties); and δ is an unlexicalised penalty on the number of skipped input words between two adjacent biphrases. The weighted logic program in Figure  1 specifies the fully parameterised weighted set of solutions, which we denote D(x), f (d) . 2 A weighted logic program starts from its axioms and follows exhaustively deducing new items by combination of existing ones and no deduction happens twice. In Figure 1, a nonteminal item summarises partial derivation (or hypotheses). It is denoted by [C, r, γ] (also known as carry), where: C is a coverage vector, necessary to impose the non-overlapping constraint; r is the rightmost position most recently covered, necessary for the computation of δ; and γ is the last n − 1 words 1 Preventing phrases from overlapping requires an exponential number of constraints (the powerset of x) rendering the problem NP-complete (Knight, 1999). 2 Weighted logics have been extensively used to describe weighted sets (Lopez, 2009), operations over weighted sets (Chiang, 2007;Dyer and Resnik, 2010), and a variety of dynamic programming algorithms (Cohen et al., 2008). , that is, an instantiated biphrase which covers the span x i i and yields y j j with weight φ r . The side condition imposes the non-overlapping constraint (c k is the kth bit in C). The antecedents are used to compute the weight of the deduction, and the carry is updated in the consequent (item below the horizontal line). Finally, the rule ACCEPT incorporates the end-ofsentence boundary to complete items. 3 It is perhaps illustrative to understand the set of weighted translation derivations as the intersection between two components. One that is only locally parameterised and contains all translation derivations (a translation lattice or forest), and one that re-ranks the first as a function of the interactions between translation steps. The model of translational equivalences parameterised only with φ is an instance of the former. An n-gram LM component is an instance of the latter.

Hypergraphs
A backward-hypergraph, or simply hypergraph, is a generalisation of a graph where edges have multiple origins and one destination (Gallo et al., 1993). They can represent both finite-state and context-free weighted sets and they have been widely used in SMT (Huang and Chiang, 2007). A hypergraph is defined by a set of nodes (or ver-tices) V and a weighted set of edges E, w . An edge e connects a sequence of nodes in its tail t[e] ∈ V * under a head node h[e] ∈ V and has weight w(e). A node v is a terminal node if it has no incoming edges, otherwise it is a nonterminal node. The node that has no outgoing edges, is called root, with no loss of generality we can assume hypergraphs to have a single root node.
Hypergraphs can be seen as instantiated logic programs. In this view, an item is a template for the creation of nodes, and a weighted deduction rule is a template for edges. The tail of an edge is the sequence of nodes associated with the antecedents, and the head is the node associated with the consequent. Even though the space of weighted derivations in phrase-based SMT is finite-state, using a hypergraph as opposed to a finite-state automaton makes it natural to encode multi-word phrases using tails. We opt for representing the target side of the biphrase as a sequence of terminals nodes, each of which represents a target word.

Beam filling algorithms
Beam search (Koehn et al., 2003) and cube pruning (Chiang, 2007) are examples of state-of-the-art approximate search algorithms. They approximate the intersection between the translation forest and the language model by expanding a limited beam of hypotheses from each nonterminal node. Hypotheses are organised in priority queues according to common traits and a fast-to-compute heuristic view of outside weights (cheapest way to complete a hypothesis) puts them to compete at a fairer level. Beam search exhausts a node's possible expansions, scores them, and discards all but the k highest-scoring ones. This process is wasteful in that k is typically much smaller than the number of possible expansions. Cube pruning employs a priority queue at beam filling and computes k highscoring expansions directly in near best-first order. The parameter k is known as beam size and it controls the time-accuracy trade-off of the algorithm. Heafield et al. (2013a) move away from using the language model as a black-box and build a more involved beam filling algorithm. Even though they target approximate search, some of their ideas have interesting connections to ours (see §4). They group hypotheses that share partial language model state (Li and Khudanpur, 2008) reasoning over multiple hypotheses at once. They fill a beam in best-first order by iteratively visiting groups using a priority queue: if the top group contains a single hypothesis, the hypothesis is added to the beam, otherwise the group is partitioned and the parts are pushed back to the queue. More recently, Heafield et al. (2014) applied their beam filling algorithm to phrase-based decoding.

Exact optimisation
Exact optimisation for monotone translation has been done using A * search (Tillmann et al., 1997) and finite-state operations (Kumar et al., 2006). Och et al. (2001) design near-admissible heuristics for A * and decode very short sentences (6-14 words) for a word-based model (Brown et al., 1993) with a maximum distortion strategy (d = 3).
Zaslavskiy et al. (2009) frame phrase-based decoding as an instance of a generalised Travelling Salesman Problem (TSP) and rely on robust solvers to perform decoding. In this view, a salesman graph encodes the translation options, with each node representing a biphrase. Nonoverlapping constraints are imposed by the TSP solver, rather than encoded directly in the salesman graph. They decode only short sentences (17 words on average) using a 2-gram LM due to salesman graphs growing too large. 4 Chang and Collins (2011) relax phrase-based models w.r.t. the non-overlapping constraints, which are replaced by soft penalties through Lagrangian multipliers, and intersect the LM component exhaustively. They do employ a maximum distortion limit (d = 4), thus the problem they tackle is no longer NP-complete. Rush and Collins (2011) relax a hierarchical phrase-based model (Chiang, 2005) 5 w.r.t. the LM component. The translation forest and the language model trade their weights (through Lagrangian multipliers) so as to ensure agreement on what each component believes to be the maximum. In both approaches, when the dual converges to a compliant solution, the solution is guaranteed to be optimal. Other-wise, a subset of the constraints is explicitly added and the dual optimisation is repeated. They handle sentences above average length, however, resorting to compact rulesets (10 translation options per input segment) and using only 3-gram LMs.
In the context of hierarchical models, Aziz et al. (2013) work with unpruned forests using upperbounds. Their approach is the closest to ours. They also employ a coarse-to-fine strategy with the OS * framework , and investigate unbiased sampling in addition to optimisation. However, they start from a coarser upperbound with unigram probabilities, and their refinement strategies are based on exhaustive intersections with small n-gram matching automata. These refinements make forests grow unmanageable too quickly. Because of that, they only deal with very short sentences (up to 10 words) and even then decoding is very slow. We design better upperbounds and a more efficient refinement strategy. Moreover, we decode long sentences using language models of order 3 to 5. 6 4 Approach 4.1 Exact optimisation with OS *  introduced OS * , a unified view of optimisation and sampling which can be seen as a cross between adaptive rejection sampling (Robert and Casella, 2004) and A * optimisation (Hart et al., 1968). In this framework, a complex goal distribution is upperbounded by a simpler proposal distribution for which optimisation (and sampling) is feasible. This proposal is incrementally refined to be closer to the goal until the maximum is found (or until the sampling performance exceeds a certain level). Figure 2 illustrates exact optimisation with OS * . Suppose f is a complex target goal distribution, such that we cannot optimise f , but we can as- Moreover, suppose that g (0) is simple enough to be optimised efficiently. The algorithm proceeds by solving d 0 = argmax d g (0) (d) and comput- ing the quantity r 0 = f (d0) /g (0) (d0). If r 0 were sufficiently close to 1, then g (0) (d 0 ) would be sufficiently close to f (d 0 ) and we would have found the optimum. However, in the illustration , thus r 0 1. At this point the algorithm has concrete evidence to motivate a refinement of g (0) that can lower its maximum, bringing it closer to f * = max d f (d) at the cost of some small increase in complexity. The refined proposal must remain an upperbound to f . To continue with the illustration, suppose g (1) is obtained. The process is repeated until eventually for some finite t. At which point d t is the optimum derivation d * from f and the sequence of upperbounds provides a proof of optimality. 7

Model
We work with phrase-based models in a standard parameterisation (Equation 2). However, to avoid having to deal with NP-completeness, we constrain reordering to happen only within a limited window given by a notion of distortion limit. We require that the last source word covered by any biphrase must be within d words from the leftmost uncovered source position (Lopez, 2009). This is a widely used strategy and it is in use in the Moses toolkit (Koehn et al., 2007). 8 Nevertheless, the problem of finding the best derivation under the model remains impracticable due to nonlocal parameterisation (namely, the n-gram LM component). The weighted set D(x), f (d) , which represents the objective, is a complex hypergraph which we cannot afford to construct. We propose to construct instead a simpler hypergraph for which optimisation by dynamic programming is feasible. This proxy represents the weighted set D(x), . Note that this proposal contains exactly the same translation options as in the original decoding problem. The simplification happens only with respect to the parameterisation. Instead of intersecting the complete n-gram LM distribution explicitly, we implicitly intersect a simpler upperbound view of it, where by simpler we mean lower-order.
Equation 3 shows the model we use as a proxy to perform exact optimisation over f . In comparison to Equation 2, the term l i=1 ω(y[ei]) replaces ψ(y) = λ ψ p LM (y). While ψ weights the yield y taking into account all n-grams (including those crossing the boundaries of phrases), ω weights edges in isolation. Particularly, ω(y[e i ]) = λ ψ q LM (y[e i ]), where y[e i ] returns the sequence of target words (a target phrase) associated with the edge, and q LM (·) is an upperbound on the true LM probability p LM (·) (see §4.3). It is obvious from Equation 3 that our proxy model is much simpler than the original -the only form of nonlocal parameterisation left is the distortion penalty, which is simple enough to represent exactly.
The program in Figure 3 illustrates the construction of D(x), g (0) (d) . A nonterminal item [l, C, r] stores: the leftmost uncovered position l and a truncated coverage vector C (together they track d input positions); and the rightmost position r most recently translated (necessary for the computation of the distortion penalty). Observe how nonterminal items do not store the LM state. 9 The rule ADJACENT expands derivations by concatenation with a biphrase x i i → y j j starting at the leftmost uncovered position i = l. That causes the coverage window to move ahead to the next leftmost uncovered position: l = l + α 1 (C) + 1, [I + 1, C, r] [I + 1, ∅, I + 1] : δ(r, I + 1) ⊗ ω(EOS) r ≤ I Figure 3: Specification of the initial proposal hypergraph. This program allows the same reorderings as (Lopez, 2009) (see logic WLd), however, it does not store LM state information and it uses the upperbound LM distribution ω(·).
where α 1 (C) returns the number of leading 1s in C, and C α 1 (C) + 1 represents a left-shift. The rule NON-ADJACENT handles the remaining cases i > l provided that the expansion skips at most d input words |r − i + 1| ≤ d. In the consequent, the window C is simply updated to record the translation of the input span i..i . In the nonadjacent case, a gap constraint imposes that the resulting item will require skipping no more than d positions before the leftmost uncovered word is translated |i − l + 1| ≤ d. 10 Finally, note that deductions incorporate the weighted upperbound ω(·), rather than the true LM component ψ(·). 11

LM upperbound and Max-ARPA
Following  we compute an upperbound on n-gram conditional probabilities by precomputing max-backoff weights stored in a "Max-ARPA" table, an extension of the ARPA format (Jurafsky and Martin, 2000).
A standard ARPA table T stores entries Z, Z.p, Z.b , where Z is an n-gram equal to the concatenation Pz of a prefix P with a word z, Z.p is the conditional probability p(z|P), and Z.b is a so-called "backoff" weight associated with Z.
The conditional probability of an arbitrary n-gram p(z|P), whether listed or not, can then be recovered from T by the simple recursive procedure shown in Equation 4, where tail deletes the first word of the string P.
Pz ∈ T and P ∈ T p(z| tail(P)) × P.b Pz ∈ T and P ∈ T Pz.p Pz ∈ T The optimistic version (or "max-backoff") q of p is defined as q(z|P) ≡ max H p(z|HP), where H varies over all possible contexts extending the prefix P to the left. The Max-ARPA table allows to compute q(z|P) for arbitrary values of z and P. It is constructed on the basis of the ARPA table T by adding two columns to T : a column Z.q that stores the value q(z|P) and a column Z.m that stores an optimistic version of the backoff weight.
These columns are computed offline in two passes by first sorting T in descending order of n-gram length. 12 In the first pass (Algorithm 1), we compute for every entry in the table an optimistic backoff weight m. In the second pass (Algorithm 2), we compute for every entry an optimistic conditional probability q by maximising over 1word history extensions (whose .q fields are already known due to the sorting of T ).
The following Theorem holds (see proof below): For an arbitrary n-gram Z = Pz, the probability q(z|P) can be recovered through the procedure shown in Equation 5.
Pz ∈ T and P ∈ T p(z|P) × P.m Pz ∈ T and P ∈ T Pz.q Pz ∈ T Note that, if Z is listed in the table, we return its upperbound probability q directly. When the ngram is unknown, but its prefix is known, we take into account the optimistic backoff weight m of the prefix. On the other hand, if both the n-gram and its prefix are unknown, then no additional context could change the score of the n-gram, in which case q(z|P) = p(z|P).
In the sequel, we will need the following definitions. Suppose α = y J I is a substring of y = y M 1 .
12 If an n-gram is listed in T , then all its substrings must also be listed. Certain pruning strategies may corrupt this property, in which case we make missing substrings explicit.
Then p LM (α) ≡ J k=I p(y k |y k−1 1 ) is the contribution of α to the true LM score of y. We then obtain an upperbound q LM (α) to this contribution by defining q LM (α) ≡ q(y I | ) J k=I+1 q(y k |y k−1 I ).
Proof of Theorem. Let us first suppose that the length of P is strictly larger than the order n of the language model. Then for any H, p(z|HP) = p(z|P); this is because HP / ∈ T and P / ∈ T , along with all intermediary strings, hence, by (4), p(z|HP) = p(z| tail(HP)) = p(z| tail(tail(HP))) = . . . = p(z|P). Hence q(z|P) = p(z|P), and, because Pz / ∈ T and P / ∈ T , the theorem is satisfied in this case.
Having established the theorem for |P| > n, we now assume that it is true for |P| > m and prove by induction that it is true for |P| = m. We use the fact that, by the definition of q, we have q(z|P) = maxx∈∆ q(z|xP). We have three cases to consider. First, suppose that Pz / ∈ T and P / ∈ T . Then xPz / ∈ T and xP / ∈ T , hence by induction q(z|xP) = p(z|xP) = p(z|P) for any x, therefore q(z|P) = p(z|P). We have thus proven the first case. Second, suppose that Pz / ∈ T and P ∈ T . Then, for any x, we have xPz / ∈ T , and: For xP / ∈ T , by induction, q(z|xP) = p(z|xP) = p(z|P), and therefore max x∈∆, xP / ∈T q(z|xP) = p(z|P). For xP ∈ T , we have q(z|xP) = p(z|xP) × xP.m = p(z|P) × xP.b × xP.m. Thus, we have: max For xPz / ∈ T, xP / ∈ T , we have q(z|xP) = p(z|xP) = p(z|P) = Pz.p, where the last equality is due to the fact that Pz Pz.q ← Pz.p 3: for x ∈ ∆ s.t xP ∈ T do 4: if xPz ∈ T then 5: Pz.q ← max(Pz.q, xPz.q) 6: else 7: Pz.q ← max(Pz.q, Pz.p × xP.b × xP.m) 8: end if 9: end for 10: end for

Search
The search for the true optimum derivation is illustrated in Algorithm 3. The algorithm takes as input the initial proposal distribution g (0) (d) (see §4.2, Figure 3) and a maximum error (which we set to a small constant 0.001 rather than zero, to avoid problems with floating point precision). In line 3 we find the optimum derivation d in g (0) (see §4.5). The variable g * stores the maximum score w.r.t. the current proposal, while the variable f * stores the maximum score observed thus far w.r.t. the true model (note that in line 5 we assess the true score of d). In line 6 we start a loop that runs until the error falls below . This error is the difference (in log-domain) between the proxy maximum g * and the best true score observed thus far f * . 13 In line 7, we refine the current proposal using evidence from d (see §4.6). In line 9, we update the maximum derivation searching through the refined proposal. In line 11, we keep track of the best score so far according to the true model, in order to compute the updated gap in line 6.

Dynamic Programming
Finding the best derivation in a proposal hypergraph is straightforward with standard dynamic programming. We can compute inside weights in the max-times semiring in time proportional 13 Because g (t) upperbounds f everywhere, in optimisation we have a guarantee that the maximum of f must lie in the interval [f * , g * ) (see Figure 2) and the quantity g * − f * is an upperbound on the error that we incur if we early-stop the search at any given time t. This bound provides a principled criterion in trading accuracy for performance (a direction that we leave for future work). Note that most algorithms for approximate search produce solutions with unbounded error.
Algorithm 3 Exact decoding 1: function OPTIMISE(g (0) , ) 2: update "best so far" 12: end while 13: return g (t) , d 14: end function to O(|V | + |E|) (Goodman, 1999). Once inside weights have been computed, finding the Viterbiderivation starting from the root is straightforward. A simple, though important, optimisation concerns the computation of inside weights. The inside algorithm (Baker, 1979) requires a bottom-up traverse of the nodes in V . To do that, we topologically sort the nodes in V at time t = 0 and maintain a sorted list of nodes as we refine g throughout the search -thus avoiding having to recompute the partial ordering of the nodes at every iteration.

Refinement
If a derivation d = argmax d g (t) (d) is such that , there must be in d at least one ngram whose upperbound LM weight is far above its true LM weight. We then lower g (t) locally by refining only nonterminal nodes that participate in d. Nonterminal nodes are refined by having their LM states extended one word at a time. 14 For an illustration, assume we are performing optimisation with a bigram LM. Suppose that in the first iteration a derivation d 0 = argmax d g (0) (d) is obtained. Now consider an edge in d 0 where an empty LM state is made explicit (with an empty string ) and αy 1 represents a target phrase. We refine the edge's head [l 0 , C 0 , r 0 , ] by creating a node based on it, however, with an extended LM state, i.e., [l 0 , C 0 , r 0 , y 1 ]. This motivates a split of the set of incoming edges to the original node, such that, if the target projection of an incoming edge ends in y 1 , that edge is reconnected to the new node as below.
[l, C, r, ] αy 1 w − → [l 0 , C 0 , r 0 , y 1 ] The outgoing edges from the new node are reweighted copies of those leaving the original node. That is, outgoing edges such as [l 0 , C 0 , r 0 , ] y 2 β w − → l , C , r , γ motivate edges such as is a change in LM probability due to an extended context. Figure 4 is the logic program that constructs the refined hypergraph in the general case. In comparison to Figure 3, items are now extended to store an LM state. The input is the original hypergraph G = V, E and a node v 0 ∈ V to be refined by left-extending its LM state γ 0 with the word y. In the program, uσ w − → v with u, v ∈ V and σ ∈ ∆ * represents an edge in E. An item [l, C, r, γ] v (annotated with a state v ∈ V ) represents a node (in the refined hypergraph) whose signature is equivalent to v (in the input hypergraph). We start with AXIOMS by copying the nodes in G. In COPY, edges from G are copied unless they are headed by v 0 and their target projections end in yγ 0 (the extended context). Such edges are processed by REFINE, which instead of copying them, creates new ones headed by a refined version of v 0 . Finally, REWEIGHT continues from the refined node with reweighted copies of the edges leaving v 0 . The weight update represents a change in LM probability (w.r.t. the upperbound distribution) due to an extended context.

Experiments
We used the dataset made available by the Workshop on Statistical Machine Translation (WMT) (Bojar et al., 2013) to train a German-English phrase-based system using the Moses toolkit (Koehn et al., 2007) in a standard setup. For phrase extraction, we used both Europarl (Koehn, 2005) and News Commentaries (NC) totalling about 2.2M sentences. 15 For language modelling, in addition to the monolingual parts of Europarl Figure 4: Local intersection via LM right state refinement. The input is a hypergraph G = V, E , a node v 0 ∈ V singly identified by its carry [l 0 , C 0 , r 0 , γ 0 ] and a left-extension y for its LM context γ 0 . The program copies most of the edges If a derivation goes through v 0 and the string under v 0 ends in yγ 0 , the program refines and reweights it. and NC, we added News-2013 totalling about 25M sentences. We performed language model interpolation and batch-mira tuning (Cherry and Foster, 2012) using newstest2010 (2,849 sentence pairs). For tuning we used cube pruning with a large beam size (k = 5000) and a distortion limit d = 4. Unpruned language models were trained using lmplz (Heafield et al., 2013b) which employs modified Kneser-Ney smoothing (Kneser and Ney, 1995). We report results on newstest2012.
Our exact decoder produces optimal translation derivations for all the 3,003 sentences in the test set. Table 1 summarises the performance of our novel decoder for language models of order n = 3 to n = 5. For 3-gram LMs we also varied the distortion limit d (from 4 to 6). We report the average time (in seconds) to build the initial proposal, the total run time of the algorithm, the number of iterations N before convergence, and the size of the hypergraph in the end of the search (in thousands of nodes and thousands of edges , total decoding time including build, number of iterations (N), and number of nodes and edges (in thousands) at the end of the search.
It is insightful to understand how different aspects of the initial proposal impact on performance. Increasing the translation option limit (tol) leads to g (0) having more edges (this dependency is linear with tol). In this case, the number of nodes is only minimally affected -due to the possibility of a few new segmentations. The maximum phrase length (mpl) introduces in g (0) more configurations of reordering constraints ([l, C] in Figure 3). However, not many more, due to C being limited by the distortion limit d. In practice, we observe little impact on time performance. Increasing d introduces many more permutations of the input leading to exponentially many more nodes and edges. Increasing the order n of the LM has no impact on g (0) and its impact on the overall search is expressed in terms of a higher number of nodes being locally intersected.
An increased hypergraph, be it due to additional nodes or additional edges, necessarily leads to slower iterations because at each iteration we must compute inside weights in time O(|V |+|E|). The number of nodes has the larger impact on the number of iterations. OS * is very efficient in ignoring hypotheses (edges) that cannot compete for an optimum. For instance, we observe that running time depends linearly on tol only through the computation of inside weights, while the number of iterations is only minimally affected. 17 An in-|E0| = 178 with d = 6. Observe the exponential dependency on distortion limit, which also leads to exponentially longer running times. 17 It is possible to reduce the size of the hypergraph throughout the search using the upperbound on the search error g * − f * to prune hypotheses that surely do not stand a chance of competing for the optimum (Graehl, 2005). Another direction is to group edges connecting the same nonterminal nodes into one partial edge (Heafield et al., 2013a)this is particularly convenient due to our method only visiting the 1-best derivation from g(d) at each iteration.  Table 2: Average number of nodes (in thousands) whose LM state encode an m-gram, and average number of unique LM states of order m in the final hypergraph for different n-gram LMs (d = 4 everywhere).
creased LM order, for a fixed distortion limit, impacts much more on the number of iterations than on the average running time of a single iteration. Fixing d = 4, the average time per iteration is 0.1 (n = 3), 0.13 (n = 4) and 0.18 (n = 5). Fixing a 3-gram LM, we observe 0.1 (d = 4), 0.17 (d = 5) and 0.31 (d = 6). Note the exponential growth of the latter, due to a proposal encoding exponentially many more permutations. Table 2 shows the average degree of refinement of the nodes in the final proposal. Nodes are shown by level of refinement, where m indicates that they store m words in their carry. The table also shows the number of unique m-grams ever incorporated to the proposal. This table illustrates well how our decoding algorithm moves from a coarse upperbound where every node stores an empty string to a variable-order representation which is sufficient to prove an optimum derivation.
In our approach a complete derivation is optimised from the proxy model at each iteration. We observe that over 99% of these derivations project onto distinct strings. In addition, while the optimum solution may be found early in the search, a certificate of optimality requires refining the proxy until convergence (see §4.1). It turns out that most of the solutions are first encountered as late as in the last 6-10% of the iterations.
We use the optimum derivations obtained with our exact decoder to measure the number of search errors made by beam search and cube pruning with increasing beam sizes (see Table 3). Beam search reaches optimum derivations with beam sizes k ≥ 500 for all language models tested. Cube pruning, on the other hand, still makes mistakes at k = 1000. Table 4 shows translation quality achieved with different beam sizes for cube pruning and compares it to exact decoding. Note that for k ≥ 10 4 cube pruning converges to optimum   Table 4: Translation quality in terms of BLEU as a function of beam size in cube pruning with language models of order 3 to 5. The bottom row shows BLEU for our exact decoder.
derivations in the vast majority of the cases (100% with a 3-gram LM) and translation quality in terms of BLEU is no different from OS * . However, with k < 10 4 both model scores and translation quality can be improved. Figure 5 shows a finer view on search errors as a function of beam size for LMs of order 3 to 5 (fixed d = 4). In Figure 6, we fix a 3-gram LM and vary the distortion limit (from 4 to 6). Dotted lines correspond to beam search and dashed lines correspond to cube pruning.

Conclusions and Future Work
We have presented an approach to decoding with unpruned hypergraphs using upperbounds on the language model distribution. The algorithm is an instance of a coarse-to-fine strategy with connections to A * and adaptive rejection sampling known as OS * . We have tested our search algorithm using state-of-the-art phrase-based models employing robust language models. Our algorithm is able to decode all sentences of a standard test set in manageable time consuming very little memory. We have performed an analysis of search errors made by beam search and cube pruning and found that both algorithms perform remarkably well for phrase-based decoding. In the case of cube pruning, we show that model score and translation  quality can be improved for beams k < 10, 000.
There are a number of directions that we intend to investigate to speed up our decoder, such as: (1) error-safe pruning based on search error bounds; (2) use of reinforcement learning to guide the decoder in choosing which n-gram contexts to extend; and (3) grouping edges into partial edges, effectively reducing the size of the hypergraph and ultimately computing inside weights in less time.