Algorithms for Acyclic Weighted Finite-State Automata with Failure Arcs

Weighted finite-state automata (WSFAs) arecommonly used in NLP. Failure transitions area useful extension for compactly representingbackoffs or interpolation in n-gram modelsand CRFs, which are special cases of WFSAs.Unfortunately, applying standard algorithmsfor computing the pathsum requires expand-ing these compact failure transitions. As aresult, na ̈ıve computation of the pathsum inacyclic WFSAs with failure transitions runs inO(|Q|2|Σ|) (O(|Q||Σ|) for deterministic WF-SAs) while the equivalent algorithm in normalWFSAs runs in O(|E|), where E representsthe set of transitions, Q the set of states, andΣ the alphabet. In this work, we present moreefficient algorithms for computing the pathsumin sparse acyclic WFSAs, i.e., WFSAs with av-erage out symbol fraction s ≪ 1. In those,backward runs in O(s|Q||Σ|). We proposean algorithm for semiring-weighted automatawhich runs in O(|E| + s|Σ||Q||Tmax| log |Σ|),where |Tmax| is the size of the largest con-nected component of failure transitions. Ad-ditionally, we propose faster algorithms fortwo specific cases. For ring-weighted WF-SAs we propose an algorithm with complex-ity O(|E| + s|Σ||Q||πmax|), where |πmax| de-notes the longest path length of failure transi-tions stemming from q and Σ(q) the set of sym-bols on the outgoing transitions from q. Forsemiring-weighted WFSAs whose failure tran-sition topology satisfies a condition exemplifiedby CRFs, we propose an algorithm with com-plexity O(|E| + s|Σ||Q| log |Σ|).

Failure transitions are a useful augmentation of standard WFSAs.First introduced in the context of string matching (Aho and Corasick, 1975), they can be used to represent backoff n-gram language models (Allauzen et al., 2003), higher-order CRFs, and variable-order CRFs (VoCRFs; Vieira et al., 2016)  in a more compact way.They represent "default" transitions out of states when no other transition is possible.For example, in backoff n-gram language models, a weighted failure transition from a higherorder history to a lower-order history (e.g., from a 4-gram to a 3-gram) is used to back off before reading a word that was rarely observed with the higher-order history, so that it was not worth including a dedicated transition for that word.
The pathsum computes the total weight of all the paths in a WFSA graph, where the weights may fall in any semiring. 1Examples include finding the highest-weighted path for Viterbi decoding, computing the posterior marginals (inference) in hidden Markov models, and computing the normalizing constant in CRFs.The pathsum is particularly efficient to compute in acyclic WFSAs with the backward algorithm, whose runtime is O(|E|).However, the special semantics of failure transitions mean that the ordinary backward algorithm cannot be applied (nor can the forward algorithm).Failure transitions must first be replaced by normal ones (Alg. 2 below), resulting in the failure-expanded transition set E, which can contain up to |Q| 2 |Σ| transitions.Replacing failure transitions, therefore, undoes the compaction afforded by them.This is especially expensive (|E| ≫ |E|) for backoff language models, for example, where each of the many 4-gram states only has explicit transitions in E for symbols a that were observed in training data to follow that 4-gram, but has transitions in E for every a ∈ Σ.For example, Penn Treebank tagging has |Σ| = 36 and Czech morphological tagging has |Σ| > 1000 (Hajič and Hladká, 1998).While Allauzen  et al. (2003) present an O(n 2 |Σ||Q|) method to preprocess a (possibly cyclic) n-gram language model WFSA with failure transitions such that the pathsum remains identical, their method only applies to the case of the tropical semiring.
In this paper, we study the problem of efficiently computing the pathsum in WFSAs with failure transitions over general semirings.We specifically focus on acyclic WFSAs, 2 introducing several algorithms, all based on the backward algorithm, that take advantage of the compact structure induced by the failure transitions.Our improvements are strongest for WFSAs that are sparse in a sense to be defined shortly.We summarise our contributions as follows: • We present simple baseline algorithms using failure transition removal ( §3.1) and memoization ( §3.2).
• With some extra work to avoid subtraction ( §5), we extend the algorithm to general semirings ( §6).

Preliminaries
This section defines WFSAs, the pathsum problem, the backward algorithm, and failure transitions.
Definition 1.A semiring is a 5-tuple W = (K, ⊕, ⊗, 0, 1) where K is a set equipped with operations ⊕ and ⊗, s.t.(K, ⊕, 0) is a commutative 2 We have also worked out several novel algorithms for cyclic WFSAs with failure transitions.However, the 9-page format of EMNLP submissions meant that we had to save these for a future manuscript.Moreover, we view the split into acyclic and cyclic cases as natural, as our algorithms for cyclic automata are based on the more complex algorithm of Lehmann (1977).While the acyclic algorithms cannot be run on a backoff language model (one of our examples), they can be used, for example, to compute the total language model probability of all paths in an acyclic lattice that may have wildcard arcs and/or failure arcs.Even though the language model is cyclic, its intersection with the lattice becomes acyclic.monoid, (K, ⊗, 1) is a monoid, ⊕ distributes over ⊗, and 0 annihilates ⊗.Definition 2. A weighted finite-state automaton (WFSA) is a 5-tuple A = ⟨Σ, Q, E, λ, ρ⟩, where Σ is a finite alphabet, Q a finite set of states, E a collection of transitions in Q×Σ×K×Q, λ ∶ Q → K the initial-state weighting function, and ρ ∶ Q → K the final-state weighting function.
To improve readability, we render a transition (q, a, w, q ′ ) as q a/w → q ′ .We further define E(q) def = {e | ∃a, w, q ′ ∶ e = q a/w → q ′ ∈ E} as the set of outgoing transitions of q ∈ Q, and E(q, a) as those labeled with a ∈ Σ. Σ(q) def = {a | E(q, a) ≠ ∅} denotes the set of transition labels in E(q).
Importantly, we will assume that the graph (Q, E) is acyclic (see footnote 2).Less importantly, our definition of WFSAs does not allow ε-transitions, assuming that they have been eliminated in advance (Mohri, 2002a), which is easy in the acyclic case.Our runtime analyses assume for simplicity that (i) the graph is connected (implying |E| ≥ |Q| − 1) and (ii) that for each q, q ′ ∈ Q, E contains at most one transition q a/w → q ′ for any a ∈ Σ.This can always be achieved by replacing "parallel" transitions {q to the initial and final states of π, respectively.Π(A) denotes the set of all paths in A.
Definition 5.The pathsum of A is defined as (1) The problem of computing the pathsum is sometimes also referred to as the generalized shortestdistance problem (Mohri, 2002b).
Definition 6.The backward value β(q) of a state q ∈ Q is the sum of the inner weights of all paths π starting at q right-multiplied by ρ (n (π)), i.e., We extend this definition to state-symbol pairs (q, a) ∈ Q × Σ as (3) The value β(q, a) can be seen as the result of restricting the paths contributing to β(q) to those starting with a ∈ Σ.
Naïvely computing the pathsum by enumerating all π ∈ Π(A) in an acyclic WFSA would result in an exponential runtime.However, algebraic properties of semirings allow for faster algorithms (Mohri, 2002b).An example is the backward algorithm, a dynamic program which computes backward values and the pathsum in acyclic WFSAs in time O (|E|).It exploits the fact that, in acyclic WFSAs, Q can always be topologically sorted and the backward values can be computed in reverse topological order.This guarantees that the backward values of q's children will have been computed by the time we expand q, meaning that β(q) can be computed as The pseudocode is given in Alg. 1.All our algorithms are based on the backward algorithm.

Failure Transitions
We consider an extension of WFSAs where any state can have a single fallback state q ϕ .Definition 7. A WFSA with failure transitions (WFSA-ϕ) is a 6-tuple A = ⟨Σ, Q, E, λ, ρ, ϕ⟩, where ϕ is a failure function-a partial function that maps some states q ∈ Q to their fallback state ϕ(q) = q ϕ .
Fallback states can be represented by transitions q ϕ/1 → q ϕ with a special meaning:3 they are only traversed upon reading a symbol a ∉ Σ(q) and thus represent a default option used when no ordinary transition is available. 4This formalization means that every state has at most one fallback state.
We do not include ϕ in Σ or ϕ-transitions in E. We denote the set of ϕ-transitions as E ϕ and assume that E ∪ E ϕ still forms an acyclic graph.
ϕ-transitions can be explicitly represented in a normal WFSA by expansion of ϕ-transitions.Definition 8. Given an acyclic WFSA-ϕ A = ⟨Σ, Q, E, λ, ρ, ϕ⟩, we introduce the recursively defined failure-expanded transition set as follows and the set E ⊆ Q × Σ × K × Q as the union of these sets over Q and Σ.
E(q, a) is well-defined due to the assumed acyclicity of E ϕ .It may be empty.E captures all "indirect" transitions which can be made across arbitrarily long paths of only ϕ-transitions.Σ(q), analogously to Σ(q), denotes the set of outgoing symbols for q ∈ Q in the failure-expanded WFSA-ϕ.Definition 9. We define the average out-symbol fraction s of a WFSA as s ∈ [0, 1] is a measure of completeness of the WFSA.We correspondingly define s, the equivalent in the failure-expanded transition set E.
We say informally that a WFSA is Σ-sparse if s ≪ 1, so on average |Σ(q)| ≪ |Σ|.Intuitively, this means that the average state only has outgoing transitions on a few distinct symbols.We will show that the runtime tradeoff between our baseline pathsum algorithm MemoizationBackward (Alg.3) and later algorithms depends on the difference between s and s.Our algorithms are efficient when s ≪ s: intuitively in the regime where failure expansion would add outgoing transitions for many new symbols.
Correcting Eq. ( 5) to take ϕ-transitions into account, the backward values in a WFSA-ϕ can be computed as Importantly, the following equality holds This follows straight from the definition of ϕ.It states that the backward values of the state-symbol pairs (q, a) in WFSA-ϕ equal the ones in a normal WFSA if an a-labeled transition can be taken; if not, the backward value is inherited from the fallback state, since the 1-weighted ϕ-transition is taken.
Connected components of the graph formed by ϕ-transitions of a WFSA-ϕ are trees (specifically, anti-arborescences) since a state can have at most one outgoing ϕ-transition and the WFSA is acyclic.This motivates the following definition.Definition 10.Let A be an acyclic WFSA-ϕ.A failure tree T is a connected component of the graph formed by ϕ-transitions of A. An example of a failure tree T is shown in Fig. 1a.We write |T | for the number of states in T , with |T max | being the number of states in the largest failure tree and |π max | the number of states in the longest failure path.
T q denotes the failure tree containing q ∈ Q.We write q ≺ q ′ to say that q is a proper ancestor of q ′ in T q , i.e., there is a non-empty ϕ-path from q to q ′ .

Expanding Failure Transitions
The pathsum of a WFSA-ϕ can be naïvely computed by replacing the ϕ-transitions with normal ones according to the semantics of the ϕ-transitions and running the backward algorithm on the expanded WFSA.Before introducing our contributions, we present this method for pedagogical purposes.While this solution is near-optimal for non-Σ-sparse WFSAs, it can be improved for certain Σ-sparse WFSAs.

Expanding Failure Transitions
Failure expansion is a transformation of an acyclic WFSA-ϕ which replaces the ϕ-transitions while retaining acyclicity.See Alg. 2 for the pseudocode, Fig. 1b for an example of failure expansion, and App.A for an example of how the backward algorithm operates in this setting.

Decomposing the Backward Values
The algorithms we present in later sections sidestep the need to materialize all additional transitions replacing the failure transition.They are based on a decomposition of the backward values into two components: the local and the failure component.Using Eq. ( 4), we can split β(q) into β(q) = ρ(q) ⊕ β(q, Σ) (10) The two terms on the right hand-side of Eq. ( 11) can be further expanded as except that the second term is 0 if q has no failure transition (in which case q ϕ is not defined).β(q, Σ(q)) is exactly the quantity computed by Alg. 1 on line 3; our modifications never change this computation.Rather, all of our algorithms seek to simplify the computation of β(q, Σ ∖ Σ(q)).
Eq. ( 13) makes it possible to avoid failure expansion by storing not only β(q) but also the values β(q, a) at each state q.Since q ϕ will then memoize all needed β(q ϕ , b) values, the sum (13) becomes easy to compute for any q that may back off to q ϕ .Passing the summand β(q ϕ , b) back to q is cheaper than passing back all of the arcs q ϕ b/w → q ′ ∈ E that contribute to that summand, as Alg. 2 does: a nondeterministic WFSA may have multiple such arcs.The pseudocode of this modification is presented in Alg. 3. Notice the additional term β(q, Σ ∖ Σ(q)) on line 10 in Alg. 3, which was not needed in the backward algorithm for ordinary WFSAs.See App.A for a guided example on a small WFSA.
In the general case of non-deterministic WFSAs, failure expansion may have to loop over as many as |Q||Σ ∖ Σ(q)| transitions at each state q.Alg. 3 reduces this to a loop over |Σ ∖ Σ(q)| symbols, which is (s − s)|Σ| on average.The full complexity of Alg. 3 is then O(|E| + (s − s)|Σ||Q|)) (similarly to App.B).
The shortcoming of Alg. 3 is that (s − s)|Σ| may still be large.The terms β(q ϕ , b) must be individually copied back to q as β(q, b) for each |Σ ∖ Σ(q)|.Our proposed algorithms in the following subsections avoid the overhead incurred by this copying.

Runtime
The runtime of Alg. 4 is on the order of the number of calls to line 10, plus |E| to cover all the sums in line 11 (which executes at most once for each q, a pair, thanks to memoization).Every a ∈ Σ(q) results in two such calls, at lines 6 and 8; there is also a possible recursive call at line 12 if a ∈ Σ(q ′ ) for at least one proper ancestor q ′ ≺ q in the failure tree (thanks to memoization, this happens at most once per q, a pair, even if there are multiple choices of q ′ ).Thus, the overall runtime is We will revisit ring-weighted WFSA-ϕ's in §6.4.

Incrementally Modified Aggregator
The point of Alg. 4 line 9 is to replace some summands of β(q ϕ , Σ) to get β(q, Σ).When no subtraction operator ⊖ is available (e.g., if ⊕ = max), we can use an aggregation data structure that is designed to efficiently replace individual summands in a sum without using subtraction.For example, a Fenwick tree (Fenwick, 1994) can replace a summand and recompute the sum in O(log N ) called in Alg. 4, the memo (for that particular a) may not yet exist and will have to be filled in on demand. 6Clearly time, where N is the number of summands.(Fenwick trees are similar to binary heaps; they are reviewed in App.C.)Here we merely give the interface to aggregators: 1: class Aggregator(): ▷ We use γ to refer to an aggregator instance 2: 5: def undo(n: N) ▷ Reverts the last n updates We will represent each sum β(q, Σ) in Alg. 4 as the total value of an aggregator that stores summands β(q, a) for a ∈ Σ.In principle, this aggregator could be obtained by copying the aggregator for β(q ϕ , Σ) and then modifying some summands (see line 9).However, aggregators are not constantsize data structures, so creating all of these slightly different aggregators would be expensive.
Instead, our strategy will be to use just a single aggregator, for the "current" state q, and make small modifications as we visit different states q ′ .More precisely, we have one aggregator γ per failure tree, first created at the tree's root.When we step backwards in the failure tree, say from q ϕ to q, we modify "just a few" summands in γ so that β(q, a) replaces β(q ϕ , a) for a ∈ Σ(q).This is fast if Σ(q) is small.We can now obtain β(q, Σ) as the aggregator's new total value.To visit other ancestors of q ϕ , we must first move forward to q ϕ again, which we do by reverting the modifications. 7finition 11.Aggregator γ represents q ∈ Q if γ(a) = β(q, a), ∀a ∈ Σ γ will be updated to represent different states in the failure tree at different times.When γ represents q, it holds that β(q) = ρ(q) ⊕ γ.value(), by (10).
Updates are carried out by the methods in Alg. 5, which move backward and forward in a failure tree.When γ represents q ϕ , we can call Visit(γ, q) to update γ so that it represents q.At any later time when γ again represents q, we can call Leave(γ, q) to undo this update, so that γ again represents q ϕ .Each Visit(γ, q) or Leave(γ, q) call runs in time for a ∈ Σ(q) : 3: γ.set(a, β(q, a))▷ Use the memoizing β(q, a) from Alg. 4 4: def Leave(γ, q): ▷ update γ that represented q to represent q ϕ 5: undo(|Σ(q)|) ▷ revert all the updates made by Visit Note that Visit(γ, q) accomplishes the same goal as Alg. 4 line 9, but with an extra runtime factor of O(log |Σ|) to avoid subtraction.It may still be faster than the O(|Σ ∖ Σ(q)|) runtime of Alg. 3 line 10, when |Σ(q)| is quite small relative to |Σ|.

A General Backward Algorithm for Acyclic WFSAs with Failure Transitions
Alg. 6 is our most general version of the backward algorithm for computing the pathsum of an acyclic WFSA-ϕ.It makes use of the Aggregator and pseudocode from the previous section.
Algorithm 6 1: def GeneralBackward(A): 2: for q ∈ ReverseTopological(A) : 3: T ← T q ▷ failure tree containing q 4: if q has no fallback state : ▷ q is root of failure tree 5: γ T ← new Aggregator() ▷ New empty aggregator 6: Visit(γ T , q); q T ← q ▷ Initialize γ T & remember q 7: else 8: while q T is not a descendant of q in T : 9: Leave(γ T , q T ); q T ← q ϕ T ▷ Descend in T 10: Visit + (γ T , q, q T ); q T ← q ▷ Ascend in T 11: ▷ Now γ T represents q (thanks to all of the above) 12: β(q) ← ρ(q) ⊕ γ T .value() return ⊕ q∈Q λ(q) ⊗ β(q) 14: def Visit + (γ, q, q ′ ): ▷ update γ that represented q ′ to represent q 15: if q ϕ ≠ q ′ : Visit + (γ, q ϕ , q ′ ); 16: Visit(γ, q) Like Alg. 3, this computes β(q) at all states in reverse topological order.However, it attempts to share work among states q in the same failure tree T , by having them share an aggregator γ T that currently represents some state q T ∈ T (in the sense of definition 11).The algorithm updates the aggregator to represent q, by descending in the failure tree until it reaches a common descendant, and then ascending again until it reaches q.
To make line 8 efficient, we preprocess each failure tree by visiting its states in depth-first order and annotating each state with the time interval during which it is on the stack. 9The loop at line 8 continues until the q T interval contains the q interval.

Runtime
As in Alg. 4 (see §4.1), O(|E|) runtime is needed to sum over the non-failure transitions from each state.The rest of the runtime is dominated by the calls to Visit and Leave.Recall from §5 that visiting or leaving q takes time O(|Σ(q)| log |Σ|).Since a state can be left at most once for each time it is visited, it suffices to count just the visits.
The number of visits to each state depends on the (reverse) topological order used at line 2. In the best case, q iterates over the states of each failure tree in depth-first order, starting at the root.Then Visit is called only on the current iterate q-either as a root (line 6) or as a parent (line 15).Since each state is Visited exactly once, the total runtime is O(|E| + ∑ q∈Q |Σ(q)| log |Σ|).In the worst case, however, each q at line 2 is far in the failure tree from the previous one, forcing q T to descend all the way to the root and then ascend again to q.This means line 15 visits all states q ′ for which q ⪯ q ′ .The total runtime is therefore O(|E| + ∑ q ′ ∈Q |Σ(q ′ )| ancs(q ′ ) log |Σ|), where ancs(q ′ ) def = |{q ∶ q ⪯ q ′ }| is the number of ancestors of q ′ in the failure tree.Renaming the summation variable, we get O(|E| + ∑ q∈Q |Σ(q)| ancs(q) log |Σ|). 10e can get a simpler but looser worst-case bound by increasing ancs(q) to |T max |, the maximum size of any failure tree.Rewriting this in terms of s, we have bounded the runtime by O(|E| + s|Σ||Q||T max | log |Σ|), where, however, in the best case we avoid the |T max | factor.
The worst-case behavior is illustrated by Fig. 2, where the only possible topological order is 1, 2, 3, 4, 5, . ... When line 2 iterates over state 5 immediately after state 4, the aggregator must transition 4 Note that this involves 2 Visits, as 2 is the height of state 5. 9 During depth-first search, the "clock" is a counter that starts at 0 and advances by 1 on each recursive call or return.Thus, the "time interval" is a pair of small integers. 10This can be slightly improved by noting that line 15 never Visits the root of the failure tree, so the ∑ q∈Q can omit the root.(Thus, the runtime is O(|E|) on a WFSA without failure arcs.)Although the root is still Visited once at line 6, the cost of that visit can be folded into the |E| term, since the initial creation of the aggregator at the root state q can be accomplished in time only O(|Σ(q)|) without the log factor (see App. C).If the a arcs were not present in Fig. 2, however, then 1, 2, 4, . . ., 3, 5, . . .would also be a topological order, which achieves the best-case behavior of visiting each state only once.Indeed, many topological orders would be available-some more efficient than others.

Topological Sorting Heuristics
It is desirable to choose a good topological order when one is available.In particular, the "best-case" scenario above is achieved under this condition: Definition 12. Let A be an acyclic WFSA-ϕ.Given a reverse topological order of the states, we say that q completely precedes q ′ if q and all its failuretree ancestors precede q ′ and all its failure-tree ancestors.We say that the order is compatible with the failure trees of A if whenever q, q ′ are in the same failure tree 11 but have disjoint sets of ancestors, either q completely precedes q ′ or vice-versa.
To put this another way, a compatible order of the WFSA states may jump back and forth among failure trees, as needed to achieve a topological ordering, but each failure tree's states will appear in some depth-first order starting at the tree's root, which ensures that each state is Visited just once.
In some backoff architectures such as variableorder conditional random fields (Vieira et al., 2016), it is easy to find a compatible order.In these WFSAs, each failure tree is associated with a position in a fixed input sentence.Simply visit the failure trees from right to left, enumerating each one's states in depth-first order starting at the root.
For the general case, we have developed an topological sorting algorithm that will find a compatible order when one exists.We begin with Kahn's  (1962) agenda-based algorithm for finding a reverse topological order.It places all states onto a very simple priority queue in which "ready" states are prioritized at the front of the queue.The next state q to enumerate is obtained by popping this 11 Remark: If we dropped the condition that q, q ′ be in the same failure tree, then we would get a stronger compability criterion that would let all failure trees share a single aggregator.queue, and then the parents of q (that is, its immediate predecessors in the WFSA graph) decrement their counts of unenumerated children (i.e., immediate successors).If a parent state's count reaches 0, then it becomes ready and moves to the front portion of the queue.If the algorithm ever pops a non-ready state, then it throws an exception saying that the WFSA was cyclic.
Our approach is to modify Kahn's algorithm so as to break ties.Once q's children have been enumerated, Kahn's algorithm is allowed to enumerate q at any time, but our modified version prefers to wait until it would be possible to enumerate q and (eventually) its failure-tree ancestors with a single Visit each.Unfortunately, this test is expensive,12 so using it would not actually speed up Alg. 6.We therefore omit the details here.
In practice, we recommend using a greedy version of the above algorithm.We do wait to enumerate q until q can be enumerated with a single Visit, but we no longer worry about its ancestors.This greedy heuristic is still guaranteed to find a compatible order if the WFSA has the special property that there are no paths between states in the same failure tree (other than ϕ-paths).Variable-order CRFs do have this property.Fig. 2 does not.
Specifically, we say that a not-yet-enumerated state q ∈ T is cheap if it is a ϕ-parent of the current q T (that is, q ϕ = q T ), so that Alg. 6 only has to call Visit(γ T , q) to update q T ← q.Modify Kahn's algorithm to prioritize cheap ready states ahead of expensive ready states. 13Modify Alg. 6 to repeatedly descend at the end of the main loop until q T has at least one unenumerated ϕ-parent,14 ensuring that there is a new cheap state in T .The hope is that this cheap state will become ready while it is still cheap (indeed, it may already be ready).

Copying Aggregators
Long Leave-Visit paths can trigger many updates to aggregator γ T .Such paths can be shortened by splitting the failure tree into multiple smaller trees, each with its own aggregator.When we Visit a state q, we can choose to copy the aggregator from q ϕ and update only the copy, leaving the old aggregator at q ϕ .While this incurs a one-time copying cost, we can now split off the failure subtree rooted at q into its own failure tree.Enumerating states in this subtree will now never require visiting q's descendants.The effect is to reduce |T max | in the analysis of §6.1.App.E presents • a dynamic splitting heuristic that is sensitive to the actual toposort order ( §6.2) • an static splitting algorithm that uses dynamic programming to choose the optimal set of split states to minimize a worst-case bound • runtime analysis of an idealized case to show how Alg. 6 uses copying to gracefully degrade into Alg.3 as the WFSA becomes denser

The Ring Case
In the case of a ring, it is possible to implement a faster aggregator.The aggregator still stores N summands and their total, but no partial sums.It can replace a summand in time O(1) rather than O(log N ), by subtracting off the old summand from the total and adding the new one.This eliminates the log |Σ| factor from the runtimes in §6.1.The resulting bound O(|E| + s|Σ||Q||T max |) for Alg.6 is still worse than §4.1's bound of O(|E| + |Σ||Q| min(1, s|π max |)) for Alg. 4.However, the former becomes better when a compatible order is known and the |T max | can be dropped.
It is more instructive to compare the tighter bounds of O(|E| + ∑ q ′ ∈Q |Σ(q ′ )| ancs(q ′ )) for Alg.6 and O (|E| + ∑ q∈Q | Σ(q)|) for Alg. 4. If a compatible order is known, ancs(q ′ ) can be dropped and the former is better.If not, then either runtime could be better.The former effectively charges each state q for all of the out-symbols at all of its ϕ-descendants q ′ , 15 while the latter charges q for all of the distinct out-symbols at its ϕ-ancestors q ′ .The reason for the difference: Both algorithms override a descendant's a-arc with an ancestor's a-arc, but to find these descendant-ancestor pairs, Alg.6 loops over a-arcs at the descendant (pushing subtrahends up from below) while Alg. 4 loops over a-arcs at the ancestor (pulling subtrahends up from above).When different descendant-ancestor paths overlap, the former algorithm shares work between them if the Visit order is good, while the latter shares work between them via memoization. 15

Alg. O-Cost Use case
Alg. 1 Table 1: Runtime of computing the failure term by the different algorithms.The "use case" column indicates when an algorithm has better complexity than the baseline algorithm, Alg. 3. C U is the update complexity of the aggregator interface: log |Σ| in the general case (via a Fenwick tree) and 1 in the ring case of §6.4 (via subtraction).Alg.6 + is the runtime for WFSA-ϕ's such as VoCRFs where a compatible state order is known, whereas Alg.6 − is the general worst-case runtime.

Comparison of Algorithms
This work proposed multiple algorithms for computing the pathsum of an acyclic WFSA-ϕ.They are all alternatives to running the backward algorithm (Alg.1)-or its simple improvement by aggregation (Alg.3)-after explicitly expanding failure transitions (Alg.2).This section summarizes the improvements.
As mentioned in §3.2, we never change the way the local component of the backward values is computed.All algorithms we consider therefore retain the O(|E|) complexity term from expanding the non-ϕ transitions.What differs is the method for computing the failure term β(q, Σ ∖ Σ(q))-the contribution of the paths starting at q that take q's failure transition.Table 1 compares this term's runtime complexity for all the algorithms discussed.
Maintaining perspective, the benefits of our more sophisticated pathsum algorithms over the basic Alg. 3 only make an actual impact if Alg. 3's failure component complexity O((s − s)|Σ||Q|) is dominant over the local component O(|E|), where |E| ≥ s|Σ||Q|.In particular, reducing the failure component is only helpful if s ≫ s, so that expanding failure transitions would make the graph denser.

Conclusion
We presented two new algorithms for more efficiently computing the backward values and pathsum of a sparse acyclic semiring-weighted FSA with ϕ-transitions, using the observation that a ϕtransition from q to q ϕ means that β(q) is a sparsely modified version of β(q ϕ ).We characterized when the new algorithms are asymptotically faster.

Limitations
This section addresses two main limitations of our work: the assumptions made on the structure of the WFSAs and the applicability of the proposed algorithms in real scenarios.
Acyclicity assumption.We only consider acyclic WFSAs.While this covers interesting use cases such as CRFs, other commonly used instances of WFSAs also contain cycles, e.g., ngram language models.Furthermore, all our novel algorithms actually assume that E ∪ E ϕ is acyclic, whereas failure expansion only requires that the resulting E is acyclic.The former is a strictly stronger condition-see Fig. 5 below for an example WFSA-ϕ where E ∪ E ϕ is not acyclic, but E is.
Applicability.As seen above, the runtime of Alg. 6 depends on the size of failure trees, with complexity O(|E| + s|Σ||Q||T max | log |Σ|).In practice, failure trees may be large, or s may be large, which could result in our algorithms performing worse than the naïve approaches. 16To see this, consider higher-order CRFs with backoff, a useful formalism for sequence tagging in NLP (Vieira  et al., 2016), which can be encoded as WFSAs.They were the initial motivation for our proposed algorithms.Although these backoff CRFs do admit a compatible topological order that allows us to avoid the |T max | factor ( §6.2), we inspect them as an example of how large |T max | can be.
An order-n CRF tagging a sequence of length ℓ can be represented with a WFSA-ϕ in form of a lattice of ℓ layers.The layers include tag sequences of length ≤ n, meaning that, given a set of tags Σ, each layer contains states representing histories h ∈ {ϵ}∪Σ∪⋯∪Σ n .This results in O(|Σ| n ) states per layer.Backoff transitions in such models encode transitions to lower-order histories (transitioning from a history of length k to one of length k − 1) whenever a transition to a history of the same order is not possible.It is easy to see that each history of order k could have up to Σ incoming ϕ-transitions, connecting it to a large failure tree, which is exponential in size w.r.t.n.

Ethics Statement
We are not aware of any specific social risks created or exacerbated by this work.

B Number of Transitions Added by Failure Expansion
We show §3.1's claim that the number of transitions added by failure expansion (Alg.2) is (s − s)|Σ||Q| when the input WFSA-ϕ is deterministic.
In the deterministic case, each out-symbol at a state labels exactly one outgoing transition.Hence the number of added transitions for a given state q equals the number of added out-symbols, |Σ(q) ∖ Σ(q)| = |Σ(q)| − |Σ(q)|, where Σ(q) ⊇ Σ(q).Summing over all q ∈ Q, and using definition 9, we get a total number of added transitions of In the general case where the input WFSA-ϕ may be non-deterministic, each added out-symbol may label anywhere from 1 to |Q| added transitions.Thus the total number of added transitions is between (s − s)|Σ||Q| and (s − s)|Σ||Q| 2 .

C Aggregator Implementation
A Fenwick tree (Fenwick, 1994) is a data structure that stores a sequence v 1 , . . ., v N and can efficiently return any prefix sum of the form ⊕ N ′ n=1 v n for N ′ ∈ [0, N ], as well as allowing the individual elements v n to be updated.Each prefix-sum query or element update takes O(log N ) time.
Our aggregator interface in §5 is simpler.It only queries the full sum ⊕ N n=1 v n (the case N ′ = N ).Thus, the order of the elements is not considered by this interface.§6.4 noted that in the special case where subtraction is available (and numerically stable), an aggregator can be implemented even more efficiently without a Fenwick tree, since then it is easy to update the sum in constant time when updating any element.However, subtraction is not guaranteed to be available for arbitrary ⊕ operations (e.g., ⊕ = max).
A Fenwick tree stores the elements v n at the leaves of a balanced binary tree.Each internal (nonleaf) node stores the ⊕-sum of the values stored at its children.As a result, thanks to the associativity of ⊕, the root of the tree contains the full sum ⊕ N n=1 v n , which can be looked up in O(1) time.An example of a Fenwick tree (in the real semiring) is presented in Fig. 4. Note that we draw the root of a Fenwick tree at the top and consider it to be the ancestor of all other nodes, whereas failure trees had the root as the descendant of all other states.Initial creation of the Fenwick tree takes only O(N ) total time by visiting all nodes in bottom-up order and setting each non-leaf node to the ⊕-sum of its children.When a leaf v n is updated, just its ancestors are recomputed, again in bottom-up order.As there are about log N ancestors, this update takes O(log N ) total time.
Our aggregator is a Fenwick tree that stores N = |Σ| elements, where v n is the value associated with the n th element of Σ. (That is, we identify the possible keys a ∈ Σ with the integers [1, N ].)Initially, v n = 0, but may be changed by set.Each call to set takes O(log |Σ|) time; this factor appears in our runtime analysis.To achieve our runtime bounds for sparse WFSAs, we must take care not to spend O(|Σ|) time initializing all of the leaves and internal nodes to 0 every time we create an aggregator.Array initialization overhead can always be avoided, using a method from computer science folklore (Aho et al., 1974, exercise 2.12).
Alternatively, we can store values in the Fenwick tree only for those keys for which values have been set.Under this design, the operation set(a ∶ Σ, v ∶ K) must update v n ← v where n is the integer index associated with key a.To find n, the aggregator maintains a hash table that maps keys to consecutive integers.We assume O(1)time hash operations.The first key that is set is mapped to 1, the second is mapped to 2, etc.When a key a is set for the first time-that is, when it is not found in the hash table-N is incremented, the mapping a ↦ N is added to the hash table, and v N = v is appended to the Fenwick sequence.The hash table is also consulted by the get operation.
In our application, for an aggregator that represents state q, the keys that have been set are Σ(q).The design in the previous paragraph therefore reduces N from |Σ| to N = |Σ(q)|.As a result, the factor O(log |Σ|) in our analysis could actually be reduced to O(log max q∈Q |Σ(q)|).
Note that to obtain this runtime reduction, the undo method must properly undo the changes not only to the Fenwick tree but to the integerizing hash table (see footnote 7).If a call to set in Visit incremented N and added a ↦ N , then the call to undo in Leave must remove a ↦ N and decrement N , thereby keeping N small as desired.

D Weighted ϕ-Transitions
Throughout the main paper, we assumed that all ϕ-transitions have a weight of 1.This simplifying assumption is typically violated by backoff models (e.g., Allauzen et al., 2003).Fortunately, it can be removed with relatively small changes to our equations, algorithms and data structures.
Most simply, a weighted failure transition q ϕ/w ϕ → q ϕ could be simulated by a path q ϕ/1 → q ε ε/w ϕ → q ϕ where q ε is a newly introduced intermediate state with only an ε-transition.We would then have to eliminate the ε-transition as mentioned in §2.In this case, this simply means replacing q ε ε/w ϕ → q ϕ in E with transitions {q ε a/w ϕ ⊗w → q ′ ∶ q ϕ a/w → q ′ ∈ E}.However, this may be expensive when the original fallback state q ϕ has many outgoing transitions, which is typical in a backoff setting.Copying all of those transitions to a parent as in Alg. 2 (failure expansion) is exactly what the new methods in this paper are designed to avoid.We therefore give direct modifications to our constructions.
Suppose the failure transition for state q has weight w ϕ -that is, E contains q ϕ/w ϕ → q ϕ -where perhaps w ϕ ≠ 1.Then the second case of Eq. ( 9) should be modified to set β(q, a) = w ϕ ⊗ β(q ϕ , a) for any a ∉ Σ(q).Similarly, w ϕ should be incorporated into Eq.( 13), which becomes Finally, the subtraction expression in the right-hand side of Eq. ( 14) must be left-multiplied by w ϕ .In Alg. 2, which constructs the failure-expanded edge set, the update at state q becomes Algs. 3 and 4 undergo straightforward modifications based on the modified Eqs. ( 9), ( 13) and ( 14).When β(q ϕ , b) is copied backwards over a ϕ-transition q ϕ/w ϕ → q ϕ , it must be left-multiplied by w ϕ to yield β(q, b).This affects Alg. 3 lines 8-9 and Alg. 4 line 12, as well as the purple terms in Alg. 4 line 9.These modifications do not affect the asymptotic runtime complexity.
Alg. 6 requires more modification.We must extend our aggregator class ( §5) with a new method that left-multiplies all elements by a constant:17 1: class Aggregator(): ▷ We use γ to refer to an aggregator instance ⋮ 6: In Alg. 5, Visit(γ, q) should begin by calling mult(w ϕ ) where w ϕ is the weight of the failure arc from q. Consequently, Leave(γ, q) should be modified to undo one more update than before.
How to implement the mult method efficiently?
With both subtraction and division The subtraction-based aggregator ( §6.4) can be modified to still support all operations in O(1) time, provided that the ring K is actually a divsion ring (noncommutative field), i.e., it supports division by non-0 multipliers.The aggregator maintains an overall multiplier M , initially 1, and the call mult(m) replaces M with m ⊗ M ; thus, M is a product of the I multipliers applied far, m I ⊗ ⋯ ⊗ m 1 .As in App.C, we identify each key with an integer index n.If a has index n, then set(a, v) stores M −1 ⊗ v into v n . 18Later get(a) can return M ⊗v n ; since M has been updated in the meantime, this yields the originally set value v left-multiplied by all subsequent multipliers, since M has been updated.The aggregator also maintains the total ⊕ N n=1 v n as v n values are set or replaced (using subtraction), and the value method returns M ⊗ ⊕ N n=1 v n .With subtraction only If K does not support division, then the subtraction-based aggregator can be rescued as follows.The aggregator maintains the number I of multipliers applied so far, as well as their product M as before.The function set(a, v) now stores v into v n and I into i n , and later get(a) returns M in ⊗ v n , where in general M i is defined to be the product of multipliers subsequent to m i , that is, The aggregator maintains the current total S that should be returned by lookup; the mult(m) method left-multiplies this total by m, while the method set(a, v) modifies this total by adding v⊖get(a) before it updates (v n , i n ). 19The difficulty is now in obtaining the partial products M i without division.This can be done by maintaining m 1 , . . ., m I in a Fenwick tree. 20This means that mult and get now take time O(log I) rather than O(1).The effect on §6.4's ring-based version of Alg. 6 is to add ∑ q ′ ∈Q ancs(q ′ ) log |π max | ≤ |Q||T max | log |π max | to the asymptotic runtime expression.This is the same cost as if every state had log |π max | additional outgoing symbols.|π max | is usually very small.Without subtraction For this case, we stored the summands in a Fenwick tree (App.C).Fortunately, it is possible to extend that data structure to support mult in time O(N ), where N is the number of elements, without affecting the asymptotic runtime of set, value, or undo.The asymptotic runtimes of Algs.5-6 will remain unchanged.
In our modified Fenwick tree, the N leaves store unscaled values u 1 , . . ., u N ∈ K.Each node j (leaf or internal node) stores a multiplier m j that will be lazily applied to all of the leaves that are descendants of j.Thus, the scaled value v n is found as the product m r ⊗ m j 1 ⊗ m j 2 ⊗ ⋯ ⊗ m n ⊗ u n , where r, j 1 , . . .n is the path from the root r to the leaf n.Thus, the leaves store the elements v n directly (as they would in an ordinary Fenwick tree) only 19 Although get is used internally within this implementation of set, that does not contradict the comment later in this section that get is never called directly by Algs.5-6. 20A Fenwick tree supports suffix sums (or indeed, sums over any contiguous subsequence of elements) as efficiently as prefix sums.In our case, the elements in the Fenwick tree are multipliers, so the "sum" operation to be used is ⊗, not ⊕.
in the special case where all the multipliers are 1.In general v n must be computed on demand.The runtime of get is now O(log N ) rather than O(1), but Algs.5-6 never actually use the get method.
The new call mult(m) simply replaces m r ← m ⊗ m r , which affects all v n in O(1) total time.
To support fast computation of the total value ⊕ N n=1 v N , we also store partial sums at the nodes, as before.Thus, each node j stores a pair (m j , u j ).The scaled value of node j is defined to be m j ⊗ u j .When j is an internal node, we ensure as an invariant that u j is the sum of the scaled values of j's children, updating it whenever j's children change.The value method simply returns the scaled value of the root in O(1) time.
The interesting modification is to the set method.To set v n to v, leaf n is modified to set (m n , u n ) ← (1, v)-but also, all of n's ancestors j must be modified to have multipliers m j = 1, so that v n = 1 ⊗ ⋯ ⊗ 1 ⊗ v = v as desired.Before being set to 1, each old multiplier m j is "pushed down" to its children so that it still affects all leaves of j.The method descends from the root r to leaf n: it pushes the m j values out of the way on the way down, updates the leaf at the bottom, and restores the invariant by recomputing the u j values on the way back up as it returns.set desc(n, v, r) ▷ r is the root of the Fenwick tree 4: def set desc(n: leaf, v: K, j: node): 5: ▷ j is an ancestor of n; j's own proper ancestors have multiplier 1; so will j upon return 6: if j is a leaf : (m j , u j ) ← (1, v) ▷ Since j = n 7: else 8: m j ← 1 ▷ m j has been pushed down 10: set desc(n, v, child of j that is anc. of n) 11: The else clause in set desc can be rephrased (less readably) to avoid looping twice over children(j): 8: k ← the child of j that is an ancestor of n 9: set desc(n, v, k) 11: for k ′ ∈ siblings(k) : ▷ in a binary tree, there will be ≤ 1 13: m j ← 1 ▷ m j has been pushed down and invariant restored at j

E Tree Splitting Details
Alg. 6 is applicable to any acyclic semiringweighted WFSA-ϕ.However, updating an Aggregator as it travels within a failure tree incurs an additional worst-case multiplicative runtime factor of |T max |, the size of the biggest failure tree.This section outlines an improvement by lessening this impact.We do so by splitting large failure trees into multiple smaller ones.
Alg. 6 destructively updates an aggregator γ when Visiting a state q from q ϕ .This takes time O(|Σ(q)| log |Σ|).In contrast, Alg. 3 can be thought of as non-destructively copying γ to q from q ϕ , which means the work can be saved and does not have to be redone if q is re-Visited later.
This inspires us to hybridize Alg.6 as follows: Visit(γ, q) in Alg. 5 may optionally copyand-update γ rather than just updating it.Copying effectively cuts the transition q ϕ/w → q ϕ , making the sub-tree rooted at q a new independent failure tree with its own Aggregator instance.Copy-andupdate does incur a one-time cost of O(|Σ|), 21 but now Alg.6 line 3 will select a smaller failure tree.
However, at what states (if any) should we split each failure tree?The optimal set of splits depends on the topological order ( §6.2) used by Alg. 6. Dynamic splitting heuristics A simple greedy heuristic would be to split at q upon any call Visit(q) where copy-and-update is estimated to be cheaper than destructive updating, based on the current size of the aggregator and the number of required updates |Σ(q)|.However, this does not consider the future benefit of having smaller failure trees, and it does not adapt to the topological order.
A more sophisticated dynamic heuristic is for Visit(q) to split at q if not doing so would cause the total time spent so far on all Visit(q) calls 22 to exceed the time that it would take to copy-andupdate at q. (Put another way, it does so if it now realizes in retrospect that it would have been better for the very first Visit(q) call to have invested in copy-and-update.)This ensures that our enhanced Alg.6 will take at most twice as long as Alg. 3, 21 When we copy-and-update, each modification takes only O(1) time, not O(log |Σ|).The strategy is to copy all O(|Σ|) elements from γ of the Fenwick tree, update some of them, and only then build a new Fenwick tree from the updated elements, which takes only O(|Σ|) time: see App. C.
22 Remark: For a given q, all such calls do exactly the same work and should take the same amount of time, regardless of other splits.which always does copy-and-update.It eventually splits any state that is Visited often enough by the chosen topological order, especially if that state is expensive to Visit.On the other hand, if the chosen topological order is compatible so that every state is Visited only once, it will still achieve or outperform the best-case behavior of the original Alg. 6.

Static splitting algorithms
We may also consider static methods, which do not adapt to the topological order that is actually used, but optimize to mitigate the worst case.In the runtime analysis of §6.1, the failure tree T contributes O(f (T ) log |Σ|) to the failure term in the worst-case runtime of Alg. 6, 23 where f (T ) def = ∑ q∈T |Σ(q)| ancs(q). 24We may seek a split that is optimal with respect to this runtime bound.Let q 1 be the root of T .Suppose we choose to copy-and-update the aggregator when we first visit each of q 2 , . . ., q K ∈ T , essentially cutting off each state q k from its fallback state q ϕ k .(Here all of the q k are to be distinct.)This splits T into trees T 1 , . . ., T K , where each T k is rooted at q k .Then the contribution of these K trees to the asymptotic runtime upper bound is proportional to where the first term covers the cost of the K − 1 copy-and-update operations, and where the factor ancs(q) in the definition of f (T k ) considers only the ancestors of q within T k .Our goal is to choose K ≥ 1 and q 2 , . . ., q K to minimize this expression.
We first remark that requiring K ≤ 2 makes it easy to solve the problem in time O(|T |) time, assuming that we already know |Σ(q)| for each q ∈ T .Define D q = ∑ q ′ ≻q |Σ(q ′ )|, the total number of out-symbols at proper descendants of q.The improvement f (T ) − (f (T 1 ) + f (T 2 )) from splitting T at q 2 is simply D q 2 ancs(q 2 ).Intuitively, there are D q 2 out-symbols that can no longer be encountered when Visit + is called on any of the ancs(q 2 ) states in T 2 . 25This yields an improvement of −|Σ| + D q 2 ancs(q 2 ) log |Σ| in the runtime bound.A simple recursion from the root q 1 is enough to find D q and ancs(q) at every state q, and thus find the state q ≠ q 1 that achieves the best improvement in the runtime bound when chosen as q 2 .If no choice achieves a positive improvement, then we do not split the tree and leave K = 1.
We now present an exact algorithm for the full problem, with no bound on K. Roughly speaking, after we split at a state q ′ (making it the root of its own failure tree), we will also consider splitting again at its ancestors q, but we do not make these decisions greedily-we use dynamic programming.The main observation is that if q is currently in a failure tree with root q ′ ≻ q (where either q ′ = q 1 or we previously split at q ′ ), then splitting at q will give a further improvement of −|Σ|+(D q −D q ′ )ancs(q) log |Σ|.Denote this quantity by ∆ q|q ′ .We now wish to find the set of states S = {q 2 , . . ., q K } ⊆ T ∖ {q 1 } that maximizes ∑ K k=2 ∆ q k |q ′ k , where q ′ k is the highest state in {q 1 , . . ., q K } that is a proper descendant of q k (that is, q ′ k ≻ q k ).This sum is the total improvement obtained by splitting at all of {q 2 , . . ., q K }, since it is the total that would be obtained by splitting them successively in any reverse topological order.
For each state q ∈ T and each q ′ ≻ q, define ∆q|q ′ = max( ∆q|q ′ , ∆q|q ′ ) (15) ∆q|q ′ = (∑ p ∆p|q ) + ∆ q|q ′ (16) ∆q|q ′ = (∑ p ∆p|q ′ ) + 0 where p in the summations ranges over the parents of q (if any) in the failure tree.Here ∆q|q ′ ≥ 0 is the maximum total improvement that can be obtained by splitting a failure tree rooted at q ′ ≻ q at any set of states ⪯ q; ∆q|q ′ is the maximum if this set includes q, and ∆q|q ′ is the maximum if this set does not include q. 26 The optimal split of T then has total improvement ∑ p ∆p|q 1 where q 1 is the root of T and p ranges over its parents.
Tracing back through the derivation of this optimal improvement, one may determine which states were split to obtain it.This is similar to following backpointers in the Viterbi algorithm.Concretely, define For example, Sq|q ′ ≥ 0 is the optimal set of states 26 A similar split into two cases-include q or exclude q-is used in the well-known linear-time dynamic programming algorithm for finding the max-weighted independent set of vertices in a tree.In effect, these algorithms label each tree node with a bit saying whether or not to include it, subject to some constraints.They resemble methods for refining the nonterminal labels of a parse tree (Petrov and Klein, 2007).
⪯ q to split in a failure tree rooted at q ′ ≻ q.The optimal set of split points in T , not counting the original root q 1 , is S = ⋃ p Sp|q 1 where, again, p ranges over the parents of q 1 .Any of the unions written here can be enumerated by (recursively) enumerating the disjoint sets that are unioned together, without any copying to materialize the sets.
Concretely, we first work from the leaves down to the root: at each q, we compute and memoize all of the ∆ quantities (quantities ( 15)-( 17) for all q ′ ≻ q), after first having done so at the parents of q.We then enumerate S using the definitions ( 18)-( 20), which recurse from the root back up to the leaves.Thanks to the choice at Eq. ( 18) based on the ∆ quantities, this recursion enumerates only sets that are actually subsets of S. In particular, it enumerates Sq|q ′ for just those q ′ ≻ q pairs such that q ′ is the highest proper descendant of q in the optimal set {q 1 } ∪ S.
The total runtime is dominated by ( 15)-( 17) and is proportional to the number of q ′ ≻ q pairs in T .Summed over all trees T , this is just the total height of all states in all failure trees, or equivalently ∑ q ′ ∈Q (ancs(q ′ ) − 1).This resembles the failure term in the worst-case runtime of Alg. 6, but is much faster since it eliminates all factors that depend on |Σ|.Thus, when a compatible order is not known ( §6.2), taking the time to optimally split the failure trees may be worth the investment.
Runtime analysis after static splitting To get a sense of how this improves the worst-case runtime, consider an idealized WFSA-ϕ where every state q has the same number of out-symbols, Σ(q) = s|Σ|.Furthermore, relax the runtime bound by replacing ancs(q) in the definition of Setting the derivative with respect to K to zero, we find that the optimal K = |T | √ s log |Σ|.However, for a WFSA with sufficiently dense out-symbols, namely one with s > 1 log |Σ| , this asks to take K > |T |, which is impossible.There the method will have to settle for K = |T |, splitting each state into its own failure tree.This makes Alg.6 reduce to Alg. 3.
Conversely, for a WFSA with sufficiently sparse out-symbols, namely one with s < 1 |Tmax| 2 log |Σ| , the above formula asks to take K < 1 for all failure trees.That is also impossible: the method will have to settle for K = 1, not splitting T at all.This is the original version of Alg. 6.
In between these two extremes, we can take

Definition 4 .
The inner path weight is defined as w I (π) the (full) path weight as Example of a failure tree.Its root is node 4. To expand failure transitions, the dashed transitions are added and the ϕ-transitions are removed.

Figure 2 :
Figure 2: A WFSA-ϕ fragment in which Alg. 6 would perform a large number of updates over the ϕ-transitions.
Failure-expanded version of the fragment from Fig.3a.

FFigure 5 :
Figure 5: Example of a WFSA-ϕ where E ∪ E ϕ is not acyclic, yet its failure expanded transition set E is.
Tim Vieira, Ryan Cotterell, and Jason Eisner.2016.Speed-accuracy tradeoffs in tagging with variableorder CRFs and structured sparsity.In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
K ≈ |T | √ s log |Σ| as proposed above.This makes the bound (21) on the contribution of failure tree T to the runtime become O(|Σ||T | √ s log |Σ|).Note that the √ term is < 1 because we are not too dense, so this may beat Alg. 3. It also beats the original Alg.6: if we did not split the tree but kept K = 1, the expression would give O(|Σ||T | 2 s log |Σ|).In short, splitting the tree avoids the quadratic worstcase cost of Alg. 6.To put it another way, by eliminating the worst-case interaction among the K trees, we have reduced from O(|Σ|K 2 ) to O(|Σ|K).Recall that K ≥ 1 since we are not too sparse, so this is again an improvement.