Efficient Semiring-Weighted Earley Parsing

We present Earley’s (1970) context-free parsing algorithm as a deduction system, incorporating various known and new speed-ups. In particular, our presentation supports a known worst-case runtime improvement from Earley’s (1970) O(N3|G||R|), which is unworkable for the large grammars that arise in natural language processing, to O(N3|G|), which matches the complexity of CKY on a binarized version of the grammar G. Here N is the length of the sentence, |R| is the number of productions in G, and |G| is the total length of those productions. We also provide a version that achieves runtime of O(N3|M|) with |M| leq |G| when the grammar is represented compactly as a single finite-state automaton M (this is partly novel). We carefully treat the generalization to semiring-weighted deduction, preprocessing the grammar like Stolcke (1995) to eliminate the possibility of deduction cycles, and further generalize Stolcke’s method to compute the weights of sentence prefixes. We also provide implementation details for efficient execution, ensuring that on a preprocessed grammar, the semiring-weighted versions of our methods have the same asymptotic runtime and space requirements as the unweighted methods, including sub-cubic runtime on some grammars.


Introduction
Earley (1970) was a landmark paper in computer science. 1 Its algorithm was the first to directly parse under an unrestricted context-free grammar in time O N 3 , with N being the length of the input string. Furthermore, it is faster for certain grammars because it uses left context to filter its search at each position. It parses unambiguous grammars in O N 2 time and a class of "boundedstate" grammars, which includes all deterministic grammars, in O(N ) time. Its artful combination of top-down (goal-driven) and bottom-up (datadriven) inference later inspired a general method for executing logic programs, "Earley deduction" (Pereira and Warren, 1983).
Earley's algorithm parses a sentence incrementally from left to right, optionally maintaining a packed parse forest over the sentence prefix that has been observed so far. This supports online sentence processing-incremental computation of syntactic features and semantic interpretations-and also reveals for each prefix the set of grammatical choices for the next word. 2 It can be attractively extended to compute the probabilities of the possible next words (Jelinek and Lafferty, 1991;Stolcke, 1995). This is a standard way to compute autoregressive language model probabilities under a PCFG to support cognitive modeling (Hale, 2001) and speech recognition (Roark, 2001). Such probabilities could further be combined with those of a large autoregressive language model to form a product-of-experts model. Recent papers (as well as multiple github projects) have made use of a restricted version of this, restricting generation from the language model to only extend the current prefix in ways that are grammatical under an unweighted CFG; then only grammatical text or code will be generated (Shin et al. It is somewhat tricky to implement Earley's algorithm so that it runs as fast as possible. Most importantly, the worst-case runtime should be linear in the size of the grammar, but this property was not achieved by Earley (1970) himself nor by textbook treatments of his algorithm (e.g., Jurafsky and Martin, 2009, §13.4). This is easy to overlook when the grammar is taken to be fixed, so that the grammar constant is absorbed into the O operator, as in the opening paragraph of this paper.
Yet reducing the grammar constant is critical in practice, since natural language grammars can be very large (Dunlop et al., 2010). For example, the Berkeley grammar (Petrov et al., 2006), a learned grammar for the Penn Treebank (PTB) (Marcus et al., 1993), contains over one million productions.
In this reference paper, we attempt to collect the key efficiency tricks and present them declaratively, in the form of a unified deduction system that can be executed with good asymptotic complexity. 3 We obtain further speedups by allowing the grammar to be presented in the form of a weighted finite-state automaton whose paths correspond to the productions, which allows similar productions to share structure and thus to share computation. Previous versions of this trick use a different automaton for each left-hand side nonterminal (Purdom and Brown, 1981;Kochut, 1983;Leermakers, 1989; Perlin, 1991, inter alia); we show how to use a single automaton, which allows further sharing among productions with different left-hand sides.
We carefully generalize our methods to handle semiring-weighted grammars, where the parser must compute the total weight of all trees that are consistent with an observed sentence (Goodman, 1999)-or more generally, consistent with the prefix that has been observed so far. Our goal is to ensure that if the semiring operations run in constant time, then semiring-weighted parsing runs in the same time and space as unweighted parsing (up to a constant factor), for every grammar and sentence, including those where unweighted parsing is faster than the worst case. Eisner (2023) shows how to achieve this guarantee for any acyclic deduction system, so we produce such a system by preprocessing the grammar to eliminate cyclic derivations. 4 Intuitively, this means we do not have to sum over infinitely many derivations at runtime (as Goodman (1999) would). We also show how to compute prefix weights, which is surprisingly tricky and requires the semiring to be commutative. Our presentation of preprocessing and prefix weights generalizes and corrects that of Stolcke (1995), who relied on special properties of PCFGs.
Finally, we provide a reference implementation 3 There has been no previous unified, formal treatment that is written as a deduction system, to the best of our knowledge. That said, declarative formulations have been presented in other formats in the dissertations of Barthélemy (1993), de la Clergerie (1993), and Nederhof (1994a). 4 Our method to remove nullary productions may be a contribution of this paper, as we were unable to find a correct construction in the literature.
in Cython 5 and empirically demonstrate the value of the speedups.

Weighted Context-Free Grammars
A context-free grammar (CFG) G is a tuple ⟨N , Σ, R, S⟩ where Σ is a finite set of terminal symbols, N is a finite set of nonterminal symbols with Σ ∩ N = ∅, R is a set of productions from a nonterminal to a sequence of terminals and nonterminals (i.e., R ⊆ N × (Σ ∪ N ) * ), and S ∈ N is the start symbol. We use lowercase variable names (a, b, . . . ) and uppercase ones (A, B, . . . ) for elements of Σ and N , respectively. We use a Greek letter (ρ, µ, or ν) to denote a sequence of terminals and nonterminals, i.e., an element of (Σ ∪ N ) * . Therefore, a production has the form A → ρ. Note that ρ may be the empty sequence ε. We refer to |ρ| ≥ 0 as the arity of the production, |A → ρ| def = 1 + |ρ| as the size of the production, and |G| def = (A→ρ)∈R |A → ρ| for the total size of the CFG. Therefore, if K is the maximum arity of a production, |G| ≤ |R|(1 + K). Productions of arity 0, 1, and 2 are referred to as nullary, unary, and binary productions respectively.
For a given G, we write µ ⇒ ν to mean that µ ∈ (Σ ∪ N ) * can be rewritten into ν ∈ (Σ ∪ N ) * by a single production of G. For example, A B ⇒ ρ B expands A into ρ using the production A → ρ. The reflexive and transitive closure of this relation, ⇒ ρ µ, and refer to ρ as a prefix of ρ µ.
A derivation subtree of G is a finite rooted ordered tree T such that each node is labeled either with a terminal a ∈ Σ, in which case it must be a leaf, or with a nonterminal A ∈ N , in which case R must contain the production A → ρ where ρ is the sequence of labels on the node's 0 or more children. For any A ∈ N , we write T A for the set of derivation subtrees whose roots have label A, and refer to the elements of T S as derivation trees. Given a string x ∈ Σ * of length N , we write T A x for the set of derivation subtrees with leaf sequence x. For an input sentence x, its set of derivation trees T x def = T S x is countable and possibly infinite. It is non-empty iff S * ⇒ x, with each T ∈ T x serving as a witness that S * ⇒ x, i.e., that G can generate x. We will also consider weighted CFGs Earley EarleyFast Domains i, j, k ∈ {0, . . . , N } A, B ∈ N ∪ {S ′ } a ∈ Σ ρ, µ, ν ∈ (N ∪ Σ) *  (WCFGs), in which each production A → ρ is additionally equipped with a weight w(A → ρ) ∈ W where W is the set of values of a semiring S def = ⟨W, ⊕, ⊗, 0 , 1 ⟩. Semirings are defined in App. A. We assume that ⊗ is commutative, deferring the trickier non-commutative case to App. K. Any derivation tree T of G can now be given a weight where A → ρ ranges over the productions associated with the nonterminal nodes of T . The goal of a weighted recognizer is to find the total weight of all derivation trees of a given input sentence x: An ordinary unweighted recognizer is the special case where W is the boolean semiring, so Z x = true iff S * ⇒ x iff T x ̸ = ∅. A parser returns at least one derivation tree from T x iff T x ̸ = ∅.
As an extension to the weighted recognition problem (2), one may wish to find the prefix weight of a string y ∈ Σ * , which is the total weight of all sentences x = yz ∈ Σ * having that prefix: (3) §1 discussed applications of prefix probabilities-the special case of (3) for a probabilistic CFG (PCFG), in which the production weights are rewrite probabilities: W = R ≥0 and (∀A ∈ N ) (A→ρ)∈R w(A → ρ) = 1.

Parsing as Deduction
We will describe Earley's algorithm using a deduction system, a formalism that is often employed in the presentation of parsing algorithms (Pereira and Shieber, 1987; Sikkel, 1997), as well as in mathematical logic and programming language theory (Pierce, 2002). Much is known about how to execute (Goodman, 1999), transform (Eisner and Blatz, 2007), and neuralize (Mei et al., 2020) deduction systems.
A deduction system proves items V using deduction rules. Items represent propositions; the rules are used to prove all propositions that are true. A deduction rule is of the form V where EXAMPLE is the name of the rule, the 0 or more items above the bar are called antecedents, and the single item below the bar is called a consequent. Antecedents may also be written to the side of the bar; these are called side conditions and will be handled differently for weighted deduction in §6. Axioms (listed separately) are merely rules that have no antecedents; as a shorthand, we omit the bar in this case and simply write the consequent. An proof tree is a finite rooted ordered tree whose nodes are labeled with items, and where every node is licensed by the existence of a deduction rule whose consequent V matches the label of the node and whose antecedents U 1 , U 2 , . . . match the labels of the node's children. It follows that the leaves are labeled with axioms. A proof of item V is a proof tree d V whose root is labeled with V : this shows how V can be deduced from its children, which can be deduced from their children, and so on until axioms are encountered at the leaves. We say V is provable if D V , which denotes the set of all its proofs, is nonempty.
Our unweighted recognizer determines whether a certain goal item is provable by a certain set of deduction rules from axioms that encode G and x. The deduction system is set up so that this is the case iff S * ⇒ x. The recognizer can employ a forward chaining method (see e.g. Ceri et al., 1990; Eisner, 2023) that iteratively deduces items by applying deduction rules whenever possible to antecedent items that have already been proved; this will eventually deduce all provable items. An unweighted parser extends the recognizer with some extra bookkeeping that lets it return one or more actual proofs of the goal item if it is provable. 6

Earley's Algorithm
Earley's algorithm can be presented as the specific deduction system Earley shown in Table 1 (Sikkel, 1997;Shieber et al., 1995;Goodman, 1999), explained in more detail in App. B. Its proof trees D goal are in one-to-one correspondence with the derivation trees T x (a property that we will maintain for our improved deduction systems in §5 and §7). The grammar G is encoded by axioms A * ⇒ ρ that correspond to the productions of the grammar. The input sentence x is encoded by axioms of the 6 Each proved item stores a "backpointer" to the rule that proved it. Equivalently, an item's proofs may be tracked by its weight in a "derivation semiring" (Goodman, 1999).
is derivable only if the grammar G has a production A → µ ν such that µ * ⇒ x i:j . Therefore, • indicates the progress we have made through the production. An item with nothing to the right of • , e.g., [i, j, A → ρ • ], is called complete. The set of all items with a shared right index j is called the item set of j, denoted T j .
While µ * ⇒ x i:j is a necessary condition for [i, j, A → µ • ν] to be provable, it is not sufficient. For efficiency, the Earley deduction system is cleverly constructed so that this item is provable iff 8 it can appear in a proof of the goal item for some input string beginning with x 0:j , and thus possibly for x itself. 9 Including [0, 0, S ′ → • S] as an axiom in the system effectively causes forward chaining to start looking for a derivation at position 0. Forward chaining will prove the goal item [0, N, S ′ → S • ] iff S * ⇒ x. These two items conveniently pretend that the grammar has been augmented with a new start symbol S ′ / ∈ N that only rewrites according to the single production S ′ → S.
The Earley system employs three deduction rules: PREDICT, SCAN, and COMPLETE. We refer the reader to App. B for a presentation and analysis of these rules, which reveals a total runtime of O N 3 |G||R| . App. C outlines how past work improved this runtime. In particular, Graham et al. (1980) presented an unweighted recognizer that is a variant of Earley's, along with implementation details that enable it to run in time O N 3 |G| . However, those details were lost in retelling their algorithm as a deduction system (Sikkel, 1997, p. 113 In these items, the constant symbol ⋆ can be regarded as a wildcard that stands for "any sequence ρ." We also use these new items to replace the goal item and the axiom that used S ′ ; the extra S ′ symbol is no longer needed. The proofs are essentially unchanged (App. D).
We now describe our new deduction rules for COMP and PRED. (SCAN is unchanged.) We also analyze their runtime, using the same techniques as in App. B.

Predict
We split PRED into two rules: PRED1 and PRED2. The first rule, PRED1, creates an item that gathers together all requests to look for a given nonterminal B starting at a given position j: There are three free choices in the rule: indices i and j, and dotted production A → • B ν. Therefore, PRED1 has a total runtime of O N 2 |G| .
The second rule, PRED2, expands the item into commitments to look for each specific kind of B: PRED2 has two free choices: index j and production B → ρ. Therefore, PRED2 has a runtime of O(N |R|), which is dominated by O(N |G|) and so the two rules together have a runtime of O N 2 |G| .

Complete
We speed up COMP in a similar fashion to PRED. We split COMP into two rules: COMP1 and COMP2. The first rule, COMP1, gathers all complete B constituents over a given span into a single item: We have three free choices: indices j and k, and complete production B → ρ with domain size |R|. Therefore, COMP1 has a total runtime of O N 2 |R| , or O N 2 |G| .
The second rule, COMP2, attaches the resulting complete items to any incomplete items that predicted them: We have four free choices: indices i, j, and k, and dotted production A → µ • B ν. Therefore, COMP2 has a total runtime of O N 3 |G| and so the two rules together have a runtime of O N 3 |G| .

Semiring-Weighted Parsing
We have so far presented Earley's algorithm and our improved deduction system in the unweighted case. However, we are often interested in determining not just whether a parse exists, but the total weight of all parses as in equation (2), or the total weight of all parses consistent with a given prefix as in equation (3).
We first observe that by design, the derivation trees of the CFG are in 1-1 correspondence with the proof trees of our deduction system that are rooted at the goal item. Furthermore, the weight of a derivation subtree can be found as the weight of the corresponding proof tree, if the weight w(d V ) of any proof tree d V is defined recursively as follows.
Base case: d V may be a single node, i.e., V is an axiom. If V has the form A → ρ, then w(d V ) is the weight of the corresponding grammar production, i.e., w(A → ρ). All other axiomatic proof trees of Earley and EarleyFast have weight 1 . 10 Recursive case: If the root node of d V has child subtrees d U 1 , d U 2 , . . ., then w(d V ) = w(d U 1 ) ⊗ w(d U 2 ) ⊗ · · · . However, the factors in this product include only the antecedents written above the bar, not the side conditions (see §3).
Following Goodman (1999), we may also associate a weight with each item V , denotedβ(V ), which is the total weight of all its proofs d V ∈ D V . By the distributive property, we can obtain that weight as an ⊕-sum over all one-step proofs of V from antecedents. Specifically, each deduction rule that deduces V contributes an ⊕-summand, given 10 However, this will not be true in EarleyFSA ( §7 below).
There the grammar is given by a WFSA, and each axiom corresponding to an arc or final state of this grammar will inherit its weight from that arc or final state. Similarly, if we generalize to lattice parsing-where the input is given by an acyclic WFSA and each proof tree corresponds to a parse of some weighted path from this so-called lattice-then an axiom providing a terminal token should use the weight of the corresponding lattice edge. Then the weight of the proof tree will include the total weight of the lattice path along with the weight of the CFG productions used in the parse. by the productβ(U 1 ) ⊗β(U 2 ) ⊗ · · · of the weights of its antecedent items (other than side conditions). Now our weighted recognizer can obtain Z x (the total weight of all derivations of x) asβ of the goal item (the total weight of all proofs of that item).
For an item V of the form [i, j, A → µ • ν], the weightβ(V ) will consider derivations of nonterminals in µ but not those in ν. We therefore refer tȯ β(V ) as an incomplete inside weight. However, ν will come into play in the extension of §6.1.
The deduction systems work for any semiringweighted CFG. Unfortunately, the forwardchaining algorithm for weighted deduction (Eisner et al., 2005, Fig. 3) may not terminate if the system permits cyclic proofs, where an item can participate in one of its own proofs. In this case, the algorithm will merely approach the correct value of Z x as it discovers deeper and deeper proofs of the goal item. Cyclicity in our system can arise from sets of unary productions such as or other nullary productions). We take the approach of eliminating problematic unary and nullary productions from the weighted grammar without changing Z x for any x. We provide methods to do this in App. E and App. F respectively. It is important to eliminate nullary productions before eliminating unary cycles, since nullary removal may create new unary productions. The elimination of some productions can increase |G|, but we explain how to limit this effect.

Extension to Prefix Weights
Stolcke (1995) showed how to extend Earley's algorithm to compute prefix probabilities under PCFGs, by associating a "forward probability" with each •item. 11 However, he relied on the property that all nonterminals A have Z A = 1, where Z A denotes the free weight from that, and prove goal from the complete item. See Fig. 3 for an alternative derivation and more discussion.
also does not handle semiring-weighted grammars.
We generalize by associating with each • -item, instead of a "forward probability," a "prefix outside weight" from the same commutative semiring that is used to weight the grammar productions. Formally, each w(V ) will now be a pair (β(V ),α(V )), and we combine these pairs in specific ways. Recall from §4 that the item V = [i, j, A → µ • ν] is provable iff 8 it appears in a proof of some sentence beginning with x 0:j . For any such proof containing V , its steps can be partitioned as shown in Fig. 1, factoring the proof weight into three factors. Just as the incomplete inside weightβ(V ) is the total weight of all ways to prove V , the future inside weight Z ν is the total weight of all ways to prove [i, j, A → µ ν • ] from V and the prefix outside weightα(V ) is the total weight of all ways to prove the goal item from [i, j, A → µ ν • ]-in both cases allowing any future words x j: as "free" axioms. 12 The future inside weight Z ν = i:ν i ∈N Z ν i does not depend on the input sentence. To avoid a slowdown at parsing time, we precompute this product for each suffix ν of each production in R, after using methods in App. F to precompute the free weights Z A for each nonterminal A.
Likeβ(V ),α(V ) is obtained as an ⊕-sum over all one-step proofs of V . Typically, each one-step proof incrementsα(V ) by the prefix outside weight of its • -antecedent or • -side condition (for COMP2, the left • -antecedent). As an important exception, when V = [j, j, B → • ⋆], each of its one-step with some steps inside the A (including its production) to get all the steps outside the B. The base case is the start axiom,α([0, 0, S → • ⋆]) = 1 . Unfortunately, this computation ofα(V ) is only correct if there is no left-recursion in the grammar. We explain this issue in App. G.1 and fix it by extending the solution of Stolcke (1995, §4.5.1).
The prefix weight of x 0:j (j > 0) is computed as an ⊕-sumα([j, j]) over all one-step proofs of the new item [j, j] via the following new deduction rule that is triggered by the consequent of SCAN: POS: 7 Earley's Algorithm Using an FSA In this section, we present a generalization of EarleyFast that can parse with any weighted finite-state automaton (WFSA) grammar M in O N 3 |M| . Here M is a WFSA (Mohri, 2009) that encodes the CFG productions as follows. For any ρ ∈ (Σ∪N ) * and any A ∈ N , for M to accept the string ρ A with weight w ∈ W is tantamount to having the production A → ρ in the CFG with weight w. The grammar size |M| is the number of WFSA arcs. See Fig. 2 for an example. This presentation has three advantages over a CFG. First, M can be compiled from an extended CFG (Purdom and Brown, 1981), which allows user-friendly specifications like NP → Det? Adj * N + PP * that may specify infinitely many productions with unboundedly long right-hand-sides ρ (although M still only describes a context-free language). Second, productions with similar right-hand-sides can be partially merged to achieve a smaller grammar and a faster runtime. They may share partial paths in M, which means that a single item can efficiently represent many dotted productions. Third, when ⊗ is non-commutative, only the WFSA grammar formalism allows elimination of nullary rules in all cases (see App. F). Our WFSA grammar is similar to a recursive transition network or RTN grammar (Woods, 1970). Adapting Earley's algorithm to RTNs was discussed by Kochut (1983), Leermakers (1989), and Perlin (1991). Klein and Manning (2001b) used a weighted version for PTB parsing. None of them spelled out a deduction system, however.
Also, an RTN is a collection of productions of the form A → M A , where for M A to accept ρ corresponds to having A → ρ in the CFG. Thus an RTN uses one FSA per nonterminal. Our innovation is to use one WFSA for the entire grammar, specifying the left-hand-side nonterminal as a final symbol. Thus, to allow productions A → µ ν and B → µ ν ′ , our single WFSA can have paths µ ν A and µ ν ′ B that share the µ prefix-as in Fig. 2. This allows our EarleyFSA to match the µ prefix only once, in a way that could eventually result in completing either an A or a B (or both). 13 A traditional weighted CFG G can be easily encoded as an acyclic WFSA M with |M| = |G|, by creating a weighted path of length k and weight w 14 for each CFG production of size k and weight w, terminating in a final state, and then merging the initial states of these paths into a single state that becomes the initial state of the resulting WFSA. The paths are otherwise disjoint. Importantly, this WFSA can then be determinized and minimized 13 Nederhof (1994b) also shares prefixes between A and B; but there, once paths split to yield separate items, they cannot remerge to share a suffix. We can merge by deriving [j, k, q?] in multiple ways. Our [j, k, q?] does not specify its set of target left-hand sides; FILTER recomputes that set dynamically. 14 For example, the production S → NP VP would be encoded as a path of length 3 accepting the sequence NP VP S. The production's weight may arbitrarily be placed on the first arc of the path, the other arcs having weight 1 (see App. A).
Rules q ∈ I PRED2: [j, j, q?] (Mohri, 1997) to potentially reduce the number of states and arcs (while preserving the total weight of each sequence) and thus speed up parsing (Klein and Manning, 2001b). Among other things, this will merge common prefixes and common suffixes.
In general, however, the grammar can be specified by any WFSA M-not necessarily deterministic. This could be compiled from weighted regular expressions, or be an encoded Markov model trained on observed productions (Collins, 1999), or be obtained by merging states of another WFSA grammar (Stolcke and Omohundro, 1994) in order to smooth its weights and speed it up.
The WFSA has states Q and weighted arcs (or edges) E, over an alphabet A consisting of Σ ∪ N together with hatted nonterminals like A. Its initial and final states are denoted by I ⊆ Q and F ⊆ Q, respectively. 15 We denote an arc of the WFSA by (q a ⇝ q ′ ) ∈ E where q, q ′ ∈ Q and a ∈ A ∪ {ε}. This corresponds to an axiom with the same weight as the edge. q ∈ I corresponds to an axiom whose weight is the initial-state weight of q. The item q ∈ F is true not only if q is a final state but more generally if q has an ε-path of length ≥ 0 to a final state; the item's weight is the total weight of all such ε-paths, where a path's weight includes its final-state weight.
For a state q ∈ Q and symbol A ∈ N , the precomputed side condition q A ⇝ ⋆ is true iff there exists a state q ′ ∈ Q such that q A ⇝ q ′ exists in E.  Note that if the WFSA is obtained as described above, it will only have one initial state.
ditionally, the precomputed side condition q * A ⇝ ⋆ is true if there exists a path starting from q that eventually reads A. As these are only used as side conditions, they may be given any non-0 weight. The EarleyFSA deduction system is given in Table 2. It can be run in time O N 3 |M| . It is similar to EarleyFast, where the dotted rules have been replaced by WFSA states. However, unlike a dotted rule, a state does not specify a PREDICTed left-hand-side nonterminal. As a result, when any deduction rule "advances the dot" to a new state q, it builds a provisional item [j, k, q?] that is annotated with a question mark. This mark represents the fact that although q is compatible with several left hand sides A (those for which q * A ⇝ ⋆ is true), the left context x 0:j might not call for any of those nonterminals. If it calls for at least one such nonterminal A, then the new FILTER rule will remove the question mark, allowing further progress.
One important practical advantage of this scheme for natural language parsing is that it prevents a large-vocabulary slowdown. 16 In Earley, where a ranges over all nouns in the vocabulary. But EarleyFSA in the corresponding situation will predict only [4, 4, q] where q is the initial state, without yet predicting the next word. If the next input word is [4, 5, happy], then EarleyFSA follows just the happy arcs from q, yielding items of the form [4, 5, q ′ ?] (which will then be FILTERed away since happy is not a noun).
Note that SCAN, COMP1 and COMP2 are ternary, rather than binary as in EarleyFast. For further speed-ups we can apply the fold transform on these rules in a similar manner as before, resulting in binary deduction rules. We present this binarized version in App. I.
As before, we must eliminate unary and nullary rules before parsing; App. J explains how to do this with a WFSA grammar. In addition, although Table 2 allows the WFSA to contain ε-arcs, App. J explains how to eliminate ε-cycles in the WFSA, which could prevent us from converging, for the usual reason that an item [i, j, q] could participate in its own derivation. Afterwards, there is again a nearly acyclic order in which the deduction engine can prove items (as in App. H.1 or App. H.3).
As noted above, we can speed up EarleyFSA by reducing the size of the WFSA. Unfortunately, minimization of general FSAs is NP-hard. However, we can at least seek the minimal deterministic WFSA M ′ such that |M ′ | ≤ |M|, at least in most semirings (Mohri, 2000; Eisner, 2003). The determinization (Aho et al., 1986) and minimization (Aho and Hopcroft, 1974; Revuz, 1992) algorithms for the boolean semiring are particularly well-known. Minimization merges states, which results in merging items, much as when EarleyFast merged items that had different predot symbols (Leermakers, 1992; Nederhof and Satta, 1997; Moore, 2000).
Another advantage of the WFSA presentation of Earley's is that it makes it simple to express a tighter bound on the runtime. Much of the grammar size |G| or |M| is due to terminal symbols that are not used at most positions of the input. Suppose the input is an ordinary sentence (one word at each position, unlike the lattice case in footnote 7), and suppose c is a constant such that no state q has more than c outgoing arcs labeled with the same terminal a ∈ Σ. Then when SCAN tries to extend [i, j, q], it considers at most c arcs. Thus, the O(|M|) factor in our runtime (where |M| = |E|) can be replaced with O(|Q| · c + |E N |), where E N ⊆ E is the set of edges that are not labeled with terminals.

Practical Runtime of Earley's
We empirically measure the runtimes of Earley, EarleyFast, and EarleyFSA. We use the tropical semiring to find the highest-weighted derivation trees. We use two grammars that were ex-tracted from the PTB: Markov-order-2 (M2) and Parent-annotated Markov-order-2 (PM2). 17 For each grammar, we ran our parsers (using the tropical semiring; Pin, 1998) on 100 randomly selected sentences of 5 to 40 words from the PTB test-set (mean 21.4, stdev 10.7), although we omitted sentences of length > 25 from the Earley graph as it was too slow (> 3 minutes per sentence). The full results are displayed in App. L. The graph shows that EarleyFast is roughly 20× faster at all sentence lengths. We obtain a further speed-up of 2.5× by switching to EarleyFSA.

Conclusion
In this reference work, we have shown how the runtime of Earley's algorithm is reduced to O N 3 |G| from the naive O N 3 |G||R| . We presented this dynamic programming algorithm as a deduction system, which splits prediction and completion into two steps each, in order to share work among related items. To further share work, we generalized Earley's algorithm to work with a grammar specified by a weighted FSA. We demonstrated that these speed-ups are effective in practice. We also provided details for efficient implementation of our deduction system. We showed how to generalize these methods to semiring-weighted grammars by correctly transforming the grammars to eliminate cyclic derivations. We further provided a method to compute the total weight of all sentences with a given prefix under a semiring-weighted CFG.
We intend this work to serve as a clean reference for those who wish to efficiently implement an Earley-style parser or develop related incremental parsing methods. For example, our deduction systems could be used as the starting point for • neural models of incremental processing, in which each derivation of an item contributes not only to its weight but also to its representation in a vector space (cf. Drozdov et al. Orthogonal to the speed-ups discussed in this work, Earley (1970) described an extension that we do not include here, which filters deduction items using k words of lookahead. (However, we do treat 1-word lookahead and left-corner parsing in App. G.2.) While our deduction system runs in time proportional to the grammar size |G|, this size is measured only after unary and nullary productions have been eliminated from the grammar-which can increase the grammar size as discussed in Apps. E and F.
We described how to compute prefix weights only for EarleyFast, and we gave a prioritized execution scheme only for EarleyFast. The versions for EarleyFSA should be similar.
Computing sentence weights (2) and prefix weights (3) involves a sum over infinitely many trees. In arbitrary semirings, there is no guarantee that such sums can be computed. Computing them requires summing geometric series andmore generally-finding minimal solutions to systems of polynomial equations. See discussion in App. A and App. F. Non-commutative semirings also present special challenges; see App. K.
A semiring is commutative if additionally ⊗ is commutative. A closed semiring has an additional operator * satisfying the axiom (∀w ∈ W) w * = As an example that may be of particular interest, Goodman (1999) shows how to construct a (non-commutative) derivation semring, so that Z x in equation (2) gives the best derivation (parse tree) along with its weight, or alternatively a representation of the forest of all weighted derivations. This is how a weighted recognizer can be converted to a parser.

B Earley's Original Algorithm as a
Deduction System §4 introduced the deduction system that corresponds to Earley's original algorithm. We explain and analyze it here. Overall, the three rules of this system, Earley (Table 1), correspond to possible steps in a top-down recursive descent parser (Aho et al., 1986): • SCAN consumes the next single input symbol (the base case of recursive descent); • PREDICT calls a subroutine to consume an entire constituent of a given nonterminal type by recursively consuming its subconstituents; • COMPLETE returns from that subroutine.
How then does it differ from recursive descent? Rather like depth-first search, Earley's algorithm uses memoization to avoid redoing work, which avoids exponential-time backtracking and infinite recursion. But like breadth-first search, it pursues possibilities in parallel rather than by backtracking. The steps are invoked not by a backtracking call stack but by a deduction engine, which can deduce new items in any convenient order. The effect on the recursive descent parser is essentially to allow co-routining (Knuth, 1997): execution of a recursive descent subroutine can suspend until further input becomes available or until an ancestor routine has returned and memoized a result thanks to some other nondeterministic execution path.

B.1 Predict
To look for constituents of type B starting at position j, using the rule B → ρ, we need to prove [j, j, B → • ρ]. Earley's algorithm imposes [i, j, A → µ • B ν] as a side condition, so that we only start looking if such a constituent B could be combined with some item to its left. 18 Runtime analysis. How many ways are there to jointly instantiate the two antecedents of PRED with actual items? The pair of items is determined by making four choices: 19 indices i and j with a domain size of N +1, dotted production A → µ • B ν with domain size |G|, and production B → ρ with a domain size of |R|. Therefore, the number of instantiations of PRED is O N 2 |G||R| . That is then PRED's contribution to the runtime of a suitable implementation of forward chaining deduction, using Theorem 1 of McAllester (2002). 20

B.2 Scan
If we have proved an incomplete item [i, j, A → µ • a ν], we can advance the dot if the next terminal symbol is a: 18 Minnen (1996) and Eisner and Blatz (2007) explain that this side condition is an instance of the "magic sets" technique that filters some unnecessary work from a bottom-up algorithm (Ramakrishnan, 1991). 19 Treating these choices as free and independent is enough to give us an upper bound. In actuality, the choices are not quite independent-for example, any provable item has i ≤ j-but there are no interdependencies that could be exploited to tighten our asymptotic bound. 20 Technically, that theorem also directs us to count the instantiations of just the first antecedent, namely O(|R| Runtime analysis. COMP has five free choices: indices i, j, and k with a domain size of N + 1, dotted production A → µ • B ν with domain size |G|, and the complete production B → ρ with a domain size of |R|. Therefore, COMP contributes O N 3 |G||R| to the runtime.

B.4 Total Space and Runtime
By a similar analysis of free choices, the number of items that the Earley deduction system will be able to prove is O N 2 |G| . This is a bound on the space needed by the forward chaining implementation to store the items that have been proved so

C Previous Speed-ups
We briefly discuss past approaches used to improve the asymptotic efficiency of Earley.
Leermakers (1992) noted that in an item of the form [i, j, A → µ • ν], the sequence µ is irrelevant to subsequent deductions. Therefore, he suggested (in effect) replacing µ with a generic placeholder ⋆. This merges items that had only differed in their µ values, so the algorithm processes fewer items. This technique can also be seen in Moore (2000) and Klein and Manning (2001a,b). Importantly, this means that each nonterminal only has one complete item, [j, k, B → ⋆ • ], for each span. This effect alone is enough to improve the runtime of Earley's to O N 3 |G| + N 2 |G||R| . Our §5.2 will give a version of the trick that only gets this effect, by folding the COMPLETE rule. The full version of Leermakers (1992)'s trick is subsumed by our generalized approach in §7.
While the GHR algorithm-a modified version of Earley's algorithm-is commonly known to be O N 3 |G||R| , Graham et al. (1980,  §3) provide a detailed exploration of the low-level implementation of their algorithm that enables it to be run in O N 3 |G| time. This explanation spans 20 pages and includes techniques similar to those mentioned in §5, as well as discussion of data structures. To the best of our knowledge, these details have not been carried forward in subsequent presentations of GHR (Stolcke, 1995; Goodman, 1999). In the deduction system view, we are able to achieve the same runtime quite easily and transparently by folding both COMPLETE ( §5.2) and PREDICT ( §5.1).In both cases, this eliminates the pairwise interactions between all |G| dotted productions and all |R| complete productions, thereby reducing |G||R| to |G|.

D Correspondence Between Earley and EarleyFast
The proofs of EarleyFast are in one-to-one correspondence with the proofs of Earley.
We show the key steps in transforming between the two styles of proof. Table 3 shows the correspondence between an application of PRED and an application of PRED1 and PRED2 , while Table 4 shows the correspondence between an application of COMP and an application of COMP1 and COMP2.

E Eliminating Unary Cycles
As mentioned in §6, our weighted deduction system requires that we eliminate unary cycles from the grammar. Stolcke (1995, §4.5) addresses the problem of unary production cycles by modifying the deduction rules. 21 He assumes use of the probability semiring, where W = [0, 1], ⊕ = +, and ⊗ = ×. In that case, inverting a single |N | × |N | matrix suffices to compute the total weight of all rewrite sequences A * ⇒ B, known as unary chains, for each ordered pair A, B ∈ N 2 . 22 His modified rules then ignore the original unary productions and refer to these weights instead.
We take a very similar approach, but instead describe it as a transformation of the weighted grammar, leaving the deduction system unchanged. We generalize from the probability semiring to any closed semiring-that is, any semiring that provides an operator * to compute geometric series sums in closed form (see App. A). In addition, we improve the construction: we do not collapse all unary chains as Stolcke (1995) does, but only those subchains that can appear on cycles. This prevents the grammar size from blowing up more than necessary (recall that the parser's runtime is proportional to grammar size). For example, if the unary productions are A i → A i+1 for all 1 ≤ i < K, then there is no cycle and our transformation leaves these K − 1 productions unchanged, rather than replacing them with K(K − 1)/2 new unary productions that correspond to the possible chains A i * Given a weighted CFG G = ⟨N , Σ, R, S, w⟩, consider the weighted graph whose vertices are N and whose weighted edges A → B are given by the unary productions A → B. (This graph may include self-loops such as A → A.) Its strongly connected components (SCCs) will represent unary production cycles and can be found in linear time (and thus in O(|G|) time). For any A and B in the same SCC, w(A * ⇒ B) ∈ W denotes the total weight of all rewrite sequences of the form A * ⇒ B (including the 0-length sequence with weight 1 , if A = B). For an SCC of size K, there are K 2 such weights and they can be found in total time O K 3 by the Kleene-Floyd-Warshall algorithm (Lehmann, 1977;Tarjan, 1981b,a). In the real semiring, this algorithm corresponds to using Gauss-Jordan elimination to invert I − E, where E is the weighted adjacency matrix of the SCC (rather than of the whole graph as in Stolcke (1995)). In the general case, it computes the infinite matrix sum I ⊕ E ⊕ (E ⊗ E) ⊕ · · · in closed form, with the help of the * operator of the closed semiring.
We now construct a new grammar G ′ = ⟨N ′ , Σ, R ′ , S, w ′ ⟩ that has no unary cycles, as follows. For each A ∈ N , our N ′ contains two nonterminals, A and A. For each ordered pair of non- 22 In a PCFG in which all rule weights are > 0, this total weight is guaranteed finite provided that all nonterminals are generating (footnote 8).

Earley
EarleyFast where ρ is a version of ρ in which each nonterminal B has been replaced by B. Finally, as a constant-factor optimization, A and A may be merged back together if A formed a trivial SCC with no self-loop: that is, remove the weight-1 production A → A from R ′ and replace all copies of A and A with A throughout G ′ .
Of course, as Aycock and Horspool (2002) noted, this grammar transformation does change the derivations (parse trees) of a sentence, which is also true for the grammar transformation in App. F below. A derivation under the new grammar (with weight w) may represent infinitely many derivations under the old grammar (with total weight w). In principle, if the old weights were in the derivation semiring (see App. A), then w will be a representation of this infinite set. This implies that the * operator in this section, and the polynomial system solver in App. F below, must be able to return weights in the derivation semiring that represent infinite context-free languages.

F Eliminating Nullary Productions
In addition to unary cycles (App. E) we must eliminate nullary productions in order to avoid cyclic proofs, as mentioned in §6. This must be done before eliminating unary cycles, since eliminating nullary productions can create new unary productions. Hopcroft et al. (2007,  §7.1.3) explain how to do this in the unweighted case. Stolcke (1995, §4.7.4) sketches a generalization to the probability semiring, but it also uses the non-semiring operations of division and subtraction (and is not clearly correct). We therefore give an explicit general construction.
While we provide a method that handles nullary productions by modifying the grammar, it is also possible to instead modify the algorithm to allow advancing the dot over nullable nonterminals, i.e., nonterminals A such that the grammar allows A * ⇒ ε (Aycock and Horspool, 2002). Our first step, like Stolcke's, is to compute the "null weight" for each A ∈ N . Although a closed semiring does not provide an operator for this summation, these values are a solution to the system of |N | polynomial equations 23 In the same way, the free weights from equation (4) in §6.1 are a solution to the system which differs only in that ρ is allowed to contain terminal symbols. In both cases, the distributive property of semirings is being used to recursively characterize a sum over what may be infinitely many trees. A solution to system (8) must exist for the sums in equation (2) to be well-defined in the first place. (Similarly, a solution to system (9) must exist for the sums in equations (3) and (4) to be well-defined.) If there are multiple solutions, the desired sum is given by the "minimal" solution, in which as many variables as possible take on value 0 . Often in practice the minimal solution can be found using fixed-point iteration, which initializes all free weights to 0 and then iteratively recomputes them via system (8) (respectively system (9)) until they no longer change (e.g., at numerical convergence). For example, this is guaranteed to work in the tropical semiring Given the null weights e A ∈ W, we now modify the grammar as follows. We adopt the convention that for a production A → ρ that is not yet in R, we consider its weight to be w(A → ρ) = 0 , and increasing this weight by any non-0 amount adds it to R. For each nonterminal B such that e B ̸ = 0 , let us assume the existence of an auxiliary nonter- We iterate this step: as long as we can find a production A → µ B ν in R such that e B ̸ = 0 , we modify it to the more restricted version A → µ B ̸ =ε ν (keeping its weight), but to preserve the possibility that B * ⇒ ε, we also increase the weight of the shortened production A → µ ν by e B ⊗ w(A → µ B ν).
A production A → ρ where ρ includes k nonterminals B with e B ̸ = 0 will be gradually split up by the above procedure into 2 k productions, in which each B has been either specialized to B ̸ =ε or removed. The shortest of these productions is A → ε, whose weight is w(A → ε) = e A by equation (8). So far we have preserved all weights w A * ⇒ x , provided that the auxiliary nonterminals behave as assumed. For each A we must now remove A → ε from R, and since A can no longer rewrite as ε, we rename all other rules A → ρ to A ̸ =ε → ρ. This closes the loop by defining the auxiliary nontermi-nals as desired.
Finally, since S is the start symbol, we add back S → ε (with weight e S ) as well as adding the new rule S → S ̸ =ε (with weight 1 ). Thus (as in Chomsky Normal Form), the only nullary rule is now S → ε, which may be needed to generate the 0length sentence. We now have a new grammar with nonterminals N ′ = {S} ∪ {B ̸ =ε : B ∈ N }. To simplify the names, we can rename the start symbol S to S ′ and then drop the ̸ =ε subscripts. Also, any nonterminals that only rewrote as ε in the original grammar are no longer generating and can be safely removed (see footnote 8).

G.1 Recursive Chains in Prefix Outside
Weights As mentioned in §6.1, there is a subtle issue that arises if the grammar has left-recursive productions. Consider the left-recursive rule B → B ρ. Using equation (5), the prefix outside weight of the predicted item [j, j, B → • B ρ] will only include the weight corresponding to one rule application of B → B ρ, but correctness demands that we account for the possibility of recursively applying B → B ρ as well. A well-known technique to remove left-recursion is the left-corner transform (Rosenkrantz and Lewis, 1970; Johnson and Roark, 2000). As that may lead to drastic increases in grammar size, however, we instead provide a modification of PRED1 that deals with this technical complication (which adapts Stolcke (1995, §4.5.1) to our improved deduction system and generalizes it to closed semirings). Fig. 3 provides some further intuition on the left-recursion issue.
We require some additional definitions: B is a left child of A iff there exists a rule A → B ρ. The reflexive and transitive closure of the left-child relation is * ⇒ L , which was already defined in §2. A nonterminal A is said to be left-recursive if A is a nontrivial left corner of itself, i.e., if A + ⇒ L A (meaning that A → B ρ and B * ⇒ L A for some B). A grammar is left-recursive if at least one of its nonterminals is left-recursive.
To deal with left-recursive grammars, we collapse the weights of left-recursive paths similarly as we did with unary cycles (see App. E), and ⊗multiply in at the PRED1 step.
We consider the left-corner multigraph: given a weighted CFG G = ⟨N , Σ, R, S, w⟩, its vertices are N and its edges are given by the left-child re- Table 5: Explicit formulas for incrementing the prefix outside weights during one-step proofs for EarleyFast for the general case in which the grammar may be left-recursive, as explained in App. G.1. Note that the prefix outside weights for COMP1 go unused for subsequent proof steps, and thus do not contribute to the prefix weights associated with the input string x. The prefix outside weight forα([j, j]) is the desired prefix weight w S * ⇒ L x 0:j . lations, with one edge for every production. Each edge is associated with a weight equal to the weight of the corresponding production ⊗-times the free weights of the nonterminals on the right hand side of the production that are not the left-child. For instance, for a production A → B C D, the weight of the corresponding edge in the graph will be w(A → B C D) ⊗ Z C ⊗ Z D . This graph's SCCs represent the left-corner relations. For any A and B in the same SCC w A * ⇒ L B ∈ W denotes the total weight of all left-corner rewrite sequences of the form A * ⇒ L B, including the free weights needed to compute the prefix outside weights. These can, again, be found in O K 3 time with the Kleene-Floyd-Warshall algorithm (Lehmann, 1977; Tarjan, 1981b,a), where K is the size of the SCC. These weights can be precomputed and have no effect on the runtime of the parsing algorithm. We replace PRED1 with the following: PRED1LR: to the prefix outside weightα([j, j, B → • ρ]). Note that the case B = C recovers the standard PRED1, and such rules will always be instantiated since * ⇒ L is reflexive. The PRED1LR rule has three side conditions (whose visual layout here is not significant). Its consequent will feed into PRED2; the condition µ ̸ = ε ensures that the output of PRED2 cannot serve again as a side condition to PRED1, since the recursion from C was already fully computed by the C * ⇒ L B item. However, since this condition prevents PRED1LR from predicting anything at the start of the sentence, we must also replace the start axiom [0, 0, S → • ⋆] with a rule that resembles PRED1 and derives the start axiom together with all its left corners: START: The final formulas for aggregating the prefix outside weights are spelled out explicitly in Table 5. Note that we did not spell out a corresponding prefix weight algorithm for EarleyFSA.

G.2 One-Word Lookahead
Orthogonally to App. G.1, we can optionally extend the left child relation to terminal symbols, saying that a is a left child of A if there exists a rule A → a ρ.
The resulting extended left-corner relation (in its unweighted version) can be used to construct a side condition on PRED1 (or PRED1LR), so that at position j, it does not predict all symbols that are compatible with the left context, but only those that are also compatible with the next input terminal.  Figure 3: To find the prefix weight of the prefix "She never joked and," the algorithm must consider all derivations of all complete sentences with that prefix ( §6.1). This figure shows two derivations of one such completion-the ambiguous sentence "She never joked and didn't smile during 2020." The left derivation corresponds to a proof tree for which the unique incomplete item ending at position 4 is [2, 4, VP → VP Conj • VP] and the right derivation corresponds to a proof tree for which the unique incomplete item ending at position 4 is [1, 4, VP → VP Conj • VP]. We colorize the productions associated with these items' incomplete inside weight, future inside weight and outside weight. The prefix outside weight ofα([4, 4, VP → • ⋆]) sums the product of these three weights over all derivations at PRED1 (equation (5)). It makes use of use of free weights to sum over all expansions of any nonterminals that were predicted to follow position 4: in these examples, Z VP and Z PP are included in, respectively, the future inside weight and the prefix outside weight of the incomplete antecedent item ([2, 4, VP → VP Conj • VP] or [1, 4, VP → VP Conj • VP]). The example derivations shown both contribute one copy of VP → VP PP to the prefix outside weight. However, since that production is left-recursive, the prefix is also consistent with other completions that use it k times to produce k arbitrary future prepositional phrases, for any k = 0, 1, 2, . . .. To sum over all such possibilities we provide a modified PRED1 in App. G.1. This summation over copies of VP → VP PP (and over the possible expansions of its future PP) is needed both when VP is predicted at position 1 and again when VP is predicted at position 2.
To be precise, PRED1 (or PRED1LR) should only predict B at position j if [j, k, a] and B * ⇒ L a (for some a). This is in fact Earley (1970)'s k-word lookahead scheme in the special case k = 1.

G.3 Left-Corner Parsing
Nederhof (1993) and Nederhof (1994b) describe a left-corner parsing technique that we could apply to further speed up Earley's algorithm. This subsumes the one-word lookahead technique of the previous section. Eisner and Blatz (2007) sketched how the technique could be derived automatically.
Normally, if B is a deeply nested left corner of C, then the item A → µ • C ν will trigger a long chain of PREDICT actions that culminates in [j, j, B → • ⋆]. Unfortunately, it may not be possible for this B (or anything predicted from it) to SCAN its first terminal symbol, in which case the work has been wasted.
But recall from App. G.1 that the PRED1LR rule effectively summarizes this long chain of predictions using a precomputed weighted item C * The left-corner parsing technique simply skips the PREDICT steps and uses C * ⇒ L B as a side condition to lazily check after the fact that the relevant prediction of a • -initial rule could have been made.
PRED1 is removed, so the method never creates dotted productions of the form A → µ • ν where µ = ε-except for the start item and the items derived from it using PRED2.
In COMP2, a side condition µ ̸ = ε is added. For the special case µ = ε, a new version of COMP2 is used in which and C * ⇒ L A (which ensures that EarleyFast would have PREDICTed that item). Note that µ ′ = ε is possible in the case where D is the start symbol S. The SCAN rule is split in exactly the same way into µ ̸ = ε and µ = ε variants.

H Execution of Weighted EarleyFast
Eisner (2023) presents generic strategies for executing unweighted and weighted deduction systems. We apply these here to solve the weighted recognition and prefix weight problems, by computing the weights of all items that are provable from given grammar and sentence axioms.

H.1 Execution via Two-Pass Algorithms
The Earley and EarleyFast deduction systems are nearly acyclic, thanks to our elimination of unary rule cycles and nullary rules from the grammar. However, cycles in the left-child relation can still create deduction cycles, with [k, k, A → • B X] and [k, k, B → • A Y ] proving each other via PRED or via PRED1 and PRED2.
Weighted deduction can be accomplished for these systems using the generic methods of Eisner (2023, §7). This will detect the left-child cycles at runtime (Tarjan, 1972) and solve the weights to convergence within each strongly connected component (SCC). While solving the SCCs can be expensive in general, it is trivial in our setting since the weights of the items within an SCC do not actually depend on one another: these items serve only as side conditions for one another. Thus, any iterative method will converge immediately.
Alternatively, the deduction system becomes fully acyclic when we eliminate prediction chains as shown in App. G.1. In particular, this modified version of EarleyFast replaces PRED1 with PRED1LR. 24 Using this acyclic deduction system allows a simpler execution strategy: under any acyclic deduction system, a reference-counting strategy (Kahn, 1962) can be applied to find the proved items and then compute their weights in topologically sorted order (Eisner, 2023, §6).
In both cyclic and acyclic cases, the above weighted recognition strategies consume only a constant factor more time and space than their unweighted versions, across all deduction systems and all inputs. 25 For EarleyFast and its acyclic version, this means the runtimes are O(N |G|) for a 24 Recall that eliminating the left-child cycles in advance in this way is needed when one wants to compute weights of the form w(V ) = (β(V ),α(V )), in which case the items in an SCC do not merely serve as side conditions for one another. The weighted deduction formalism of Eisner (2023) is flexible enough to handle cyclic rules that would correctly define these pairwise weights in terms of one another, but solving the SCCs would no longer be fast. 25 Excluding the time to solve the SCCs in the cyclic case; but for us, the statement holds even when including that time.

H.2 One-Pass Execution via Prioritization
For the acyclic version of the deduction system (App. G.1), an alternative strategy is to use a prioritized agenda to visit the items of the acyclic deduction system in some topologically sorted order (Eisner, 2023, §5). This may be faster in practice than the generic reference-counting strategy because it requires only one pass instead of two. It also remains space-efficient. On the other hand, it requires a priority queue, which adds a term to the asymptotic runtime (worsening it in some cases such as bounded-state grammars). We must associate a priority π(V ) with each item V such that if U is an antecedent or side condition in some rule that proves V , then π(U ) < π(V ). Below, we will present a nontrivial prioritization scheme in which the priorities implicitly take the form of lexicographically ordered tuples.
These priorities can easily be converted to integers in a way that preserves their ordering. Thus, a bucket queue (Dial, 1969) or an integer priority queue (Thorup, 2000) can be used (see Eisner (2023, §5) for details). The added runtime overhead 26 is O(M ) for the bucket queue or O(M ′ log log M ′ ) for the integer priority queue, where M = O N 2 |G| is the number of distinct priority levels in the set of possible items, and M ′ ≤ M is the number of distinct priority levels of the actually proved items, which depends on the grammar and input sentence.
For EarleyFast with the modifications of App. G.1, we assign the minimum priority to all of the axioms. All other items have one of six forms: The relative priorities of these items are as follows: • Items with smaller k are visited sooner (leftto-right processing).
• Among items with the same k, items with j < k are visited before items with j = k. Thus, the leftmost antecedent of PRED1LR precedes its consequent.
• Among items with the same k and with j < k, items with larger j are visited sooner. Thus, the rightmost antecedent of COMP2 precedes its consequent in the case i < j, where a narrower item is used to build a wider one.
• Among items of the first two forms with the same k and the same j < k, B is visited sooner than A if A * ⇒ B. This ensures that the rightmost antecedent of COMP2 precedes its consequent in the case i = j and ν = ε, which completes a unary constituent.
To facilitate this comparison, one may assign integers to the nonterminals according to their height in the unweighted graph whose vertices are N and whose edges A → B correspond to the unary productions A → B. (This graph is acyclic once unary cycles have been eliminated by the method of App. E.) • Remaining ties are broken in the order of the numbered list above. This ensures that the antecedents of COMP1, POS, and PRED2 precede their consequents, and the rightmost antecedent of COMP2 precedes its consequent in the case i = j and ν ̸ = ε, which starts a non-unary constituent.
To understand the flow of information, notice that the 6 specific items in the numbered list above would be visited in the order shown.

H.3 Pseudocode for Prioritized Algorithms
For concreteness, we now give explicit pseudocode that runs the rules to build all of the items in the correct order. This may be easier to implement than the above reductions to generic methods. It is also slightly more efficient than App. H.2, due to exploiting some properties of our particular system. Furthermore, in this section we handle EarleyFast as well as its acyclic modification. When the flag p is set to true, we carry out the acyclic version, which replaces PRED1 with PRED1LR and START (App. G.1), and also includes POS ( §6.1) to find prefix weights. The algorithm pops (dequeues) and processes items in the same order as App. H.2 (when p is true), except that in this version, axioms of the form B → ρ and [k − 1, k, a] are never pushed (enqueued) or popped but are only looked up in indices. Similarly, the [j, j] items (used to find prefix weights) are never pushed or popped but only proved. Thus, none of these items need priorities.
When an item U is popped, our pseudocode invokes only deduction rules for which U might match the rightmost antecedent (which could be a side condition), or in the case of SCAN or PRED1LR, the leftmost antecedent. In all cases, the other antecedents are either axioms or have lower priorities. While we do not give pseudocode for each rule, invoking a rule on U always checks first whether U actually does match the relevant antecedent. If so, it looks up the possible matches for its other antecedents from among the axioms and the previously proved items. This may allow the rule to prove consequents, which it adds to the queues and indices as appropriate (see below).
The main routine is given as Alg. 1. A queue iteration such as "for k ∈ Q: . . . " iterates over a collection that may change during iteration; it is shorthand for "while Q ̸ = ∅: { k = Q.pop(); . . . }." We maintain a dictionary (the chart) that maps items to their weights. Each time an item V is proved by some rule, its weight w(V ) is updated accordingly, as explained in §6 and Table 5. The weight isβ(V ) or (β(V ),α(V )) according to whether p is false or true.
Alg. 1 writes C(pattern) to denote the set of all provable items (including axioms) that match pattern. This set will have previously been computed and stored in an index dedicated to the specific invocation of C(pattern) in the pseudocode. The index is another dictionary, with the previously bound variables of the pattern serving as the key. The pseudocode for individual rules also uses indices, to look up antecedents.
When an item V is first proved by a rule and added to the chart, it is also added to all of the appropriate sets in the indices. Prioritization ensures that we do not look up a set until it has converged.
Each dictionary may be implemented as a hash table, in which case lookup takes expected O(1) time under the Uniform Hashing Assumption. An array may also be used for guaranteed O(1) access, although its sparsity may increase the algorithm's asymptotic space requirements. 27 What changes when p is false, other than a few of the rules? Just one change is needed to the prioritization scheme of App. H.2. The EarleyFast deduction system is cyclic, as mentioned in App. H.1, so in this case, we cannot enforce π(U ) < π(V ) when U and V are an antecedent and consequent of the same rule. We will only be able to guarantee π(U ) ≤ π(V ), where the = case arises only for PRED1 and PRED2. To achieve this weaker prioritization, we modify our tiebreaking principle from App. H.2 (when p is false) to say that for a given k, all items of the last two forms have equal priority and thus may be popped in any order. 28 When a rule proves a consequent that has the same priority as one of its antecedents, it is possible that the consequent had popped previously. In our case, this happens only for the rule PRED1, so crucially, it does not matter if the new proof changes the consequent's weight-this consequent is used only as a side condition (to PRED2) so its weight is ignored. However, to avoid duplicate work, we must take care to avoid re-pushing the consequent now that it has been proved again. 29 Rather than place all the items on a single queue that is prioritized lexicographically as in App. H.2, we use a collection of priority queues that are combined in the pseudocode to have the same effect. They are configured and maintained as follows.
• Q is a priority queue of distinct positions k ∈ {0, . . . , N }, which pop in increasing order. k is added to it upon proving an item of the form [·, k, ·]. Initially Q = {0} due to the start axiom [0, 0, S → • ⋆].
• For each k ∈ Q, P k is a priority queue of distinct positions j ∈ {0, . . . , k}, which pop in decreasing order except that k itself pops last. j is added to it upon proving an item of the form [j, k, ·]. Initially P 0 = {0} due to the start axiom. 27 Its sparsity need not increase the runtime requirements, however: an uninitialized array can be used to simulate an initialized array with constant overhead. Higham and Schenk (1993) attribute this technique to computer science folklore. 28 Formerly, all items of form 5 already had equal priority, and so did all items of form 6, but the former priority was strictly lower. This worked because there were no prediction chains. 29 Specifically, this discussion implies that in general, when a consequent may have the same priority as an antecedent, we must check whether it has ever been pushed onto the queue, not whether it is currently on the queue. Luckily, this is easily done by checking whether it is a key in the chart. add G, x axioms to dictionaries and queues 3: if p : START() ▷ apply START (App. G.1)

4:
for k ∈ Q : ▷ that is: while Q ̸ = ∅, pop into k 5: for j ∈ P k : 6: for B ∈ N jk : 7: return 0 ▷ goal item has not been proved • For each j ∈ P k with j < k, N jk is a priority queue of distinct nonterminals B ∈ N , which pop in the height order described in App. H.2 above. B is added to it upon proving an item of the form [j, k, B → ρ • ].
• If p is false, then for each k ∈ Q, S k is a queue of all proved items of the form [k, k, B → • ⋆] or [k, k, B → • ρ]. These items have equal priority so may pop in any order (e.g., LIFO). Initially S 0 contains just the start axiom.
Transitive consequents added later to a queue always have ≥ priority than their antecedents that have already popped, so the minimum priority of the queue increases monotonically over time. This monotone property is what makes bucket queues viable in our setting (see Eisner, 2023, §5). In general, our priority queues are best implemented as bucket queues if they are dense and binary heaps or integer priority queues if they are sparse. Table 6 gives a version of EarleyFSA in which the ternary deduction rules SCAN, COMP1 and COMP2 have been binarized using the fold transform, as promised in §7.

I Binarized EarleyFSA
• The SCAN1 and SCAN2 rules, which replace SCAN, introduce and consume new intermediate items of the form [i, j, q a ⇝ ⋆]. The SCAN1 rule sums over possible start positions j for word a. This is only advantageous in the case of lattice parsing (see footnote 7), since for string parsing, the only possible choice of j is k − 1.
• In a similar vein, COMP2A and COMP2B introduce and consume new intermediate items The COMP2A rule aggregates different items from i to j that are looking for a B constituent to their immediate right, summing over their possible current states q.
• Similarly, COMP1A introduces new intermediate items that sum over possible final states q ′ .
• We did not bother to binarize the ternary rule FILTER, as there is no binarization that provides an asymptotic speed-up.
There are different ways to binarize inference rules, and in Table 6 we have chosen to binarize SCAN and COMP2 in complementary ways. Our binarization of SCAN is optimized for the common case of a dense WFSA and a sparse sentence, where state q allows many terminal symbols a but the input allows only one (as in string parsing) or a few. SCAN1 finds just the symbols a allowed by the input and SCAN2 looks up only those out-arcs from q. Conversely, our binarization of COMP2 is optimized for the case of a sparse WFSA and a dense parse table: COMP2A finds the small number of incomplete constituents over [i, j] that are looking for a B, and COMP2B looks those up when it finds a complete B constituent, just like EarleyFast.
It is possible to change each of these binarizations. In particular, binarizing SCAN by first combining [i, j, q] with q a ⇝ q ′ (analogously to COMP2A) would be useful when parsing a large or infinite lattice-such as the trie implicit in a neural language model-with a constrained grammar

J Handling Nullary and Unary Productions in an FSA
As for EarleyFast, EarleyFSA ( §7) requires elimination of nullary productions. We can handle nullary productions by directly adapting the construction of App. F to the WFSA case. Indeed, the WFSA version is simpler to express. For each arc q B ⇝ q ′ such that B ∈ N and e B ̸ = 0 , we replace the B label of that arc with B ̸ =ε (preserving the arc's weight), and add a new arc q ε ⇝ q ′ of weight e B . We then define a new WFSA M ′ = (M ∩ ¬M bad ) ∪ M good , where M bad is an unweighted FSA that accepts exactly those strings of the form A (i.e., nullary productions), ¬ takes the unweighted complement, and M good is a WFSA that accepts exactly strings of the form S (with weight e S ) and S ̸ =ε S (with weight 1 ). As this construction introduces new ε arcs, it should precede the elimination of ε-cycles.
Notice that in the example of App. F where a production A → ρ was replaced with up to 2 k − 1 variants, the WFSA construction efficiently shares structure among these variants. It adds at most k edges at the first step and at most doubles the total number of states through intersection with ¬M bad .
Similarly, we can handle unary productions by directly adapting the construction of App. E to the WFSA case. We first extract all weighted unary rules by intersecting M with the unweighted language {B A : A, B ∈ N } (and determinizing the result so as to combine duplicate rules). Exactly as in App. E, we construct the unary rule graph and compute its SCCs along with weights w A * ⇒ B for all A, B in the same SCC. We modify the WFSA by underlining all hatted nonterminals A and overlining all nonterminals B. Finally, we define our new WFSA grammar (M ∩ ¬M bad ) ∪ M good . Here M bad is an unweighted FSA that accepts exactly those strings of the form B A and M good is a WFSA that accepts exactly strings of the form B A such that A, B are in the same SCC, with weight w A * ⇒ B . Following each construction, nonterminal names can again be simplified as in Apps. E and F.
Finally, §7 mentioned that we must eliminate ε-cycles from the FSA. The algorithm for doing so (Mohri, 2002) is fundamentally the same as our method for eliminating unary rule cycles from a CFG (App. E), but now it operates on the graph Domains i, j, k ∈ {0, . . . , N } A ∈ N a ∈ Σ q, q ′ ∈ Q   (Table 2) in which SCAN, COMP1 and COMP2 have been binarized using a fold transform. Since COMP1A does not depend on the input it can in practice be run during preprocessing, just like the rules that derive other WFSA items such as q * A ⇝ ⋆. See the main text (App. I) for a discussion of alternative binarization schemes. whose edges are ε-arcs of the FSA, rather than the graph whose edges are unary rules of the CFG.

K Non-Commutative Semirings
We finally consider the case of non-commutative weight semirings, where the order of multiplication becomes significant.
In this case, in the product (1) that defines the weight of a derivation tree T , the productions should be multiplied in the order of a pre-order traversal of T .
In §3, when we recursively defined the weight w(d V ) of a proof, we took a product over the abovethe-bar antecedents of a proof rule. These should be multiplied in the same left-to-right order that is shown in the rule. Our deduction rules are carefully written so that under these conventions, the resulting proof weight matches the weight (1) of the corresponding CFG derivation.
For the same reason, the same left-to-right order should be used in §3 when computing the inside probabilityβ(V ) of an item.
Eliminating nullary productions from a weighted CFG (App. F) is not in general possible in non-commutative semirings. However, if the grammar has no nullary productions or is converted to an FSA before eliminating nullary productions (App. J), then weighted parsing may remain possible.
What goes wrong? The construction in App. F unfortunately reorders the weights in the product (1). Specifically, in the production A → µ B ν, the product should include the weight e B after the weights in the µ subtrees, but our construction made it part of the weight of the modified production A → µ ν and thus moved it before the µ subtrees. This is incorrect when µ ̸ = ε and ⊗ is non-commutative.
The way to rescue the method is to switch to using WFSA grammars ( §7). The WFSA grammar breaks each rule up into multiple arcs, whose weights variously fall before, between, and after the weights of its children. When defining the weight of a derivation under the WFSA grammar, we do not simply use a pre-order traveral as in equation (1). The definition is easiest to convey informally through an example. Suppose a derivation tree for A * ⇒ x uses a WFSA path at the root that accepts BC A with weight w. Recursively let w B and w C be the weights of the child subderivations, rooted at B and C. Then the overall weight of the derivation of A will not be w ⊗ w B ⊗ w C (prefix order), but rather w 1 ⊗ w B ⊗ w 2 ⊗ w C ⊗ w 3 . Here we have factored the path weight w into w 1 ⊗ w 2 ⊗ w 3 , which are respectively the weights of the subpath up through B (including the initial-state weight), the subpath from there up through C, and the subpath from there to the end (including the final-state weight).
When converting a CFG to an equivalent WFSA grammar (footnote 14), the rule weight always goes at the start of the rule so that the weights are unchanged. However, the nullary elimination procedure for the WFSA (App. J) is able to replace unweighted nonterminals in the middle of a production with weighted ε-arcs. This is the source of its extra power, as well as its greater simplicity compared to App. F.
It really is not possible to fully eliminate nulls within the simpler weighted CFG formalism.
Consider an unambiguous weighted CFG whose productions are S → a S A, S → b S B, S → c, A → ε, B → ε, with respective weights w a , w b , w c , w A , w B . Then a string x = abbc will have Z x given by the mirrored product w a ⊗ w b ⊗ w b ⊗ w c ⊗ w B ⊗ w B ⊗ w A . Within our weighted CFG formalism, there is no way to include the final weights w B ⊗ w B ⊗ w A if we are not allowed to have null constituents in those positions.
Even with WFSAs, there is still a problem-in the non-commutative case, we cannot eliminate unary rule cycles (App. J). If we had built a binary A constituent with weight w, then a unary CFG rule A → A with weight w 1 required us to compute the total weight of all derivations of A, by taking a summation of the form w ⊕ (w 1 ⊗ w) ⊕ (w 1 ⊗ w 1 ⊗ w) ⊕ · · · . This factors as ( 1 ⊕ w 1 ⊕ (w 1 ⊗ w 1 ) ⊕ · · · ) ⊗ w, and unary rule cycle elimination served to precompute the parenthesized sum, which was denoted as w A * ⇒ A , and record it as the weight of a new rule A → A. However, in the noncommutative case, the WFSA path corresponding to A → A might start with w 1 and end with w 2 . In that case, the necessary summation has the form w ⊕ (w 1 ⊗ w ⊗ w 2 ) ⊕ (w 1 ⊗ w 1 ⊗ w ⊗ w 2 ⊗ w 2 ) ⊕ · · · . Unfortunately this cannot be factored as before, so we cannot precompute the infinite sums . As all these algorithms are worst-case cubic in N , each curve on these log-log plots is bounded above by a line of slope 3, but the lower lines have better grammar constants. The experiment was conducted using a Cython implementation on an Intel(R) Core(TM) i7-7500U processor with 16GB RAM.
as before. 30 The construction in App. J assumed that we could extract weighted unary rules from the WFSA, with a single consolidated weight at the start of each rule-but consolidating the weight in that way required commutativity.

L Runtime Experiment Results
More details on the experiments of §8 appear in Fig. 4. 30 A related problem would appear in trying to generalize the left-corner rewrite weights in App. G.1 to the noncommutative case.